UCX — LLMpedia

UCX
Name	UCX
Developer	Lawrence Berkeley National Laboratory; Sandia National Laboratories; Intel Corporation; NVIDIA Corporation
Initial release	2014
Platform	Linux; IBM POWER; ARM architecture
License	BSD license

Contents

Overview
Architecture and Components
Communication Protocols and APIs
Performance and Scalability
Implementations and Integrations
Use Cases and Applications
Development and Community

UCX

UCX is an open-source communication framework designed for high-performance computing and data-intensive applications. It provides a unified set of primitives and transport-agnostic APIs to enable low-latency, high-bandwidth messaging and memory interoperability across heterogeneous hardware and software stacks. UCX aims to integrate with existing libraries and runtimes used in large-scale scientific computing and enterprise datacenters.

Overview

UCX originated from collaborations among Lawrence Berkeley National Laboratory, Sandia National Laboratories, Intel Corporation, and NVIDIA Corporation to address communication bottlenecks observed in exascale computing initiatives such as Exascale Computing Project. It targets interoperability with projects like MPI implementations, OpenSHMEM, PGAS runtimes, and storage frameworks. UCX abstracts transport layers including InfiniBand, Ethernet, RoCE, and vendor interconnects like Mellanox Technologies devices, exposing consistent semantics for peer-to-peer transfers, remote memory access, and active message delivery. Major supercomputing centers and vendors in the TOP500 ecosystem use UCX to optimize interconnect utilization and reduce software overhead in HPC stacks.

Architecture and Components

The UCX architecture is modular, composed of layered components: a core abstraction layer, transport adapters, protocol engines, and user-facing APIs. The core manages worker contexts, endpoints, and memory domains to represent resources such as OFED devices, CUDA-enabled accelerators, and host NICs. Transport adapters implement low-level drivers for fabrics like InfiniBand, RoCEv2, Ethernet with DPDK, and shared-memory mechanisms on multicore x86_64 and ARM nodes. Protocol engines provide rendezvous, eager, and zero-copy transfers, coordinating with memory registration services that interact with Linux kernel subsystems and device-specific RDMA capabilities. The design permits integration with asynchronous event models used by libevent, PMIx, and container orchestration platforms like Kubernetes.

Communication Protocols and APIs

UCX exposes APIs for connection management, tag-matching, stream semantics, active messages, and Remote Memory Access (RMA). The API set supports one-sided operations similar to SHMEM semantics and two-sided semantics useful for MPI implementations such as Open MPI and MPICH. Underlying protocols include eager short-message, rendezvous for large messages, and atomic operations mapped to hardware atomics provided by vendors like Intel and AMD. UCX also implements callbacks and progress function hooks compatible with runtime schedulers used in Slurm and Flux job managers. Interoperability layers allow UCX to serve as a transport backend for libraries like GROMACS, LAMMPS, TensorFlow, and PyTorch by adapting their communication patterns to UCX primitives.

Performance and Scalability

UCX emphasizes low-latency paths and minimal CPU overhead to achieve near-hardware performance. Benchmarks on systems such as Summit (supercomputer), Fugaku, and cloud instances with Mellanox SR-IOV show improvements in bandwidth and reduced message latency compared to traditional socket-based stacks. Techniques like zero-copy RDMA, GPU-direct transfers for NVIDIA GPUs, and kernel-bypass via DPDK and OFED contribute to throughput gains. Scalability features include endpoint pooling, connectionless transports for large process counts, and collective offload support when integrated with network hardware that provides offload capabilities as found in modern InfiniBand HCAs. UCX also provides tunable parameters for eager thresholds, rendezvous limits, and congestion control to adapt to wide-area and exascale networks.

Implementations and Integrations

UCX serves as the foundation for several higher-level projects and vendor stacks. It is the transport layer for UCX-based MPI deployments such as Open MPI's UCX BTL/MTL and the Cray ecosystem. UCX backends power storage and I/O middleware including Ceph and Lustre integrations, and act as a communication substrate for machine-learning frameworks through projects like NCCL adapters and the Horovod library. Cloud and container ecosystems incorporate UCX via Docker images and orchestration with Kubernetes using device plugins for RDMA. Vendor collaborations produce optimized builds for Intel Xeon processors, IBM Power servers, and NVIDIA DGX platforms.

Use Cases and Applications

UCX is applied across scientific simulation, data analytics, and AI training workloads. It accelerates molecular dynamics engines such as GROMACS and NAMD, climate modeling packages like WRF, and computational chemistry suites used at national laboratories. In machine learning, UCX reduces gradient exchange latency for large-scale training in PyTorch and TensorFlow clusters, enabling efficient synchronous SGD and model parallelism at the scale used by industry labs. Storage systems use UCX for metadata and data movement in high-throughput scenarios in research archives and enterprise HPC storage deployments. UCX also appears in middleware for federated computing projects and cross-datacenter replication employed by cloud providers.

Development and Community

UCX development is hosted in public repositories with contributions from national labs, commercial vendors, and academic groups. The community coordinates via mailing lists, issue trackers, and collaborative events aligned with conferences such as SC Conference and ISC High Performance. Governance includes maintainers and a steering group with representatives from contributing organizations, and CI pipelines for platforms maintained by partners like GitLab and Jenkins. Documentation, tutorials, and performance guides are produced jointly by contributors from NVIDIA Corporation, Intel Corporation, Lawrence Berkeley National Laboratory, and supercomputing centers to support adoption across the HPC ecosystem.

Category:High-performance computing software