UCXX Operators#
Authors: Holoscan Team (NVIDIA)
Supported platforms: x86_64, aarch64
Language: C++
Last modified: March 2, 2026
Latest version: 1.0
Minimum Holoscan SDK version: 3.11.0
Tested Holoscan SDK versions: 3.11.0
Contribution metric: Level 2 - Trusted
Overview#
The UCXX operators provide high-performance, low-latency communication capabilities for Holoscan applications using the Unified Communication X (UCX) framework. These operators enable efficient data transfer between distributed Holoscan applications, making them ideal for multi-node deployments and distributed processing pipelines.
Components#
This components group includes three key components:
1. UcxxEndpoint (Resource)#
A Holoscan Resource that manages a UCXX endpoint for UCX communication. It handles connection establishment, either by listening for incoming connections or by connecting to a remote endpoint.
Parameters:
- hostname: The hostname or IP address for the connection
- port: The port number to listen on or connect to
- listen: Boolean flag indicating whether to listen for connections (server mode) or connect to a remote endpoint (client mode)
2. UcxxSenderOp (Operator)#
Sends tensor messages through a configured UcxxEndpoint using UCXX/UCX. The sender uses a two-phase protocol:
1) send a small CPU header containing tensor metadata, then
2) send the tensor payload from the tensor’s underlying pointer (CPU or GPU).
Parameters:
- endpoint: Shared pointer to a UcxxEndpoint resource
- tag: Base message tag for identifying message types (uint64_t). Note: this operator consumes two tags: tag (header) and tag+1 (payload).
- blocking: If true, the operator does not execute until the endpoint is connected. If false (default), it drains inputs and drops sends while disconnected.
- max_in_flight: Maximum number of in-flight async send requests to retain (default: 1). When exceeded, new inputs are dropped to bound memory retention if the network/receiver stalls.
Async send lifetime and backpressure behavior:
- Sends are asynchronous. Any buffers passed to UCX must remain valid until the corresponding UCX request completes.
- The sender retains a keepalive handle to the input entity (and any temporary tensor wrapper) until both header and payload requests complete, preventing pooled buffers from being recycled while UCX is still reading them.
- On disconnect, the sender requests cancellation of any in-flight sends but retains keepalive state until UCX reports completion. While disconnected, new inputs are dropped when blocking is false.
Zero-copy and transport selection (UCX-managed): - The operator itself does not copy the payload into a staging buffer; it hands UCX the original CPU/GPU pointer. Whether the transfer is truly “zero-copy” end-to-end depends on UCX’s selected protocol and transports. - UCX may choose eager vs rendezvous and may use GPU-aware transports when available (for example, same-node CUDA IPC, or GPUDirect RDMA on capable systems), but it may also internally stage/copy depending on configuration, message size, and transport support.
3. UcxxReceiverOp (Operator)#
Receives messages through a configured UcxxEndpoint. This operator listens for incoming messages, deserializes them, and outputs them to downstream operators.
Parameters:
- endpoint: Shared pointer to a UcxxEndpoint resource
- tag: Base message tag for filtering received messages (uint64_t). Note: this operator consumes two tags: tag (header) and tag+1 (payload).
- buffer_size: Tensor payload buffer size in bytes (required)
- receive_on_device: Allocate the payload buffer on device (GPU) if true, host (CPU) if false (default: true)
- allocator: Allocator used for the receive buffer allocation
Async receive behavior: - The receiver posts two receives in parallel: one for the CPU header (tensor metadata) and one for the tensor payload. - The receiver allocates a payload buffer (GPU or CPU) and receives into it; it then wraps that buffer into an output tensor and releases it when downstream is done.
Key Features#
- High Performance: Leverages UCX for optimized network communication. UCX also supports Direct Memory Access with RDMA, Infiniband, etc.
- Low Latency: Efficient zero-copy message transfers where possible
- Flexible Topology: Supports both client/server and peer-to-peer communication patterns
- Message Serialization: Uses tensor serialization based on the NVIDIA GXF/Holoscan serialization approach for efficient message serialization
- Asynchronous Operations: Non-blocking send and receive operations for better pipeline performance
- Cross-Platform: Supports both x86_64 and aarch64 architectures
Use Cases#
- Distributing Holoscan pipelines across multiple nodes
- Separating sensor acquisition from processing workloads
- Building multi-GPU processing pipelines with inter-node communication
- Creating scalable, distributed AI inference pipelines
Requirements#
- Holoscan SDK: Version 3.9.0 or higher
- UCXX Library: UCX C++ bindings
- Platforms: x86_64, aarch64
- Dependencies: UCX (Unified Communication X) framework
Example Configuration#
// Create endpoint resource (server mode)
auto endpoint = make_resource<UcxxEndpoint>(
"ucxx_endpoint",
Arg("hostname", "0.0.0.0"),
Arg("port", 12345),
Arg("listen", true)
);
// Sender operator
auto sender = make_operator<UcxxSenderOp>(
"sender",
Arg("endpoint", endpoint),
Arg("tag", 1UL)
);
// Receiver operator
auto receiver = make_operator<UcxxReceiverOp>(
"receiver",
Arg("endpoint", endpoint),
Arg("tag", 1UL),
Arg("buffer_size", 1024 * 1024) // 1MB buffer
);
Notes#
- Ensure that the UCX library is properly installed and configured on your system
- Network connectivity must be established between nodes before communication can occur
- Message tags must match between sender and receiver pairs
- The endpoint should be initialized before the sender and receiver operators
- Consider firewall rules and network security when deploying distributed applications