Skip to content

UCXX Operators#

Authors: Holoscan Team (NVIDIA)
Supported platforms: x86_64, aarch64
Language: C++
Last modified: March 2, 2026
Latest version: 1.0
Minimum Holoscan SDK version: 3.11.0
Tested Holoscan SDK versions: 3.11.0
Contribution metric: Level 2 - Trusted

Overview#

The UCXX operators provide high-performance, low-latency communication capabilities for Holoscan applications using the Unified Communication X (UCX) framework. These operators enable efficient data transfer between distributed Holoscan applications, making them ideal for multi-node deployments and distributed processing pipelines.

Components#

This components group includes three key components:

1. UcxxEndpoint (Resource)#

A Holoscan Resource that manages a UCXX endpoint for UCX communication. It handles connection establishment, either by listening for incoming connections or by connecting to a remote endpoint.

Parameters: - hostname: The hostname or IP address for the connection - port: The port number to listen on or connect to - listen: Boolean flag indicating whether to listen for connections (server mode) or connect to a remote endpoint (client mode)

2. UcxxSenderOp (Operator)#

Sends tensor messages through a configured UcxxEndpoint using UCXX/UCX. The sender uses a two-phase protocol: 1) send a small CPU header containing tensor metadata, then 2) send the tensor payload from the tensor’s underlying pointer (CPU or GPU).

Parameters: - endpoint: Shared pointer to a UcxxEndpoint resource - tag: Base message tag for identifying message types (uint64_t). Note: this operator consumes two tags: tag (header) and tag+1 (payload). - blocking: If true, the operator does not execute until the endpoint is connected. If false (default), it drains inputs and drops sends while disconnected. - max_in_flight: Maximum number of in-flight async send requests to retain (default: 1). When exceeded, new inputs are dropped to bound memory retention if the network/receiver stalls.

Async send lifetime and backpressure behavior: - Sends are asynchronous. Any buffers passed to UCX must remain valid until the corresponding UCX request completes. - The sender retains a keepalive handle to the input entity (and any temporary tensor wrapper) until both header and payload requests complete, preventing pooled buffers from being recycled while UCX is still reading them. - On disconnect, the sender requests cancellation of any in-flight sends but retains keepalive state until UCX reports completion. While disconnected, new inputs are dropped when blocking is false.

Zero-copy and transport selection (UCX-managed): - The operator itself does not copy the payload into a staging buffer; it hands UCX the original CPU/GPU pointer. Whether the transfer is truly “zero-copy” end-to-end depends on UCX’s selected protocol and transports. - UCX may choose eager vs rendezvous and may use GPU-aware transports when available (for example, same-node CUDA IPC, or GPUDirect RDMA on capable systems), but it may also internally stage/copy depending on configuration, message size, and transport support.

3. UcxxReceiverOp (Operator)#

Receives messages through a configured UcxxEndpoint. This operator listens for incoming messages, deserializes them, and outputs them to downstream operators.

Parameters: - endpoint: Shared pointer to a UcxxEndpoint resource - tag: Base message tag for filtering received messages (uint64_t). Note: this operator consumes two tags: tag (header) and tag+1 (payload). - buffer_size: Tensor payload buffer size in bytes (required) - receive_on_device: Allocate the payload buffer on device (GPU) if true, host (CPU) if false (default: true) - allocator: Allocator used for the receive buffer allocation

Async receive behavior: - The receiver posts two receives in parallel: one for the CPU header (tensor metadata) and one for the tensor payload. - The receiver allocates a payload buffer (GPU or CPU) and receives into it; it then wraps that buffer into an output tensor and releases it when downstream is done.

Key Features#

  • High Performance: Leverages UCX for optimized network communication. UCX also supports Direct Memory Access with RDMA, Infiniband, etc.
  • Low Latency: Efficient zero-copy message transfers where possible
  • Flexible Topology: Supports both client/server and peer-to-peer communication patterns
  • Message Serialization: Uses tensor serialization based on the NVIDIA GXF/Holoscan serialization approach for efficient message serialization
  • Asynchronous Operations: Non-blocking send and receive operations for better pipeline performance
  • Cross-Platform: Supports both x86_64 and aarch64 architectures

Use Cases#

  • Distributing Holoscan pipelines across multiple nodes
  • Separating sensor acquisition from processing workloads
  • Building multi-GPU processing pipelines with inter-node communication
  • Creating scalable, distributed AI inference pipelines

Requirements#

  • Holoscan SDK: Version 3.9.0 or higher
  • UCXX Library: UCX C++ bindings
  • Platforms: x86_64, aarch64
  • Dependencies: UCX (Unified Communication X) framework

Example Configuration#

// Create endpoint resource (server mode)
auto endpoint = make_resource<UcxxEndpoint>(
  "ucxx_endpoint",
  Arg("hostname", "0.0.0.0"),
  Arg("port", 12345),
  Arg("listen", true)
);

// Sender operator
auto sender = make_operator<UcxxSenderOp>(
  "sender",
  Arg("endpoint", endpoint),
  Arg("tag", 1UL)
);

// Receiver operator
auto receiver = make_operator<UcxxReceiverOp>(
  "receiver",
  Arg("endpoint", endpoint),
  Arg("tag", 1UL),
  Arg("buffer_size", 1024 * 1024)  // 1MB buffer
);

Notes#

  • Ensure that the UCX library is properly installed and configured on your system
  • Network connectivity must be established between nodes before communication can occur
  • Message tags must match between sender and receiver pairs
  • The endpoint should be initialized before the sender and receiver operators
  • Consider firewall rules and network security when deploying distributed applications