CUDA MPS Tutorial for Holoscan Applications¶
Authors: Holoscan Team (NVIDIA)
Supported platforms: x86_64, aarch64
Last modified: March 18, 2025
Latest version: 0.1.0
Minimum Holoscan SDK version: 0.6.0
Tested Holoscan SDK versions: 0.6.0
Contribution metric: Level 1 - Highly Reliable
CUDA MPS is NVIDIA's Multi-Process Service for CUDA applications. It allows multiple CUDA applications to share a single GPU, which can be useful for running more than one Holoscan application on a single machine featuring one or more GPUs. This tutorial describes the steps to enable CUDA MPS and demonstrate few performance benefits of using it.
Table of Contents¶
Steps to enable CUDA MPS¶
Before enabling CUDA MPS, please check whether your system supports CUDA MPS.
CUDA MPS can be enabled by running the nvidia-cuda-mps-control -d
command and stopped by running
echo quit | nvidia-cuda-mps-control
command. More control commands are described
here.
CUDA MPS does not require any changes to an existing Holoscan application; even an already compiled application binary works as it is. Therefore, a Holoscan application can work with CUDA MPS without any changes to its source code or binary. However, a machine learning model like a TRT engine file may need to be recompiled for the first time after enabling CUDA MPS.
We have included a helper script in this tutorial start_mps_daemon.sh
to enable
CUDA MPS with necessary environment variables.
./start_mps_daemon.sh
Customization¶
CUDA MPS provides many options to customize resource allocation for MPS clients. For example, it has
an option to limit the maximum number of GPU threads that can
be used by every MPS client.
The CUDA_MPS_ACTIVE_THREAD_PERCENTAGE
environment variable can be used to control this limit
system-wide. This limit can also be configured by communicating the active thread percentage to the control daemon with
echo "set_default_active_thread_percentage <Thread Percentage>" | nvidia-cuda-mps-control
.
Our start_mps_daemon.sh
script takes this percentage as the first argument as well.
./start_mps_daemon.sh <Active Thread Percentage>
For different applications, one may want to set different limits on the number of GPU threads
available to each of them. This can be done by setting the CUDA_MPS_ACTIVE_THREAD_PERCENTAGE
environment variable separately for each application. It is elaborated in details here.
There are other customizations available in CUDA MPS as well. Please refer to the CUDA MPS documentation to know more about them. Please note that concurrently running Holoscan applications may increase the GPU device memory footprint. Therefore, one needs to be careful about hitting the GPU memory size and potential delay due to page faults.
CUDA MPS improves the performance for concurrently running Holoscan applications. Since multiple applications can simultaneously execute more than one CUDA compute tasks with CUDA MPS, it can also improve the overall GPU utilization.
Performance Benefits on x86 System¶
Note: Endoscopy Tool Tracking does not work with CUDA MPS after holohub/Holoscan-SDK-v2.6.0 due to the unavailability of CUDA dynamic parallelism implemented in this PR in CUDA MPS. In case endoscopy tool tracking needs to be tested with CUDA MPS, please use the
holoscan-sdk-v2.6.0
tag or earlier.
Suppose, we want to run the endoscopy tool tracking and ultrasound segmentation applications concurrently on an x86 workstation with RTX A6000 GPU. The below table shows the maximum end-to-end latency performance without and with CUDA MPS, where the active thread percentage is set to 40\% for each application. It demonstrates 18% and 50% improvement in the maximum end-to-end latency for the endoscopy tool tracking and ultrasound segmentation applications, respectively.
Application | Without MPS (ms) | With MPS (ms) |
---|---|---|
Endoscopy Tool Tracking | 115.38 | 94.20 |
Ultrasound Segmentation | 121.48 | 60.94 |
In another set of experiments, we concurrently run multiple instances of the endoscopy tool tracking application in different processes. We set the active thread percentage to be 20\% for each MPS client. The below graph shows the maximum end-to-end latency with and without CUDA MPS. The experiment demonstrates up to 36% improvement with CUDA MPS.
Such experiments can easily be conducted with Holoscan Flow Benchmarking to retrieve various end-to-end latency performance metrics.
IGX Orin¶
CUDA MPS is available on IGX Orin since CUDA 12.5. Please check you CUDA version and upgrade to CUDA 12.5+ to test CUDA MPS. We evaluate the benefits of MPS on IGX Orin with discrete and integrated GPUs. Please follow the steps outlined in Steps to enable CUDA MPS to start running the MPS server on IGX Orin.
We use the model benchmarking application to demonstrate the benefits of CUDA MPS. In general, MPS improves performance by enabling multiple concurrent processes to share a CUDA context and scheduling resources. We show the benefits of using CUDA MPS along two dimensions: (a) increasing the workload per application instance (varying the number of parallel inferences for the same model) and (b) increasing the total number of instances.
Model Benchmarking Application Setup¶
Please follow the steps outlined in model benchmarking to ensure that the application builds and runs properly.
Note that you need to run the video using v4l2loopback in a separate terminal while running the model benchmarking application.
Make sure to change the device path in the
model_benchmarking/python/model_benchmarking.yaml
file to match the values you provided in themodprobe
command when following the v4l2loopback instructions.
Performance Benchmark Setup¶
To gather performance metrics for the model benchmarking application, follow the steps outlined in Holoscan Flow Benchmarking.
If you are running within a container, please complete Step-3 before launching the container
We use the following steps:
1. Patch Application:
./benchmarks/holoscan_flow_benchmarking/patch_application.sh model_benchmarking
2. Build Application for Benchmarking:
./run build model_benchmarking python --configure-args -DCMAKE_CXX_FLAGS=-I$PWD/benchmarks/holoscan_flow_benchmarking`
3. Set Up V4l2Loopback Devices:
i. Install v4l2loopback
and v4l2loopback
:
sudo apt-get install v4l2loopback-dkms ffmpeg
ii. Determine the number of instances you would like to benchmark and set that as the value of devices
. Then, load the v4l2loopback
kernel module on virtual devices /dev/video[*]
. This enables each instance to get its input from a separate virtual device.
Example: For 3 instances, the v4l2loopback
kernel module can be loaded on /dev/video1
, /dev/video2
and /dev/video3
:
sudo modprobe v4l2loopback devices=3 video_nr=1 max_buffers=4
Now open 3 separate terminals.
In terminal-1, run:
ffmpeg -stream_loop -1 -re -i /data/ultrasound_segmentation/ultrasound_256x256.avi -pix_fmt yuyv422 -f v4l2 /dev/video1
In terminal-2, run:
ffmpeg -stream_loop -1 -re -i /data/ultrasound_segmentation/ultrasound_256x256.avi -pix_fmt yuyv422 -f v4l2 /dev/video2
In terminal-3, run:
ffmpeg -stream_loop -1 -re -i /data/ultrasound_segmentation/ultrasound_256x256.avi -pix_fmt yuyv422 -f v4l2 /dev/video3
4. Benchmark Application:
python benchmarks/holoscan_flow_benchmarking/benchmark.py --run-command="python applications/model_benchmarking/python/model_benchmarking.py -l <number of parallel inferences> -i" --language python -i <number of instances> -r <number of runs> -m <number of messages> --sched greedy -d <outputs folder> -u
The command executes <number of runs>
runs of <number of instances>
instances of the model benchmarking application with <number of messages>
messages. Each instance runs <number of parallel inferences>
parallel model benchmarking inferences with no post-processing and visualization (-i
).
Please refer to Model benchmarking options and Holoscan flow benchmarking options for more information on the various command options.
Example: After Step-3, to benchmark 3 instances for 10 runs with 1000 messages, run:
python benchmarks/holoscan_flow_benchmarking/benchmark.py --run-command="python applications/model_benchmarking/python/model_benchmarking.py -l 7 -i" --language python -i 3 -r 10 -m 1000 --sched greedy -d myoutputs -u`
Performance Benefits on IGX Orin w/ Discrete GPU¶
We look at the performance benefits of MPS by varying the number of instances and number of inferences. We use the RTX A6000 GPU for our experiments. From our experiments, we observe that enabling MPS results in upto 12% improvement in maximum latency compared to the default setting.
Varying Number of Instances¶
We fix the number of parallel inferences to 7, number of runs to 10 and number of messages to 1000 and vary the number of instances from 3 to 7 using the -i
parameter. Please refer to Performance Benchmark Setup for benchmarking commands.
The graph below shows the maximum end-to-end latency of model benchmarking application with and without CUDA MPS, where the active thread percentage was set to 80/(number of instances)
. For example, for 5 instances, we set the active thread percentage to 80/5 = 16
. By provisioning resources this way, we leave some resources idle in case a client should require to use it. Please refer to CUDA MPS Resource Provisioning for more details regarding this.
The graph is missing a bar for the case of 7 instances and 7 parallel inferences as we were unable to get the baseline to execute. However, we were able to run when MPS was enabled, highlighting the advantage of using MPS for large workloads. We see that the maximum end-to-end latency improves when MPS is enabled and the improvement is more pronounced as the number of instances increases. This is because, as the number of concurrent processes increases, MPS confines CUDA workloads to a certain predefined set of SMs. MPS combines multiple CUDA contexts from multiple processes into one, while simultaneously running them together. It reduces the number of context switches and related inferences, resulting in improved GPU utilization.
| Maximum end-to-end Latency |
| :-------------------------:|
We also notice minor improvements in the 99.9th percentile latency and similar improvements in the 99th percentile latency.
99.9th Percentile Latency | 99th Percentile Latency |
---|---|
![]() |
![]() |
Varying Number of Parallel Inferences¶
We vary the number of parallel inferences to show that MPS may not be beneficial if the workload is insufficient to offset the overhead of running the MPS server. The graph below shows the result of increasing the number of parallel inferences from 3 to 7 while the number of instances is constant.
As the number of parallel inferences increases, so does the workload, and the benefit of MPS is more evident. However, when the workload is low, CUDA MPS may not be beneficial.
Maximum Latency for 5 Instances |
---|
![]() |
IGX Orin w/ Integrated GPU¶
MPS Setup on IGX-iGPU¶
Note that we run all commands as root
1. Please add cuda-12.5+ to $PATH
and $LD_LIBRARY_PATH
If you have multiple CUDA installations, check it at
/usr/local/
directory.
echo $PATH
/usr/local/cuda-12.6/compat:/usr/local/cuda-12.6/bin:/usr/local/sbin:/usr/local/bin:/usr/sbin:/usr/bin:/sbin:/bin:/snap/bin
echo $LD_LIBRARY_PATH
/usr/local/cuda-12.6/compat/lib:/usr/local/cuda-12.6/compat:/usr/local/cuda-12.6/lib64:
2. Be sure to pass -v /tmp/nvidia-mps:/tmp/nvidia-mps -v /tmp/nvidia-log:/tmp/nvidia-log -v
/usr/local/cuda-12.6:/usr/local/cuda-12.6
to ./dev_container launch
command to ensure that the container is
connected to the MPS control and server
Example:
./dev_container launch --img holohub:v2.1 --docker_opts "-v /tmp/nvidia-mps:/tmp/nvidia-mps -v /tmp/nvidia-log:/tmp/nvidia-log -v /usr/local/cuda-12.6:/usr/local/cuda-12.6"
3. Inside the container, be sure to set the following environment variables:
export CUDA_VISIBLE_DEVICES=0
export CUDA_MPS_PIPE_DIRECTORY=/tmp/nvidia-mps
export CUDA_MPS_LOG_DIRECTORY=/tmp/nvidia-log
export PATH=/usr/local/cuda-12.6/bin:$PATH
export PATH=/usr/local/cuda-12.6/compat:$PATH
export LD_LIBRARY_PATH=/usr/local/cuda-12.6/lib64:$LD_LIBRARY_PATH
export LD_LIBRARY_PATH=/usr/local/cuda-12.6/compat:$LD_LIBRARY_PATH
export LD_LIBRARY_PATH=/usr/local/cuda-12.6/compat/lib:$LD_LIBRARY_PATH
export LD_LIBRARY_PATH=/usr/lib/aarch64-linux-gnu/nvidia:$LD_LIBRARY_PATH
$PATH
and $LD_LIBRARY_PATH
values inside the container are:
echo $PATH
/usr/local/cuda-12.6/bin:/opt/tensorrt/bin:/usr/local/mpi/bin:/usr/local/nvidia/bin:/usr/local/cuda/bin:/usr/local/sbin:/usr/local/bin:/usr/sbin:/usr/bin:/sbin:/bin:/usr/local/ucx/bin:/opt/nvidia/holoscan/bin
echo $LD_LIBRARY_PATH
/usr/local/cuda-12.6/compat/lib:/usr/local/cuda-12.6/compat:/usr/local/cuda-12.6/lib64:/usr/local/cuda/compat/lib.real:/usr/local/cuda/compat/lib:/usr/local/nvidia/lib:/usr/local/nvidia/lib64:/opt/nvidia/holoscan/lib
4. Start MPS server and control
sudo -i
export CUDA_MPS_ACTIVE_THREAD_PERCENTAGE=20
nvidia-cuda-mps-control -d
5. After steps 1-4, follow the benchmark insructions to benchmark the application
Performance Benefits on IGX Orin w/ Integrated GPU¶
We look at the performance benefits of MPS by varying the number of application instances. We run
the model benchmarking application in a mode where the inputs are always available and being read
from the disk with the video replayer operator. For every instance of the application, we run 1
inference (-l 1
) as iGPU is a smaller GPU. In this experiment, we also oversubscribe the GPU to
provide the instances more opportunity to utilize the available SMs (IGX Orin iGPU has 16 SMs).
From our experiments, we observe that enabling MPS results in 22-50%, 13-49% and 6-37% improvement
in maximum latency, 99.9 percentile latency and average latency, respectively. The graphs below
capture the result. In the X-axis, number of instances is show to increase from 2 to 7, and the
number in the parenthesis shows the number of SMs per instance enabled by CUDA_ACTIVE_THREAD_PERCENTAGE
.
| Maximum end-to-end Latency |
| :-------------------------:|
99.9 Percentile Latency | Average Latency |
---|---|
![]() |
![]() |
We use different number of SMs for different instances to ensure the total number of SMs requested by all the instances exceed the number of available SMs.