📷🤖 Holoscan VILA Live¶
Authors: Holoscan Team (NVIDIA)
Supported platforms: x86_64, aarch64
Last modified: March 18, 2025
Language: Python
Latest version: 1.0.0
Minimum Holoscan SDK version: 2.0.0
Tested Holoscan SDK versions: 2.0.0
Contribution metric: Level 1 - Highly Reliable
This application demonstrates how to run VILA 1.5 models on live video feed with the possibility of changing the prompt in real time.
VILA 1.5 is a family of Vision Language Models (VLM) created by NVIDIA & MIT. It uses SigLIP to encode images into tokens which are fed into an LLM with an accompanying prompt. This application collects video frames from the V4L2 operator and feeds them to an AWQ-quantized VILA 1.5 for inference using the TinyChat library. This allows users to interact with a Generative AI model that is "watching" a chosen video stream in real-time.
Note: This demo currently uses Llama-3-VILA1.5-8b-AWQ, but any of the following AWQ-quantized models from the VILA 1.5 family should work as long as the file names are changed in the Dockerfile and run_vila_live.sh:
- VILA1.5-3b-AWQ
- VILA1.5-3b-s2-AWQ
- Llama-3-VILA1.5-8b-AWQ
- VILA1.5-13b-AWQ
- VILA1.5-40b-AWQ
⚙️ Setup Instructions¶
The app defaults to using the video device at /dev/video0
Note: You can use a USB webcam as the video source, or an MP4 video by following the instructions for the V4L2_Camera example app.
To debug if this is the correct device download v4l2-ctl
:
sudo apt-get install v4l-utils
v4l2-ctl --list-devices
NVIDIA Tegra Video Input Device (platform:tegra-camrtc-ca):
/dev/media0
vi-output, lt6911uxc 2-0056 (platform:tegra-capture-vi:0):
/dev/video0
Dummy video device (0x0000) (platform:v4l2loopback-000):
/dev/video3
🚀 Build and Run Instructions¶
From the Holohub main directory run the following command:
./dev_container build_and_run vila_live
./dev_container build_and_run vila_live --run_args "--source replayer"
Once the main LMM-based app is running, you will see a link for the app at http://127.0.0.1:8050
.
To receive the video stream, please also ensure port 49000 is open.
💻 Supported Hardware¶
- IGX w/ dGPU
- x86 w/ dGPU
- IGX w/ iGPU and Jetson AGX supported with workaround
There is a known issue running this application on IGX w/ iGPU and on Jetson AGX (see #500). The workaround is to update the device to avoid picking up the libnvv4l2.so library.
cd /usr/lib/aarch64-linux-gnu/
ls -l libv4l2.so.0.0.999999
sudo rm libv4l2.so.0.0.999999
sudo ln -s libv4l2.so.0.0.0.0 libv4l2.so.0.0.999999
📷⚙️ Video Options¶
There are three options to ingest video data.
- use a physical device or capture card, such as a v4l2 device as described in the Setup Instructions. Make sure the vila_live.yaml contains the v4l2_source group and specifies the device correctly (
pixel_format
may be tuned accordingly, e.g.pixel_format: "auto"
). -
convert a video file to a gxf-compatible format using the convert_video_to_gxf_entities.py script. See the yolo_model_deployment application for a detailed example. When using the replayer, configure the replayer_source in the yaml file and launch the application with:
This application downloads a pre-recorded video from Pexels when the application is built. Please review the license terms from Pexels../run_vila_live.sh --source "replayer"
-
create a virtual video device, that mounts a video file and replays it, as detailed in the v4l2_camera examples in holoscan-sdk. This approach may require signing the v4l2loopback kernel module, when using a system with secure-boot enabled. Make sure the vila_live.yaml contains the v4l2_source group and specifies the virtual device correctly. replay the video, using for example:
ffmpeg -stream_loop -1 -re -i <your_video_path> -pix_fmt yuyv422 -f v4l2 /dev/video3
🙌 Acknowledgements¶
- Jetson AI Lab, Live LLaVA: for the inspiration to create this app
- Jetson-Containers repo: For the Flask web-app with WebSockets
- LLM-AWQ repo: For the example code to create AWQ-powered LLM servers