HoloChat #

Authors: Nigel Nelson (NVIDIA)
Supported platforms: x86_64, aarch64
Language: Python
Last modified: May 13, 2025
Latest version: 0.2.0
Minimum Holoscan SDK version: 2.0.0
Tested Holoscan SDK versions: 2.0.0
Contribution metric: Level 4 - Experimental

Table of Contents#

Hardware Requirements
Run Instructions
Intended Use
Known Limitations
Best Practices

HoloChat is an AI-driven chatbot, built on top of a locally hosted Code-Llama model OR a remote NIM API for Llama-3-70b, which acts as developer's copilot in Holoscan development. The LLM leverages a vector database comprised of the Holoscan SDK repository and user guide, enabling HoloChat to answer general questions about Holoscan, as well act as a Holoscan SDK coding assistant.

Hardware Requirements: 👉💻#

Processor: x86/Arm64

If running local LLM: - GPU: NVIDIA dGPU w/ >= 28 GB VRAM - Memory: >= 28 GB of available disk memory - Needed to download fine-tuned Code Llama 34B and BGE-Large embedding model

*Tested using NVIDIA IGX Orin w/ RTX A6000 and Dell Precision 5820 Workstation w/ RTX A6000

Running HoloChat: 🏃💨#

When running HoloChat, you have two LLM options: - Local: Uses Phind-CodeLlama-34B-v2 running on your local machine using Llama.cpp - Remote: Uses Llama-3-70b-Instruct using the NVIDIA NIM API

You can also run HoloChat in MCP mode: - MCP: Runs as a Model Context Protocol server that provides Holoscan documentation and code context to upstream LLMs like Claude

TLDR; 🥱#

To run locally:

./dev_container build_and_run holochat --run_args --local

To run using the NVIDIA NIM API:

echo "NVIDIA_API_KEY=<api_key_here>" > ./applications/holochat/.env

./dev_container build_and_run holochat

To run as an MCP server:

./dev_container build_and_run holochat --run_args --mcp

See MCP_MODE.md for more details on using MCP mode.

Build Notes: ⚙️#

Build Time: - HoloChat uses a PyTorch container from NGC and may also download the ~23 GB Phind LLM from HuggingFace. As such, the first time building this application will likely take ~45 minutes depending on your internet speeds. However, this is a one-time set-up and subsequent runs of HoloChat should take seconds to launch.

Build Location:

If running locally: HoloChat downloads ~28 GB of model data to the holochat/models directory. As such, it is recommended to only run this application on a disk drive with ample storage (ex: the 500 GB SSD included with NVIDIA IGX Orin).

Running Instructions:#

If connecting to your machine via SSH, be sure to forward the appropriate ports: - For chatbot UI: 7860 - For local LLM: 8080 - For MCP server: 8090

ssh <user_name>@<IP address> -L 7860:localhost:7860 -L 8080:localhost:8080 -L 8090:localhost:8090

Running w/ Local LLM 💻#

To build and start the app:

./dev_container build_and_run holochat --run_args --local

Once the LLM is loaded on the GPU and the Gradio app is running, HoloChat should be available at http://127.0.0.1:7860/.

Running w/ NIM API ☁️#

To use the NIM API you must create a .env file at:

./applications/holochat/.env

This is where you should place your NVIDIA API key.

NVIDIA_API_KEY=<api_key_here>

To build and run the app:

./dev_container build_and_run holochat

Once the Gradio app is running, HoloChat should be available at http://127.0.0.1:7860/.

Usage Notes: 🗒️#

Intended use: 🎯#

HoloChat is developed to accelerate and assist Holoscan developers’ learning and development. HoloChat serves as an intuitive chat interface, enabling users to pose natural language queries related to the Holoscan SDK. Whether seeking general information about the SDK or specific coding insights, users can obtain immediate responses thanks to the underlying Large Language Model (LLM) and vector database.

HoloChat is given access to the Holoscan SDK repository, the HoloHub repository, and the Holoscan SDK user guide. This essentially allows users to engage in natural language conversations with these documents, gaining instant access to the information they need, thus sparing them the task of sifting through vast amounts of documentation themselves.

Known Limitations: ⚠️🚧#

Before diving into how to make the most of HoloChat, it's crucial to understand and acknowledge its known limitations. These limitations can guide you in adopting the best practices below, which will help you navigate and mitigate these issues effectively. * Hallucinations: Occasionally, HoloChat may provide responses that are not entirely accurate. It's advisable to approach answers with a healthy degree of skepticism. * Memory Loss: LLM's limited attention window may lead to the loss of previous conversation history. To mitigate this, consider restarting the application to clear the chat history when necessary. * Limited Support for Stack Traces: HoloChat's knowledge is based on the Holoscan repository and the user guide, which lack large collections of stack trace data. Consequently, HoloChat may face challenges when assisting with stack traces.

Best Practices: ✅👍#

While users should be aware of the above limitations, following the recommended tips will drastically minimize these possible shortcomings. In general, the more detailed and precise a question is, the better the results will be. Some best practices when asking questions are: * Be Verbose: If you want to create an application, specify which operators should be used if possible (HolovizOp, V4L2VideoCaptureOp, InferenceOp, etc.). * Be Specific: The less open-ended a question is the less likely the model will hallucinate. * Specify Programming Language: If asking for code, include the desired language (Python or C++). * Provide Code Snippets: If debugging errors include as much relevant information as possible. Copy and paste the code snippet that produces the error, the abbreviated stack trace, and describe any changes that may have introduced the error.

In order to demonstrate how to get the most out of HoloChat two example questions are posed below. These examples illustrate how a user can refine their questions and as a result, improve the responses they receive:

Worst👎: “Create an app that predicts the labels associated with a video”

Better👌: “Create a Python app that takes video input and sends it through a model for inference.”

Best🙌: “Create a Python Holoscan application that receives streaming video input, and passes that video input into a pytorch classification model for inference. Then, collect the model’s predicted class and use Holoviz to display the class label on each video frame.”

Worst👎: “What os can I use?”

Better👌: “What operating system can I use with Holoscan?”

Best🙌: “Can I use MacOS with the Holoscan SDK?”

Appendix:#

Meta Terms of Use:#

By using the Code-Llama model, you are agreeing to the terms and conditions of the license, acceptable use policy and Meta’s privacy policy.

Implementation Details:#

HoloChat operates by taking user input and comparing it to the text stored within the vector database, which is comprised of Holoscan SDK information. The most relevant text segments from SDK code and the user guide are then appended to the user's query. This approach allows the chosen LLM to answer questions about the Holoscan SDK, without being explicitly trained on SDK data.

However, there is a drawback to this method - the most relevant documentation is not always found within the vector database. Since the user's question serves as the search query, queries that are too simplistic or abbreviated may fail to extract the most relevant documents from the vector database. As a consequence, the LLM will then lack the necessary context, leading to poor and potentially inaccurate responses. This occurs because LLMs strive to provide the most probable response to a question, and without adequate context, they hallucinate to fill in these knowledge gaps.

HoloChat#