# LLM Deployment using TensorRT-LLM This directory contains examples and reference implementations for deploying Large Language Models (LLMs) in various configurations using TensorRT-LLM. ## Use the Latest Release We recommend using the latest stable release of dynamo to avoid breaking changes: [![GitHub Release](https://img.shields.io/github/v/release/ai-dynamo/dynamo)](https://github.com/ai-dynamo/dynamo/releases/latest) You can find the latest release [here](https://github.com/ai-dynamo/dynamo/releases/latest) and check out the corresponding branch with: ```bash git checkout $(git describe --tags $(git rev-list --tags --max-count=1)) ``` --- ## Table of Contents - [Feature Support Matrix](#feature-support-matrix) - [Quick Start](#quick-start) - [Single Node Examples](#single-node-examples) - [Advanced Examples](#advanced-examples) - [KV Cache Transfer](#kv-cache-transfer-in-disaggregated-serving) - [Client](#client) - [Benchmarking](#benchmarking) - [Multimodal Support](#multimodal-support) - [Logits Processing](#logits-processing) - [Performance Sweep](#performance-sweep) ## Feature Support Matrix ### Core Dynamo Features | Feature | TensorRT-LLM | Notes | |---------|--------------|-------| | [**Disaggregated Serving**](/dynamo/v-0-9-0/design-docs/disaggregated-serving) | ✅ | | | [**Conditional Disaggregation**](/dynamo/v-0-9-0/design-docs/disaggregated-serving) | 🚧 | Not supported yet | | [**KV-Aware Routing**](/dynamo/v-0-9-0/components/router) | ✅ | | | [**SLA-Based Planner**](/dynamo/v-0-9-0/components/planner/planner-guide) | ✅ | | | [**Load Based Planner**](/dynamo/v-0-9-0/components/planner) | 🚧 | Planned | | [**KVBM**](/dynamo/v-0-9-0/components/kvbm) | ✅ | | ### Large Scale P/D and WideEP Features | Feature | TensorRT-LLM | Notes | |--------------------|--------------|-----------------------------------------------------------------| | **WideEP** | ✅ | | | **GB200 Support** | ✅ | | ## TensorRT-LLM Quick Start Below we provide a guide that lets you run all of our the common deployment patterns on a single node. ### Start Infrastructure Services (Local Development Only) For local/bare-metal development, start etcd and optionally NATS using [Docker Compose](https://github.com/ai-dynamo/dynamo/tree/main/deploy/docker-compose.yml): ```bash docker compose -f deploy/docker-compose.yml up -d ``` > [!NOTE] > - **etcd** is optional but is the default local discovery backend. You can also use `--kv_store file` to use file system based discovery. > - **NATS** is optional - only needed if using KV routing with events (default). You can disable it with `--no-kv-events` flag for prediction-based routing > - **On Kubernetes**, neither is required when using the Dynamo operator, which explicitly sets `DYN_DISCOVERY_BACKEND=kubernetes` to enable native K8s service discovery (DynamoWorkerMetadata CRD) ### Build container ```bash # TensorRT-LLM uses git-lfs, which needs to be installed in advance. apt-get update && apt-get -y install git git-lfs # On an x86 machine: ./container/build.sh --framework trtllm # On an ARM machine: ./container/build.sh --framework trtllm --platform linux/arm64 # Build the container with the default experimental TensorRT-LLM commit # WARNING: This is for experimental feature testing only. # The container should not be used in a production environment. ./container/build.sh --framework trtllm --tensorrtllm-git-url https://github.com/NVIDIA/TensorRT-LLM.git --tensorrtllm-commit main ``` ### Run container ```bash ./container/run.sh --framework trtllm -it ``` ## Single Node Examples > [!IMPORTANT] > Below we provide some simple shell scripts that run the components for each configuration. Each shell script is simply running the `python3 -m dynamo.frontend ` to start up the ingress and using `python3 -m dynamo.trtllm ` to start up the workers. You can easily take each command and run them in separate terminals. For detailed information about the architecture and how KV-aware routing works, see the [Router Guide](/dynamo/v-0-9-0/components/router/router-guide). ### Aggregated ```bash cd $DYNAMO_HOME/examples/backends/trtllm ./launch/agg.sh ``` ### Aggregated with KV Routing ```bash cd $DYNAMO_HOME/examples/backends/trtllm ./launch/agg_router.sh ``` ### Disaggregated ```bash cd $DYNAMO_HOME/examples/backends/trtllm ./launch/disagg.sh ``` ### Disaggregated with KV Routing > [!IMPORTANT] > In disaggregated workflow, requests are routed to the prefill worker to maximize KV cache reuse. ```bash cd $DYNAMO_HOME/examples/backends/trtllm ./launch/disagg_router.sh ``` ### Aggregated with Multi-Token Prediction (MTP) and DeepSeek R1 ```bash cd $DYNAMO_HOME/examples/backends/trtllm export AGG_ENGINE_ARGS=./engine_configs/deepseek-r1/agg/mtp/mtp_agg.yaml export SERVED_MODEL_NAME="nvidia/DeepSeek-R1-FP4" # nvidia/DeepSeek-R1-FP4 is a large model export MODEL_PATH="nvidia/DeepSeek-R1-FP4" ./launch/agg.sh ``` Notes: - There is a noticeable latency for the first two inference requests. Please send warm-up requests before starting the benchmark. - MTP performance may vary depending on the acceptance rate of predicted tokens, which is dependent on the dataset or queries used while benchmarking. Additionally, `ignore_eos` should generally be omitted or set to `false` when using MTP to avoid speculating garbage outputs and getting unrealistic acceptance rates. ## Advanced Examples Below we provide a selected list of advanced examples. Please open up an issue if you'd like to see a specific example! ### Multinode Deployment For comprehensive instructions on multinode serving, see the [multinode-examples.md](/dynamo/v-0-9-0/additional-resources/tensor-rt-llm-details/multinode-examples) guide. It provides step-by-step deployment examples and configuration tips for running Dynamo with TensorRT-LLM across multiple nodes. While the walkthrough uses DeepSeek-R1 as the model, you can easily adapt the process for any supported model by updating the relevant configuration files. You can see [Llama4+eagle](/dynamo/v-0-9-0/additional-resources/tensor-rt-llm-details/llama-4-eagle) guide to learn how to use these scripts when a single worker fits on the single node. ### Speculative Decoding - **[Llama 4 Maverick Instruct + Eagle Speculative Decoding](/dynamo/v-0-9-0/additional-resources/tensor-rt-llm-details/llama-4-eagle)** ### Kubernetes Deployment For complete Kubernetes deployment instructions, configurations, and troubleshooting, see [TensorRT-LLM Kubernetes Deployment Guide](https://github.com/ai-dynamo/dynamo/tree/main/examples/backends/trtllm/deploy/README.md). ### Client See [client](/dynamo/v-0-9-0/components/backends/sg-lang#testing-the-deployment) section to learn how to send request to the deployment. NOTE: To send a request to a multi-node deployment, target the node which is running `python3 -m dynamo.frontend `. ### Benchmarking To benchmark your deployment with AIPerf, see this utility script, configuring the `model` name and `host` based on your deployment: [perf.sh](https://github.com/ai-dynamo/dynamo/blob/main/benchmarks/llm/perf.sh) ## KV Cache Transfer in Disaggregated Serving Dynamo with TensorRT-LLM supports two methods for transferring KV cache in disaggregated serving: UCX (default) and NIXL (experimental). For detailed information and configuration instructions for each method, see the [KV cache transfer guide](/dynamo/v-0-9-0/additional-resources/tensor-rt-llm-details/kv-cache-transfer). ## Request Migration You can enable [request migration](/dynamo/v-0-9-0/user-guides/fault-tolerance/request-migration) to handle worker failures gracefully. Use the `--migration-limit` flag to specify how many times a request can be migrated to another worker: ```bash # For decode and aggregated workers python3 -m dynamo.trtllm ... --migration-limit=3 ``` > [!IMPORTANT] > **Prefill workers do not support request migration** and must use `--migration-limit=0` (the default). Prefill workers only process prompts and return KV cache state - they don't maintain long-running generation requests that would benefit from migration. See the [Request Migration Architecture](/dynamo/v-0-9-0/user-guides/fault-tolerance/request-migration) documentation for details on how this works. ## Request Cancellation When a user cancels a request (e.g., by disconnecting from the frontend), the request is automatically cancelled across all workers, freeing compute resources for other requests. ### Cancellation Support Matrix | | Prefill | Decode | |-|---------|--------| | **Aggregated** | ✅ | ✅ | | **Disaggregated** | ✅ | ✅ | For more details, see the [Request Cancellation Architecture](/dynamo/v-0-9-0/user-guides/fault-tolerance/request-cancellation) documentation. ## Client See [client](/dynamo/v-0-9-0/components/backends/sg-lang#testing-the-deployment) section to learn how to send request to the deployment. NOTE: To send a request to a multi-node deployment, target the node which is running `python3 -m dynamo.frontend `. ## Benchmarking To benchmark your deployment with AIPerf, see this utility script, configuring the `model` name and `host` based on your deployment: [perf.sh](https://github.com/ai-dynamo/dynamo/blob/main/benchmarks/llm/perf.sh) ## Multimodal support Dynamo with the TensorRT-LLM backend supports multimodal models, enabling you to process both text and images (or pre-computed embeddings) in a single request. For detailed setup instructions, example requests, and best practices, see the [TensorRT-LLM Multimodal Guide](/dynamo/v-0-9-0/user-guides/multimodality-support/tensor-rt-llm-multimodal). ## Logits Processing Logits processors let you modify the next-token logits at every decoding step (e.g., to apply custom constraints or sampling transforms). Dynamo provides a backend-agnostic interface and an adapter for TensorRT-LLM so you can plug in custom processors. ### How it works - **Interface**: Implement `dynamo.logits_processing.BaseLogitsProcessor` which defines `__call__(input_ids, logits)` and modifies `logits` in-place. - **TRT-LLM adapter**: Use `dynamo.trtllm.logits_processing.adapter.create_trtllm_adapters(...)` to convert Dynamo processors into TRT-LLM-compatible processors and assign them to `SamplingParams.logits_processor`. - **Examples**: See example processors in `lib/bindings/python/src/dynamo/logits_processing/examples/` ([temperature](https://github.com/ai-dynamo/dynamo/tree/main/lib/bindings/python/src/dynamo/logits_processing/examples/temperature.py), [hello_world](https://github.com/ai-dynamo/dynamo/tree/main/lib/bindings/python/src/dynamo/logits_processing/examples/hello_world.py)). ### Quick test: HelloWorld processor You can enable a test-only processor that forces the model to respond with "Hello world!". This is useful to verify the wiring without modifying your model or engine code. ```bash cd $DYNAMO_HOME/examples/backends/trtllm export DYNAMO_ENABLE_TEST_LOGITS_PROCESSOR=1 ./launch/agg.sh ``` Notes: - When enabled, Dynamo initializes the tokenizer so the HelloWorld processor can map text to token IDs. - Expected chat response contains "Hello world". ### Bring your own processor Implement a processor by conforming to `BaseLogitsProcessor` and modify logits in-place. For example, temperature scaling: ```python from typing import Sequence import torch from dynamo.logits_processing import BaseLogitsProcessor class TemperatureProcessor(BaseLogitsProcessor): def __init__(self, temperature: float = 1.0): if temperature <= 0: raise ValueError("Temperature must be positive") self.temperature = temperature def __call__(self, input_ids: Sequence[int], logits: torch.Tensor): if self.temperature == 1.0: return logits.div_(self.temperature) ``` Wire it into TRT-LLM by adapting and attaching to `SamplingParams`: ```python from dynamo.trtllm.logits_processing.adapter import create_trtllm_adapters from dynamo.logits_processing.examples import TemperatureProcessor processors = [TemperatureProcessor(temperature=0.7)] sampling_params.logits_processor = create_trtllm_adapters(processors) ``` ### Current limitations - Per-request processing only (batch size must be 1); beam width > 1 is not supported. - Processors must modify logits in-place and not return a new tensor. - If your processor needs tokenization, ensure the tokenizer is initialized (do not skip tokenizer init). ## Performance Sweep For detailed instructions on running comprehensive performance sweeps across both aggregated and disaggregated serving configurations, see the [TensorRT-LLM Benchmark Scripts for DeepSeek R1 model](https://github.com/ai-dynamo/dynamo/tree/main/examples/backends/trtllm/performance-sweeps/README.md). This guide covers recommended benchmarking setups, usage of provided scripts, and best practices for evaluating system performance. ## Dynamo KV Block Manager Integration Dynamo with TensorRT-LLM currently supports integration with the Dynamo KV Block Manager. This integration can significantly reduce time-to-first-token (TTFT) latency, particularly in usage patterns such as multi-turn conversations and repeated long-context requests. Here is the instruction: [Running KVBM in TensorRT-LLM](/dynamo/v-0-9-0/components/kvbm/kvbm-guide#run-kvbm-in-dynamo-with-tensorrt-llm) .