TensorRT-LLM

View as Markdown

LLM Deployment using TensorRT-LLM

This directory contains examples and reference implementations for deploying Large Language Models (LLMs) in various configurations using TensorRT-LLM.

Use the Latest Release

We recommend using the latest stable release of dynamo to avoid breaking changes:

GitHub Release

You can find the latest release here and check out the corresponding branch with:

$git checkout $(git describe --tags $(git rev-list --tags --max-count=1))

Table of Contents

Feature Support Matrix

Core Dynamo Features

FeatureTensorRT-LLMNotes
Disaggregated Serving
Conditional Disaggregation🚧Not supported yet
KV-Aware Routing
SLA-Based Planner
Load Based Planner🚧Planned
KVBM

Large Scale P/D and WideEP Features

FeatureTensorRT-LLMNotes
WideEP
DP Rank Routing
GB200 Support

TensorRT-LLM Quick Start

Below we provide a guide that lets you run all of our the common deployment patterns on a single node.

Start Infrastructure Services (Local Development Only)

For local/bare-metal development, start etcd and optionally NATS using Docker Compose:

$docker compose -f deploy/docker-compose.yml up -d
  • etcd is optional but is the default local discovery backend. You can also use --discovery-backend file to use file system based discovery.
  • NATS is optional - only needed if using KV routing with events. Workers must be explicitly configured to publish events. Use --no-router-kv-events on the frontend for prediction-based routing without events
  • On Kubernetes, neither is required when using the Dynamo operator, which explicitly sets DYN_DISCOVERY_BACKEND=kubernetes to enable native K8s service discovery (DynamoWorkerMetadata CRD)

Build container

$# TensorRT-LLM uses git-lfs, which needs to be installed in advance.
$apt-get update && apt-get -y install git git-lfs
$
$# On an x86 machine:
$python container/render.py --framework=trtllm --target=runtime --output-short-filename --cuda-version=13.1
$docker build -t dynamo:trtllm-latest -f container/rendered.Dockerfile .
$
$# On an ARM machine:
$python container/render.py --framework=trtllm --target=runtime --platform=arm64 --output-short-filename --cuda-version=13.1
$docker build -t dynamo:trtllm-latest -f container/rendered.Dockerfile .

Run container

$./container/run.sh --framework trtllm -it

Single Node Examples

Below we provide some simple shell scripts that run the components for each configuration. Each shell script is simply running the python3 -m dynamo.frontend <args> to start up the ingress and using python3 -m dynamo.trtllm <args> to start up the workers. You can easily take each command and run them in separate terminals.

For detailed information about the architecture and how KV-aware routing works, see the Router Guide.

Aggregated

$cd $DYNAMO_HOME/examples/backends/trtllm
$./launch/agg.sh

Aggregated with KV Routing

$cd $DYNAMO_HOME/examples/backends/trtllm
$./launch/agg_router.sh

Disaggregated

$cd $DYNAMO_HOME/examples/backends/trtllm
$./launch/disagg.sh

Disaggregated with KV Routing

In disaggregated workflow, requests are routed to the prefill worker to maximize KV cache reuse.
$cd $DYNAMO_HOME/examples/backends/trtllm
$./launch/disagg_router.sh

Aggregated with Multi-Token Prediction (MTP) and DeepSeek R1

$cd $DYNAMO_HOME/examples/backends/trtllm
$
$export AGG_ENGINE_ARGS=./engine_configs/deepseek-r1/agg/mtp/mtp_agg.yaml
$export SERVED_MODEL_NAME="nvidia/DeepSeek-R1-FP4"
$# nvidia/DeepSeek-R1-FP4 is a large model
$export MODEL_PATH="nvidia/DeepSeek-R1-FP4"
$./launch/agg.sh

Notes:

  • There is a noticeable latency for the first two inference requests. Please send warm-up requests before starting the benchmark.
  • MTP performance may vary depending on the acceptance rate of predicted tokens, which is dependent on the dataset or queries used while benchmarking. Additionally, ignore_eos should generally be omitted or set to false when using MTP to avoid speculating garbage outputs and getting unrealistic acceptance rates.

Advanced Examples

Below we provide a selected list of advanced examples. Please open up an issue if you’d like to see a specific example!

Multinode Deployment

For comprehensive instructions on multinode serving, see the multinode-examples.md guide. It provides step-by-step deployment examples and configuration tips for running Dynamo with TensorRT-LLM across multiple nodes. While the walkthrough uses DeepSeek-R1 as the model, you can easily adapt the process for any supported model by updating the relevant configuration files. You can see Llama4+eagle guide to learn how to use these scripts when a single worker fits on the single node.

Speculative Decoding

Kubernetes Deployment

For complete Kubernetes deployment instructions, configurations, and troubleshooting, see TensorRT-LLM Kubernetes Deployment Guide.

Client

See client section to learn how to send request to the deployment.

NOTE: To send a request to a multi-node deployment, target the node which is running python3 -m dynamo.frontend <args>.

Benchmarking

To benchmark your deployment with AIPerf, see this utility script, configuring the model name and host based on your deployment: perf.sh

KV Cache Transfer in Disaggregated Serving

Dynamo with TensorRT-LLM supports two methods for transferring KV cache in disaggregated serving: UCX (default) and NIXL (experimental). For detailed information and configuration instructions for each method, see the KV cache transfer guide.

Request Migration

Dynamo supports request migration to handle worker failures gracefully. When enabled, requests can be automatically migrated to healthy workers if a worker fails mid-generation. See the Request Migration Architecture documentation for configuration details.

Request Cancellation

When a user cancels a request (e.g., by disconnecting from the frontend), the request is automatically cancelled across all workers, freeing compute resources for other requests.

Cancellation Support Matrix

PrefillDecode
Aggregated
Disaggregated

For more details, see the Request Cancellation Architecture documentation.

Client

See client section to learn how to send request to the deployment.

NOTE: To send a request to a multi-node deployment, target the node which is running python3 -m dynamo.frontend <args>.

Benchmarking

To benchmark your deployment with AIPerf, see this utility script, configuring the model name and host based on your deployment: perf.sh

Multimodal support

Dynamo with the TensorRT-LLM backend supports multimodal models, enabling you to process both text and images (or pre-computed embeddings) in a single request. For detailed setup instructions, example requests, and best practices, see the TensorRT-LLM Multimodal Guide.

Video Diffusion Support (Experimental)

Dynamo supports video generation using diffusion models through the --modality video_diffusion flag.

Requirements

  • visual_gen: Part of TensorRT-LLM, located at tensorrt_llm/visual_gen/. Currently available only on the feat/visual_gen branch (not yet merged to main or any release). Install from source:
    $git clone https://github.com/NVIDIA/TensorRT-LLM.git
    $cd TensorRT-LLM && git checkout feat/visual_gen
    $cd tensorrt_llm/visual_gen && pip install -e .
  • dynamo-runtime with video API: The Dynamo runtime must include ModelType.Videos support. Ensure you’re using a compatible version.

Supported Models

Diffusers PipelineDescriptionExample Model
WanPipelineWan 2.1/2.2 Text-to-VideoWan-AI/Wan2.1-T2V-1.3B-Diffusers

The pipeline type is auto-detected from the model’s model_index.json — no --model-type flag is needed.

Quick Start

$python -m dynamo.trtllm \
> --modality video_diffusion \
> --model-path Wan-AI/Wan2.1-T2V-1.3B-Diffusers \
> --media-output-fs-url file:///tmp/dynamo_media

API Endpoint

Video generation uses the /v1/videos endpoint:

$curl -X POST http://localhost:8000/v1/videos \
> -H "Content-Type: application/json" \
> -d '{
> "prompt": "A cat playing piano",
> "model": "wan_t2v",
> "seconds": 4,
> "size": "832x480",
> "nvext": {
> "fps": 24
> }
> }'

Configuration Options

FlagDescriptionDefault
--media-output-fs-urlFilesystem URL for storing generated mediafile:///tmp/dynamo_media
--default-heightDefault video height480
--default-widthDefault video width832
--default-num-framesDefault frame count81
--enable-teacacheEnable TeaCache optimizationFalse
--disable-torch-compileDisable torch.compileFalse

Limitations

  • Video diffusion is experimental and not recommended for production use
  • Only text-to-video is supported in this release (image-to-video planned)
  • Requires GPU with sufficient VRAM for the diffusion model

Logits Processing

Logits processors let you modify the next-token logits at every decoding step (e.g., to apply custom constraints or sampling transforms). Dynamo provides a backend-agnostic interface and an adapter for TensorRT-LLM so you can plug in custom processors.

How it works

  • Interface: Implement dynamo.logits_processing.BaseLogitsProcessor which defines __call__(input_ids, logits) and modifies logits in-place.
  • TRT-LLM adapter: Use dynamo.trtllm.logits_processing.adapter.create_trtllm_adapters(...) to convert Dynamo processors into TRT-LLM-compatible processors and assign them to SamplingParams.logits_processor.
  • Examples: See example processors in lib/bindings/python/src/dynamo/logits_processing/examples/ (temperature, hello_world).

Quick test: HelloWorld processor

You can enable a test-only processor that forces the model to respond with “Hello world!”. This is useful to verify the wiring without modifying your model or engine code.

$cd $DYNAMO_HOME/examples/backends/trtllm
$export DYNAMO_ENABLE_TEST_LOGITS_PROCESSOR=1
$./launch/agg.sh

Notes:

  • When enabled, Dynamo initializes the tokenizer so the HelloWorld processor can map text to token IDs.
  • Expected chat response contains “Hello world”.

Bring your own processor

Implement a processor by conforming to BaseLogitsProcessor and modify logits in-place. For example, temperature scaling:

1from typing import Sequence
2import torch
3from dynamo.logits_processing import BaseLogitsProcessor
4
5class TemperatureProcessor(BaseLogitsProcessor):
6 def __init__(self, temperature: float = 1.0):
7 if temperature <= 0:
8 raise ValueError("Temperature must be positive")
9 self.temperature = temperature
10
11 def __call__(self, input_ids: Sequence[int], logits: torch.Tensor):
12 if self.temperature == 1.0:
13 return
14 logits.div_(self.temperature)

Wire it into TRT-LLM by adapting and attaching to SamplingParams:

1from dynamo.trtllm.logits_processing.adapter import create_trtllm_adapters
2from dynamo.logits_processing.examples import TemperatureProcessor
3
4processors = [TemperatureProcessor(temperature=0.7)]
5sampling_params.logits_processor = create_trtllm_adapters(processors)

Current limitations

  • Per-request processing only (batch size must be 1); beam width > 1 is not supported.
  • Processors must modify logits in-place and not return a new tensor.
  • If your processor needs tokenization, ensure the tokenizer is initialized (do not skip tokenizer init).

DP Rank Routing (Attention Data Parallelism)

TensorRT-LLM supports attention data parallelism (attention DP) for models like DeepSeek. When enabled, multiple attention DP ranks run within a single worker, each with its own KV cache. Dynamo can route requests to specific DP ranks based on KV cache state.

Dynamo vs TRT-LLM Internal Routing

  • Dynamo DP Rank Routing: The router selects the optimal DP rank based on KV cache overlap and instructs TRT-LLM to use that rank with strict routing (attention_dp_relax=False). Use this with --router-mode kv for cache-aware routing.
  • TRT-LLM Internal Routing: TRT-LLM’s scheduler assigns DP ranks internally. Use this with --router-mode round-robin or random when KV-aware routing isn’t needed.

Enabling DP Rank Routing

$# Worker with attention DP
$# (TP=2 acts as the "world size", in effect creating 2 attention DP ranks)
$CUDA_VISIBLE_DEVICES=0,1 python3 -m dynamo.trtllm \
> --model-path <MODEL_PATH> \
> --tensor-parallel-size 2 \
> --enable-attention-dp \
> --publish-events-and-metrics
$
$# Frontend with KV routing
$python3 -m dynamo.frontend --router-mode kv

The --enable-attention-dp flag sets attention_dp_size = tensor_parallel_size and configures Dynamo to publish KV events per DP rank. The router automatically creates routing targets for each (worker_id, dp_rank) combination.

Attention DP requires TRT-LLM’s PyTorch backend. AutoDeploy does not support attention DP.

Performance Sweep

For detailed instructions on running comprehensive performance sweeps across both aggregated and disaggregated serving configurations, see the TensorRT-LLM Benchmark Scripts for DeepSeek R1 model. This guide covers recommended benchmarking setups, usage of provided scripts, and best practices for evaluating system performance.

Dynamo KV Block Manager Integration

Dynamo with TensorRT-LLM currently supports integration with the Dynamo KV Block Manager. This integration can significantly reduce time-to-first-token (TTFT) latency, particularly in usage patterns such as multi-turn conversations and repeated long-context requests.

Here is the instruction: Running KVBM in TensorRT-LLM .

Known Issues and Mitigations

KV Cache Exhaustion Causing Worker Deadlock (Disaggregated Serving)

Issue: In disaggregated serving mode, TensorRT-LLM workers can become stuck and unresponsive after sustained high-load traffic. Once in this state, workers require a pod/process restart to recover.

Symptoms:

  • Workers function normally initially but hang after heavy load testing
  • Inference requests get stuck and eventually timeout
  • Logs show warnings: num_fitting_reqs=0 and fitting_disagg_gen_init_requests is empty, may not have enough kvCache
  • Error logs may contain: asyncio.exceptions.InvalidStateError: invalid state

Root Cause: When max_tokens_in_buffer in the cache transceiver config is smaller than the maximum input sequence length (ISL) being processed, KV cache exhaustion can occur under heavy load. This causes context transfers to timeout, leaving workers stuck waiting for phantom transfers and entering an irrecoverable deadlock state.

Mitigation: Ensure max_tokens_in_buffer exceeds your maximum expected input sequence length. Update your engine configuration files (e.g., prefill.yaml and decode.yaml):

1cache_transceiver_config:
2 backend: DEFAULT
3 max_tokens_in_buffer: 65536 # Must exceed max ISL

For example, see examples/backends/trtllm/engine_configs/gpt-oss-120b/prefill.yaml.

Related Issue: #4327