Examples
For quick start instructions, see the TensorRT-LLM README. This document provides all deployment patterns for running TensorRT-LLM with Dynamo, including single-node, multi-node, and Kubernetes deployments.
Table of Contents
Infrastructure Setup
For local/bare-metal development, start etcd and optionally NATS using Docker Compose:
- etcd is optional but is the default local discovery backend. You can also use
--discovery-backend fileto use file system based discovery. - NATS is optional - only needed if using KV routing with events. Workers must be explicitly configured to publish events. Use
--no-router-kv-eventson the frontend for prediction-based routing without events. - On Kubernetes, neither is required when using the Dynamo operator, which explicitly sets
DYN_DISCOVERY_BACKEND=kubernetesto enable native K8s service discovery (DynamoWorkerMetadata CRD).
Each launch script runs the frontend and worker(s) in a single terminal. You can run each command separately in different terminals for testing. Each shell script simply runs python3 -m dynamo.frontend <args> to start up the ingress and python3 -m dynamo.trtllm <args> to start up the workers.
For detailed information about the architecture and how KV-aware routing works, see the Router Guide.
Single Node Examples
Aggregated
Aggregated with KV Routing
Disaggregated
Disaggregated with KV Routing
In disaggregated workflow, requests are routed to the prefill worker to maximize KV cache reuse.
Aggregated with Multi-Token Prediction (MTP) and DeepSeek R1
- There is a noticeable latency for the first two inference requests. Please send warm-up requests before starting the benchmark.
- MTP performance may vary depending on the acceptance rate of predicted tokens, which is dependent on the dataset or queries used while benchmarking. Additionally,
ignore_eosshould generally be omitted or set tofalsewhen using MTP to avoid speculating garbage outputs and getting unrealistic acceptance rates.
Advanced Examples
Multinode Deployment
For comprehensive instructions on multinode serving, see the Multinode Examples guide. It provides step-by-step deployment examples and configuration tips for running Dynamo with TensorRT-LLM across multiple nodes. While the walkthrough uses DeepSeek-R1 as the model, you can easily adapt the process for any supported model by updating the relevant configuration files. You can see the Llama4 + Eagle guide to learn how to use these scripts when a single worker fits on a single node.
Speculative Decoding
Model-Specific Guides
- Gemma3 with Sliding Window Attention
- GPT-OSS-120b — Reasoning model with tool calling support
Kubernetes Deployment
For complete Kubernetes deployment instructions, configurations, and troubleshooting, see the TensorRT-LLM Kubernetes Deployment Guide.
Performance Sweep
For detailed instructions on running comprehensive performance sweeps across both aggregated and disaggregated serving configurations, see the TensorRT-LLM Benchmark Scripts for DeepSeek R1 model.
Client
See the client section to learn how to send requests to the deployment.
To send a request to a multi-node deployment, target the node which is running python3 -m dynamo.frontend <args>.
Benchmarking
To benchmark your deployment with AIPerf, see this utility script, configuring the
model name and host based on your deployment: perf.sh