---
title: Examples
---
For quick start instructions, see the [TensorRT-LLM README](/dynamo/dev/backends/tensor-rt-llm). This document provides all deployment patterns for running TensorRT-LLM with Dynamo, including single-node, multi-node, and Kubernetes deployments.
## Table of Contents
- [Infrastructure Setup](#infrastructure-setup)
- [Single Node Examples](#single-node-examples)
- [Advanced Examples](#advanced-examples)
- [Client](#client)
- [Benchmarking](#benchmarking)
## Infrastructure Setup
For local/bare-metal development, start etcd and optionally NATS using Docker Compose:
```bash
docker compose -f deploy/docker-compose.yml up -d
```
- **etcd** is optional but is the default local discovery backend. You can also use `--discovery-backend file` to use file system based discovery.
- **NATS** is optional - only needed if using KV routing with events. Workers must be explicitly configured to publish events. Use `--no-router-kv-events` on the frontend for prediction-based routing without events.
- **On Kubernetes**, neither is required when using the Dynamo operator, which explicitly sets `DYN_DISCOVERY_BACKEND=kubernetes` to enable native K8s service discovery (DynamoWorkerMetadata CRD).
Each launch script runs the frontend and worker(s) in a single terminal. You can run each command separately in different terminals for testing. Each shell script simply runs `python3 -m dynamo.frontend ` to start up the ingress and `python3 -m dynamo.trtllm ` to start up the workers.
For detailed information about the architecture and how KV-aware routing works, see the [Router Guide](/dynamo/dev/components/router/router-guide).
## Single Node Examples
### Aggregated
```bash
cd $DYNAMO_HOME/examples/backends/trtllm
./launch/agg.sh
```
### Aggregated with KV Routing
```bash
cd $DYNAMO_HOME/examples/backends/trtllm
./launch/agg_router.sh
```
### Disaggregated
```bash
cd $DYNAMO_HOME/examples/backends/trtllm
./launch/disagg.sh
```
### Disaggregated with KV Routing
In disaggregated workflow, requests are routed to the prefill worker to maximize KV cache reuse.
```bash
cd $DYNAMO_HOME/examples/backends/trtllm
./launch/disagg_router.sh
```
### Aggregated with Multi-Token Prediction (MTP) and DeepSeek R1
```bash
cd $DYNAMO_HOME/examples/backends/trtllm
export AGG_ENGINE_ARGS=./engine_configs/deepseek-r1/agg/mtp/mtp_agg.yaml
export SERVED_MODEL_NAME="nvidia/DeepSeek-R1-FP4"
# nvidia/DeepSeek-R1-FP4 is a large model
export MODEL_PATH="nvidia/DeepSeek-R1-FP4"
./launch/agg.sh
```
- There is a noticeable latency for the first two inference requests. Please send warm-up requests before starting the benchmark.
- MTP performance may vary depending on the acceptance rate of predicted tokens, which is dependent on the dataset or queries used while benchmarking. Additionally, `ignore_eos` should generally be omitted or set to `false` when using MTP to avoid speculating garbage outputs and getting unrealistic acceptance rates.
## Advanced Examples
### Multinode Deployment
For comprehensive instructions on multinode serving, see the [Multinode Examples](/dynamo/dev/additional-resources/tensor-rt-llm-details/multinode-examples) guide. It provides step-by-step deployment examples and configuration tips for running Dynamo with TensorRT-LLM across multiple nodes. While the walkthrough uses DeepSeek-R1 as the model, you can easily adapt the process for any supported model by updating the relevant configuration files. You can see the [Llama4 + Eagle](/dynamo/dev/additional-resources/tensor-rt-llm-details/llama-4-eagle) guide to learn how to use these scripts when a single worker fits on a single node.
### Speculative Decoding
- **[Llama 4 Maverick Instruct + Eagle Speculative Decoding](/dynamo/dev/additional-resources/tensor-rt-llm-details/llama-4-eagle)**
### Model-Specific Guides
- **[Gemma3 with Sliding Window Attention](/dynamo/dev/additional-resources/tensor-rt-llm-details/gemma-3-sliding-window)**
- **[GPT-OSS-120b](/dynamo/dev/additional-resources/tensor-rt-llm-details/gpt-oss)** — Reasoning model with tool calling support
### Kubernetes Deployment
For complete Kubernetes deployment instructions, configurations, and troubleshooting, see the [TensorRT-LLM Kubernetes Deployment Guide](https://github.com/ai-dynamo/dynamo/tree/main/examples/backends/trtllm/deploy/README.md).
### Performance Sweep
For detailed instructions on running comprehensive performance sweeps across both aggregated and disaggregated serving configurations, see the [TensorRT-LLM Benchmark Scripts for DeepSeek R1 model](https://github.com/ai-dynamo/dynamo/tree/main/examples/backends/trtllm/performance_sweeps/README.md).
## Client
See the [client](/dynamo/dev/backends/sg-lang#testing-the-deployment) section to learn how to send requests to the deployment.
To send a request to a multi-node deployment, target the node which is running `python3 -m dynamo.frontend `.
## Benchmarking
To benchmark your deployment with AIPerf, see this utility script, configuring the
`model` name and `host` based on your deployment: [perf.sh](https://github.com/ai-dynamo/dynamo/blob/main/benchmarks/llm/perf.sh)