--- title: Examples --- For quick start instructions, see the [TensorRT-LLM README](/dynamo/dev/backends/tensor-rt-llm). This document provides all deployment patterns for running TensorRT-LLM with Dynamo, including single-node, multi-node, and Kubernetes deployments. ## Table of Contents - [Infrastructure Setup](#infrastructure-setup) - [Single Node Examples](#single-node-examples) - [Advanced Examples](#advanced-examples) - [Client](#client) - [Benchmarking](#benchmarking) ## Infrastructure Setup For local/bare-metal development, start etcd and optionally NATS using Docker Compose: ```bash docker compose -f deploy/docker-compose.yml up -d ``` - **etcd** is optional but is the default local discovery backend. You can also use `--discovery-backend file` to use file system based discovery. - **NATS** is optional - only needed if using KV routing with events. Workers must be explicitly configured to publish events. Use `--no-router-kv-events` on the frontend for prediction-based routing without events. - **On Kubernetes**, neither is required when using the Dynamo operator, which explicitly sets `DYN_DISCOVERY_BACKEND=kubernetes` to enable native K8s service discovery (DynamoWorkerMetadata CRD). Each launch script runs the frontend and worker(s) in a single terminal. You can run each command separately in different terminals for testing. Each shell script simply runs `python3 -m dynamo.frontend ` to start up the ingress and `python3 -m dynamo.trtllm ` to start up the workers. For detailed information about the architecture and how KV-aware routing works, see the [Router Guide](/dynamo/dev/components/router/router-guide). ## Single Node Examples ### Aggregated ```bash cd $DYNAMO_HOME/examples/backends/trtllm ./launch/agg.sh ``` ### Aggregated with KV Routing ```bash cd $DYNAMO_HOME/examples/backends/trtllm ./launch/agg_router.sh ``` ### Disaggregated ```bash cd $DYNAMO_HOME/examples/backends/trtllm ./launch/disagg.sh ``` ### Disaggregated with KV Routing In disaggregated workflow, requests are routed to the prefill worker to maximize KV cache reuse. ```bash cd $DYNAMO_HOME/examples/backends/trtllm ./launch/disagg_router.sh ``` ### Aggregated with Multi-Token Prediction (MTP) and DeepSeek R1 ```bash cd $DYNAMO_HOME/examples/backends/trtllm export AGG_ENGINE_ARGS=./engine_configs/deepseek-r1/agg/mtp/mtp_agg.yaml export SERVED_MODEL_NAME="nvidia/DeepSeek-R1-FP4" # nvidia/DeepSeek-R1-FP4 is a large model export MODEL_PATH="nvidia/DeepSeek-R1-FP4" ./launch/agg.sh ``` - There is a noticeable latency for the first two inference requests. Please send warm-up requests before starting the benchmark. - MTP performance may vary depending on the acceptance rate of predicted tokens, which is dependent on the dataset or queries used while benchmarking. Additionally, `ignore_eos` should generally be omitted or set to `false` when using MTP to avoid speculating garbage outputs and getting unrealistic acceptance rates. ## Advanced Examples ### Multinode Deployment For comprehensive instructions on multinode serving, see the [Multinode Examples](/dynamo/dev/additional-resources/tensor-rt-llm-details/multinode-examples) guide. It provides step-by-step deployment examples and configuration tips for running Dynamo with TensorRT-LLM across multiple nodes. While the walkthrough uses DeepSeek-R1 as the model, you can easily adapt the process for any supported model by updating the relevant configuration files. You can see the [Llama4 + Eagle](/dynamo/dev/additional-resources/tensor-rt-llm-details/llama-4-eagle) guide to learn how to use these scripts when a single worker fits on a single node. ### Speculative Decoding - **[Llama 4 Maverick Instruct + Eagle Speculative Decoding](/dynamo/dev/additional-resources/tensor-rt-llm-details/llama-4-eagle)** ### Model-Specific Guides - **[Gemma3 with Sliding Window Attention](/dynamo/dev/additional-resources/tensor-rt-llm-details/gemma-3-sliding-window)** - **[GPT-OSS-120b](/dynamo/dev/additional-resources/tensor-rt-llm-details/gpt-oss)** — Reasoning model with tool calling support ### Kubernetes Deployment For complete Kubernetes deployment instructions, configurations, and troubleshooting, see the [TensorRT-LLM Kubernetes Deployment Guide](https://github.com/ai-dynamo/dynamo/tree/main/examples/backends/trtllm/deploy/README.md). ### Performance Sweep For detailed instructions on running comprehensive performance sweeps across both aggregated and disaggregated serving configurations, see the [TensorRT-LLM Benchmark Scripts for DeepSeek R1 model](https://github.com/ai-dynamo/dynamo/tree/main/examples/backends/trtllm/performance_sweeps/README.md). ## Client See the [client](/dynamo/dev/backends/sg-lang#testing-the-deployment) section to learn how to send requests to the deployment. To send a request to a multi-node deployment, target the node which is running `python3 -m dynamo.frontend `. ## Benchmarking To benchmark your deployment with AIPerf, see this utility script, configuring the `model` name and `host` based on your deployment: [perf.sh](https://github.com/ai-dynamo/dynamo/blob/main/benchmarks/llm/perf.sh)