Reference Guide

Features, configuration, and operational details for the TensorRT-LLM backend

View as Markdown

Building a Custom Container

To build a TensorRT-LLM container from source (e.g., for custom modifications or a different CUDA version), see the Building a Custom Container guide.

KV Cache Transfer

Dynamo with TensorRT-LLM supports two methods for transferring KV cache in disaggregated serving: UCX (default) and NIXL (experimental). For detailed information and configuration instructions for each method, see the KV Cache Transfer Guide.

Request Migration

Dynamo supports request migration to handle worker failures gracefully. When enabled, requests can be automatically migrated to healthy workers if a worker fails mid-generation. See the Request Migration Architecture documentation for configuration details.

Request Cancellation

When a user cancels a request (e.g., by disconnecting from the frontend), the request is automatically cancelled across all workers, freeing compute resources for other requests.

Cancellation Support Matrix

PrefillDecode
Aggregated
Disaggregated

For more details, see the Request Cancellation Architecture documentation.

Multimodal Support

Dynamo with the TensorRT-LLM backend supports multimodal models, enabling you to process both text and images (or pre-computed embeddings) in a single request. For detailed setup instructions, example requests, and best practices, see the TensorRT-LLM Multimodal Guide.

Video Diffusion Support (Experimental)

Dynamo supports video generation using diffusion models through TensorRT-LLM. For requirements, supported models, API usage, and configuration options, see the Video Diffusion Guide.

Logits Processing

Logits processors let you modify the next-token logits at every decoding step. Dynamo provides a backend-agnostic interface and an adapter for TensorRT-LLM. For the API, examples, and how to bring your own processor, see the Logits Processing Guide.

DP Rank Routing (Attention Data Parallelism)

TensorRT-LLM supports attention data parallelism for models like DeepSeek, enabling KV-cache-aware routing to specific DP ranks. For configuration and usage details, see the DP Rank Routing Guide.

KVBM Integration

Dynamo with TensorRT-LLM currently supports integration with the Dynamo KV Block Manager. This integration can significantly reduce time-to-first-token (TTFT) latency, particularly in usage patterns such as multi-turn conversations and repeated long-context requests.

See the instructions here: Running KVBM in TensorRT-LLM.

Observability

TensorRT-LLM exposes Prometheus metrics for monitoring inference performance. For detailed metrics reference, collection setup, and Grafana integration, see the Prometheus Metrics Guide.

Known Issues and Mitigations

For known issues, workarounds, and mitigations, see the Known Issues and Mitigations page.