--- title: Reference Guide subtitle: 'Features, configuration, and operational details for the TensorRT-LLM backend' --- ## Building a Custom Container To build a TensorRT-LLM container from source (e.g., for custom modifications or a different CUDA version), see the [Building a Custom Container](/dynamo/dev/additional-resources/tensor-rt-llm-details/building-a-custom-container) guide. ## KV Cache Transfer Dynamo with TensorRT-LLM supports two methods for transferring KV cache in disaggregated serving: UCX (default) and NIXL (experimental). For detailed information and configuration instructions for each method, see the [KV Cache Transfer Guide](/dynamo/dev/additional-resources/tensor-rt-llm-details/kv-cache-transfer). ## Request Migration Dynamo supports [request migration](/dynamo/dev/user-guides/fault-tolerance/request-migration) to handle worker failures gracefully. When enabled, requests can be automatically migrated to healthy workers if a worker fails mid-generation. See the [Request Migration Architecture](/dynamo/dev/user-guides/fault-tolerance/request-migration) documentation for configuration details. ## Request Cancellation When a user cancels a request (e.g., by disconnecting from the frontend), the request is automatically cancelled across all workers, freeing compute resources for other requests. ### Cancellation Support Matrix | | Prefill | Decode | |-|---------|--------| | **Aggregated** | ✅ | ✅ | | **Disaggregated** | ✅ | ✅ | For more details, see the [Request Cancellation Architecture](/dynamo/dev/user-guides/fault-tolerance/request-cancellation) documentation. ## Multimodal Support Dynamo with the TensorRT-LLM backend supports multimodal models, enabling you to process both text and images (or pre-computed embeddings) in a single request. For detailed setup instructions, example requests, and best practices, see the [TensorRT-LLM Multimodal Guide](/dynamo/dev/user-guides/multimodality-support/tensor-rt-llm-multimodal). ## Video Diffusion Support (Experimental) Dynamo supports video generation using diffusion models through TensorRT-LLM. For requirements, supported models, API usage, and configuration options, see the [Video Diffusion Guide](/dynamo/dev/backends/tensor-rt-llm/video-diffusion-experimental). ## Logits Processing Logits processors let you modify the next-token logits at every decoding step. Dynamo provides a backend-agnostic interface and an adapter for TensorRT-LLM. For the API, examples, and how to bring your own processor, see the [Logits Processing Guide](/dynamo/dev/additional-resources/tensor-rt-llm-details/logits-processing). ## DP Rank Routing (Attention Data Parallelism) TensorRT-LLM supports attention data parallelism for models like DeepSeek, enabling KV-cache-aware routing to specific DP ranks. For configuration and usage details, see the [DP Rank Routing Guide](/dynamo/dev/additional-resources/tensor-rt-llm-details/dp-rank-routing). ## KVBM Integration Dynamo with TensorRT-LLM currently supports integration with the Dynamo KV Block Manager. This integration can significantly reduce time-to-first-token (TTFT) latency, particularly in usage patterns such as multi-turn conversations and repeated long-context requests. See the instructions here: [Running KVBM in TensorRT-LLM](/dynamo/dev/components/kvbm/kvbm-guide#run-kvbm-in-dynamo-with-tensorrt-llm). ## Observability TensorRT-LLM exposes Prometheus metrics for monitoring inference performance. For detailed metrics reference, collection setup, and Grafana integration, see the [Prometheus Metrics Guide](/dynamo/dev/backends/tensor-rt-llm/prometheus-metrics). ## Known Issues and Mitigations For known issues, workarounds, and mitigations, see the [Known Issues and Mitigations](/dynamo/dev/backends/tensor-rt-llm/known-issues-and-mitigations) page.