--- title: Multimodal Model Serving subtitle: 'Deploy multimodal models with image, video, and audio support in Dynamo' --- Dynamo supports multimodal inference across multiple LLM backends, enabling models to process images, video, and audio alongside text. **Security Requirement**: Multimodal processing must be explicitly enabled at startup. See the relevant backend documentation ([vLLM](multimodal-vllm.md), [SGLang](multimodal-sglang.md), [TRT-LLM](multimodal-trtllm.md)) for the necessary flags. This prevents unintended processing of multimodal data from untrusted sources. ## Key Features ```mermaid --- title: Sample flow for an aggregated VLM serving scenario --- flowchart TD A[Request] --> B{KV cache hit?} B -->|Yes| C[Use KV] B -->|No| D{Embedding cache hit?} D -->|Yes| E[Load embedding] D -->|No| F[Run encoder] F --> G[save to cache] G --> H["PREFILL (image tokens + text tokens โ†’ KV cache)"] E --> H C --> I[DECODE] H --> I I --> J[Response] Dynamo provides support for improving latency and throughput for vision-and-language workloads through the following features, that can be used together or separately, depending on your workload characteristics: | Feature | Description | |---------|-------------| | **[Embedding Cache](embedding-cache.md)** | CPU-side LRU cache that skips re-encoding repeated images | | **[Encoder Disaggregation](encoder-disaggregation.md)** | Separate vision encoder worker for independent scaling | | **[Multimodal KV Routing](multimodal-kv-routing.md)** | MM-aware KV cache routing for optimal worker selection | ## Support Matrix | Stack | Image | Video | Audio | |-------|-------|-------|-------| | **[vLLM](https://github.com/ai-dynamo/dynamo/blob/main/docs/features/multimodal/multimodal-vllm.md)** | โœ… | ๐Ÿงช | ๐Ÿงช | | **[TRT-LLM](https://github.com/ai-dynamo/dynamo/blob/main/docs/features/multimodal/multimodal-trtllm.md)** | โœ… | โŒ | โŒ | | **[SGLang](https://github.com/ai-dynamo/dynamo/blob/main/docs/features/multimodal/multimodal-sglang.md)** | โœ… | โŒ | โŒ | **Status:** โœ… Supported | ๐Ÿงช Experimental | โŒ Not supported ### Input Format Support | Format | SGLang | TRT-LLM | vLLM | |--------|--------|---------|------| | HTTP/HTTPS URL | โœ… | โœ… | โœ… | | Data URL (Base64) | โŒ | โŒ | โœ… | | Pre-computed Embeddings (.pt) | โŒ | โœ… | โŒ | ## Example Workflows Reference implementations for deploying multimodal models: - [vLLM multimodal examples](https://github.com/ai-dynamo/dynamo/tree/main/examples/backends/vllm/launch) - [TRT-LLM multimodal examples](https://github.com/ai-dynamo/dynamo/tree/main/examples/backends/trtllm/launch) - [SGLang multimodal examples](https://github.com/ai-dynamo/dynamo/tree/main/examples/backends/sglang/launch) - [Experimental multimodal examples](https://github.com/ai-dynamo/dynamo/tree/main/examples/multimodal/launch) (video, audio) ## Backend Documentation Detailed deployment guides, configuration, and examples for each backend: - **[vLLM Multimodal](https://github.com/ai-dynamo/dynamo/blob/main/docs/features/multimodal/multimodal-vllm.md)** - **[TensorRT-LLM Multimodal](https://github.com/ai-dynamo/dynamo/blob/main/docs/features/multimodal/multimodal-trtllm.md)** - **[SGLang Multimodal](https://github.com/ai-dynamo/dynamo/blob/main/docs/features/multimodal/multimodal-sglang.md)**