vLLM Multimodal
This document provides a comprehensive guide for multimodal inference using vLLM backend in Dynamo.
Security Requirement: All multimodal workers require the --enable-multimodal flag to be explicitly set at startup. This is a security feature to prevent unintended processing of multimodal data from untrusted sources. Workers will fail at startup if a multimodal worker mode is enabled without --enable-multimodal. This flag is analogous to --enable-mm-embeds in vllm serve but also extends it to all multimodal content (url, embeddings, b64).
Support Matrix
Supported URL Formats
Deployment Patterns
The main multimodal vLLM launchers in this repo are:
Image/Video Serving
Dynamo supports multimodal image and video requests for Vision Language Models (VLMs). Qwen/Qwen3-VL-2B-Instruct is a good example because the same model can handle both image_url and video_url requests through the standard OpenAI chat endpoint.
Aggregated Serving
Use the single-worker aggregated launcher for the simplest image/video setup:
Image request:
Video request:
E/PD Serving (Encode + PD)
Use disagg_multimodal_e_pd.sh when you want a separate encode worker and a combined prefill/decode worker. This path is primarily useful for image-centric workloads and embedding-cache experiments.
When a separate encode worker is deployed with the current vLLM path, only image_url inputs are routed to it. video_url inputs are still processed on the combined PD worker.
E/P/D Serving (Full Disaggregation)
Use disagg_multimodal_epd.sh when you want separate encode, prefill, and decode workers for multimodal workloads.
In the current vLLM implementation, the separate encode worker is only used for image_url inputs. video_url inputs are still processed on the prefill worker, not on the encode worker.
Audio Serving
Dynamo supports audio_url requests for audio-capable models. Audio is loaded by the backend worker via vLLM’s AudioMediaIO at native sample rate — vLLM’s model-specific processor handles resampling and feature extraction internally. Omni models can handle image_url, video_url, and audio_url in the same request.
Aggregated Serving
Use the same aggregated multimodal launcher with an audio-capable model:
Audio request:
Embedding Cache
Dynamo supports embedding cache in both aggregated and disaggregated settings:
Aggregated Worker
A single vLLM instance caches encoded embeddings on CPU so repeated images skip encoding entirely. Supported natively with vLLM 0.17+.
Launch with Dynamo:
dynamo.vllm automatically configures ec_both mode with the DynamoMultimodalEmbeddingCacheConnector when the capacity is > 0.
Launch with vllm serve (standalone, no Dynamo):
The multimodal_embedding_cache_capacity_gb parameter controls the CPU-side LRU cache size in GB (0 = disabled). Requires vLLM 0.17+.
Disaggregated Encoder (Embedding Cache in Prefill Worker)
In the disaggregated setting, the Prefill Worker (P) owns a CPU-side LRU embedding cache (EmbeddingCacheManager). On each request P checks the cache first — on a hit, the Encode Worker is skipped entirely. On a miss, P routes to the Encode Worker (E), receives embeddings via NIXL, saves them to the cache, and then feeds the embeddings along with the request into the vLLM Instance for prefill.
Launch:
Client: Use the same image_url request format shown in Aggregated Serving.
LoRA Adapters on Multimodal Workers
Multimodal workers support dynamic loading and unloading of LoRA adapters at runtime via the management API. This enables serving fine-tuned multimodal models alongside the base model.
Loading a LoRA Adapter
Load an adapter on a running multimodal worker via the load_lora endpoint:
Sending Requests with a LoRA
Set the model field in the request to the LoRA adapter name:
Requests without a LoRA name (or with the base model name) will use the base model.
Unloading a LoRA Adapter
Listing Loaded Adapters
Disaggregated Mode
In disaggregated (prefill/decode) deployments, the same LoRA adapter must be loaded on both the prefill and decode workers. The LoRA identity (model field) is automatically propagated from the prefill worker to the decode worker in the forwarded request.
If a LoRA is loaded on the prefill worker but not on the decode worker, the decode worker will fall back to the base model for that request.
Supported Models
For a list of multimodal models supported by vLLM, see vLLM Supported Multimodal Models. Models listed there should generally work with aggregated serving, though they may not all be explicitly tested in this repo.