vLLM Multimodal

View as Markdown

This document provides a comprehensive guide for multimodal inference using vLLM backend in Dynamo.

Security Requirement: All multimodal workers require the --enable-multimodal flag to be explicitly set at startup. This is a security feature to prevent unintended processing of multimodal data from untrusted sources. Workers will fail at startup if a multimodal worker mode is enabled without --enable-multimodal. This flag is analogous to --enable-mm-embeds in vllm serve but also extends it to all multimodal content (url, embeddings, b64).

Support Matrix

ModalityAggregatedDisaggregated
ImageYesYes
VideoYesYes
AudioYesNo

Supported URL Formats

FormatExampleDescription
HTTP/HTTPShttp://example.com/image.jpgRemote media files
Data URLdata:image/jpeg;base64,/9j/4AAQ...Base64-encoded inline data

Deployment Patterns

The main multimodal vLLM launchers in this repo are:

PatternDeviceLaunch ScriptBest For
Aggregatedcudaagg_multimodal.shSimplest image/video serving from a single multimodal worker on CUDA devices
Aggregatedxpuxpu/agg_multimodal_xpu.shSimplest image/video serving from a single multimodal worker on XPU devices
E/PD (Encode + PD)cudadisagg_multimodal_e_pd.shSimple example of separating encoder, good for testing embedding-cache workflows
E/P/D (Full Disaggregation)cudadisagg_multimodal_epd.shDisaggregated image/video serving with separate encode, prefill, and decode workers on CUDA devices

Image/Video Serving

Dynamo supports multimodal image and video requests for Vision Language Models (VLMs). Qwen/Qwen3-VL-2B-Instruct is a good example because the same model can handle both image_url and video_url requests through the standard OpenAI chat endpoint.

Aggregated Serving

Use the single-worker aggregated launcher for the simplest image/video setup:

$cd $DYNAMO_HOME/examples/backends/vllm
$
$# GPU deployment
$bash launch/agg_multimodal.sh --model Qwen/Qwen3-VL-2B-Instruct
$
$# XPU deployment
$bash launch/xpu/agg_multimodal_xpu.sh --model Qwen/Qwen3-VL-2B-Instruct

Image request:

$curl http://localhost:8000/v1/chat/completions \
> -H "Content-Type: application/json" \
> -d '{
> "model": "Qwen/Qwen3-VL-2B-Instruct",
> "messages": [
> {
> "role": "user",
> "content": [
> {
> "type": "text",
> "text": "What is in this image?"
> },
> {
> "type": "image_url",
> "image_url": {
> "url": "http://images.cocodataset.org/test2017/000000155781.jpg"
> }
> }
> ]
> }
> ],
> "max_tokens": 64,
> "temperature": 0.0,
> "stream": false
> }'

Video request:

$curl http://localhost:8000/v1/chat/completions \
> -H "Content-Type: application/json" \
> -d '{
> "model": "Qwen/Qwen3-VL-2B-Instruct",
> "messages": [
> {
> "role": "user",
> "content": [
> {
> "type": "text",
> "text": "Describe the video in detail"
> },
> {
> "type": "video_url",
> "video_url": {
> "url": "https://qianwen-res.oss-cn-beijing.aliyuncs.com/Qwen3-Omni/demo/draw.mp4"
> }
> }
> ]
> }
> ],
> "max_tokens": 64,
> "stream": false
> }' | jq

E/PD Serving (Encode + PD)

Use disagg_multimodal_e_pd.sh when you want a separate encode worker and a combined prefill/decode worker. This path is primarily useful for image-centric workloads and embedding-cache experiments.

When a separate encode worker is deployed with the current vLLM path, only image_url inputs are routed to it. video_url inputs are still processed on the combined PD worker.

$cd $DYNAMO_HOME/examples/backends/vllm
$
$# Multi-GPU deployment
$bash launch/disagg_multimodal_e_pd.sh --model Qwen/Qwen3-VL-2B-Instruct
$
$# Single-GPU (functional testing with small models)
$bash launch/disagg_multimodal_e_pd.sh --model Qwen/Qwen3-VL-2B-Instruct --single-gpu

E/P/D Serving (Full Disaggregation)

Use disagg_multimodal_epd.sh when you want separate encode, prefill, and decode workers for multimodal workloads.

In the current vLLM implementation, the separate encode worker is only used for image_url inputs. video_url inputs are still processed on the prefill worker, not on the encode worker.

$cd $DYNAMO_HOME/examples/backends/vllm
$
$# Multi-GPU deployment
$bash launch/disagg_multimodal_epd.sh --model Qwen/Qwen3-VL-2B-Instruct
$
$# Single-GPU (functional testing with small models)
$bash launch/disagg_multimodal_epd.sh --model Qwen/Qwen3-VL-2B-Instruct --single-gpu

Audio Serving

Dynamo supports audio_url requests for audio-capable models. Audio is loaded by the backend worker via vLLM’s AudioMediaIO at native sample rate — vLLM’s model-specific processor handles resampling and feature extraction internally. Omni models can handle image_url, video_url, and audio_url in the same request.

Aggregated Serving

Use the same aggregated multimodal launcher with an audio-capable model:

$pip install 'vllm[audio]' # installs librosa and other audio dependencies
$cd $DYNAMO_HOME/examples/backends/vllm
$
$# GPU deployment
$bash launch/agg_multimodal.sh --model Qwen/Qwen3-Omni-30B-A3B-Instruct
$
$# XPU deployment
$DYN_CHAT_PROCESSOR=vllm \
> bash launch/xpu/agg_multimodal_xpu.sh --model Qwen/Qwen3-Omni-30B-A3B-Instruct

Audio request:

$curl http://localhost:8000/v1/chat/completions \
> -H "Content-Type: application/json" \
> -d '{
> "model": "Qwen/Qwen3-Omni-30B-A3B-Instruct",
> "messages": [
> {
> "role": "user",
> "content": [
> {
> "type": "text",
> "text": "What sound is this?"
> },
> {
> "type": "audio_url",
> "audio_url": {
> "url": "https://raw.githubusercontent.com/yuekaizhang/Triton-ASR-Client/main/datasets/mini_en/wav/1221-135766-0002.wav"
> }
> }
> ]
> }
> ],
> "max_tokens": 100,
> "stream": false
> }' | jq

Embedding Cache

Dynamo supports embedding cache in both aggregated and disaggregated settings:

SettingImplementationLaunch Script
AggregatedSupported via vLLM ECConnector in vLLM 0.17+agg_multimodal.sh (or with vllm serve directly)
Disaggregated encoderDynamo-managed cache in the worker layer on top of vLLM enginedisagg_multimodal_e_pd.sh

Aggregated Worker

A single vLLM instance caches encoded embeddings on CPU so repeated images skip encoding entirely. Supported natively with vLLM 0.17+.

Launch with Dynamo:

$bash examples/backends/vllm/launch/agg_multimodal.sh \
> --model Qwen/Qwen3-VL-30B-A3B-Instruct-FP8 \
> --multimodal-embedding-cache-capacity-gb 10

dynamo.vllm automatically configures ec_both mode with the DynamoMultimodalEmbeddingCacheConnector when the capacity is > 0.

Launch with vllm serve (standalone, no Dynamo):

$vllm serve Qwen/Qwen3-VL-30B-A3B-Instruct-FP8 \
> --ec-transfer-config "{
> \"ec_role\": \"ec_both\",
> \"ec_connector\": \"DynamoMultimodalEmbeddingCacheConnector\",
> \"ec_connector_module_path\": \"dynamo.vllm.multimodal_utils.multimodal_embedding_cache_connector\",
> \"ec_connector_extra_config\": {\"multimodal_embedding_cache_capacity_gb\": 10}
> }"

The multimodal_embedding_cache_capacity_gb parameter controls the CPU-side LRU cache size in GB (0 = disabled). Requires vLLM 0.17+.

Disaggregated Encoder (Embedding Cache in Prefill Worker)

In the disaggregated setting, the Prefill Worker (P) owns a CPU-side LRU embedding cache (EmbeddingCacheManager). On each request P checks the cache first — on a hit, the Encode Worker is skipped entirely. On a miss, P routes to the Encode Worker (E), receives embeddings via NIXL, saves them to the cache, and then feeds the embeddings along with the request into the vLLM Instance for prefill.

Launch:

$cd $DYNAMO_HOME/examples/backends/vllm
$bash launch/disagg_multimodal_e_pd.sh --multimodal-embedding-cache-capacity-gb 10

Client: Use the same image_url request format shown in Aggregated Serving.

LoRA Adapters on Multimodal Workers

Multimodal workers support dynamic loading and unloading of LoRA adapters at runtime via the management API. This enables serving fine-tuned multimodal models alongside the base model.

Loading a LoRA Adapter

Load an adapter on a running multimodal worker via the load_lora endpoint:

$# For components workers (URI-based, requires DYN_LORA_ENABLED=true)
$curl -X POST http://<worker-host>:<port>/load_lora \
> -H "Content-Type: application/json" \
> -d '{
> "lora_name": "my-vlm-adapter",
> "source": {"uri": "s3://my-bucket/adapters/my-vlm-adapter"}
> }'
$
$# For example workers (path-based)
$curl -X POST http://<worker-host>:<port>/load_lora \
> -H "Content-Type: application/json" \
> -d '{
> "lora_name": "my-vlm-adapter",
> "lora_path": "/path/to/adapter"
> }'

Sending Requests with a LoRA

Set the model field in the request to the LoRA adapter name:

$curl -X POST http://<frontend-host>:<port>/v1/chat/completions \
> -H "Content-Type: application/json" \
> -d '{
> "model": "my-vlm-adapter",
> "messages": [
> {"role": "user", "content": [
> {"type": "text", "text": "Describe this image"},
> {"type": "image_url", "image_url": {"url": "https://example.com/image.jpg"}}
> ]}
> ]
> }'

Requests without a LoRA name (or with the base model name) will use the base model.

Unloading a LoRA Adapter

$curl -X POST http://<worker-host>:<port>/unload_lora \
> -H "Content-Type: application/json" \
> -d '{"lora_name": "my-vlm-adapter"}'

Listing Loaded Adapters

$curl -X POST http://<worker-host>:<port>/list_loras

Disaggregated Mode

In disaggregated (prefill/decode) deployments, the same LoRA adapter must be loaded on both the prefill and decode workers. The LoRA identity (model field) is automatically propagated from the prefill worker to the decode worker in the forwarded request.

$# Load on prefill worker
$curl -X POST http://<prefill-worker>/load_lora \
> -d '{"lora_name": "my-adapter", "source": {"uri": "s3://bucket/adapter"}}'
$
$# Load on decode worker (same adapter)
$curl -X POST http://<decode-worker>/load_lora \
> -d '{"lora_name": "my-adapter", "source": {"uri": "s3://bucket/adapter"}}'

If a LoRA is loaded on the prefill worker but not on the decode worker, the decode worker will fall back to the base model for that request.

Supported Models

For a list of multimodal models supported by vLLM, see vLLM Supported Multimodal Models. Models listed there should generally work with aggregated serving, though they may not all be explicitly tested in this repo.