SGLang Multimodal | NVIDIA Dynamo Documentation

This document provides a comprehensive guide for multimodal inference using SGLang backend in Dynamo. SGLang multimodal supports EPD, E/PD, and E/P/D flows, with NIXL (RDMA) for zero-copy tensor transfer in disaggregated modes.

Support Matrix

Modality	Input Format	Aggregated	Disaggregated	Notes
Image	HTTP/HTTPS URL	Yes	Yes	Vision encoder generates embeddings
Image	Data URL (Base64)	No	No
Video	HTTP/HTTPS URL	No	No
Audio	HTTP/HTTPS URL	No	No

Supported URL Formats

Format	Example	Description
HTTP/HTTPS	`http://example.com/image.jpg`	Remote media files

Deployment Patterns

SGLang supports EPD, E/PD, and E/P/D patterns. See Multimodal Architecture Patterns for detailed explanations.

Pattern	Supported	Launch Script	Notes
EPD (Simple Aggregated)	✅	`agg.sh`	Internal encoding
E/PD (Encode Separate)	✅	`multimodal_epd.sh`	Vision encoder separate
E/P/D (Full Disaggregation)	✅	`multimodal_disagg.sh`	KV cache via bootstrap
EP/D (Traditional Disaggregated)	❌	N/A	Not supported

Component Flags

Component	Flag	Purpose
Processor	`--multimodal-processor`	HTTP entry, OpenAI→SGLang conversion
Encode Worker	`--multimodal-encode-worker`	Vision encoder, embeddings generation
PD Worker	`--multimodal-worker`	Prefill + Decode with embeddings
Decode Worker	`--multimodal-worker --serving-mode=decode`	Entry point for disaggregation
Prefill Worker	`--multimodal-worker --serving-mode=prefill`	Called by Decode, bootstrap coordination

SGLang-Specific Characteristics

Vision Encoder in Python: Encode worker loads vision model (AutoModel) and image processor (AutoImageProcessor)
Token Expansion: Single <|image_pad|> token replaced with N tokens based on embedding shape
NIXL Transfer: Embeddings transferred from Encoder → PD Worker using NIXL
No Rust Processing: All tokenization and image handling happens in Python

Use the Latest Release

We recommend using the latest stable release of dynamo to avoid breaking changes:

You can find the latest release and check out the corresponding branch with:

$ git checkout $(git describe --tags $(git rev-list --tags --max-count=1))

EPD Serving (Simple Aggregated)

Components

worker: DecodeWorkerHandler handles encoding, prefilling, and decoding in a single process.

Workflow

The DecodeWorkerHandler receives multimodal requests with image URLs and passes them directly to SGLang’s engine. SGLang’s internal mm_data_processor handles image fetching, loading, encoding, and token expansion.

Launch

$ cd $DYNAMO_HOME/examples/backends/sglang
$ ./launch/agg.sh --model Qwen/Qwen2.5-VL-7B-Instruct --chat-template qwen2-vl

Client:

$ curl http://localhost:8000/v1/chat/completions \
>   -H "Content-Type: application/json" \
>   -d '{
>     "model": "Qwen/Qwen2.5-VL-7B-Instruct",
>     "messages": [
>       {
>         "role": "user",
>         "content": [
>           {
>             "type": "text",
>             "text": "Describe the image."
>           },
>           {
>             "type": "image_url",
>             "image_url": {
>               "url": "http://images.cocodataset.org/test2017/000000155781.jpg"
>             }
>           }
>         ]
>       }
>     ],
>     "max_tokens": 50,
>     "stream": false
>   }' | jq

E/PD Serving (Encode Separate)

Components

workers:
- MultimodalEncodeWorkerHandler for encoding
- MultimodalWorkerHandler for prefilling and decoding.
processor: MultimodalProcessorHandler
- tokenizes the prompt using the chat template
- passes the text and image url to the MultimodalEncodeWorker.

Workflow

The MultimodalEncodeWorker downloads and encodes the image and passes the embeddings to the MultimodalWorker. The work complete event is sent via NATS, while the embeddings tensor is transferred via RDMA through the NIXL interface. The MultimodalWorker then prefills and decodes the prompt in the same engine, as in the LLM aggregated serving example. Only the processor is registered to the Dynamo frontend as an available endpoint. Workers do NOT register - they are internal components and communicate via NATS.

Launch

$ cd $DYNAMO_HOME/examples/backends/sglang
$ ./launch/multimodal_epd.sh

Client:

$ curl http://localhost:8000/v1/chat/completions \
>   -H "Content-Type: application/json" \
>   -d '{
>     "model": "Qwen/Qwen2.5-VL-7B-Instruct",
>     "messages": [
>       {
>         "role": "user",
>         "content": [
>           {
>             "type": "text",
>             "text": "Describe the image."
>           },
>           {
>             "type": "image_url",
>             "image_url": {
>               "url": "http://images.cocodataset.org/test2017/000000155781.jpg"
>             }
>           }
>         ]
>       }
>     ],
>     "max_tokens": 50,
>     "stream": false
>   }' | jq

E/P/D Serving (Full Disaggregation)

Components

workers:
- MultimodalEncodeWorkerHandler for encoding
- MultimodalWorkerHandler for decoding
- MultimodalPrefillWorkerHandler for prefilling
processor: MultimodalProcessorHandler tokenizes the prompt and passes it to the MultimodalEncodeWorker.

Workflow

In models like Qwen2.5-VL, embeddings are only required during the prefill stage. The image embeddings are transferred via NIXL from the Encode Worker to the Decode Worker (the entry point for disaggregation), which then coordinates with the Prefill Worker. The Prefill Worker processes the embeddings and forwards the KV cache back to the Decode Worker for token generation.

Launch

$ cd $DYNAMO_HOME/examples/backends/sglang
$ ./launch/multimodal_disagg.sh

Client:

$ curl http://localhost:8000/v1/chat/completions \
>   -H "Content-Type: application/json" \
>   -d '{
>     "model": "Qwen/Qwen2.5-VL-7B-Instruct",
>     "messages": [
>       {
>         "role": "user",
>         "content": [
>           {
>             "type": "text",
>             "text": "Describe the image."
>           },
>           {
>             "type": "image_url",
>             "image_url": {
>               "url": "http://images.cocodataset.org/test2017/000000155781.jpg"
>             }
>           }
>         ]
>       }
>     ],
>     "max_tokens": 50,
>     "stream": false
>   }' | jq

Bootstrap Coordination

SGLang disaggregation uses a bootstrap mechanism for P->D coordination:

Request Flow (Important)

Client → Frontend → Processor → Encode → DECODE Worker → Prefill Worker
                                               ↑
                                    Entry point for disaggregation!

Bootstrap Process

Decode Worker receives request from Encode Worker
Decode Worker calls Prefill Worker via NATS to request bootstrap info
Prefill Worker generates {host, port, room} and returns immediately
Both workers connect to same “room” using bootstrap coordinates
SGLang internally transfers KV cache state via bootstrap connection (not NIXL)

Key Difference from vLLM

vLLM: Frontend → Prefill → Decode (Prefill is entry point)
SGLang: Frontend → Processor → Encode → Decode → Prefill (Decode is entry point)

Inter-Component Communication

Control Flow (NATS)

All component-to-component communication happens via NATS:

E/PD Mode (Encode Separate)

Processor → Encode Worker → PD Worker
  (NATS)        (NATS + NIXL embeddings)

E/P/D Mode (Full Disaggregation)

Processor → Encode Worker → DECODE Worker → Prefill Worker
  (NATS)        (NATS)            (NATS)
                             ↓
                    Decode requests bootstrap
                             ↓
                    Prefill returns {host, port, room}
                             ↓
                    Both connect via bootstrap
                             ↓
                    SGLang internal KV cache transfer

Detailed Message Flow

Processor → Encode Worker:
  - NATS round_robin with SglangMultimodalRequest
  - Contains: tokenized input_ids, image URL, sampling params
Encode Worker → Decode/PD Worker:
  - NATS round_robin to "backend" component
  - Contains: expanded token_ids, NIXL metadata, embeddings shape
  - NIXL transfer: embeddings tensor
Decode Worker → Prefill Worker (disagg only):
  - NATS call to "prefill" component
  - Decode requests bootstrap coordinates
  - Prefill returns: {bootstrap_host, bootstrap_port, bootstrap_room}
Prefill ↔ Decode (via bootstrap):
  - SGLang internal connection (not NATS)
  - KV cache state shared via bootstrap mechanism

Data Transfer (NIXL)

NIXL is used only for embedding transfer:

1 # Encode Worker
2 descriptor = connect.Descriptor(precomputed_embeddings)
3 with connector.create_readable(descriptor) as readable:
4     request.serialized_request = readable.metadata()
5     await pd_worker_client.round_robin(request)
6     await readable.wait_for_completion()
7 
8 # PD Worker
9 embeddings = torch.empty(request.embeddings_shape, dtype=torch.float16)
10 descriptor = connect.Descriptor(embeddings)
11 read_op = await connector.begin_read(request.serialized_request, descriptor)
12 await read_op.wait_for_completion()

Vision Encoding Details

Encode Worker Components

The encode worker loads and runs the vision model in Python:

1 self.image_processor = AutoImageProcessor.from_pretrained(
2     model_path, trust_remote_code=True
3 )
4 self.vision_model = AutoModel.from_pretrained(
5     model_path,
6     device_map="auto",
7     torch_dtype=torch.float16,
8     trust_remote_code=True
9 )

Token Expansion Process

Processor inserts single image token (e.g., <|image_pad|>)
Encode worker generates embeddings: shape = (batch, num_patches, hidden_dim)
Encode worker replaces single token with num_patches tokens
Downstream worker receives expanded token sequence

Example:

1 # Before: ["Hello", "<|image_pad|>", "world"]
2 # After:  ["Hello", "<|image_pad|>", "<|image_pad|>", ...(576 tokens), "world"]

Chat Template Processing

SGLang uses its own chat template system:

1 from sglang.srt.parser.conversation import chat_templates
2 
3 conv = chat_templates["qwen2-vl"].copy()
4 conv.append_message(conv.roles[0], f"{conv.image_token} Describe this image")
5 processed = tokenizer(text=conv.get_prompt(), return_tensors="pt")

Supported templates: qwen2-vl, llama-3, vicuna, etc.

NIXL Usage

Use Case	NIXL Used?	Data Transfer	Notes
EPD (Simple Aggregated)	No	N/A	All processing internal to SGLang
E/PD (Encode Separate)	Yes	Encoder → PD (embeddings)	Vision encoder separate
E/P/D (Full Disaggregation)	Yes	Encoder → Prefill (embeddings)	KV cache via SGLang bootstrap

Key Difference: SGLang P/D uses bootstrap mechanism, not NIXL for KV cache like vLLM.

Known Limitations

No Data URL support - Only HTTP/HTTPS URLs supported; data:image/... base64 URLs not supported
No pre-computed embeddings - Cannot use .pt, .pth, .bin embedding files; vision encoder runs for every request
No video support - No video encoder implementation
No audio support - No audio encoder implementation
Only Processor registers with Dynamo - Workers are internal components, frontend routes to Processor only
Disaggregated routing - Decode Worker is the entry point (calls Prefill), cannot route directly to Prefill workers
Limited model generalization - Token expansion logic is model-specific; adding new models may require implementation updates

Supported Models

SGLang multimodal only supports image-based vision-language models:

Qwen2-VL / Qwen2.5-VL (primary support)
Models with AutoImageProcessor and vision tower
Models compatible with SGLang’s image embedding format

Key Files

File	Description
`components/src/dynamo/sglang/main.py`	Component initialization, only Processor registers
`components/src/dynamo/sglang/request_handlers/multimodal/processor_handler.py`	Processor implementation, OpenAI→SGLang
`components/src/dynamo/sglang/request_handlers/multimodal/encode_worker_handler.py`	Vision encoder, embeddings generation
`components/src/dynamo/sglang/request_handlers/multimodal/worker_handler.py`	PD/Prefill/Decode workers, NIXL read
`components/src/dynamo/sglang/multimodal_utils/multimodal_chat_processor.py`	Chat template processing
`components/src/dynamo/sglang/protocol.py`	Request/response data structures
`components/src/dynamo/sglang/register.py`	Registration logic (only called for Processor)