Encoder Disaggregation

Separate vision encoding into a dedicated worker for independent scaling
View as Markdown

Overview

Encoder disaggregation separates the vision encoder from the prefill/decode pipeline into its own worker. Instead of running image encoding inline, a dedicated encode worker handles media processing and transfers the resulting embeddings to downstream workers via NIXL (RDMA).

This enables:

  • Independent scaling of encode workers based on vision workload
  • Reduced GPU memory pressure on prefill/decode workers
  • Better GPU utilization by matching worker counts to actual bottlenecks

When to Use

Use encoder disaggregation when:

  • Vision encoding is a bottleneck and you need to scale encoders independently of LLM workers
  • You want to run the vision encoder on different hardware (e.g., smaller GPUs for encoding, larger GPUs for LLM inference)
  • Your deployment handles high volumes of multimodal requests and encoding throughput is limiting

For simple deployments or development/testing, the aggregated (EPD) pattern is easier to set up.

Support Matrix

BackendE/PDE/P/DNotes
vLLMNIXL transfer for embeddings; NIXL KV cache transfer for P/D
TRT-LLMSupports image URLs (via MultimodalEncoder) and pre-computed embeddings (via NIXL)
SGLangNIXL for embeddings; bootstrap mechanism for P/D KV transfer

Deployment Patterns

E/PD — Separate encoder, combined prefill+decode:

Frontend → Processor → Encode Worker → PD Worker → Response
(NIXL)

The encode worker runs the vision model and transfers embeddings via NIXL to a combined prefill+decode worker.

E/P/D — All stages separate:

Frontend → Processor → Encode Worker → Prefill Worker → Decode Worker → Response
(NIXL) (KV transfer)

Full disaggregation with separate workers for each stage. The encode worker transfers embeddings to the prefill worker, which then transfers KV cache to the decode worker.

Launching

vLLM

$cd $DYNAMO_HOME/examples/backends/vllm
$
$# E/PD
$bash launch/disagg_multimodal_e_pd.sh --model "Qwen/Qwen3-VL-30B-A3B-Instruct-FP8"
$
$# E/P/D
$bash launch/disagg_multimodal_epd.sh --model "Qwen/Qwen3-VL-30B-A3B-Instruct-FP8"

TRT-LLM

$cd $DYNAMO_HOME/examples/backends/trtllm
$
$# E/PD
$bash launch/disagg_e_pd.sh
$
$# E/P/D
$./launch/epd_multimodal_image_and_embeddings.sh

SGLang

$cd $DYNAMO_HOME/examples/backends/sglang
$
$# E/PD
$./launch/multimodal_epd.sh
$
$# E/P/D
$./launch/multimodal_disagg.sh

See the backend-specific documentation (vLLM, TRT-LLM, SGLang) for full configuration details and component flags.