Multimodal KV Routing
Overview
Multimodal KV routing extends Dynamo’s KV-aware router to account for image content when computing cache overlap scores. A dedicated MM router worker sits between the frontend and backend workers. It downloads images, computes a hash of each image (mm_hash), and includes this hash in per-block routing metadata. The KV router then selects the backend worker with the highest cache overlap, including overlap on image embedding blocks.
Repeated requests containing the same image are routed to the worker that already has the corresponding KV cache blocks, maximizing prefix cache reuse.
Note: KV cache is separate from embedding cache (also called encoder cache), which reuses vision encoder outputs (image→embeddings) to avoid re-running the encoder. For encoder-side reuse see Embedding Cache.
When to Use
Use multimodal KV routing when:
- You have multiple backend workers serving multimodal requests
- Your workload includes repeated images across requests (e.g., the same product photo, shared reference images)
- You want to maximize KV cache hit rates for multimodal content
Without MM-aware routing, the standard router treats image token blocks as opaque and cannot match which worker has cached a particular image’s KV blocks.
Support Matrix
*Requires an upcoming version of vLLM that has not yet been released. Support will be available once the new vLLM release is published.
How It Works
- The frontend routes to the MM router worker via round-robin
- The MM router downloads each image and computes an
mm_hash - Per-block routing metadata (
block_mm_infos) is built, tagging blocks that contain image tokens - The KV router evaluates overlap across all backend workers, accounting for image-bearing blocks
- The request is forwarded to the worker with the highest overlap
On repeated requests with the same image, the selected worker shows higher cached block counts, reducing prefill latency.
Launching
vLLM
TRT-LLM
See the vLLM MM Router README and TRT-LLM MM Router README for full setup instructions and configuration options.
Known Limitations
- Currently supports Qwen-family multimodal processors (Qwen2-VL, Qwen2.5-VL, Qwen3-VL) for per-image visual token counting
- Images are downloaded twice: once in the MM router (for hash computation) and once in the backend worker (for processing)