Feature Benchmarks

Start with the feature or performance claim you want to inspect.
View as Markdown

Feature Benchmarks evaluate Dynamo features, topologies, and feature stacks under controlled traffic. Each page states the question, compares deployable configurations, shows how to reproduce the run, and links to the Recipe target when one deployment should be used directly.

Features Under Test

Serving techniques and topology changes benchmarked across the comparisons below.

KV routing

Routes traffic to workers with reusable KV cache so TTFT, ITL, and goodput can improve on prefix-heavy workloads.

Prefill/decode split

Separates prompt prefill and token decode into specialized worker pools for long-context latency and throughput tests.

WideEP

Spreads MoE experts across a wider GPU set so expert-heavy requests get more parallel capacity.

Embedding cache

Reuses multimodal embeddings, especially repeated images, instead of recomputing them for every request.

Speculative decoding

Drafts candidate tokens and verifies them with the target model; Eagle3 is the speculative path used here.

KV offload

Moves colder KV blocks to a host-memory tier so longer context can fit without keeping all KV on GPU.

Frontend decoding

Moves decode coordination into the Dynamo frontend so routing and cache policy can act before backend execution.

Multi-node topology

Runs serving workers across node boundaries to compare aggregate, single-node P/D, and multi-node P/D shapes.

Feature composition

Agentic coding throughput stack

How much do KV routing, speculative decoding, P/D split, and KV offload gain when composed?

KV routingSpec decodingP/D splitKV offload
Kimi-K2.5 NVFP4
TrafficAgentic coding traceHardware24x GB200
OpenOpen
Feature composition

Frontend decoding plus embedding cache

How do Dynamo frontend decoding and embedding cache change a single-GPU multimodal benchmark versus vanilla vLLM serve?

Frontend decodingEmbedding cache
Qwen3.6-35B-A3B FP8
TrafficMultimodal sliding windowHardware1x H100 or GB200
OpenOpen
A/B test

Multimodal embedding cache

How much does enabling the vLLM multimodal embedding cache improve repeated-image traffic on one GB200 worker?

Embedding cache
Qwen3-VL-30B
Traffic80% image reuseHardware1x GB200
OpenOpen
A/B test

KV-aware routing + WideEP + P/D split

Does disaggregated KV-aware routing with WideEP improve latency and goodput against a GB200 control?

KV routingWideEPP/D split
DeepSeek V3.2 NVFP4
TrafficLong-context coding traceHardware32x GB200
OpenOpen
A/B test

KV-aware routing + prefill/decode split

Does disaggregated KV-aware routing reduce TTFT and ITL compared with aggregated round-robin routing?

KV routingP/D split
Qwen3-32B
TrafficMooncake prefix reuseHardware16x H200
OpenOpen
Topology benchmark

Aggregate vs single-node P/D vs multi-node P/D

How do vLLM topologies compare when normalized by GPU?

P/D splitMulti-node
Llama-3.3-70B FP8
Traffic8K ISL / 1K OSLHardware4-16 GPUs
OpenOpen