Feature Benchmarks | NVIDIA Dynamo Documentation

Feature Benchmarks evaluate Dynamo features, topologies, and feature stacks under controlled traffic. Each page states the question, compares deployable configurations, shows how to reproduce the run, and links to the Recipe target when one deployment should be used directly.

Features Under Test

Serving techniques and topology changes benchmarked across the comparisons below.

KV routing

Routes traffic to workers with reusable KV cache so TTFT, ITL, and goodput can improve on prefix-heavy workloads.

Prefill/decode split

Separates prompt prefill and token decode into specialized worker pools for long-context latency and throughput tests.

WideEP

Spreads MoE experts across a wider GPU set so expert-heavy requests get more parallel capacity.

Embedding cache

Reuses multimodal embeddings, especially repeated images, instead of recomputing them for every request.

Speculative decoding

Drafts candidate tokens and verifies them with the target model; Eagle3 is the speculative path used here.

KV offload

Moves colder KV blocks to a host-memory tier so longer context can fit without keeping all KV on GPU.

Frontend decoding

Moves decode coordination into the Dynamo frontend so routing and cache policy can act before backend execution.

Multi-node topology

Runs serving workers across node boundaries to compare aggregate, single-node P/D, and multi-node P/D shapes.

Feature composition

Agentic coding throughput stack

How much do KV routing, speculative decoding, P/D split, and KV offload gain when composed?

KV routingSpec decodingP/D splitKV offload

Kimi-K2.5 NVFP4

TrafficAgentic coding traceHardware24x GB200

OpenOpen

Feature composition

Frontend decoding plus embedding cache

How do Dynamo frontend decoding and embedding cache change a single-GPU multimodal benchmark versus vanilla vLLM serve?

Frontend decodingEmbedding cache

Qwen3.6-35B-A3B FP8

TrafficMultimodal sliding windowHardware1x H100 or GB200

OpenOpen

A/B test

Multimodal embedding cache

How much does enabling the vLLM multimodal embedding cache improve repeated-image traffic on one GB200 worker?

Embedding cache

Qwen3-VL-30B

Traffic80% image reuseHardware1x GB200

OpenOpen

A/B test

KV-aware routing + WideEP + P/D split

Does disaggregated KV-aware routing with WideEP improve latency and goodput against a GB200 control?

KV routingWideEPP/D split

DeepSeek V3.2 NVFP4

TrafficLong-context coding traceHardware32x GB200

OpenOpen

A/B test

KV-aware routing + prefill/decode split

Does disaggregated KV-aware routing reduce TTFT and ITL compared with aggregated round-robin routing?

KV routingP/D split

Qwen3-32B

TrafficMooncake prefix reuseHardware16x H200

OpenOpen

Topology benchmark

Aggregate vs single-node P/D vs multi-node P/D

How do vLLM topologies compare when normalized by GPU?

P/D splitMulti-node

Llama-3.3-70B FP8

Traffic8K ISL / 1K OSLHardware4-16 GPUs

OpenOpen