Qwen3.6 Frontend and Cache Benchmark
How do Dynamo frontend-decoding and the embedding cache change a single-GPU multimodal sliding-window benchmark versus vanilla vLLM serve?
Three configurations run on the same single-GPU node (H100 or GB200, selected at deploy time), so the only thing that varies is the Dynamo feature set. Each turn shares 4-of-5 images with the previous turn of the same user, so repeated images dominate — exactly the shape the embedding cache is designed for. The full Dynamo stack delivers roughly +30% RPS and large TTFT reductions versus vanilla vllm serve (H100: +29.8% RPS / −66.4% TTFT avg; GB200: +31.5% / −43.3%).
Benchmark setup
Results
Full result tables are reproduced below from the source study; the headline numbers:
H100
Δ vs vllm-serve: dynamo-fd +12.8% RPS / −60.4% TTFT avg; dynamo-fd-ec +29.8% RPS / −66.4% TTFT avg (−87.4% TTFT p50).
GB200
Δ vs vllm-serve: dynamo-fd +18.8% RPS / −36.6% TTFT avg; dynamo-fd-ec +31.5% RPS / −43.3% TTFT avg (−45.9% TTFT p50).
Frontend-decoding alone captures most of the TTFT win; the embedding cache layers on additional throughput and tighter median TTFT. ITL stays roughly flat because the cache shortens the prefill path (skipping the vision encoder for repeated images), not decode.
Compared Configurations
Reproduce
A dataset-generation job writes the sliding-window dataset (30u_8t_5w_8000word_base64.jsonl: 30 users x 8 turns, window 5, 2400x1080 base64 images, 8,000 text tokens, one session_id=user_<N> row per turn) to the shared PVC. The shared perf.yaml wraps this AIPerf command for every configuration:
The recipe is driven end-to-end by scripts that template the YAML via envsubst (the per-configuration pod name and frontend service are injected at apply time):
Each config’s profile_export_aiperf.json is retrieved locally and holds the headline metrics.
Notes
- AIPerf is installed from source pinned to a
mainSHA that includes PR 824, which makessingle_turnmode honorsession_idordering — that causal ordering is what lets prefix-cache hits across a user’s turns actually land. - The recipe expects a
shared-model-cacheRWX PVC in the namespace;Qwen/Qwen3.6-35B-A3B-FP8is public, so no HuggingFace token is required. - The vLLM command uses
--mm-processor-cache-gb 30and--max-model-len 32768to handle the 5-image multimodal context. - Source: recipes/qwen3.6-35b
Winning Configuration
The dynamo-fd-ec configuration is the winning configuration; its deploy assets are above, and a recommended Recipe may be promoted from this benchmark in a future release. The vllm-serve baseline and dynamo-fd step exist as benchmark controls.