DeepSeek V3.2 WideEP Routing A/B | NVIDIA Dynamo Documentation

Both configurations run Dynamo + TensorRT-LLM on 32x GB200 GPUs across 8 nodes: the baseline uses 4x DEP8 aggregated workers with round-robin routing; the comparison splits into 2 prefill + 2 decode workers with WideEP (DEP8) and KV-aware routing. The trace is heavily reuse-biased — roughly 44% KV cache hit rate and 57% of input tokens from shared context prefixes — so KV-aware routing can avoid large amounts of redundant long-context prefill.

Benchmark setup

Model nvidia/DeepSeek-V3.2-NVFP4GPUs 32x GB200 (8 nodes)Runtime TensorRT-LLMWorkload Mooncake-derived synthetic coding trace, fixed-schedule replay: 10,000 requests, 39,186 avg ISL (max 109,459), 344 avg OSL, 44.1% block-level KV hit rateMetrics TTFT, ITL, total request latency, and goodput at TTFT 20s / ITL 50msHeld constant Model, TensorRT-LLM runtime, 32x GB200 across 8 nodes, fixed-schedule trace replay, and TTFT 20s / ITL 50ms goodput thresholds

Compared Configurations

Role	Configuration	Deploy	Benchmark
Comparison	Disaggregated KV router + WideEP2x prefill + 2x decode with WideEP (DEP8), KV-aware routing	deploy.yaml	perf.yaml
Baseline	Aggregated round-robin4x DEP8 aggregated workers, round-robin routing	deploy.yaml	perf.yaml

Reproduce

The trace is synthesized from the Mooncake FAST25 conversation trace using Dynamo’s prefix data generator, scaling input lengths and prefix reuse up to a coding-workload shape:

$ datagen synthesize \
>     --input-file conversation_trace.jsonl \
>     --prefix-len-multiplier 16 \
>     --prompt-len-multiplier 10 \
>     --max-isl 110000 \
>     --num-requests 10000
$ # emits conversation_trace_synth_16.00x1+10.00_speedup1_maxisl110000.jsonl

The replay uses --fixed-schedule, so request arrivals are pinned to the trace — throughput is fixed and the comparison is on TTFT, ITL, total request latency, and goodput. Each configuration’s perf.yaml wraps this AIPerf command:

$ aiperf profile -m nvidia/DeepSeek-V3.2-NVFP4 \
>   --tokenizer nvidia/DeepSeek-V3.2-NVFP4 \
>   --input-file /model-cache/traces/conversation_trace_synth_16.00x1+10.00_speedup1_maxisl110000.jsonl \
>   --custom-dataset-type mooncake_trace \
>   --fixed-schedule \
>   --url http://<frontend>:8000 \
>   --streaming \
>   --goodput "time_to_first_token:20000 inter_token_latency:50"

The frontend services are agg-round-robin-dsv32-nvfp4-frontend and disagg-kv-dsv32-nvfp4-frontend. Deploy one configuration at a time:

$ export NAMESPACE=your-namespace
$ 
$ # One-time prep: storage, ComputeDomain (for MNNVL co-location), model download
$ kubectl apply -f recipes/deepseek-v32-fp4/model-cache/model-cache.yaml -n ${NAMESPACE}
$ kubectl apply -f recipes/deepseek-v32-fp4/model-cache/compute-domain.yaml -n ${NAMESPACE}
$ kubectl apply -f recipes/deepseek-v32-fp4/model-cache/model-download.yaml -n ${NAMESPACE}
$ kubectl wait --for=condition=Complete job/model-download -n ${NAMESPACE} --timeout=600s
$ 
$ # Copy the synthesized trace onto the PVC
$ kubectl cp <local_trace.jsonl> ${NAMESPACE}/<helper-pod>:/model-cache/traces/
$ 
$ # Pick one configuration from the table above, deploy it, wait for readiness, then apply its perf.yaml.
$ kubectl apply -f recipes/deepseek-v32-fp4/trtllm/<configuration>/deploy.yaml -n ${NAMESPACE}
$ kubectl apply -f recipes/deepseek-v32-fp4/trtllm/<configuration>/perf.yaml -n ${NAMESPACE}

The benchmark runs as a Kubernetes Job; tail it with kubectl logs -f -l job-name=<bench-job-name> -n ${NAMESPACE} (each config’s perf.yaml defines its Job name). Results land under /model-cache/perf/<epoch>_<job-name>/ on the model-cache PVC; copy them out with kubectl cp.

Notes

The source publishes the comparison as a results video plus the dataset statistics above, not a numeric results table; run both configurations to produce TTFT/ITL/goodput deltas for your cluster.
perf.yaml pins transformers==4.57.6 alongside aiperf==0.6.0 — older transformers cannot load the deepseek_v32 tokenizer and AIPerf surfaces it as “Failed to load tokenizer”.
Multi-node GB200 deployments need the ComputeDomain CR so the DRA scheduler co-locates worker pods on MNNVL-connected nodes; if you rename it, mirror the change in each deploy.yaml under extraPodSpec.resourceClaims and resources.claims.
Background on the underlying optimizations: Optimizing DeepSeek-V3.2 on NVIDIA Blackwell GPUs.
Source: recipes/deepseek-v32-fp4

The disaggregated KV router + WideEP configuration is the promoted deployment target: DeepSeek V3.2 NVFP4. The aggregated round-robin configuration exists as a benchmark control only.

Compared Configurations

Reproduce

Notes

Related Recipe