DeepSeek V3.2 WideEP Routing A/B

Does disaggregated KV-aware routing with WideEP improve latency and goodput against an aggregated round-robin GB200 control?

View as Markdown

Both configurations run Dynamo + TensorRT-LLM on 32x GB200 GPUs across 8 nodes: the baseline uses 4x DEP8 aggregated workers with round-robin routing; the comparison splits into 2 prefill + 2 decode workers with WideEP (DEP8) and KV-aware routing. The trace is heavily reuse-biased — roughly 44% KV cache hit rate and 57% of input tokens from shared context prefixes — so KV-aware routing can avoid large amounts of redundant long-context prefill.

Benchmark setup

Model nvidia/DeepSeek-V3.2-NVFP4GPUs 32x GB200 (8 nodes)Runtime TensorRT-LLMWorkload Mooncake-derived synthetic coding trace, fixed-schedule replay: 10,000 requests, 39,186 avg ISL (max 109,459), 344 avg OSL, 44.1% block-level KV hit rateMetrics TTFT, ITL, total request latency, and goodput at TTFT 20s / ITL 50msHeld constant Model, TensorRT-LLM runtime, 32x GB200 across 8 nodes, fixed-schedule trace replay, and TTFT 20s / ITL 50ms goodput thresholds

Compared Configurations

RoleConfigurationDeployBenchmark
ComparisonDisaggregated KV router + WideEP2x prefill + 2x decode with WideEP (DEP8), KV-aware routingdeploy.yamlperf.yaml
BaselineAggregated round-robin4x DEP8 aggregated workers, round-robin routingdeploy.yamlperf.yaml

Reproduce

The trace is synthesized from the Mooncake FAST25 conversation trace using Dynamo’s prefix data generator, scaling input lengths and prefix reuse up to a coding-workload shape:

$datagen synthesize \
> --input-file conversation_trace.jsonl \
> --prefix-len-multiplier 16 \
> --prompt-len-multiplier 10 \
> --max-isl 110000 \
> --num-requests 10000
$# emits conversation_trace_synth_16.00x1+10.00_speedup1_maxisl110000.jsonl

The replay uses --fixed-schedule, so request arrivals are pinned to the trace — throughput is fixed and the comparison is on TTFT, ITL, total request latency, and goodput. Each configuration’s perf.yaml wraps this AIPerf command:

$aiperf profile -m nvidia/DeepSeek-V3.2-NVFP4 \
> --tokenizer nvidia/DeepSeek-V3.2-NVFP4 \
> --input-file /model-cache/traces/conversation_trace_synth_16.00x1+10.00_speedup1_maxisl110000.jsonl \
> --custom-dataset-type mooncake_trace \
> --fixed-schedule \
> --url http://<frontend>:8000 \
> --streaming \
> --goodput "time_to_first_token:20000 inter_token_latency:50"

The frontend services are agg-round-robin-dsv32-nvfp4-frontend and disagg-kv-dsv32-nvfp4-frontend. Deploy one configuration at a time:

$export NAMESPACE=your-namespace
$
$# One-time prep: storage, ComputeDomain (for MNNVL co-location), model download
$kubectl apply -f recipes/deepseek-v32-fp4/model-cache/model-cache.yaml -n ${NAMESPACE}
$kubectl apply -f recipes/deepseek-v32-fp4/model-cache/compute-domain.yaml -n ${NAMESPACE}
$kubectl apply -f recipes/deepseek-v32-fp4/model-cache/model-download.yaml -n ${NAMESPACE}
$kubectl wait --for=condition=Complete job/model-download -n ${NAMESPACE} --timeout=600s
$
$# Copy the synthesized trace onto the PVC
$kubectl cp <local_trace.jsonl> ${NAMESPACE}/<helper-pod>:/model-cache/traces/
$
$# Pick one configuration from the table above, deploy it, wait for readiness, then apply its perf.yaml.
$kubectl apply -f recipes/deepseek-v32-fp4/trtllm/<configuration>/deploy.yaml -n ${NAMESPACE}
$kubectl apply -f recipes/deepseek-v32-fp4/trtllm/<configuration>/perf.yaml -n ${NAMESPACE}

The benchmark runs as a Kubernetes Job; tail it with kubectl logs -f -l job-name=<bench-job-name> -n ${NAMESPACE} (each config’s perf.yaml defines its Job name). Results land under /model-cache/perf/<epoch>_<job-name>/ on the model-cache PVC; copy them out with kubectl cp.

Notes

  • The source publishes the comparison as a results video plus the dataset statistics above, not a numeric results table; run both configurations to produce TTFT/ITL/goodput deltas for your cluster.
  • perf.yaml pins transformers==4.57.6 alongside aiperf==0.6.0 — older transformers cannot load the deepseek_v32 tokenizer and AIPerf surfaces it as “Failed to load tokenizer”.
  • Multi-node GB200 deployments need the ComputeDomain CR so the DRA scheduler co-locates worker pods on MNNVL-connected nodes; if you rename it, mirror the change in each deploy.yaml under extraPodSpec.resourceClaims and resources.claims.
  • Background on the underlying optimizations: Optimizing DeepSeek-V3.2 on NVIDIA Blackwell GPUs.
  • Source: recipes/deepseek-v32-fp4

The disaggregated KV router + WideEP configuration is the promoted deployment target: DeepSeek V3.2 NVFP4. The aggregated round-robin configuration exists as a benchmark control only.