DeepSeek V3.2 NVFP4 | NVIDIA Dynamo Documentation

This recipe deploys DeepSeek V3.2 NVFP4 disaggregated across 32 GB200 GPUs — 2x prefill + 2x decode workers (DEP8) with KV-aware routing and WideEP — for long-context coding traces with heavy prefix reuse (~39.2K avg ISL, 44% KV block hit rate). The aggregated round-robin configuration is the benchmark control and lives on the related Feature Benchmark page.

Deployment target

Checkpoint nvidia/DeepSeek-V3.2-NVFP4GPUs 32x GB200 across 8 nodes (NVL72, MNNVL)Techniques Disagg, KV-aware routing, WideEPWorkload 10K-request Mooncake-based coding trace, fixed scheduleSLA TTFT 20s / ITL 50ms

Prerequisites

A Kubernetes cluster with the Dynamo platform (operator) installed and 32x GB200 across 8 nodes (NVL72, MNNVL-connected) available.
A Hugging Face token with access to nvidia/DeepSeek-V3.2-NVFP4.
The nvcr.io/nvidia/ai-dynamo/tensorrtllm-runtime:1.2.1 image (already pinned in deploy.yaml).

Create the namespace and token secret:

$ export NAMESPACE=your-namespace
$ kubectl create namespace ${NAMESPACE}
$ kubectl create secret generic hf-token-secret \
>   --from-literal=HF_TOKEN="your-token" \
>   -n ${NAMESPACE}

Multinode deployments on MNNVL-connected GB200 clusters typically require a ComputeDomain in your namespace so the DRA scheduler co-locates worker pods on MNNVL-connected nodes — otherwise internode GPU peer memory access fails:

$ kubectl apply -f recipes/deepseek-v32-fp4/model-cache/compute-domain.yaml -n ${NAMESPACE}

If you rename the ComputeDomain or its resourceClaimTemplate (defaults: your-compute-domain / your-compute-domain-channel), apply the same names in the deploy YAMLs under extraPodSpec.resourceClaims and mainContainer.resources.claims.

Edit cluster-specific values before applying: storageClassName in model-cache/model-cache.yaml (run kubectl get storageclass), your namespace (perf.yaml hardcodes namespace: your-namespace), and the ComputeDomain claim names.

Deploy

Create storage, download NVIDIA’s official NVFP4 checkpoint, copy the trace into the PVC, then deploy:

$ # 1. Create the model-cache PVC (RWX, 400Gi).
$ kubectl apply -f recipes/deepseek-v32-fp4/model-cache/model-cache.yaml -n ${NAMESPACE}
$ 
$ # 2. Download nvidia/DeepSeek-V3.2-NVFP4 into the PVC.
$ kubectl apply -f recipes/deepseek-v32-fp4/model-cache/model-download.yaml -n ${NAMESPACE}
$ kubectl wait --for=condition=Complete job/model-download -n ${NAMESPACE} --timeout=600s
$ 
$ # 3. Copy the synthesized trace file into the PVC.
$ kubectl cp <local_trace.jsonl> ${NAMESPACE}/<helper-pod>:/model-cache/traces/
$ 
$ # 4. Deploy: KV-routed frontend + 2x prefill + 2x decode (WideEP) workers.
$ kubectl apply -f recipes/deepseek-v32-fp4/trtllm/disagg-kv-router/deploy.yaml -n ${NAMESPACE}
$ kubectl wait --for=condition=ready pod -l nvidia.com/dynamo-graph-deployment-name=disagg-kv-dsv32-nvfp4 \
>   -n ${NAMESPACE} --timeout=1200s

To synthesize the trace, run Dynamo’s prefix data generator on the Mooncake conversation_trace.jsonl:

$ datagen synthesize \
>     --input-file conversation_trace.jsonl \
>     --prefix-len-multiplier 16 \
>     --prompt-len-multiplier 10 \
>     --max-isl 110000 \
>     --num-requests 10000
$ # emits conversation_trace_synth_16.00x1+10.00_speedup1_maxisl110000.jsonl

Smoke Test

The Dynamo operator creates the disagg-kv-dsv32-nvfp4-frontend service for this deployment.

$ kubectl port-forward svc/disagg-kv-dsv32-nvfp4-frontend 8000:8000 -n ${NAMESPACE}
$ 
$ curl http://localhost:8000/v1/chat/completions \
>   -H 'Content-Type: application/json' \
>   -d '{"model":"nvidia/DeepSeek-V3.2-NVFP4","messages":[{"role":"user","content":"Write a one-sentence readiness check."}],"max_tokens":64}'

Benchmark

The checked-in perf.yaml runs a short synthetic warmup, then replays the trace at its original timestamps. It wraps this AIPerf run:

$ aiperf profile -m nvidia/DeepSeek-V3.2-NVFP4 --tokenizer nvidia/DeepSeek-V3.2-NVFP4 \
>   --input-file /model-cache/traces/conversation_trace_synth_16.00x1+10.00_speedup1_maxisl110000.jsonl \
>   --custom-dataset-type mooncake_trace --fixed-schedule \
>   --url http://disagg-kv-dsv32-nvfp4-frontend:8000 --streaming \
>   --goodput "time_to_first_token:20000 inter_token_latency:50"

$ kubectl apply -f recipes/deepseek-v32-fp4/trtllm/disagg-kv-router/perf.yaml -n ${NAMESPACE}
$ 
$ # Follow the benchmark Job's logs:
> kubectl logs -f -l job-name=disagg-kv-dsv32-nvfp4-bench -n ${NAMESPACE}

Results land under /model-cache/perf/<epoch>_<job-name>/ on the model-cache PVC; copy them out with kubectl cp.

Because the replay is --fixed-schedule, throughput is fixed by the trace — the metrics to compare are TTFT (KV-aware routing cuts prefill compute via prefix hits), ITL (disaggregation isolates decode from prefill interference), total request latency, and goodput against the TTFT 20s / ITL 50ms SLA.

DeepSeek V3.2 WideEP routing A/B

Notes

The aggregated round-robin configuration (trtllm/agg-round-robin/) is the benchmark control arm; it appears on the Feature Benchmark page, not as a recipe target here.
The benchmark job pins transformers==4.57.6: the trtllm-runtime base ships 4.55.0, and aiperf’s transformers>=4.56.0 floor would otherwise pull 5.x, which cannot load the deepseek_v32 tokenizer (surfaces as “Failed to load tokenizer”).
Decode workers run the WIDEEP MoE backend with moe_expert_parallel_size: 8 and FP8 KV cache; prefill enables chunked prefill and KV block reuse (tokens_per_block: 64, matching --kv-block-size 64).
The trace tops out at 109,459 input tokens; both engine configs set max_seq_len: 121000 to accommodate it.

Source

Source README: recipes/deepseek-v32-fp4/README.md
Disagg + KV router + WideEP: deploy.yaml and perf.yaml
Setup assets: model-cache/ including compute-domain.yaml