DeepSeek V3.2 NVFP4

Serve nvidia/DeepSeek-V3.2-NVFP4 with Dynamo and TensorRT-LLM, disaggregated with KV routing and WideEP on GB200 NVL72.

View as Markdown

This recipe deploys DeepSeek V3.2 NVFP4 disaggregated across 32 GB200 GPUs — 2x prefill + 2x decode workers (DEP8) with KV-aware routing and WideEP — for long-context coding traces with heavy prefix reuse (~39.2K avg ISL, 44% KV block hit rate). The aggregated round-robin configuration is the benchmark control and lives on the related Feature Benchmark page.

Deployment target

Checkpoint nvidia/DeepSeek-V3.2-NVFP4GPUs 32x GB200 across 8 nodes (NVL72, MNNVL)Techniques Disagg, KV-aware routing, WideEPWorkload 10K-request Mooncake-based coding trace, fixed scheduleSLA TTFT 20s / ITL 50ms

Prerequisites

  • A Kubernetes cluster with the Dynamo platform (operator) installed and 32x GB200 across 8 nodes (NVL72, MNNVL-connected) available.
  • A Hugging Face token with access to nvidia/DeepSeek-V3.2-NVFP4.
  • The nvcr.io/nvidia/ai-dynamo/tensorrtllm-runtime:1.2.1 image (already pinned in deploy.yaml).

Create the namespace and token secret:

$export NAMESPACE=your-namespace
$kubectl create namespace ${NAMESPACE}
$kubectl create secret generic hf-token-secret \
> --from-literal=HF_TOKEN="your-token" \
> -n ${NAMESPACE}

Multinode deployments on MNNVL-connected GB200 clusters typically require a ComputeDomain in your namespace so the DRA scheduler co-locates worker pods on MNNVL-connected nodes — otherwise internode GPU peer memory access fails:

$kubectl apply -f recipes/deepseek-v32-fp4/model-cache/compute-domain.yaml -n ${NAMESPACE}

If you rename the ComputeDomain or its resourceClaimTemplate (defaults: your-compute-domain / your-compute-domain-channel), apply the same names in the deploy YAMLs under extraPodSpec.resourceClaims and mainContainer.resources.claims.

Edit cluster-specific values before applying: storageClassName in model-cache/model-cache.yaml (run kubectl get storageclass), your namespace (perf.yaml hardcodes namespace: your-namespace), and the ComputeDomain claim names.

Deploy

Create storage, download NVIDIA’s official NVFP4 checkpoint, copy the trace into the PVC, then deploy:

$# 1. Create the model-cache PVC (RWX, 400Gi).
$kubectl apply -f recipes/deepseek-v32-fp4/model-cache/model-cache.yaml -n ${NAMESPACE}
$
$# 2. Download nvidia/DeepSeek-V3.2-NVFP4 into the PVC.
$kubectl apply -f recipes/deepseek-v32-fp4/model-cache/model-download.yaml -n ${NAMESPACE}
$kubectl wait --for=condition=Complete job/model-download -n ${NAMESPACE} --timeout=600s
$
$# 3. Copy the synthesized trace file into the PVC.
$kubectl cp <local_trace.jsonl> ${NAMESPACE}/<helper-pod>:/model-cache/traces/
$
$# 4. Deploy: KV-routed frontend + 2x prefill + 2x decode (WideEP) workers.
$kubectl apply -f recipes/deepseek-v32-fp4/trtllm/disagg-kv-router/deploy.yaml -n ${NAMESPACE}
$kubectl wait --for=condition=ready pod -l nvidia.com/dynamo-graph-deployment-name=disagg-kv-dsv32-nvfp4 \
> -n ${NAMESPACE} --timeout=1200s

To synthesize the trace, run Dynamo’s prefix data generator on the Mooncake conversation_trace.jsonl:

$datagen synthesize \
> --input-file conversation_trace.jsonl \
> --prefix-len-multiplier 16 \
> --prompt-len-multiplier 10 \
> --max-isl 110000 \
> --num-requests 10000
$# emits conversation_trace_synth_16.00x1+10.00_speedup1_maxisl110000.jsonl

Smoke Test

The Dynamo operator creates the disagg-kv-dsv32-nvfp4-frontend service for this deployment.

$kubectl port-forward svc/disagg-kv-dsv32-nvfp4-frontend 8000:8000 -n ${NAMESPACE}
$
$curl http://localhost:8000/v1/chat/completions \
> -H 'Content-Type: application/json' \
> -d '{"model":"nvidia/DeepSeek-V3.2-NVFP4","messages":[{"role":"user","content":"Write a one-sentence readiness check."}],"max_tokens":64}'

Benchmark

The checked-in perf.yaml runs a short synthetic warmup, then replays the trace at its original timestamps. It wraps this AIPerf run:

$aiperf profile -m nvidia/DeepSeek-V3.2-NVFP4 --tokenizer nvidia/DeepSeek-V3.2-NVFP4 \
> --input-file /model-cache/traces/conversation_trace_synth_16.00x1+10.00_speedup1_maxisl110000.jsonl \
> --custom-dataset-type mooncake_trace --fixed-schedule \
> --url http://disagg-kv-dsv32-nvfp4-frontend:8000 --streaming \
> --goodput "time_to_first_token:20000 inter_token_latency:50"
$kubectl apply -f recipes/deepseek-v32-fp4/trtllm/disagg-kv-router/perf.yaml -n ${NAMESPACE}
$
$# Follow the benchmark Job's logs:
>kubectl logs -f -l job-name=disagg-kv-dsv32-nvfp4-bench -n ${NAMESPACE}

Results land under /model-cache/perf/<epoch>_<job-name>/ on the model-cache PVC; copy them out with kubectl cp.

Because the replay is --fixed-schedule, throughput is fixed by the trace — the metrics to compare are TTFT (KV-aware routing cuts prefill compute via prefix hits), ITL (disaggregation isolates decode from prefill interference), total request latency, and goodput against the TTFT 20s / ITL 50ms SLA.

Notes

  • The aggregated round-robin configuration (trtllm/agg-round-robin/) is the benchmark control arm; it appears on the Feature Benchmark page, not as a recipe target here.
  • The benchmark job pins transformers==4.57.6: the trtllm-runtime base ships 4.55.0, and aiperf 0.6.0’s transformers>=4.56.0 floor would otherwise pull 5.x, which cannot load the deepseek_v32 tokenizer (surfaces as “Failed to load tokenizer”).
  • Decode workers run the WIDEEP MoE backend with moe_expert_parallel_size: 8 and FP8 KV cache; prefill enables chunked prefill and KV block reuse (tokens_per_block: 64, matching --kv-block-size 64).
  • The trace tops out at 109,459 input tokens; both engine configs set max_seq_len: 121000 to accommodate it.

Source