> For clean Markdown content of this page, append .md to this URL. For the complete documentation index, see https://docs.nvidia.com/dynamo/llms.txt. For full content including API reference and SDK examples, see https://docs.nvidia.com/dynamo/llms-full.txt.

# DeepSeek V3.2 NVFP4

This recipe deploys DeepSeek V3.2 NVFP4 disaggregated across 32 GB200 GPUs — 2x prefill + 2x decode workers (DEP8) with KV-aware routing and WideEP — for long-context coding traces with heavy prefix reuse (\~39.2K avg ISL, 44% KV block hit rate). The aggregated round-robin configuration is the benchmark control and lives on the related Feature Benchmark page.

<p>
  Deployment target
</p>

<b>Checkpoint</b> nvidia/DeepSeek-V3.2-NVFP4

<b>GPUs</b> 32x GB200 across 8 nodes (NVL72, MNNVL)

<b>Techniques</b> Disagg, KV-aware routing, WideEP

<b>Workload</b> 10K-request Mooncake-based coding trace, fixed schedule

<b>SLA</b> TTFT 20s / ITL 50ms

## Prerequisites

* A Kubernetes cluster with the Dynamo platform (operator) installed and **32x GB200 across 8 nodes (NVL72, MNNVL-connected)** available.
* A Hugging Face token with access to `nvidia/DeepSeek-V3.2-NVFP4`.
* The `nvcr.io/nvidia/ai-dynamo/tensorrtllm-runtime:1.2.1` image (already pinned in `deploy.yaml`).

Create the namespace and token secret:

```bash
export NAMESPACE=your-namespace
kubectl create namespace ${NAMESPACE}
kubectl create secret generic hf-token-secret \
  --from-literal=HF_TOKEN="your-token" \
  -n ${NAMESPACE}
```

Multinode deployments on MNNVL-connected GB200 clusters typically require a ComputeDomain in your namespace so the DRA scheduler co-locates worker pods on MNNVL-connected nodes — otherwise internode GPU peer memory access fails:

```bash
kubectl apply -f recipes/deepseek-v32-fp4/model-cache/compute-domain.yaml -n ${NAMESPACE}
```

If you rename the ComputeDomain or its `resourceClaimTemplate` (defaults: `your-compute-domain` / `your-compute-domain-channel`), apply the same names in the deploy YAMLs under `extraPodSpec.resourceClaims` and `mainContainer.resources.claims`.

Edit cluster-specific values before applying: `storageClassName` in `model-cache/model-cache.yaml` (run kubectl get storageclass), your namespace (perf.yaml hardcodes `namespace: your-namespace`), and the ComputeDomain claim names.

## Deploy

Create storage, download NVIDIA's official NVFP4 checkpoint, copy the trace into the PVC, then deploy:

```bash
# 1. Create the model-cache PVC (RWX, 400Gi).
kubectl apply -f recipes/deepseek-v32-fp4/model-cache/model-cache.yaml -n ${NAMESPACE}

# 2. Download nvidia/DeepSeek-V3.2-NVFP4 into the PVC.
kubectl apply -f recipes/deepseek-v32-fp4/model-cache/model-download.yaml -n ${NAMESPACE}
kubectl wait --for=condition=Complete job/model-download -n ${NAMESPACE} --timeout=600s

# 3. Copy the synthesized trace file into the PVC.
kubectl cp <local_trace.jsonl> ${NAMESPACE}/<helper-pod>:/model-cache/traces/

# 4. Deploy: KV-routed frontend + 2x prefill + 2x decode (WideEP) workers.
kubectl apply -f recipes/deepseek-v32-fp4/trtllm/disagg-kv-router/deploy.yaml -n ${NAMESPACE}
kubectl wait --for=condition=ready pod -l nvidia.com/dynamo-graph-deployment-name=disagg-kv-dsv32-nvfp4 \
  -n ${NAMESPACE} --timeout=1200s
```

To synthesize the trace, run Dynamo's [prefix data generator](https://github.com/ai-dynamo/dynamo/tree/main/benchmarks/prefix_data_generator) on the Mooncake `conversation_trace.jsonl`:

```bash
datagen synthesize \
    --input-file conversation_trace.jsonl \
    --prefix-len-multiplier 16 \
    --prompt-len-multiplier 10 \
    --max-isl 110000 \
    --num-requests 10000
# emits conversation_trace_synth_16.00x1+10.00_speedup1_maxisl110000.jsonl
```

## Smoke Test

The Dynamo operator creates the `disagg-kv-dsv32-nvfp4-frontend` service for this deployment.

```bash
kubectl port-forward svc/disagg-kv-dsv32-nvfp4-frontend 8000:8000 -n ${NAMESPACE}

curl http://localhost:8000/v1/chat/completions \
  -H 'Content-Type: application/json' \
  -d '{"model":"nvidia/DeepSeek-V3.2-NVFP4","messages":[{"role":"user","content":"Write a one-sentence readiness check."}],"max_tokens":64}'
```

## Benchmark

The checked-in [`perf.yaml`](https://github.com/ai-dynamo/dynamo/blob/main/recipes/deepseek-v32-fp4/trtllm/disagg-kv-router/perf.yaml) runs a short synthetic warmup, then replays the trace at its original timestamps. It wraps this AIPerf run:

```bash
aiperf profile -m nvidia/DeepSeek-V3.2-NVFP4 --tokenizer nvidia/DeepSeek-V3.2-NVFP4 \
  --input-file /model-cache/traces/conversation_trace_synth_16.00x1+10.00_speedup1_maxisl110000.jsonl \
  --custom-dataset-type mooncake_trace --fixed-schedule \
  --url http://disagg-kv-dsv32-nvfp4-frontend:8000 --streaming \
  --goodput "time_to_first_token:20000 inter_token_latency:50"
```

```bash
kubectl apply -f recipes/deepseek-v32-fp4/trtllm/disagg-kv-router/perf.yaml -n ${NAMESPACE}

# Follow the benchmark Job's logs:
kubectl logs -f -l job-name=disagg-kv-dsv32-nvfp4-bench -n ${NAMESPACE}
```

Results land under `/model-cache/perf/<epoch>_<job-name>/` on the `model-cache` PVC; copy them out with `kubectl cp`.

Because the replay is `--fixed-schedule`, throughput is fixed by the trace — the metrics to compare are TTFT (KV-aware routing cuts prefill compute via prefix hits), ITL (disaggregation isolates decode from prefill interference), total request latency, and goodput against the TTFT 20s / ITL 50ms SLA.

## Related Feature Benchmarks

* [DeepSeek V3.2 WideEP routing A/B](/dynamo/dev/benchmarks/deepseek-v3-2-wideep-routing)

## Notes

* The aggregated round-robin configuration (`trtllm/agg-round-robin/`) is the benchmark control arm; it appears on the Feature Benchmark page, not as a recipe target here.
* The benchmark job pins `transformers==4.57.6`: the trtllm-runtime base ships 4.55.0, and aiperf 0.6.0's `transformers>=4.56.0` floor would otherwise pull 5.x, which cannot load the `deepseek_v32` tokenizer (surfaces as "Failed to load tokenizer").
* Decode workers run the WIDEEP MoE backend with `moe_expert_parallel_size: 8` and FP8 KV cache; prefill enables chunked prefill and KV block reuse (`tokens_per_block: 64`, matching `--kv-block-size 64`).
* The trace tops out at 109,459 input tokens; both engine configs set `max_seq_len: 121000` to accommodate it.

## Source

* Source README: [recipes/deepseek-v32-fp4/README.md](https://github.com/ai-dynamo/dynamo/blob/main/recipes/deepseek-v32-fp4/README.md)
* Disagg + KV router + WideEP: [deploy.yaml](https://github.com/ai-dynamo/dynamo/blob/main/recipes/deepseek-v32-fp4/trtllm/disagg-kv-router/deploy.yaml) and [perf.yaml](https://github.com/ai-dynamo/dynamo/blob/main/recipes/deepseek-v32-fp4/trtllm/disagg-kv-router/perf.yaml)
* Setup assets: [model-cache/](https://github.com/ai-dynamo/dynamo/tree/main/recipes/deepseek-v32-fp4/model-cache) including [compute-domain.yaml](https://github.com/ai-dynamo/dynamo/blob/main/recipes/deepseek-v32-fp4/model-cache/compute-domain.yaml)