Qwen3-32B | NVIDIA Dynamo Documentation

This recipe deploys Qwen/Qwen3-32B on 16x H200 with prefill/decode disaggregation and KV-aware routing — the winning configuration from the KV routing A/B benchmark, validated against a real Mooncake conversation trace (12,031 requests, 12K average input length, 36.64% cache efficiency). It fits multi-turn conversational workloads where requests share long prefixes that KV-aware routing can exploit.

Deployment target

Checkpoint Qwen/Qwen3-32BPrecision BF16GPUs 16x H200, 2 nodes: 6x TP2 prefill + 2x TP2 decodeTechniques Disaggregated P/D + KV-aware routingWorkload Multi-turn conversation with prefix reuse (Mooncake trace, 12K avg ISL / 343 avg OSL)SLA Goodput at TTFT 2s / ITL 25ms

Prerequisites

A Kubernetes cluster with the Dynamo platform installed and 16x H200 across 2 nodes available.
A Hugging Face token with access to Qwen/Qwen3-32B.

Create the namespace and token secret:

$ export NAMESPACE=your-namespace
$ kubectl create namespace ${NAMESPACE}
$ kubectl create secret generic hf-token-secret \
>   --from-literal=HF_TOKEN="your-token" \
>   -n ${NAMESPACE}

Edit namespace, storage class, image tags, node selectors, resource claims, Hugging Face secrets, and cluster-specific placement before applying these manifests.

Deploy

Prepare the model cache and download the checkpoint (cache.yaml creates the model-cache, compilation-cache, and perf-cache PVCs):

$ # Edit storageClassName in cache.yaml to match your cluster first.
$ kubectl apply -f recipes/qwen3-32b/model-cache/cache.yaml -n ${NAMESPACE}
$ kubectl apply -f recipes/qwen3-32b/model-cache/model-download.yaml -n ${NAMESPACE}
$ kubectl wait --for=condition=Complete job/model-download -n ${NAMESPACE} --timeout=600s

Then deploy:

$ kubectl apply -f recipes/qwen3-32b/vllm/disagg-kv-router/deploy.yaml -n ${NAMESPACE}
$ 
$ kubectl wait --for=condition=ready pod \
>   -l nvidia.com/dynamo-graph-deployment-name=disagg-router-6p-2d \
>   -n ${NAMESPACE} --timeout=1200s

Smoke Test

Send a test request to verify the deployment serves traffic:

$ kubectl port-forward svc/disagg-router-6p-2d-frontend 8000:8000 -n ${NAMESPACE} &
$ sleep 3
$ 
$ curl http://localhost:8000/v1/chat/completions \
>   -H 'Content-Type: application/json' \
>   -d '{"model":"Qwen/Qwen3-32B","messages":[{"role":"user","content":"Write a one-sentence readiness check."}],"max_tokens":64}'

Benchmark

The checked-in perf.yaml downloads the Mooncake conversation trace (FAST25; 12,031 requests over ~59 minutes) and replays it on a fixed schedule, so throughput is set by the trace and latency is what you measure. It wraps this AIPerf run:

$ aiperf profile -m Qwen/Qwen3-32B \
>   --input-file conversation_trace.jsonl \
>   --custom-dataset-type mooncake_trace \
>   --fixed-schedule \
>   --url http://disagg-router-6p-2d-frontend:8000 \
>   --streaming \
>   --goodput "time_to_first_token:2000 inter_token_latency:25"

$ kubectl apply -f recipes/qwen3-32b/vllm/disagg-kv-router/perf.yaml -n ${NAMESPACE}
$ 
$ # The benchmark runs in a tmux session; attach to watch intermediate results.
$ kubectl get pods -n ${NAMESPACE} | grep benchmark
$ kubectl exec -it -n ${NAMESPACE} <benchmark-pod-name> -- tmux a -t benchmark

Results land on the perf-cache PVC under /perf-cache/artifacts/.

Qwen3-32B KV routing A/B — the evidence behind this recipe: this configuration vs. aggregated round-robin on the same trace.

Notes

The aggregated round-robin manifest is a benchmark control and lives on the related Feature Benchmarks page, not in this recipe.
For 80 GB GPUs (H100), use the Qwen3-32B FP8 recipe — BF16 weights are ~64 GB and leave little KV headroom.

Source

Source README: recipes/qwen3-32b/README.md
Deploy and benchmark: deploy.yaml and perf.yaml
Setup assets: model-cache/cache.yaml and model-cache/model-download.yaml