Qwen3-32B

Serve Qwen3-32B with Dynamo and vLLM — disaggregated serving with KV-aware routing for multi-turn conversational traffic.

View as Markdown

This recipe deploys Qwen/Qwen3-32B on 16x H200 with prefill/decode disaggregation and KV-aware routing — the winning configuration from the KV routing A/B benchmark, validated against a real Mooncake conversation trace (12,031 requests, 12K average input length, 36.64% cache efficiency). It fits multi-turn conversational workloads where requests share long prefixes that KV-aware routing can exploit.

Deployment target

Checkpoint Qwen/Qwen3-32BPrecision BF16GPUs 16x H200, 2 nodes: 6x TP2 prefill + 2x TP2 decodeTechniques Disaggregated P/D + KV-aware routingWorkload Multi-turn conversation with prefix reuse (Mooncake trace, 12K avg ISL / 343 avg OSL)SLA Goodput at TTFT 2s / ITL 25ms

Prerequisites

  • A Kubernetes cluster with the Dynamo platform installed and 16x H200 across 2 nodes available.
  • A Hugging Face token with access to Qwen/Qwen3-32B.

Create the namespace and token secret:

$export NAMESPACE=your-namespace
$kubectl create namespace ${NAMESPACE}
$kubectl create secret generic hf-token-secret \
> --from-literal=HF_TOKEN="your-token" \
> -n ${NAMESPACE}

Edit namespace, storage class, image tags, node selectors, resource claims, Hugging Face secrets, and cluster-specific placement before applying these manifests.

Deploy

Prepare the model cache and download the checkpoint (cache.yaml creates the model-cache, compilation-cache, and perf-cache PVCs):

$# Edit storageClassName in cache.yaml to match your cluster first.
$kubectl apply -f recipes/qwen3-32b/model-cache/cache.yaml -n ${NAMESPACE}
$kubectl apply -f recipes/qwen3-32b/model-cache/model-download.yaml -n ${NAMESPACE}
$kubectl wait --for=condition=Complete job/model-download -n ${NAMESPACE} --timeout=600s

Then deploy:

$kubectl apply -f recipes/qwen3-32b/vllm/disagg-kv-router/deploy.yaml -n ${NAMESPACE}
$
$kubectl wait --for=condition=ready pod \
> -l nvidia.com/dynamo-graph-deployment-name=disagg-router-6p-2d \
> -n ${NAMESPACE} --timeout=1200s

Smoke Test

Send a test request to verify the deployment serves traffic:

$kubectl port-forward svc/disagg-router-6p-2d-frontend 8000:8000 -n ${NAMESPACE} &
$sleep 3
$
$curl http://localhost:8000/v1/chat/completions \
> -H 'Content-Type: application/json' \
> -d '{"model":"Qwen/Qwen3-32B","messages":[{"role":"user","content":"Write a one-sentence readiness check."}],"max_tokens":64}'

Benchmark

The checked-in perf.yaml downloads the Mooncake conversation trace (FAST25; 12,031 requests over ~59 minutes) and replays it on a fixed schedule, so throughput is set by the trace and latency is what you measure. It wraps this AIPerf run:

$aiperf profile -m Qwen/Qwen3-32B \
> --input-file conversation_trace.jsonl \
> --custom-dataset-type mooncake_trace \
> --fixed-schedule \
> --url http://disagg-router-6p-2d-frontend:8000 \
> --streaming \
> --goodput "time_to_first_token:2000 inter_token_latency:25"
$kubectl apply -f recipes/qwen3-32b/vllm/disagg-kv-router/perf.yaml -n ${NAMESPACE}
$
$# The benchmark runs in a tmux session; attach to watch intermediate results.
$kubectl get pods -n ${NAMESPACE} | grep benchmark
$kubectl exec -it -n ${NAMESPACE} <benchmark-pod-name> -- tmux a -t benchmark

Results land on the perf-cache PVC under /perf-cache/artifacts/.

  • Qwen3-32B KV routing A/B — the evidence behind this recipe: this configuration vs. aggregated round-robin on the same trace.

Notes

  • The aggregated round-robin manifest is a benchmark control and lives on the related Feature Benchmarks page, not in this recipe.
  • For 80 GB GPUs (H100), use the Qwen3-32B FP8 recipe — BF16 weights are ~64 GB and leave little KV headroom.

Source