Qwen3-32B FP8 | NVIDIA Dynamo Documentation

Each target below is a validated FP8 deployment of Qwen3-32B — from a 2-GPU TensorRT-LLM aggregate to 8-GPU disaggregated prefill/decode setups on TensorRT-LLM or vLLM — each with a checked-in AIPerf benchmark Job. The targets use different traffic shapes and GPU counts, so this page is not a backend benchmark. Pick your target; every command on this page updates to match.

Choose your deployment target

TargetTRT-LLM aggregated (2 GPU) RecommendedTRT-LLM disaggregated (8 GPU)vLLM disaggregated (8 GPU)

Checkpoint Qwen/Qwen3-32B-FP8GPUs 2x H100/H200/A100, single TP2 workerRouting Round-robinTechniques CUDA graphs, FP8 KV cacheWorkload 4K ISL / 500 OSL, concurrency 4

Checkpoint Qwen/Qwen3-32B-FP8GPUs 8x H100/H200/A100Topology 4x prefill (TP1) + 2x decode (TP2)Routing Round-robinWorkload 4K ISL / 500 OSL, concurrency 48

Checkpoint Qwen/Qwen3-32B-FP8GPUs 8x H100/H200/A100, single nodeTopology 2x prefill (TP2) + 1x decode (TP4)KV transfer NIXL (NixlConnector)Workload 2K ISL / 500 OSL, concurrency 8

Prerequisites

A Kubernetes cluster with the Dynamo platform installed and 2x H100/H200/A100-class GPUs available.
A Hugging Face token with access to Qwen/Qwen3-32B-FP8.

A Kubernetes cluster with the Dynamo platform installed and 8x H100/H200/A100-class GPUs available.
A Hugging Face token with access to Qwen/Qwen3-32B-FP8.

A Kubernetes cluster with the Dynamo platform installed and 8x H100/H200/A100-class GPUs on a single node — all prefill and decode workers must be co-located for NIXL KV transfer.
A Hugging Face token with access to Qwen/Qwen3-32B-FP8.

Create the namespace and token secret:

$ export NAMESPACE=your-namespace
$ kubectl create namespace ${NAMESPACE}
$ kubectl create secret generic hf-token-secret \
>   --from-literal=HF_TOKEN="your-token" \
>   -n ${NAMESPACE}

Edit namespace, storage class, image tags, node selectors, resource claims, and cluster-specific placement in the manifests before applying them.

Deploy

Prepare the model cache and download the checkpoint:

$ # Edit storageClassName in model-cache/model-cache.yaml to match your cluster first
$ # (kubectl get storageclass).
$ kubectl apply -f recipes/qwen3-32b-fp8/model-cache/ -n ${NAMESPACE}
$ kubectl wait --for=condition=Complete job/model-download -n ${NAMESPACE} --timeout=1800s

Then deploy:

A single TP2 TensorRT-LLM worker with round-robin routing and CUDA graphs enabled:

$ kubectl apply -f recipes/qwen3-32b-fp8/trtllm/agg/deploy.yaml -n ${NAMESPACE}

4x prefill workers (TP1) and 2x decode workers (TP2):

$ kubectl apply -f recipes/qwen3-32b-fp8/trtllm/disagg/deploy.yaml -n ${NAMESPACE}

2x prefill workers (TP2) and 1x decode worker (TP4) using NixlConnector KV transfer; all workers must land on one node:

$ kubectl apply -f recipes/qwen3-32b-fp8/vllm/disagg/deploy.yaml -n ${NAMESPACE}

Smoke Test

Send a test request to verify the deployment serves traffic:

$ kubectl port-forward svc/qwen3-32b-fp8-agg-frontend 8000:8000 -n ${NAMESPACE}

$ kubectl port-forward svc/qwen3-32b-fp8-disagg-frontend 8000:8000 -n ${NAMESPACE}

$ kubectl port-forward svc/qwen3-32b-fp8-vllm-disagg-frontend 8000:8000 -n ${NAMESPACE}

$ curl http://localhost:8000/v1/chat/completions \
>   -H 'Content-Type: application/json' \
>   -d '{"model":"Qwen/Qwen3-32B-FP8","messages":[{"role":"user","content":"Write a one-sentence readiness check."}],"max_tokens":64}'

Benchmark

Each target ships its own perf.yaml AIPerf Job, sized at a fixed concurrency per GPU with request-count = 10x concurrency. Artifacts land on the model-cache PVC under /model-cache/perf.

The Job wraps this AIPerf run — 4K ISL / 500 OSL at concurrency 4 (2 per GPU x 2 GPUs):

$ aiperf profile \
>   --model Qwen/Qwen3-32B-FP8 --tokenizer Qwen/Qwen3-32B-FP8 \
>   --endpoint-type chat --endpoint /v1/chat/completions \
>   --url http://qwen3-32b-fp8-agg-frontend:8000 --streaming \
>   --synthetic-input-tokens-mean 4000 --synthetic-input-tokens-stddev 0 \
>   --output-tokens-mean 500 --output-tokens-stddev 0 \
>   --extra-inputs max_tokens:500 --extra-inputs min_tokens:500 --extra-inputs ignore_eos:true \
>   --concurrency 4 --request-count 40 --warmup-request-count 4 \
>   --random-seed 100

$ kubectl apply -f recipes/qwen3-32b-fp8/trtllm/agg/perf.yaml -n ${NAMESPACE}
$ kubectl wait --for=condition=Complete job/qwen3-32b-fp8-bench -n ${NAMESPACE} --timeout=7200s

The Job wraps this AIPerf run — 4K ISL / 500 OSL at concurrency 48 (6 per GPU x 8 GPUs):

$ aiperf profile \
>   --model Qwen/Qwen3-32B-FP8 --tokenizer Qwen/Qwen3-32B-FP8 \
>   --endpoint-type chat --endpoint /v1/chat/completions \
>   --url http://qwen3-32b-fp8-disagg-frontend:8000 --streaming \
>   --synthetic-input-tokens-mean 4000 --synthetic-input-tokens-stddev 0 \
>   --output-tokens-mean 500 --output-tokens-stddev 0 \
>   --extra-inputs max_tokens:500 --extra-inputs min_tokens:500 --extra-inputs ignore_eos:true \
>   --concurrency 48 --request-count 480 --warmup-request-count 48 \
>   --random-seed 100

$ kubectl apply -f recipes/qwen3-32b-fp8/trtllm/disagg/perf.yaml -n ${NAMESPACE}
$ kubectl wait --for=condition=Complete job/qwen3-32b-fp8-bench -n ${NAMESPACE} --timeout=7200s

The Job wraps this AIPerf run — 2K ISL / 500 OSL at concurrency 8 (1 per GPU x 8 GPUs); note the shorter input length than the TRT-LLM targets:

$ aiperf profile \
>   --model Qwen/Qwen3-32B-FP8 --tokenizer Qwen/Qwen3-32B-FP8 \
>   --endpoint-type chat --endpoint /v1/chat/completions \
>   --url http://qwen3-32b-fp8-vllm-disagg-frontend:8000 --streaming \
>   --synthetic-input-tokens-mean 2000 --synthetic-input-tokens-stddev 0 \
>   --output-tokens-mean 500 --output-tokens-stddev 0 \
>   --extra-inputs max_tokens:500 --extra-inputs min_tokens:500 --extra-inputs ignore_eos:true \
>   --concurrency 8 --request-count 80 --warmup-request-count 8 \
>   --random-seed 100

$ kubectl apply -f recipes/qwen3-32b-fp8/vllm/disagg/perf.yaml -n ${NAMESPACE}
$ kubectl wait --for=condition=Complete job/qwen3-32b-fp8-vllm-disagg-perf -n ${NAMESPACE} --timeout=7200s

Compare All Targets

All three targets serve Qwen/Qwen3-32B-FP8; they differ in runtime, GPU count, topology, and benchmark traffic:

	TRT-LLM aggregated	TRT-LLM disaggregated	vLLM disaggregated
GPUs	2x H100/H200/A100	8x H100/H200/A100	8x H100/H200/A100, single node
Topology	1x worker, TP2	4x prefill (TP1) + 2x decode (TP2)	2x prefill (TP2) + 1x decode (TP4)
Routing / KV transfer	Round-robin	Round-robin	NIXL (NixlConnector)
Workload	4K / 500, conc. 4	4K / 500, conc. 48	2K / 500, conc. 8

Notes

This page is the FP8 alternative to the BF16 Qwen3-32B recipe.
The TRT-LLM and vLLM targets use different traffic shapes (4K vs 2K ISL) and different GPU counts; normalize traffic before making backend performance claims.
The aggregated config uses CUDA graphs for optimized inference, and KV cache uses FP8 dtype for memory efficiency.
--max-model-len 8192 is set in vllm/disagg/deploy.yaml for A100 40 GB compatibility; remove or increase it on H100/H200.
Update storageClassName in model-cache/model-cache.yaml before deploying.

Source

Source README: recipes/qwen3-32b-fp8/README.md
TRT-LLM aggregated: deploy.yaml and perf.yaml
TRT-LLM disaggregated: deploy.yaml and perf.yaml
vLLM disaggregated: deploy.yaml and perf.yaml
Setup assets: model-cache/model-cache.yaml and model-cache/model-download.yaml