Qwen3-32B FP8

Serve Qwen/Qwen3-32B-FP8 with Dynamo on TensorRT-LLM or vLLM, aggregated or disaggregated.

View as Markdown

Each target below is a validated FP8 deployment of Qwen3-32B — from a 2-GPU TensorRT-LLM aggregate to 8-GPU disaggregated prefill/decode setups on TensorRT-LLM or vLLM — each with a checked-in AIPerf benchmark Job. The targets use different traffic shapes and GPU counts, so this page is not a backend benchmark. Pick your target; every command on this page updates to match.

Choose your deployment target

Target
Checkpoint Qwen/Qwen3-32B-FP8GPUs 2x H100/H200/A100, single TP2 workerRouting Round-robinTechniques CUDA graphs, FP8 KV cacheWorkload 4K ISL / 500 OSL, concurrency 4
Checkpoint Qwen/Qwen3-32B-FP8GPUs 8x H100/H200/A100Topology 4x prefill (TP1) + 2x decode (TP2)Routing Round-robinWorkload 4K ISL / 500 OSL, concurrency 48
Checkpoint Qwen/Qwen3-32B-FP8GPUs 8x H100/H200/A100, single nodeTopology 2x prefill (TP2) + 1x decode (TP4)KV transfer NIXL (NixlConnector)Workload 2K ISL / 500 OSL, concurrency 8

Prerequisites

  • A Kubernetes cluster with the Dynamo platform installed and 2x H100/H200/A100-class GPUs available.
  • A Hugging Face token with access to Qwen/Qwen3-32B-FP8.
  • A Kubernetes cluster with the Dynamo platform installed and 8x H100/H200/A100-class GPUs available.
  • A Hugging Face token with access to Qwen/Qwen3-32B-FP8.
  • A Kubernetes cluster with the Dynamo platform installed and 8x H100/H200/A100-class GPUs on a single node — all prefill and decode workers must be co-located for NIXL KV transfer.
  • A Hugging Face token with access to Qwen/Qwen3-32B-FP8.

Create the namespace and token secret:

$export NAMESPACE=your-namespace
$kubectl create namespace ${NAMESPACE}
$kubectl create secret generic hf-token-secret \
> --from-literal=HF_TOKEN="your-token" \
> -n ${NAMESPACE}

Edit namespace, storage class, image tags, node selectors, resource claims, and cluster-specific placement in the manifests before applying them.

Deploy

Prepare the model cache and download the checkpoint:

$# Edit storageClassName in model-cache/model-cache.yaml to match your cluster first
$# (kubectl get storageclass).
$kubectl apply -f recipes/qwen3-32b-fp8/model-cache/ -n ${NAMESPACE}
$kubectl wait --for=condition=Complete job/model-download -n ${NAMESPACE} --timeout=1800s

Then deploy:

A single TP2 TensorRT-LLM worker with round-robin routing and CUDA graphs enabled:

$kubectl apply -f recipes/qwen3-32b-fp8/trtllm/agg/deploy.yaml -n ${NAMESPACE}

4x prefill workers (TP1) and 2x decode workers (TP2):

$kubectl apply -f recipes/qwen3-32b-fp8/trtllm/disagg/deploy.yaml -n ${NAMESPACE}

2x prefill workers (TP2) and 1x decode worker (TP4) using NixlConnector KV transfer; all workers must land on one node:

$kubectl apply -f recipes/qwen3-32b-fp8/vllm/disagg/deploy.yaml -n ${NAMESPACE}

Smoke Test

Send a test request to verify the deployment serves traffic:

$kubectl port-forward svc/qwen3-32b-fp8-agg-frontend 8000:8000 -n ${NAMESPACE}
$kubectl port-forward svc/qwen3-32b-fp8-disagg-frontend 8000:8000 -n ${NAMESPACE}
$kubectl port-forward svc/qwen3-32b-fp8-vllm-disagg-frontend 8000:8000 -n ${NAMESPACE}
$curl http://localhost:8000/v1/chat/completions \
> -H 'Content-Type: application/json' \
> -d '{"model":"Qwen/Qwen3-32B-FP8","messages":[{"role":"user","content":"Write a one-sentence readiness check."}],"max_tokens":64}'

Benchmark

Each target ships its own perf.yaml AIPerf Job, sized at a fixed concurrency per GPU with request-count = 10x concurrency. Artifacts land on the model-cache PVC under /model-cache/perf.

The Job wraps this AIPerf run — 4K ISL / 500 OSL at concurrency 4 (2 per GPU x 2 GPUs):

$aiperf profile \
> --model Qwen/Qwen3-32B-FP8 --tokenizer Qwen/Qwen3-32B-FP8 \
> --endpoint-type chat --endpoint /v1/chat/completions \
> --url http://qwen3-32b-fp8-agg-frontend:8000 --streaming \
> --synthetic-input-tokens-mean 4000 --synthetic-input-tokens-stddev 0 \
> --output-tokens-mean 500 --output-tokens-stddev 0 \
> --extra-inputs max_tokens:500 --extra-inputs min_tokens:500 --extra-inputs ignore_eos:true \
> --concurrency 4 --request-count 40 --warmup-request-count 4 \
> --random-seed 100
$kubectl apply -f recipes/qwen3-32b-fp8/trtllm/agg/perf.yaml -n ${NAMESPACE}
$kubectl wait --for=condition=Complete job/qwen3-32b-fp8-bench -n ${NAMESPACE} --timeout=7200s

The Job wraps this AIPerf run — 4K ISL / 500 OSL at concurrency 48 (6 per GPU x 8 GPUs):

$aiperf profile \
> --model Qwen/Qwen3-32B-FP8 --tokenizer Qwen/Qwen3-32B-FP8 \
> --endpoint-type chat --endpoint /v1/chat/completions \
> --url http://qwen3-32b-fp8-disagg-frontend:8000 --streaming \
> --synthetic-input-tokens-mean 4000 --synthetic-input-tokens-stddev 0 \
> --output-tokens-mean 500 --output-tokens-stddev 0 \
> --extra-inputs max_tokens:500 --extra-inputs min_tokens:500 --extra-inputs ignore_eos:true \
> --concurrency 48 --request-count 480 --warmup-request-count 48 \
> --random-seed 100
$kubectl apply -f recipes/qwen3-32b-fp8/trtllm/disagg/perf.yaml -n ${NAMESPACE}
$kubectl wait --for=condition=Complete job/qwen3-32b-fp8-bench -n ${NAMESPACE} --timeout=7200s

The Job wraps this AIPerf run — 2K ISL / 500 OSL at concurrency 8 (1 per GPU x 8 GPUs); note the shorter input length than the TRT-LLM targets:

$aiperf profile \
> --model Qwen/Qwen3-32B-FP8 --tokenizer Qwen/Qwen3-32B-FP8 \
> --endpoint-type chat --endpoint /v1/chat/completions \
> --url http://qwen3-32b-fp8-vllm-disagg-frontend:8000 --streaming \
> --synthetic-input-tokens-mean 2000 --synthetic-input-tokens-stddev 0 \
> --output-tokens-mean 500 --output-tokens-stddev 0 \
> --extra-inputs max_tokens:500 --extra-inputs min_tokens:500 --extra-inputs ignore_eos:true \
> --concurrency 8 --request-count 80 --warmup-request-count 8 \
> --random-seed 100
$kubectl apply -f recipes/qwen3-32b-fp8/vllm/disagg/perf.yaml -n ${NAMESPACE}
$kubectl wait --for=condition=Complete job/qwen3-32b-fp8-vllm-disagg-perf -n ${NAMESPACE} --timeout=7200s

Compare All Targets

All three targets serve Qwen/Qwen3-32B-FP8; they differ in runtime, GPU count, topology, and benchmark traffic:

TRT-LLM aggregatedTRT-LLM disaggregatedvLLM disaggregated
GPUs2x H100/H200/A1008x H100/H200/A1008x H100/H200/A100, single node
Topology1x worker, TP24x prefill (TP1) + 2x decode (TP2)2x prefill (TP2) + 1x decode (TP4)
Routing / KV transferRound-robinRound-robinNIXL (NixlConnector)
Workload4K / 500, conc. 44K / 500, conc. 482K / 500, conc. 8

Notes

  • This page is the FP8 alternative to the BF16 Qwen3-32B recipe.
  • The TRT-LLM and vLLM targets use different traffic shapes (4K vs 2K ISL) and different GPU counts; normalize traffic before making backend performance claims.
  • The aggregated config uses CUDA graphs for optimized inference, and KV cache uses FP8 dtype for memory efficiency.
  • --max-model-len 8192 is set in vllm/disagg/deploy.yaml for A100 40 GB compatibility; remove or increase it on H100/H200.
  • Update storageClassName in model-cache/model-cache.yaml before deploying.

Source