Qwen3-235B-A22B FP8

Serve Qwen/Qwen3-235B-A22B-FP8 with Dynamo and TensorRT-LLM on 16 Hopper or Blackwell GPUs.

View as Markdown

Each target below is a validated TensorRT-LLM deployment of Qwen3-235B-A22B-FP8 — a 235B-parameter Mixture-of-Experts model with ~22B active parameters per token — on 16 GPUs with KV-aware routing, benchmarked at 4K ISL / 200 OSL. Hopper and Blackwell need different MoE backends, and you can serve aggregated or with prefill/decode disaggregation. Pick your GPU architecture and serving topology; every command on this page updates to match.

Choose your deployment target

GPU
Topology
Checkpoint Qwen/Qwen3-235B-A22B-FP8GPUs 16x B100/B200, 4 workersParallelism TP4 x EP4 per workerMoE backend DEEPGEMM (required on SM100+)Routing KV-awareWorkload 4K ISL / 200 OSL, concurrency 32
Checkpoint Qwen/Qwen3-235B-A22B-FP8GPUs 16x B100/B200Parallelism 6x prefill (TP2) + 1x decode (TP4 x EP4)MoE backend DEEPGEMM (required on SM100+)Routing KV-awareWorkload 4K ISL / 200 OSL, concurrency 32
Checkpoint Qwen/Qwen3-235B-A22B-FP8GPUs 16x H100/H200, 4 workersParallelism TP4 x EP4 per workerMoE backend CUTLASS (TRT-LLM default)Routing KV-awareWorkload 4K ISL / 200 OSL, concurrency 32
Checkpoint Qwen/Qwen3-235B-A22B-FP8GPUs 16x H100/H200Parallelism 6x prefill (TP2) + 1x decode (TP4 x EP4)MoE backend CUTLASS (TRT-LLM default)Routing KV-awareWorkload 4K ISL / 200 OSL, concurrency 32

Prerequisites

  • A Kubernetes cluster with the Dynamo platform installed and 16x B100/B200 available (~1.3 TB total GPU VRAM).
  • A Hugging Face token with access to Qwen/Qwen3-235B-A22B-FP8.
  • A Kubernetes cluster with the Dynamo platform installed and 16x H100/H200 available (~1.3 TB total GPU VRAM).
  • A Hugging Face token with access to Qwen/Qwen3-235B-A22B-FP8.

Create the namespace and token secret:

$export NAMESPACE=your-namespace
$kubectl create namespace ${NAMESPACE}
$kubectl create secret generic hf-token-secret \
> --from-literal=HF_TOKEN="your-token" \
> -n ${NAMESPACE}

Edit namespace, storage class, image tags, node selectors, resource claims, and cluster-specific placement in the manifests before applying them.

Deploy

Prepare the model cache and download the checkpoint (30-60 minutes):

$# Edit storageClassName in model-cache/model-cache.yaml to match your cluster first
$# (kubectl get storageclass).
$kubectl apply -f recipes/qwen3-235b-a22b-fp8/model-cache/ -n ${NAMESPACE}
$kubectl wait --for=condition=Complete job/model-download -n ${NAMESPACE} --timeout=3600s

Then deploy:

$kubectl apply -f recipes/qwen3-235b-a22b-fp8/trtllm/agg/blackwell/deploy.yaml -n ${NAMESPACE}
$kubectl apply -f recipes/qwen3-235b-a22b-fp8/trtllm/disagg/blackwell/deploy.yaml -n ${NAMESPACE}
$kubectl apply -f recipes/qwen3-235b-a22b-fp8/trtllm/agg/hopper/deploy.yaml -n ${NAMESPACE}
$kubectl apply -f recipes/qwen3-235b-a22b-fp8/trtllm/disagg/hopper/deploy.yaml -n ${NAMESPACE}

Smoke Test

Send a test request to verify the deployment serves traffic:

$kubectl port-forward svc/qwen3-235b-a22b-agg-frontend 8000:8000 -n ${NAMESPACE}
$kubectl port-forward svc/qwen3-235b-a22b-disagg-frontend 8000:8000 -n ${NAMESPACE}
$curl http://localhost:8000/v1/chat/completions \
> -H 'Content-Type: application/json' \
> -d '{"model":"Qwen/Qwen3-235B-A22B-FP8","messages":[{"role":"user","content":"Write a one-sentence readiness check."}],"max_tokens":64}'

Benchmark

Every target ships a perf.yaml AIPerf Job sized at 4K ISL / 200 OSL and concurrency 32 (2 per GPU x 16 GPUs) with --request-count 320, so aggregated vs disaggregated results are comparable within the same hardware architecture. Artifacts land on the model-cache PVC under /model-cache/perf.

The aggregated Job wraps this AIPerf run:

$aiperf profile \
> --model Qwen/Qwen3-235B-A22B-FP8 --tokenizer Qwen/Qwen3-235B-A22B-FP8 \
> --endpoint-type chat --endpoint /v1/chat/completions \
> --url http://qwen3-235b-a22b-agg-frontend:8000 --streaming \
> --synthetic-input-tokens-mean 4000 --synthetic-input-tokens-stddev 0 \
> --output-tokens-mean 200 --output-tokens-stddev 0 \
> --extra-inputs max_tokens:200 --extra-inputs min_tokens:200 --extra-inputs ignore_eos:true \
> --concurrency 32 --request-count 320 --warmup-request-count 32 \
> --random-seed 100

The disaggregated Job wraps this AIPerf run:

$aiperf profile \
> --model Qwen/Qwen3-235B-A22B-FP8 --tokenizer Qwen/Qwen3-235B-A22B-FP8 \
> --endpoint-type chat --endpoint /v1/chat/completions \
> --url http://qwen3-235b-a22b-disagg-frontend:8000 --streaming \
> --synthetic-input-tokens-mean 4000 --synthetic-input-tokens-stddev 0 \
> --output-tokens-mean 200 --output-tokens-stddev 0 \
> --extra-inputs max_tokens:200 --extra-inputs min_tokens:200 --extra-inputs ignore_eos:true \
> --concurrency 32 --request-count 320 --warmup-request-count 32 \
> --random-seed 100

Apply the manifest matching your deployed target:

$kubectl apply -f recipes/qwen3-235b-a22b-fp8/trtllm/agg/blackwell/perf.yaml -n ${NAMESPACE}
$kubectl wait --for=condition=Complete job/qwen3-235b-a22b-bench -n ${NAMESPACE} --timeout=7200s
$kubectl apply -f recipes/qwen3-235b-a22b-fp8/trtllm/disagg/blackwell/perf.yaml -n ${NAMESPACE}
$kubectl wait --for=condition=Complete job/qwen3-235b-a22b-disagg-bench -n ${NAMESPACE} --timeout=7200s
$kubectl apply -f recipes/qwen3-235b-a22b-fp8/trtllm/agg/hopper/perf.yaml -n ${NAMESPACE}
$kubectl wait --for=condition=Complete job/qwen3-235b-a22b-bench -n ${NAMESPACE} --timeout=7200s
$kubectl apply -f recipes/qwen3-235b-a22b-fp8/trtllm/disagg/hopper/perf.yaml -n ${NAMESPACE}
$kubectl wait --for=condition=Complete job/qwen3-235b-a22b-disagg-bench -n ${NAMESPACE} --timeout=7200s

Compare All Targets

All four targets serve Qwen/Qwen3-235B-A22B-FP8 on TensorRT-LLM (PyTorch backend) over 16 GPUs with KV-aware routing and the same 4K ISL / 200 OSL benchmark shape:

Blackwell aggregatedBlackwell disaggregatedHopper aggregatedHopper disaggregated
GPUs16x B100/B20016x B100/B20016x H100/H20016x H100/H200
Topology4x workers, TP4 x EP46x prefill (TP2) + 1x decode (TP4 x EP4)4x workers, TP4 x EP46x prefill (TP2) + 1x decode (TP4 x EP4)
MoE backendDEEPGEMMDEEPGEMMCUTLASS (default)CUTLASS (default)
Chunked prefillEnabledDisabledEnabledDisabled
Workload4K / 200, conc. 324K / 200, conc. 324K / 200, conc. 324K / 200, conc. 32

Notes

  • The Hopper/Blackwell split is required with TRT-LLM 1.3.x: the default CUTLASS MoE backend falls through to a Hopper-specific JIT path on SM100 and crashes, so Blackwell needs moe_config.backend: DEEPGEMM. DEEPGEMM in turn crashes on Hopper due to a scale-factor dtype mismatch — hence two separate variants per topology.
  • Chunked prefill is enabled for the aggregated targets and disabled for the disaggregated targets.
  • All targets use KV-aware routing (--router-mode kv) at the frontend.
  • Model download may take 30-60 minutes; update storageClassName in model-cache/model-cache.yaml before deploying.

Source