GLM-5 NVFP4

Serve GLM-5 NVFP4 with Dynamo and SGLang, disaggregated across five GB200 nodes.

View as Markdown

This recipe serves nvidia/GLM-5-NVFP4 with disaggregated prefill/decode and EAGLE MTP speculative decoding across 5 nodes of 4x GB200 (TP4 prefill, TP16/DP16/EP16 decode), sustaining roughly 16.8K output tokens/sec on the standard UCX path (19K with the AWS EFA variant) at 512-way concurrency on a 1K ISL / 8K OSL long-output workload. Pick your KV-transfer path; every command on this page updates to match.

Choose your deployment target

KV transfer
Checkpoint nvidia/GLM-5-NVFP4Precision NVFP4 + FP8 KV cacheGPUs 20x GB200, 5 nodes of 4Parallelism TP4 prefill, TP16/DP16/EP16 decodeKV transfer NIXL over UCXWorkload 1K ISL / 8K OSL, 512 concurrency
Checkpoint nvidia/GLM-5-NVFP4Precision NVFP4 + FP8 KV cacheGPUs 20x GB200, 5x p6e-gb200.36xlargeParallelism TP4 prefill, TP16/DP16/EP16 decodeKV transfer NIXL LIBFABRIC over EFA RDMAWorkload 1K ISL / 8K OSL, 512 concurrency

Prerequisites

  • A Kubernetes cluster with the Dynamo Operator plus DRA / ComputeDomain support for MNNVL placement.
  • A shared RWX PVC for model weights and FlashInfer JIT artifacts.
  • A Hugging Face token with access to nvidia/GLM-5-NVFP4.
  • 5x 4xGB200 nodes in an NVL36 or NVL72 domain. The published runtime image (nvcr.io/nvidia/ai-dynamo/sglang-runtime:1.1.1-cuda13) is used as-is.
  • 5x p6e-gb200.36xlarge nodes (4x GB200 + 4x EFA NICs each) or equivalent GB200 in an MNNVL domain.
  • AWS EFA driver 3.0.0g or newer on the nodes (default on modern AWS EKS AMIs).
  • A container registry you can push to — this variant requires building a custom image from Dockerfile.efa before deploying (there is no prebuilt EFA image).

Create the namespace and token secret:

$export NAMESPACE=your-namespace
$kubectl create namespace ${NAMESPACE}
$kubectl create secret generic hf-token-secret \
> --from-literal=HF_TOKEN="your-token" \
> -n ${NAMESPACE}

The manifests use standard NVIDIA GPU Feature Discovery labels to select GB200 nodes and include common GPU/ARM tolerations. If your cluster uses different labels, taints, or storage classes, update nodeSelector, tolerations, and storageClassName before deploying.

Deploy

Create the model-cache PVC and download the weights (shared by both targets):

$# Edit model-cache.yaml first and set storageClassName to a RWX storage class.
$kubectl apply -f recipes/glm-5-nvfp4/model-cache/model-cache.yaml -n ${NAMESPACE}
$
$kubectl apply -f recipes/glm-5-nvfp4/model-cache/model-download.yaml -n ${NAMESPACE}
$kubectl wait --for=condition=complete job/model-download -n ${NAMESPACE} --timeout=3600s

If your cluster already provides a shared RWX cache PVC, skip model-cache.yaml and update claimName: model-cache in the download, deploy, and perf manifests, keeping the mount path as /model-store.

Edit sglang/disagg/deploy.yaml and replace the <your-namespace> placeholder, then:

$kubectl apply -f recipes/glm-5-nvfp4/sglang/disagg/deploy.yaml
$kubectl wait --for=condition=Ready pod \
> -l nvidia.com/dynamo-graph-deployment-name=glm5-sglang \
> -n ${NAMESPACE} --timeout=7200s

First cold starts can take up to about an hour while the runtime loads weights and JIT-compiles FlashInfer/DeepGEMM kernels:

$kubectl get pods -n ${NAMESPACE} -l app.kubernetes.io/part-of=glm5-sglang -w

Build the custom container image first — it bakes a patched libfabric (ofiwg/libfabric#12019) into the runtime so fi_mr_reg on CUDA VRAM succeeds on GB200’s 64K-page arm64 kernel:

$docker buildx build \
> --platform linux/arm64 \
> --build-arg ARCH=arm64 \
> -t <your-registry>/sglang-dynamo-glm5-efa:latest \
> -f recipes/glm-5-nvfp4/sglang/disagg/efa/Dockerfile.efa \
> --push .

Then edit sglang/disagg/efa/deploy.yaml to replace <your-namespace> and the image placeholder, and:

$kubectl apply -f recipes/glm-5-nvfp4/sglang/disagg/efa/deploy.yaml
$kubectl get pods -n ${NAMESPACE} -l app.kubernetes.io/part-of=glm5-sglang-efa -w

After startup, verify the LIBFABRIC backend is actually carrying KV traffic (not silent TCP fallback) — the EFA README includes three checks (NIXL backend log line, executable libplugin_LIBFABRIC.so mapping, and nixl_num_failed_transfers_total staying at 0).

Smoke Test

Send a test request to verify the deployment serves traffic:

$kubectl port-forward svc/glm5-sglang-frontend 8000:8000 -n ${NAMESPACE} &
$kubectl port-forward svc/glm5-sglang-efa-frontend 8000:8000 -n ${NAMESPACE} &
$curl http://localhost:8000/v1/chat/completions \
> -H 'Content-Type: application/json' \
> -d '{"model":"nvidia/GLM-5-NVFP4","messages":[{"role":"user","content":"Write a one-sentence readiness check."}],"max_tokens":128}'

Benchmark

Both targets ship a perf.yaml Kubernetes Job with the same workload shape: ISL=1000, OSL=8192, concurrency=512 (32 per decode GPU), 1,536 requests.

The Job wraps this AIPerf run:

$aiperf profile \
> --model nvidia/GLM-5-NVFP4 --tokenizer nvidia/GLM-5-NVFP4 \
> --endpoint-type chat --endpoint /v1/chat/completions --streaming \
> --url http://glm5-sglang-frontend:8000 \
> --synthetic-input-tokens-mean 1000 --output-tokens-mean 8192 \
> --extra-inputs max_tokens:8192 --extra-inputs min_tokens:8192 --extra-inputs ignore_eos:true \
> --concurrency 512 --request-count 1536

Edit sglang/disagg/perf.yaml to replace the namespace placeholder, then:

$kubectl apply -f recipes/glm-5-nvfp4/sglang/disagg/perf.yaml
$kubectl logs -f -l job-name=glm5-disagg-bench -n ${NAMESPACE}

The Job wraps this AIPerf run:

$aiperf profile \
> --model nvidia/GLM-5-NVFP4 --tokenizer nvidia/GLM-5-NVFP4 \
> --endpoint-type chat --endpoint /v1/chat/completions --streaming \
> --url http://glm5-sglang-efa-frontend:8000 \
> --synthetic-input-tokens-mean 1000 --output-tokens-mean 8192 \
> --extra-inputs max_tokens:8192 --extra-inputs min_tokens:8192 --extra-inputs ignore_eos:true \
> --concurrency 512 --request-count 1536

Edit sglang/disagg/efa/perf.yaml to replace the namespace placeholder, then:

$kubectl apply -f recipes/glm-5-nvfp4/sglang/disagg/efa/perf.yaml
$kubectl logs -f -l job-name=glm5-disagg-efa-bench -n ${NAMESPACE}

Expected Performance

Reference AIPerf run for the standard target (ISL=1k, OSL=8k, concurrency=512): 1,536 requests, 0 errors, 747.87s benchmark duration. This is a concurrency-burst benchmark, so TTFT includes queueing under 512 concurrent users.

MetricValue
Output throughput16,824 tokens/sec
Request throughput2.05 requests/sec
TTFT p5015,423 ms
ITL avg23.31 ms/token
Tokens/user/sec avg43.39
Request errors0

Reference AIPerf run for the EFA variant at the same workload shape (ISL=1k, OSL=8k, concurrency=512; measured on 5x p6e-gb200.36xlarge, EFA driver 3.0.0g):

MetricValue
Output throughput19,131 tokens/sec
TTFT p50621 ms
ITL avg24.5 ms/token
Output tokens/user/sec41.0

At long context (ISL=20k, OSL=2k, concurrency=64), the LIBFABRIC backend delivers 39% higher throughput and 56% lower TTFT p50 than the UCX default — full tables in the EFA README.

Compare All Targets

Both targets run the same disaggregated topology — 1 prefill node (TP4) plus 4 decode nodes (TP16/DP16/EP16) with EAGLE MTP speculative decoding and FP8 KV cache — and differ only in infrastructure and the KV-transfer backend:

Standard (UCX)AWS EFA (custom build)
Hardware20x GB200 (5x 4-GPU nodes, NVL36/NVL72)20x GB200 (5x p6e-gb200.36xlarge)
KV transferNIXL over UCXNIXL LIBFABRIC over EFA RDMA
Container imagePublished runtime imageCustom build from Dockerfile.efa
Extra requirementsEFA driver 3.0.0g+, privileged containers
Output throughput16,824 tok/s19,131 tok/s

Notes

  • EAGLE MTP speculative decoding (~85-95% accept rate) is enabled by two env vars: SGLANG_ENABLE_SPEC_V2=1 (EAGLEWorkerV2 with overlap scheduler) and SGLANG_NVFP4_CKPT_FP8_NEXTN_MOE=1 (quantizes the BF16 MTP layer to FP8 at load time, matching the base model’s compute path).
  • FP8 KV cache: uses --kv-cache-dtype fp8_e4m3 (the NSA backend auto-selects this on SM100/GB200), saving roughly 50% KV memory vs BF16.
  • FlashInfer JIT cache: the runtime image has no prebuilt flashinfer-jit-cache wheel, so the recipe sets FLASHINFER_WORKSPACE_BASE=/model-store to persist first-run JIT artifacts on the shared PVC for later pod starts.
  • Worker containers run as root because FlashInfer’s bundled cubin package creates TRTLLM MoE symlinks inside its installed package directory during startup. The benchmark pod runs as a non-root user and pins Transformers v5 because nvidia/GLM-5-NVFP4 declares tokenizer_class=TokenizersBackend.
  • The standard target sets UCX_TLS=cuda_copy,cuda_ipc,tcp for NIXL/UCX KV transfer; the EFA variant instead sets SGLANG_DISAGGREGATION_NIXL_BACKEND=LIBFABRIC and runs containers privileged so fi_mr_reg can pin VRAM for RDMA. Without the env var, SGLang silently falls back to TCP on kernel 6.8+.
  • Recovery caveat: the decode side is one TP16 rank group spread across four nodes. Treat single decode-pod replacement as disruptive and validate full-group recovery before relying on individual decode pod restarts; in validation, deleting one decode worker left the graph NotReady through repeated rank-group reinitialization attempts.

Source