Nemotron-3-Super

Serve NVIDIA-Nemotron-3-Super-120B-A12B with Dynamo and vLLM, tuned per GPU and workload.

View as Markdown

Each target below is a validated aggregated vLLM deployment of Nemotron-3-Super — NVIDIA’s ~120B hybrid Mamba/Attention/MoE model (~12B active) — with MTP speculative decoding (DL=3) and KV-aware routing; the B200 agentic target measured 1388.4 system output tok/s per GPU on its trace. B200 serves the NVFP4 checkpoint, H200 the FP8 checkpoint. Pick your GPU and workload; every command on this page updates to match.

Choose your deployment target

GPU
Workload
Checkpoint nvidia/NVIDIA-Nemotron-3-Super-120B-A12B-NVFP4Precision NVFP4 + FP8 KV cacheGPUs 4x B200 per worker, TP4 + EP (2 replicas)All2All DeepEP high-throughputSpec decode MTP, DL=3Workload Chat 8K/1K, 70% KV reuse
Checkpoint nvidia/NVIDIA-Nemotron-3-Super-120B-A12B-NVFP4Precision NVFP4 + FP8 KV cacheGPUs 4x B200 per worker, TP4 + EP (2 replicas)All2All DeepEP low-latencySpec decode MTP, DL=3Workload Agentic 64K/400, 90% KV reuse
Checkpoint nvidia/NVIDIA-Nemotron-3-Super-120B-A12B-FP8Precision FP8 + FP8 KV cacheGPUs 4x H200 per worker, TP4 + EP (2 replicas)All2All FlashInfer NVLink one-sidedSpec decode MTP, DL=3Workload Chat 8K/1K, 70% KV reuse
Checkpoint nvidia/NVIDIA-Nemotron-3-Super-120B-A12B-FP8Precision FP8 + FP8 KV cacheGPUs 4x H200 per worker, TP4 + EP (2 replicas)All2All DeepEP high-throughputSpec decode MTP, DL=3Workload Agentic 64K/400, 90% KV reuse

Prerequisites

  • A Kubernetes cluster with the Dynamo Platform installed (DGD CRDs registered with nvidia.com/v1beta1 served) and 8x B200 available (two 4-GPU worker replicas).
  • A Hugging Face token with access to nvidia/NVIDIA-Nemotron-3-Super-120B-A12B-NVFP4.
  • Namespace labeled for KAI — without kai.scheduler/enabled=true, pods sit SchedulingGated indefinitely because KAI’s pod-grouper filters by namespace label.
  • A Kubernetes cluster with the Dynamo Platform installed (DGD CRDs registered with nvidia.com/v1beta1 served) and 8x H200 available (two 4-GPU worker replicas).
  • A Hugging Face token with access to nvidia/NVIDIA-Nemotron-3-Super-120B-A12B-FP8.
  • Namespace labeled for KAI — without kai.scheduler/enabled=true, pods sit SchedulingGated indefinitely because KAI’s pod-grouper filters by namespace label.

Create the namespace with the KAI label and the token secret:

$export NAMESPACE=your-namespace
$kubectl create namespace ${NAMESPACE}
$kubectl label namespace ${NAMESPACE} kai.scheduler/enabled=true
$
$kubectl create secret generic hf-token-secret \
> --from-literal=HF_TOKEN="$HF_TOKEN" \
> -n ${NAMESPACE}

Edit namespace, storage class, image tags, and other cluster-specific settings in the manifests before applying them.

Deploy

Prepare the model cache and download the checkpoint for your SKU:

$# 1. Storage — edit storageClassName in model-cache.yaml first (kubectl get storageclass).
$kubectl apply -f recipes/nemotron-3-super/model-cache/model-cache.yaml -n ${NAMESPACE}
$
$# 2. Download. B200 uses the NVFP4 checkpoint (~80 GB).
$kubectl apply -f recipes/nemotron-3-super/model-cache/model-download.yaml -n ${NAMESPACE}
$kubectl wait --for=condition=Complete job/model-download -n ${NAMESPACE} --timeout=1800s
$# 1. Storage — edit storageClassName in model-cache.yaml first (kubectl get storageclass).
$kubectl apply -f recipes/nemotron-3-super/model-cache/model-cache.yaml -n ${NAMESPACE}
$
$# 2. Download. H200 uses the FP8 checkpoint (~120 GB).
$kubectl apply -f recipes/nemotron-3-super/model-cache/model-download-fp8.yaml -n ${NAMESPACE}
$kubectl wait --for=condition=Complete job/model-download-fp8 -n ${NAMESPACE} --timeout=3600s

Downloading both checkpoints lands ~200 GB on the PVC — the default size in model-cache.yaml — so bump storage: first if you want both.

Then deploy. First-time boot per worker takes about 6–9 minutes (image pull + vLLM engine init + Inductor + CUDA graph capture up to size 512):

$kubectl apply -f recipes/nemotron-3-super/vllm/agg-b200-chat/deploy.yaml -n ${NAMESPACE}
$kubectl get dgd nemotron-3-super-b200-chat -n ${NAMESPACE} -w
$kubectl apply -f recipes/nemotron-3-super/vllm/agg-b200-agentic/deploy.yaml -n ${NAMESPACE}
$kubectl get dgd nemotron-3-super-b200-agentic -n ${NAMESPACE} -w
$kubectl apply -f recipes/nemotron-3-super/vllm/agg-h200-chat/deploy.yaml -n ${NAMESPACE}
$kubectl get dgd nemotron-3-super-h200-chat -n ${NAMESPACE} -w
$kubectl apply -f recipes/nemotron-3-super/vllm/agg-h200-agentic/deploy.yaml -n ${NAMESPACE}
$kubectl get dgd nemotron-3-super-h200-agentic -n ${NAMESPACE} -w

Smoke Test

Send a test request to verify the deployment serves traffic:

$kubectl port-forward svc/nemotron-3-super-b200-chat-frontend 8000:8000 -n ${NAMESPACE}
$kubectl port-forward svc/nemotron-3-super-b200-agentic-frontend 8000:8000 -n ${NAMESPACE}
$kubectl port-forward svc/nemotron-3-super-h200-chat-frontend 8000:8000 -n ${NAMESPACE}
$kubectl port-forward svc/nemotron-3-super-h200-agentic-frontend 8000:8000 -n ${NAMESPACE}
$MODEL_ID=nvidia/NVIDIA-Nemotron-3-Super-120B-A12B-NVFP4
$
$curl http://localhost:8000/v1/models
$curl http://localhost:8000/v1/chat/completions \
> -H "Content-Type: application/json" \
> -d "{\"model\":\"${MODEL_ID}\",
> \"messages\":[{\"role\":\"user\",\"content\":\"Hello!\"}],
> \"max_tokens\":64,
> \"chat_template_kwargs\":{\"enable_thinking\":false}}"
$MODEL_ID=nvidia/NVIDIA-Nemotron-3-Super-120B-A12B-FP8
$
$curl http://localhost:8000/v1/models
$curl http://localhost:8000/v1/chat/completions \
> -H "Content-Type: application/json" \
> -d "{\"model\":\"${MODEL_ID}\",
> \"messages\":[{\"role\":\"user\",\"content\":\"Hello!\"}],
> \"max_tokens\":64,
> \"chat_template_kwargs\":{\"enable_thinking\":false}}"

Benchmark

A single AIPerf trace-replay Job (perf/perf.yaml) covers every target — only ENDPOINT, TRACE_FILE, TARGET_MODEL, and CONCURRENCY change in its env block (TARGET_MODEL is the NVFP4 id for B200, FP8 for H200). First stage the bundled traces from recipes/nemotron-3-super/perf/traces/ onto the model-cache PVC:

$kubectl run pvc-helper -n ${NAMESPACE} \
> --image=busybox:1.36 --restart=Never \
> --overrides='{"spec":{"containers":[{"name":"helper","image":"busybox:1.36","command":["sleep","3600"],"volumeMounts":[{"name":"model-cache","mountPath":"/model-cache"}]}],"volumes":[{"name":"model-cache","persistentVolumeClaim":{"claimName":"model-cache"}}]}}' \
> --command -- sleep 3600
$
$kubectl cp recipes/nemotron-3-super/perf/traces ${NAMESPACE}/pvc-helper:/model-cache/

Set ENDPOINT to nemotron-3-super-b200-chat-frontend:8000 (the Job default) with the NVFP4 TARGET_MODEL and the chat trace, then apply. The Job wraps this AIPerf Mooncake trace replay:

$aiperf profile \
> -m nvidia/NVIDIA-Nemotron-3-Super-120B-A12B-NVFP4 \
> --tokenizer nvidia/NVIDIA-Nemotron-3-Super-120B-A12B-NVFP4 \
> --tokenizer-trust-remote-code \
> --input-file /model-cache/traces/8k_1k_70kv_chat_new_noschedule.jsonl \
> --custom-dataset-type mooncake_trace \
> --prompt-input-tokens-block-size 512 \
> --url http://nemotron-3-super-b200-chat-frontend:8000 \
> --streaming --use-server-token-count \
> --extra-inputs ignore_eos:true \
> --concurrency 128 --random-seed 42 \
> --export-http-trace

Set ENDPOINT to nemotron-3-super-b200-agentic-frontend:8000 with the NVFP4 TARGET_MODEL and the agentic trace, then apply. The Job wraps this AIPerf Mooncake trace replay:

$aiperf profile \
> -m nvidia/NVIDIA-Nemotron-3-Super-120B-A12B-NVFP4 \
> --tokenizer nvidia/NVIDIA-Nemotron-3-Super-120B-A12B-NVFP4 \
> --tokenizer-trust-remote-code \
> --input-file /model-cache/traces/64k_400_90kv_agent_new_noschedule.jsonl \
> --custom-dataset-type mooncake_trace \
> --prompt-input-tokens-block-size 512 \
> --url http://nemotron-3-super-b200-agentic-frontend:8000 \
> --streaming --use-server-token-count \
> --extra-inputs ignore_eos:true \
> --concurrency 192 --random-seed 42 \
> --export-http-trace

Set ENDPOINT to nemotron-3-super-h200-chat-frontend:8000 with the FP8 TARGET_MODEL and the chat trace, then apply. The Job wraps this AIPerf Mooncake trace replay:

$aiperf profile \
> -m nvidia/NVIDIA-Nemotron-3-Super-120B-A12B-FP8 \
> --tokenizer nvidia/NVIDIA-Nemotron-3-Super-120B-A12B-FP8 \
> --tokenizer-trust-remote-code \
> --input-file /model-cache/traces/8k_1k_70kv_chat_new_noschedule.jsonl \
> --custom-dataset-type mooncake_trace \
> --prompt-input-tokens-block-size 512 \
> --url http://nemotron-3-super-h200-chat-frontend:8000 \
> --streaming --use-server-token-count \
> --extra-inputs ignore_eos:true \
> --concurrency 64 --random-seed 42 \
> --export-http-trace

Set ENDPOINT to nemotron-3-super-h200-agentic-frontend:8000 with the FP8 TARGET_MODEL and the agentic trace, then apply. The Job wraps this AIPerf Mooncake trace replay:

$aiperf profile \
> -m nvidia/NVIDIA-Nemotron-3-Super-120B-A12B-FP8 \
> --tokenizer nvidia/NVIDIA-Nemotron-3-Super-120B-A12B-FP8 \
> --tokenizer-trust-remote-code \
> --input-file /model-cache/traces/64k_400_90kv_agent_new_noschedule.jsonl \
> --custom-dataset-type mooncake_trace \
> --prompt-input-tokens-block-size 512 \
> --url http://nemotron-3-super-h200-agentic-frontend:8000 \
> --streaming --use-server-token-count \
> --extra-inputs ignore_eos:true \
> --concurrency 128 --random-seed 42 \
> --export-http-trace
$kubectl apply -f recipes/nemotron-3-super/perf/perf.yaml -n ${NAMESPACE}
$kubectl logs -n ${NAMESPACE} -l job-name=nemotron-3-super-bench -f
$kubectl wait --for=condition=Complete job/nemotron-3-super-bench -n ${NAMESPACE} --timeout=7200s

Artifacts land on the PVC under /model-cache/perf/<epoch>_nemotron-3-super-bench/. 15% and 30% trace subsets are provided for shorter runs. For concurrency sweeps, delete the worker pods between runs so residual KV/prefix-cache state does not skew results (Dynamo workers are PodClique pods — kubectl rollout restart deployment is a silent no-op) — see the benchmark README for the full workflow, artifact layout, and tunable environment variables.

Expected Performance

Each target is tuned for its workload shape:

WorkloadMedian ISLMedian OSLKV cache hit rate
Chat8K1K70%
Agentic64K40090%

Measured results below replay the 15% trace subsets (*_short_15perc.jsonl) with two worker replicas per deployment; your selected target’s row is highlighted:

RecipeSKUWorker replicasConcurrencyUser output tok/sSystem output tok/s/GPU
Chat (15% subset)B200212861.25844.5
Agentic (15% subset)B200219263.161388.4
Chat (15% subset)H20026456.07404.6
Agentic (15% subset)H200212862.94851.0

Compare All Targets

All four targets run aggregated vLLM 0.21.0 (runtime image vllm-runtime:1.3.0-nemotron-super-dev.1) with two TP4 + EP worker replicas, MTP speculative decoding (DL=3), and KV-aware routing. They differ in checkpoint, kernel backends, and the trace they are benchmarked against:

B200 chatH200 chatB200 agenticH200 agentic
GPUs8x B200 (2x TP4)8x H200 (2x TP4)8x B200 (2x TP4)8x H200 (2x TP4)
PrecisionNVFP4 + FP8 KVFP8 + FP8 KVNVFP4 + FP8 KVFP8 + FP8 KV
MoE backendFLASHINFER_TRTLLMFLASHINFER_CUTLASSFLASHINFER_TRTLLMFLASHINFER_CUTLASS
Attention backendFLASH_ATTNFLASH_ATTNFLASH_ATTNFLASH_ATTN
All2All backendDeepEP high-throughputFlashInfer NVLink one-sidedDeepEP low-latencyDeepEP high-throughput
WorkloadChat traceChat traceAgentic traceAgentic trace

Notes

  • This is a Day-0 recipe on a dedicated dev runtime image (vllm-runtime:1.3.0-nemotron-super-dev.1); it is functional and benchmarked but not yet promoted to a release runtime image.
  • The namespace must carry the kai.scheduler/enabled=true label before deploying; without it, pods stay SchedulingGated indefinitely.
  • B200 chat and agentic ship with MTP spec dec ON by default (DL=3, moe_backend=triton, stripped compilation-config, MAX_NUM_BATCHED_TOKENS=65536). To turn MTP off, remove the - --speculative-config=$(SPECULATIVE_CONFIG) line from worker args; with the freed memory headroom you can bump MAX_NUM_BATCHED_TOKENS to "131072" and switch COMPILATION_CONFIG to the compilation-config-fused ConfigMap key for better throughput.
  • For a fixed-AL synthetic MTP run on H200, point the SPECULATIVE_CONFIG env at the speculative-config-synthetic ConfigMap key before deploying.
  • Known issue: some 400 HTTP errors raised by the workers on invalid inputs surface as 500 through the Dynamo frontend (the proxy does not always preserve the worker’s original status code).
  • Reasoning is controlled per request via chat_template_kwargs (enable_thinking: true|false); tool calling and function calling with JSON arguments are supported.
  • Both DGDs of a given SKU serve the same --served-model-name, so either trace can be replayed against either DGD by swapping TRACE_FILE.

Source