Nemotron-3-Super | NVIDIA Dynamo Documentation

Each target below is a validated aggregated vLLM deployment of Nemotron-3-Super — NVIDIA’s ~120B hybrid Mamba/Attention/MoE model (~12B active) — with MTP speculative decoding (DL=3) and KV-aware routing; the B200 agentic target measured 1388.4 system output tok/s per GPU on its trace. B200 serves the NVFP4 checkpoint, H200 the FP8 checkpoint. Pick your GPU and workload; every command on this page updates to match.

Choose your deployment target

GPUB200 RecommendedH200

WorkloadChatAgentic

Checkpoint nvidia/NVIDIA-Nemotron-3-Super-120B-A12B-NVFP4Precision NVFP4 + FP8 KV cacheGPUs 4x B200 per worker, TP4 + EP (2 replicas)All2All DeepEP high-throughputSpec decode MTP, DL=3Workload Chat 8K/1K, 70% KV reuse

Checkpoint nvidia/NVIDIA-Nemotron-3-Super-120B-A12B-NVFP4Precision NVFP4 + FP8 KV cacheGPUs 4x B200 per worker, TP4 + EP (2 replicas)All2All DeepEP low-latencySpec decode MTP, DL=3Workload Agentic 64K/400, 90% KV reuse

Checkpoint nvidia/NVIDIA-Nemotron-3-Super-120B-A12B-FP8Precision FP8 + FP8 KV cacheGPUs 4x H200 per worker, TP4 + EP (2 replicas)All2All FlashInfer NVLink one-sidedSpec decode MTP, DL=3Workload Chat 8K/1K, 70% KV reuse

Checkpoint nvidia/NVIDIA-Nemotron-3-Super-120B-A12B-FP8Precision FP8 + FP8 KV cacheGPUs 4x H200 per worker, TP4 + EP (2 replicas)All2All DeepEP high-throughputSpec decode MTP, DL=3Workload Agentic 64K/400, 90% KV reuse

Prerequisites

A Kubernetes cluster with the Dynamo Platform installed (DGD CRDs registered with nvidia.com/v1beta1 served) and 8x B200 available (two 4-GPU worker replicas).
A Hugging Face token with access to nvidia/NVIDIA-Nemotron-3-Super-120B-A12B-NVFP4.
Namespace labeled for KAI — without kai.scheduler/enabled=true, pods sit SchedulingGated indefinitely because KAI’s pod-grouper filters by namespace label.

A Kubernetes cluster with the Dynamo Platform installed (DGD CRDs registered with nvidia.com/v1beta1 served) and 8x H200 available (two 4-GPU worker replicas).
A Hugging Face token with access to nvidia/NVIDIA-Nemotron-3-Super-120B-A12B-FP8.
Namespace labeled for KAI — without kai.scheduler/enabled=true, pods sit SchedulingGated indefinitely because KAI’s pod-grouper filters by namespace label.

Create the namespace with the KAI label and the token secret:

$ export NAMESPACE=your-namespace
$ kubectl create namespace ${NAMESPACE}
$ kubectl label namespace ${NAMESPACE} kai.scheduler/enabled=true
$ 
$ kubectl create secret generic hf-token-secret \
>   --from-literal=HF_TOKEN="$HF_TOKEN" \
>   -n ${NAMESPACE}

Edit namespace, storage class, image tags, and other cluster-specific settings in the manifests before applying them.

Deploy

Prepare the model cache and download the checkpoint for your SKU:

$ # 1. Storage — edit storageClassName in model-cache.yaml first (kubectl get storageclass).
$ kubectl apply -f recipes/nemotron-3-super/model-cache/model-cache.yaml -n ${NAMESPACE}
$ 
$ # 2. Download. B200 uses the NVFP4 checkpoint (~80 GB).
$ kubectl apply -f recipes/nemotron-3-super/model-cache/model-download.yaml -n ${NAMESPACE}
$ kubectl wait --for=condition=Complete job/model-download -n ${NAMESPACE} --timeout=1800s

$ # 1. Storage — edit storageClassName in model-cache.yaml first (kubectl get storageclass).
$ kubectl apply -f recipes/nemotron-3-super/model-cache/model-cache.yaml -n ${NAMESPACE}
$ 
$ # 2. Download. H200 uses the FP8 checkpoint (~120 GB).
$ kubectl apply -f recipes/nemotron-3-super/model-cache/model-download-fp8.yaml -n ${NAMESPACE}
$ kubectl wait --for=condition=Complete job/model-download-fp8 -n ${NAMESPACE} --timeout=3600s

Downloading both checkpoints lands ~200 GB on the PVC — the default size in model-cache.yaml — so bump storage: first if you want both.

Then deploy. First-time boot per worker takes about 6–9 minutes (image pull + vLLM engine init + Inductor + CUDA graph capture up to size 512):

$ kubectl apply -f recipes/nemotron-3-super/vllm/agg-b200-chat/deploy.yaml -n ${NAMESPACE}
$ kubectl get dgd nemotron-3-super-b200-chat -n ${NAMESPACE} -w

$ kubectl apply -f recipes/nemotron-3-super/vllm/agg-b200-agentic/deploy.yaml -n ${NAMESPACE}
$ kubectl get dgd nemotron-3-super-b200-agentic -n ${NAMESPACE} -w

$ kubectl apply -f recipes/nemotron-3-super/vllm/agg-h200-chat/deploy.yaml -n ${NAMESPACE}
$ kubectl get dgd nemotron-3-super-h200-chat -n ${NAMESPACE} -w

$ kubectl apply -f recipes/nemotron-3-super/vllm/agg-h200-agentic/deploy.yaml -n ${NAMESPACE}
$ kubectl get dgd nemotron-3-super-h200-agentic -n ${NAMESPACE} -w

Smoke Test

Send a test request to verify the deployment serves traffic:

$ kubectl port-forward svc/nemotron-3-super-b200-chat-frontend 8000:8000 -n ${NAMESPACE}

$ kubectl port-forward svc/nemotron-3-super-b200-agentic-frontend 8000:8000 -n ${NAMESPACE}

$ kubectl port-forward svc/nemotron-3-super-h200-chat-frontend 8000:8000 -n ${NAMESPACE}

$ kubectl port-forward svc/nemotron-3-super-h200-agentic-frontend 8000:8000 -n ${NAMESPACE}

$ MODEL_ID=nvidia/NVIDIA-Nemotron-3-Super-120B-A12B-NVFP4
$ 
$ curl http://localhost:8000/v1/models
$ curl http://localhost:8000/v1/chat/completions \
>   -H "Content-Type: application/json" \
>   -d "{\"model\":\"${MODEL_ID}\",
>        \"messages\":[{\"role\":\"user\",\"content\":\"Hello!\"}],
>        \"max_tokens\":64,
>        \"chat_template_kwargs\":{\"enable_thinking\":false}}"

$ MODEL_ID=nvidia/NVIDIA-Nemotron-3-Super-120B-A12B-FP8
$ 
$ curl http://localhost:8000/v1/models
$ curl http://localhost:8000/v1/chat/completions \
>   -H "Content-Type: application/json" \
>   -d "{\"model\":\"${MODEL_ID}\",
>        \"messages\":[{\"role\":\"user\",\"content\":\"Hello!\"}],
>        \"max_tokens\":64,
>        \"chat_template_kwargs\":{\"enable_thinking\":false}}"

Benchmark

A single AIPerf trace-replay Job (perf/perf.yaml) covers every target — only ENDPOINT, TRACE_FILE, TARGET_MODEL, and CONCURRENCY change in its env block (TARGET_MODEL is the NVFP4 id for B200, FP8 for H200). First stage the bundled traces from recipes/nemotron-3-super/perf/traces/ onto the model-cache PVC:

$ kubectl run pvc-helper -n ${NAMESPACE} \
>   --image=busybox:1.36 --restart=Never \
>   --overrides='{"spec":{"containers":[{"name":"helper","image":"busybox:1.36","command":["sleep","3600"],"volumeMounts":[{"name":"model-cache","mountPath":"/model-cache"}]}],"volumes":[{"name":"model-cache","persistentVolumeClaim":{"claimName":"model-cache"}}]}}' \
>   --command -- sleep 3600
$ 
$ kubectl cp recipes/nemotron-3-super/perf/traces ${NAMESPACE}/pvc-helper:/model-cache/

Set ENDPOINT to nemotron-3-super-b200-chat-frontend:8000 (the Job default) with the NVFP4 TARGET_MODEL and the chat trace, then apply. The Job wraps this AIPerf Mooncake trace replay:

$ aiperf profile \
>   -m nvidia/NVIDIA-Nemotron-3-Super-120B-A12B-NVFP4 \
>   --tokenizer nvidia/NVIDIA-Nemotron-3-Super-120B-A12B-NVFP4 \
>   --tokenizer-trust-remote-code \
>   --input-file /model-cache/traces/8k_1k_70kv_chat_new_noschedule.jsonl \
>   --custom-dataset-type mooncake_trace \
>   --prompt-input-tokens-block-size 512 \
>   --url http://nemotron-3-super-b200-chat-frontend:8000 \
>   --streaming --use-server-token-count \
>   --extra-inputs ignore_eos:true \
>   --concurrency 128 --random-seed 42 \
>   --export-http-trace

Set ENDPOINT to nemotron-3-super-b200-agentic-frontend:8000 with the NVFP4 TARGET_MODEL and the agentic trace, then apply. The Job wraps this AIPerf Mooncake trace replay:

$ aiperf profile \
>   -m nvidia/NVIDIA-Nemotron-3-Super-120B-A12B-NVFP4 \
>   --tokenizer nvidia/NVIDIA-Nemotron-3-Super-120B-A12B-NVFP4 \
>   --tokenizer-trust-remote-code \
>   --input-file /model-cache/traces/64k_400_90kv_agent_new_noschedule.jsonl \
>   --custom-dataset-type mooncake_trace \
>   --prompt-input-tokens-block-size 512 \
>   --url http://nemotron-3-super-b200-agentic-frontend:8000 \
>   --streaming --use-server-token-count \
>   --extra-inputs ignore_eos:true \
>   --concurrency 192 --random-seed 42 \
>   --export-http-trace

Set ENDPOINT to nemotron-3-super-h200-chat-frontend:8000 with the FP8 TARGET_MODEL and the chat trace, then apply. The Job wraps this AIPerf Mooncake trace replay:

$ aiperf profile \
>   -m nvidia/NVIDIA-Nemotron-3-Super-120B-A12B-FP8 \
>   --tokenizer nvidia/NVIDIA-Nemotron-3-Super-120B-A12B-FP8 \
>   --tokenizer-trust-remote-code \
>   --input-file /model-cache/traces/8k_1k_70kv_chat_new_noschedule.jsonl \
>   --custom-dataset-type mooncake_trace \
>   --prompt-input-tokens-block-size 512 \
>   --url http://nemotron-3-super-h200-chat-frontend:8000 \
>   --streaming --use-server-token-count \
>   --extra-inputs ignore_eos:true \
>   --concurrency 64 --random-seed 42 \
>   --export-http-trace

Set ENDPOINT to nemotron-3-super-h200-agentic-frontend:8000 with the FP8 TARGET_MODEL and the agentic trace, then apply. The Job wraps this AIPerf Mooncake trace replay:

$ aiperf profile \
>   -m nvidia/NVIDIA-Nemotron-3-Super-120B-A12B-FP8 \
>   --tokenizer nvidia/NVIDIA-Nemotron-3-Super-120B-A12B-FP8 \
>   --tokenizer-trust-remote-code \
>   --input-file /model-cache/traces/64k_400_90kv_agent_new_noschedule.jsonl \
>   --custom-dataset-type mooncake_trace \
>   --prompt-input-tokens-block-size 512 \
>   --url http://nemotron-3-super-h200-agentic-frontend:8000 \
>   --streaming --use-server-token-count \
>   --extra-inputs ignore_eos:true \
>   --concurrency 128 --random-seed 42 \
>   --export-http-trace

$ kubectl apply -f recipes/nemotron-3-super/perf/perf.yaml -n ${NAMESPACE}
$ kubectl logs -n ${NAMESPACE} -l job-name=nemotron-3-super-bench -f
$ kubectl wait --for=condition=Complete job/nemotron-3-super-bench -n ${NAMESPACE} --timeout=7200s

Artifacts land on the PVC under /model-cache/perf/<epoch>_nemotron-3-super-bench/. 15% and 30% trace subsets are provided for shorter runs. For concurrency sweeps, delete the worker pods between runs so residual KV/prefix-cache state does not skew results (Dynamo workers are PodClique pods — kubectl rollout restart deployment is a silent no-op) — see the benchmark README for the full workflow, artifact layout, and tunable environment variables.

Expected Performance

Each target is tuned for its workload shape:

Workload	Median ISL	Median OSL	KV cache hit rate
Chat	8K	1K	70%
Agentic	64K	400	90%

Measured results below replay the 15% trace subsets (*_short_15perc.jsonl) with two worker replicas per deployment; your selected target’s row is highlighted:

Recipe	SKU	Worker replicas	Concurrency	User output tok/s	System output tok/s/GPU
Chat (15% subset)	B200	2	128	61.25	844.5
Agentic (15% subset)	B200	2	192	63.16	1388.4
Chat (15% subset)	H200	2	64	56.07	404.6
Agentic (15% subset)	H200	2	128	62.94	851.0

Compare All Targets

All four targets run aggregated vLLM 0.23.0 (runtime image vllm-runtime:1.3.0) with two TP4 + EP worker replicas, MTP speculative decoding (DL=3), and KV-aware routing. They differ in checkpoint, kernel backends, and the trace they are benchmarked against:

	B200 chat	H200 chat	B200 agentic	H200 agentic
GPUs	8x B200 (2x TP4)	8x H200 (2x TP4)	8x B200 (2x TP4)	8x H200 (2x TP4)
Precision	NVFP4 + FP8 KV	FP8 + FP8 KV	NVFP4 + FP8 KV	FP8 + FP8 KV
MoE backend	FLASHINFER_TRTLLM	FLASHINFER_CUTLASS	FLASHINFER_TRTLLM	FLASHINFER_CUTLASS
Attention backend	FLASH_ATTN	FLASH_ATTN	FLASH_ATTN	FLASH_ATTN
All2All backend	DeepEP high-throughput	FlashInfer NVLink one-sided	DeepEP low-latency	DeepEP high-throughput
Workload	Chat trace	Chat trace	Agentic trace	Agentic trace

Notes

This recipe runs on the 1.3.0 release runtime image (vllm-runtime:1.3.0); the model-specific vLLM patches it was developed against are now upstream in the shipped vLLM (0.23.0), so no dedicated dev image is required.
The namespace must carry the kai.scheduler/enabled=true label before deploying; without it, pods stay SchedulingGated indefinitely.
B200 chat and agentic ship with MTP spec dec ON by default (DL=3, moe_backend=triton, stripped compilation-config, MAX_NUM_BATCHED_TOKENS=65536). To turn MTP off, remove the - --speculative-config=$(SPECULATIVE_CONFIG) line from worker args; with the freed memory headroom you can bump MAX_NUM_BATCHED_TOKENS to "131072" and switch COMPILATION_CONFIG to the compilation-config-fused ConfigMap key for better throughput.
For a fixed-AL synthetic MTP run on H200, point the SPECULATIVE_CONFIG env at the speculative-config-synthetic ConfigMap key before deploying.
Known issue: some 400 HTTP errors raised by the workers on invalid inputs surface as 500 through the Dynamo frontend (the proxy does not always preserve the worker’s original status code).
Reasoning is controlled per request via chat_template_kwargs (enable_thinking: true|false); tool calling and function calling with JSON arguments are supported.
Both DGDs of a given SKU serve the same --served-model-name, so either trace can be replayed against either DGD by swapping TRACE_FILE.

Source

Source README: recipes/nemotron-3-super/README.md
Benchmark workflow: recipes/nemotron-3-super/perf/README.md and perf.yaml
B200 chat: vllm/agg-b200-chat/deploy.yaml
B200 agentic: vllm/agg-b200-agentic/deploy.yaml
H200 chat: vllm/agg-h200-chat/deploy.yaml
H200 agentic: vllm/agg-h200-agentic/deploy.yaml
Model cache setup: model-cache/ (model-cache.yaml, model-download.yaml, model-download-fp8.yaml)