> For clean Markdown content of this page, append .md to this URL. For the complete documentation index, see https://docs.nvidia.com/dynamo/llms.txt. For full content including API reference and SDK examples, see https://docs.nvidia.com/dynamo/llms-full.txt.

# Nemotron-3-Super

Each target below is a validated aggregated vLLM deployment of Nemotron-3-Super — NVIDIA's \~120B hybrid Mamba/Attention/MoE model (\~12B active) — with MTP speculative decoding (DL=3) and KV-aware routing; the B200 agentic target measured 1388.4 system output tok/s per GPU on its trace. B200 serves the NVFP4 checkpoint, H200 the FP8 checkpoint. Pick your GPU and workload; every command on this page updates to match.

<p>
  Choose your deployment target
</p>

GPU

B200 Recommended

<input type="radio" id="recipe-sku-h200" name="recipe-sku" value="h200" />

H200

Workload

Chat

<input type="radio" id="recipe-usecase-agentic" name="recipe-usecase" value="agentic" />

Agentic

<b>Checkpoint</b> nvidia/NVIDIA-Nemotron-3-Super-120B-A12B-NVFP4

<b>Precision</b> NVFP4 + FP8 KV cache

<b>GPUs</b> 4x B200 per worker, TP4 + EP (2 replicas)

<b>All2All</b> DeepEP high-throughput

<b>Spec decode</b> MTP, DL=3

<b>Workload</b> Chat 8K/1K, 70% KV reuse

<b>Checkpoint</b> nvidia/NVIDIA-Nemotron-3-Super-120B-A12B-NVFP4

<b>Precision</b> NVFP4 + FP8 KV cache

<b>GPUs</b> 4x B200 per worker, TP4 + EP (2 replicas)

<b>All2All</b> DeepEP low-latency

<b>Spec decode</b> MTP, DL=3

<b>Workload</b> Agentic 64K/400, 90% KV reuse

<b>Checkpoint</b> nvidia/NVIDIA-Nemotron-3-Super-120B-A12B-FP8

<b>Precision</b> FP8 + FP8 KV cache

<b>GPUs</b> 4x H200 per worker, TP4 + EP (2 replicas)

<b>All2All</b> FlashInfer NVLink one-sided

<b>Spec decode</b> MTP, DL=3

<b>Workload</b> Chat 8K/1K, 70% KV reuse

<b>Checkpoint</b> nvidia/NVIDIA-Nemotron-3-Super-120B-A12B-FP8

<b>Precision</b> FP8 + FP8 KV cache

<b>GPUs</b> 4x H200 per worker, TP4 + EP (2 replicas)

<b>All2All</b> DeepEP high-throughput

<b>Spec decode</b> MTP, DL=3

<b>Workload</b> Agentic 64K/400, 90% KV reuse

## Prerequisites

* A Kubernetes cluster with the Dynamo Platform installed (DGD CRDs registered with `nvidia.com/v1beta1` served) and **8x B200** available (two 4-GPU worker replicas).
* A Hugging Face token with access to `nvidia/NVIDIA-Nemotron-3-Super-120B-A12B-NVFP4`.
* **Namespace labeled for KAI** — without `kai.scheduler/enabled=true`, pods sit `SchedulingGated` indefinitely because KAI's `pod-grouper` filters by namespace label.

- A Kubernetes cluster with the Dynamo Platform installed (DGD CRDs registered with `nvidia.com/v1beta1` served) and **8x H200** available (two 4-GPU worker replicas).
- A Hugging Face token with access to `nvidia/NVIDIA-Nemotron-3-Super-120B-A12B-FP8`.
- **Namespace labeled for KAI** — without `kai.scheduler/enabled=true`, pods sit `SchedulingGated` indefinitely because KAI's `pod-grouper` filters by namespace label.

Create the namespace with the KAI label and the token secret:

```bash
export NAMESPACE=your-namespace
kubectl create namespace ${NAMESPACE}
kubectl label namespace ${NAMESPACE} kai.scheduler/enabled=true

kubectl create secret generic hf-token-secret \
  --from-literal=HF_TOKEN="$HF_TOKEN" \
  -n ${NAMESPACE}
```

Edit namespace, storage class, image tags, and other cluster-specific settings in the manifests before applying them.

## Deploy

Prepare the model cache and download the checkpoint for your SKU:

```bash
# 1. Storage — edit storageClassName in model-cache.yaml first (kubectl get storageclass).
kubectl apply -f recipes/nemotron-3-super/model-cache/model-cache.yaml -n ${NAMESPACE}

# 2. Download. B200 uses the NVFP4 checkpoint (~80 GB).
kubectl apply -f recipes/nemotron-3-super/model-cache/model-download.yaml -n ${NAMESPACE}
kubectl wait --for=condition=Complete job/model-download -n ${NAMESPACE} --timeout=1800s
```

```bash
# 1. Storage — edit storageClassName in model-cache.yaml first (kubectl get storageclass).
kubectl apply -f recipes/nemotron-3-super/model-cache/model-cache.yaml -n ${NAMESPACE}

# 2. Download. H200 uses the FP8 checkpoint (~120 GB).
kubectl apply -f recipes/nemotron-3-super/model-cache/model-download-fp8.yaml -n ${NAMESPACE}
kubectl wait --for=condition=Complete job/model-download-fp8 -n ${NAMESPACE} --timeout=3600s
```

Downloading both checkpoints lands \~200 GB on the PVC — the default size in `model-cache.yaml` — so bump `storage:` first if you want both.

Then deploy. First-time boot per worker takes about 6–9 minutes (image pull + vLLM engine init + Inductor + CUDA graph capture up to size 512):

```bash
kubectl apply -f recipes/nemotron-3-super/vllm/agg-b200-chat/deploy.yaml -n ${NAMESPACE}
kubectl get dgd nemotron-3-super-b200-chat -n ${NAMESPACE} -w
```

```bash
kubectl apply -f recipes/nemotron-3-super/vllm/agg-b200-agentic/deploy.yaml -n ${NAMESPACE}
kubectl get dgd nemotron-3-super-b200-agentic -n ${NAMESPACE} -w
```

```bash
kubectl apply -f recipes/nemotron-3-super/vllm/agg-h200-chat/deploy.yaml -n ${NAMESPACE}
kubectl get dgd nemotron-3-super-h200-chat -n ${NAMESPACE} -w
```

```bash
kubectl apply -f recipes/nemotron-3-super/vllm/agg-h200-agentic/deploy.yaml -n ${NAMESPACE}
kubectl get dgd nemotron-3-super-h200-agentic -n ${NAMESPACE} -w
```

## Smoke Test

Send a test request to verify the deployment serves traffic:

```bash
kubectl port-forward svc/nemotron-3-super-b200-chat-frontend 8000:8000 -n ${NAMESPACE}
```

```bash
kubectl port-forward svc/nemotron-3-super-b200-agentic-frontend 8000:8000 -n ${NAMESPACE}
```

```bash
kubectl port-forward svc/nemotron-3-super-h200-chat-frontend 8000:8000 -n ${NAMESPACE}
```

```bash
kubectl port-forward svc/nemotron-3-super-h200-agentic-frontend 8000:8000 -n ${NAMESPACE}
```

```bash
MODEL_ID=nvidia/NVIDIA-Nemotron-3-Super-120B-A12B-NVFP4

curl http://localhost:8000/v1/models
curl http://localhost:8000/v1/chat/completions \
  -H "Content-Type: application/json" \
  -d "{\"model\":\"${MODEL_ID}\",
       \"messages\":[{\"role\":\"user\",\"content\":\"Hello!\"}],
       \"max_tokens\":64,
       \"chat_template_kwargs\":{\"enable_thinking\":false}}"
```

```bash
MODEL_ID=nvidia/NVIDIA-Nemotron-3-Super-120B-A12B-FP8

curl http://localhost:8000/v1/models
curl http://localhost:8000/v1/chat/completions \
  -H "Content-Type: application/json" \
  -d "{\"model\":\"${MODEL_ID}\",
       \"messages\":[{\"role\":\"user\",\"content\":\"Hello!\"}],
       \"max_tokens\":64,
       \"chat_template_kwargs\":{\"enable_thinking\":false}}"
```

## Benchmark

A single AIPerf trace-replay Job ([`perf/perf.yaml`](https://github.com/ai-dynamo/dynamo/blob/main/recipes/nemotron-3-super/perf/perf.yaml)) covers every target — only `ENDPOINT`, `TRACE_FILE`, `TARGET_MODEL`, and `CONCURRENCY` change in its env block (`TARGET_MODEL` is the NVFP4 id for B200, FP8 for H200). First stage the bundled traces from [`recipes/nemotron-3-super/perf/traces/`](https://github.com/ai-dynamo/dynamo/tree/main/recipes/nemotron-3-super/perf/traces) onto the `model-cache` PVC:

```bash
kubectl run pvc-helper -n ${NAMESPACE} \
  --image=busybox:1.36 --restart=Never \
  --overrides='{"spec":{"containers":[{"name":"helper","image":"busybox:1.36","command":["sleep","3600"],"volumeMounts":[{"name":"model-cache","mountPath":"/model-cache"}]}],"volumes":[{"name":"model-cache","persistentVolumeClaim":{"claimName":"model-cache"}}]}}' \
  --command -- sleep 3600

kubectl cp recipes/nemotron-3-super/perf/traces ${NAMESPACE}/pvc-helper:/model-cache/
```

Set `ENDPOINT` to `nemotron-3-super-b200-chat-frontend:8000` (the Job default) with the NVFP4 `TARGET_MODEL` and the chat trace, then apply. The Job wraps this AIPerf Mooncake trace replay:

```bash
aiperf profile \
  -m nvidia/NVIDIA-Nemotron-3-Super-120B-A12B-NVFP4 \
  --tokenizer nvidia/NVIDIA-Nemotron-3-Super-120B-A12B-NVFP4 \
  --tokenizer-trust-remote-code \
  --input-file /model-cache/traces/8k_1k_70kv_chat_new_noschedule.jsonl \
  --custom-dataset-type mooncake_trace \
  --prompt-input-tokens-block-size 512 \
  --url http://nemotron-3-super-b200-chat-frontend:8000 \
  --streaming --use-server-token-count \
  --extra-inputs ignore_eos:true \
  --concurrency 128 --random-seed 42 \
  --export-http-trace
```

Set `ENDPOINT` to `nemotron-3-super-b200-agentic-frontend:8000` with the NVFP4 `TARGET_MODEL` and the agentic trace, then apply. The Job wraps this AIPerf Mooncake trace replay:

```bash
aiperf profile \
  -m nvidia/NVIDIA-Nemotron-3-Super-120B-A12B-NVFP4 \
  --tokenizer nvidia/NVIDIA-Nemotron-3-Super-120B-A12B-NVFP4 \
  --tokenizer-trust-remote-code \
  --input-file /model-cache/traces/64k_400_90kv_agent_new_noschedule.jsonl \
  --custom-dataset-type mooncake_trace \
  --prompt-input-tokens-block-size 512 \
  --url http://nemotron-3-super-b200-agentic-frontend:8000 \
  --streaming --use-server-token-count \
  --extra-inputs ignore_eos:true \
  --concurrency 192 --random-seed 42 \
  --export-http-trace
```

Set `ENDPOINT` to `nemotron-3-super-h200-chat-frontend:8000` with the FP8 `TARGET_MODEL` and the chat trace, then apply. The Job wraps this AIPerf Mooncake trace replay:

```bash
aiperf profile \
  -m nvidia/NVIDIA-Nemotron-3-Super-120B-A12B-FP8 \
  --tokenizer nvidia/NVIDIA-Nemotron-3-Super-120B-A12B-FP8 \
  --tokenizer-trust-remote-code \
  --input-file /model-cache/traces/8k_1k_70kv_chat_new_noschedule.jsonl \
  --custom-dataset-type mooncake_trace \
  --prompt-input-tokens-block-size 512 \
  --url http://nemotron-3-super-h200-chat-frontend:8000 \
  --streaming --use-server-token-count \
  --extra-inputs ignore_eos:true \
  --concurrency 64 --random-seed 42 \
  --export-http-trace
```

Set `ENDPOINT` to `nemotron-3-super-h200-agentic-frontend:8000` with the FP8 `TARGET_MODEL` and the agentic trace, then apply. The Job wraps this AIPerf Mooncake trace replay:

```bash
aiperf profile \
  -m nvidia/NVIDIA-Nemotron-3-Super-120B-A12B-FP8 \
  --tokenizer nvidia/NVIDIA-Nemotron-3-Super-120B-A12B-FP8 \
  --tokenizer-trust-remote-code \
  --input-file /model-cache/traces/64k_400_90kv_agent_new_noschedule.jsonl \
  --custom-dataset-type mooncake_trace \
  --prompt-input-tokens-block-size 512 \
  --url http://nemotron-3-super-h200-agentic-frontend:8000 \
  --streaming --use-server-token-count \
  --extra-inputs ignore_eos:true \
  --concurrency 128 --random-seed 42 \
  --export-http-trace
```

```bash
kubectl apply -f recipes/nemotron-3-super/perf/perf.yaml -n ${NAMESPACE}
kubectl logs -n ${NAMESPACE} -l job-name=nemotron-3-super-bench -f
kubectl wait --for=condition=Complete job/nemotron-3-super-bench -n ${NAMESPACE} --timeout=7200s
```

Artifacts land on the PVC under `/model-cache/perf/<epoch>_nemotron-3-super-bench/`. 15% and 30% trace subsets are provided for shorter runs. For concurrency sweeps, delete the worker pods between runs so residual KV/prefix-cache state does not skew results (Dynamo workers are PodClique pods — `kubectl rollout restart deployment` is a silent no-op) — see the [benchmark README](https://github.com/ai-dynamo/dynamo/blob/main/recipes/nemotron-3-super/perf/README.md) for the full workflow, artifact layout, and tunable environment variables.

## Expected Performance

Each target is tuned for its workload shape:

| Workload | Median ISL | Median OSL | KV cache hit rate |
| -------- | ---------: | ---------: | ----------------: |
| Chat     |         8K |         1K |               70% |
| Agentic  |        64K |        400 |               90% |

Measured results below replay the **15% trace subsets** (`*_short_15perc.jsonl`) with two worker replicas per deployment; your selected target's row is highlighted:

<table>
  <thead>
    <tr><th>Recipe</th><th>SKU</th><th>Worker replicas</th><th>Concurrency</th><th>User output tok/s</th><th>System output tok/s/GPU</th></tr>
  </thead>

  <tbody>
    <tr data-sku="b200" data-usecase="chat">
      <td>Chat (15% subset)</td>

      <td>B200</td>

      <td>2</td>

      <td>128</td>

      <td>61.25</td>

      <td>844.5</td>
    </tr>

    <tr data-sku="b200" data-usecase="agentic">
      <td>Agentic (15% subset)</td>

      <td>B200</td>

      <td>2</td>

      <td>192</td>

      <td>63.16</td>

      <td>1388.4</td>
    </tr>

    <tr data-sku="h200" data-usecase="chat">
      <td>Chat (15% subset)</td>

      <td>H200</td>

      <td>2</td>

      <td>64</td>

      <td>56.07</td>

      <td>404.6</td>
    </tr>

    <tr data-sku="h200" data-usecase="agentic">
      <td>Agentic (15% subset)</td>

      <td>H200</td>

      <td>2</td>

      <td>128</td>

      <td>62.94</td>

      <td>851.0</td>
    </tr>
  </tbody>
</table>

## Compare All Targets

All four targets run aggregated vLLM 0.21.0 (runtime image `vllm-runtime:1.3.0-nemotron-super-dev.1`) with two TP4 + EP worker replicas, MTP speculative decoding (DL=3), and KV-aware routing. They differ in checkpoint, kernel backends, and the trace they are benchmarked against:

|                       | B200 chat              | H200 chat                   | B200 agentic       | H200 agentic           |
| --------------------- | ---------------------- | --------------------------- | ------------------ | ---------------------- |
| **GPUs**              | 8x B200 (2x TP4)       | 8x H200 (2x TP4)            | 8x B200 (2x TP4)   | 8x H200 (2x TP4)       |
| **Precision**         | NVFP4 + FP8 KV         | FP8 + FP8 KV                | NVFP4 + FP8 KV     | FP8 + FP8 KV           |
| **MoE backend**       | FLASHINFER\_TRTLLM     | FLASHINFER\_CUTLASS         | FLASHINFER\_TRTLLM | FLASHINFER\_CUTLASS    |
| **Attention backend** | FLASH\_ATTN            | FLASH\_ATTN                 | FLASH\_ATTN        | FLASH\_ATTN            |
| **All2All backend**   | DeepEP high-throughput | FlashInfer NVLink one-sided | DeepEP low-latency | DeepEP high-throughput |
| **Workload**          | Chat trace             | Chat trace                  | Agentic trace      | Agentic trace          |

## Notes

* This is a Day-0 recipe on a dedicated dev runtime image (`vllm-runtime:1.3.0-nemotron-super-dev.1`); it is functional and benchmarked but not yet promoted to a release runtime image.
* The namespace must carry the `kai.scheduler/enabled=true` label before deploying; without it, pods stay `SchedulingGated` indefinitely.
* B200 chat and agentic ship with MTP spec dec ON by default (DL=3, `moe_backend=triton`, stripped `compilation-config`, `MAX_NUM_BATCHED_TOKENS=65536`). To turn MTP off, remove the `- --speculative-config=$(SPECULATIVE_CONFIG)` line from worker args; with the freed memory headroom you can bump `MAX_NUM_BATCHED_TOKENS` to `"131072"` and switch `COMPILATION_CONFIG` to the `compilation-config-fused` ConfigMap key for better throughput.
* For a fixed-AL synthetic MTP run on H200, point the `SPECULATIVE_CONFIG` env at the `speculative-config-synthetic` ConfigMap key before deploying.
* Known issue: some 400 HTTP errors raised by the workers on invalid inputs surface as 500 through the Dynamo frontend (the proxy does not always preserve the worker's original status code).
* Reasoning is controlled per request via `chat_template_kwargs` (`enable_thinking: true|false`); tool calling and function calling with JSON arguments are supported.
* Both DGDs of a given SKU serve the same `--served-model-name`, so either trace can be replayed against either DGD by swapping `TRACE_FILE`.

## Source

* Source README: [recipes/nemotron-3-super/README.md](https://github.com/ai-dynamo/dynamo/blob/main/recipes/nemotron-3-super/README.md)
* Benchmark workflow: [recipes/nemotron-3-super/perf/README.md](https://github.com/ai-dynamo/dynamo/blob/main/recipes/nemotron-3-super/perf/README.md) and [perf.yaml](https://github.com/ai-dynamo/dynamo/blob/main/recipes/nemotron-3-super/perf/perf.yaml)
* B200 chat: [vllm/agg-b200-chat/deploy.yaml](https://github.com/ai-dynamo/dynamo/blob/main/recipes/nemotron-3-super/vllm/agg-b200-chat/deploy.yaml)
* B200 agentic: [vllm/agg-b200-agentic/deploy.yaml](https://github.com/ai-dynamo/dynamo/blob/main/recipes/nemotron-3-super/vllm/agg-b200-agentic/deploy.yaml)
* H200 chat: [vllm/agg-h200-chat/deploy.yaml](https://github.com/ai-dynamo/dynamo/blob/main/recipes/nemotron-3-super/vllm/agg-h200-chat/deploy.yaml)
* H200 agentic: [vllm/agg-h200-agentic/deploy.yaml](https://github.com/ai-dynamo/dynamo/blob/main/recipes/nemotron-3-super/vllm/agg-h200-agentic/deploy.yaml)
* Model cache setup: [model-cache/](https://github.com/ai-dynamo/dynamo/tree/main/recipes/nemotron-3-super/model-cache) (`model-cache.yaml`, `model-download.yaml`, `model-download-fp8.yaml`)
Recipe	SKU	Worker replicas	Concurrency	User output tok/s	System output tok/s/GPU
Chat (15% subset)	B200	2	128	61.25	844.5
Agentic (15% subset)	B200	2	192	63.16	1388.4
Chat (15% subset)	H200	2	64	56.07	404.6
Agentic (15% subset)	H200	2	128	62.94	851.0