GPT-OSS-120B
GPT-OSS-120B
Serve openai/gpt-oss-120b with Dynamo and TensorRT-LLM on Blackwell.
Two validated TensorRT-LLM targets cover the two traffic shapes this model is most deployed for: aggregated expert-parallel (EP4, attention-DP) serving for short-prompt, high-concurrency traffic, and a prefill/decode split for long-context generation. They are deployment targets for different workloads, not an agg-vs-disagg comparison. Pick your target; every command on this page updates to match.
Choose your deployment target
Prerequisites
- A Kubernetes cluster with the Dynamo platform installed and 4x GB200 available on ARM64 nodes — the aggregated target will not run on x86 Hopper/Ampere hardware.
- A Hugging Face token with access to
openai/gpt-oss-120b.
- A Kubernetes cluster with the Dynamo platform installed and 5x GB200 or B200 available (1 prefill + 4 decode GPUs).
- A Hugging Face token with access to
openai/gpt-oss-120b.
Create the namespace and token secret:
Update storageClassName in model-cache/model-cache.yaml and the container image tag in deploy.yaml to match your Dynamo release before deploying. Also edit namespace, node selectors, and cluster-specific placement.
Deploy
Prepare the model cache (shared by both targets):
Then deploy:
Model loading takes roughly 15-30 minutes depending on storage speed:
Smoke Test
Send a test request to verify the deployment serves traffic:
Benchmark
Each target ships a perf.yaml Kubernetes Job that waits for the model to come up, then runs AIPerf with the target’s traffic shape and a request count of 10x total concurrency.
Aggregated traffic shape: ISL 128 / OSL 1000 at 900 per GPU x 4 GPUs = 3,600 total concurrency. The Job wraps this AIPerf run:
Disaggregated traffic shape: ISL 8192 / OSL 1024 at 1,536 total concurrency. The Job wraps this AIPerf run:
Compare All Targets
Notes
- The aggregated target requires ARM64 (GB200) nodes; the disaggregated target accepts GB200 or B200.
- Do not read the two targets as an aggregated-vs-disaggregated benchmark; their traffic shapes differ by design.
- The disaggregated deployment uses 5 GPUs (1x TP1 prefill + 1x TP4 decode), while its
perf.yamlcomputes total concurrency from a 6-GPU count (256 x 6 = 1,536); adjustDEPLOYMENT_GPU_COUNTif you want strict per-GPU normalization. - Disaggregated engine configs differ per role: prefill runs TP1 with
max_batch_size=64and the overlap scheduler disabled; decode runs TP4 withmax_batch_size=1280and the overlap scheduler enabled. KV transfer uses the UCX-based cache transceiver (max_tokens_in_buffer=9216). - The disaggregated target uses
W4A8_MXFP4_MXFP8quantization via theOVERRIDE_QUANT_ALGOenvironment variable.
Source
- Source README: recipes/gpt-oss-120b/README.md
- Disaggregated README: recipes/gpt-oss-120b/trtllm/disagg/README.md
- Aggregated: deploy.yaml and perf.yaml
- Disaggregated: deploy.yaml and perf.yaml