Llama-3.3-70B Topology Benchmark | NVIDIA Dynamo Documentation

Three vLLM topologies — aggregated, single-node disaggregated, and multi-node disaggregated — intentionally use different GPU counts (4, 8, and 16x H100/H200), so concurrency is scaled at 16 per GPU and results should be read as total throughput and TPS/GPU together — more GPUs trivially raise total throughput, so TPS/GPU is the apples-to-apples lens. All three topologies are also deployable recipe targets, so this benchmark doubles as a sizing guide.

Benchmark setup

Model RedHatAI/Llama-3.3-70B-Instruct-FP8-dynamicGPUs 4 / 8 / 16x H100/H200 (varies by configuration)Runtime vLLMWorkload Synthetic 8192 ISL / 1024 OSL, 16 concurrency per GPU, request count = 10x concurrencyMetrics Output TPS and TPS/GPU, plus TTFT and ITLHeld constant Model, vLLM runtime, H100/H200 hardware family, ISL=8192, OSL=1024 (stddev 0, forced via min/max tokens), and 16 concurrency per GPU

Compared Configurations

Role	Configuration	Deploy	Benchmark
Baseline	vLLM aggregated4x H100/H200, single node, TP4 — concurrency 64	deploy.yaml	perf.yaml
Comparison	vLLM disaggregated single-node8x H100/H200, P/D separation on one node — concurrency 128	deploy.yaml	perf.yaml
Comparison	vLLM disaggregated multi-node16x H100/H200, 2 nodes x 8 GPUs — concurrency 256	deploy.yaml	perf.yaml

Reproduce

Each configuration’s perf.yaml computes total concurrency as 16 x GPU count and wraps an AIPerf run like the following — the checked-in perf.yaml is authoritative (it also sets --random-seed, ignore_eos, the tokenizer, and dataset-entry flags):

$ aiperf profile --model RedHatAI/Llama-3.3-70B-Instruct-FP8-dynamic \
>   --endpoint-type chat --endpoint /v1/chat/completions \
>   --url http://<frontend>:8000 --streaming \
>   --synthetic-input-tokens-mean 8192 --synthetic-input-tokens-stddev 0 \
>   --output-tokens-mean 1024 --output-tokens-stddev 0 \
>   --extra-inputs max_tokens:1024 --extra-inputs min_tokens:1024 \
>   --concurrency <16*gpu_count> --request-count <10*concurrency> \
>   --warmup-request-count <concurrency>

The frontend services are llama3-70b-agg-frontend, llama3-70b-disagg-sn-frontend, and llama3-70b-disagg-mn-frontend. Deploy one configuration at a time:

$ export NAMESPACE=your-namespace
$ 
$ # One-time prep: storage + model download (update storageClassName in model-cache.yaml first)
$ kubectl apply -f recipes/llama-3-70b/model-cache/ -n ${NAMESPACE}
$ kubectl wait --for=condition=Complete job/model-download -n ${NAMESPACE} --timeout=3600s
$ 
$ # Pick one configuration from the table above, deploy it, wait for readiness, then apply its perf.yaml.
$ kubectl apply -f recipes/llama-3-70b/vllm/<configuration>/deploy.yaml -n ${NAMESPACE}
$ kubectl apply -f recipes/llama-3-70b/vllm/<configuration>/perf.yaml -n ${NAMESPACE}

Notes

The source does not publish result numbers; run all three configurations on your hardware and compare total output TPS alongside TPS/GPU, since GPU counts differ per configuration.
The model uses FP8 dynamic quantization applied at runtime; the download takes roughly 15-30 minutes.
The agg and disagg-single-node configurations also ship optional GAIE (Gateway API Inference Extension) manifests under their gaie/ subfolders.
Source: recipes/llama-3-70b

All three configurations are deployable targets on the Llama-3.3-70B recipe page — none is a benchmark-only control.

Compared Configurations

Reproduce

Notes

Related Recipe