Llama-3.3-70B Topology Benchmark

How do aggregated, single-node disaggregated, and multi-node disaggregated vLLM topologies compare when normalized by GPU?

View as Markdown

Three vLLM topologies — aggregated, single-node disaggregated, and multi-node disaggregated — intentionally use different GPU counts (4, 8, and 16x H100/H200), so concurrency is scaled at 16 per GPU and results should be read as total throughput and TPS/GPU together — more GPUs trivially raise total throughput, so TPS/GPU is the apples-to-apples lens. All three topologies are also deployable recipe targets, so this benchmark doubles as a sizing guide.

Benchmark setup

Model RedHatAI/Llama-3.3-70B-Instruct-FP8-dynamicGPUs 4 / 8 / 16x H100/H200 (varies by configuration)Runtime vLLMWorkload Synthetic 8192 ISL / 1024 OSL, 16 concurrency per GPU, request count = 10x concurrencyMetrics Output TPS and TPS/GPU, plus TTFT and ITLHeld constant Model, vLLM runtime, H100/H200 hardware family, ISL=8192, OSL=1024 (stddev 0, forced via min/max tokens), and 16 concurrency per GPU

Compared Configurations

RoleConfigurationDeployBenchmark
BaselinevLLM aggregated4x H100/H200, single node, TP4 — concurrency 64deploy.yamlperf.yaml
ComparisonvLLM disaggregated single-node8x H100/H200, P/D separation on one node — concurrency 128deploy.yamlperf.yaml
ComparisonvLLM disaggregated multi-node16x H100/H200, 2 nodes x 8 GPUs — concurrency 256deploy.yamlperf.yaml

Reproduce

Each configuration’s perf.yaml computes total concurrency as 16 x GPU count and wraps an AIPerf run like the following — the checked-in perf.yaml is authoritative (it also sets --random-seed, ignore_eos, the tokenizer, and dataset-entry flags):

$aiperf profile --model RedHatAI/Llama-3.3-70B-Instruct-FP8-dynamic \
> --endpoint-type chat --endpoint /v1/chat/completions \
> --url http://<frontend>:8000 --streaming \
> --synthetic-input-tokens-mean 8192 --synthetic-input-tokens-stddev 0 \
> --output-tokens-mean 1024 --output-tokens-stddev 0 \
> --extra-inputs max_tokens:1024 --extra-inputs min_tokens:1024 \
> --concurrency <16*gpu_count> --request-count <10*concurrency> \
> --warmup-request-count <concurrency>

The frontend services are llama3-70b-agg-frontend, llama3-70b-disagg-sn-frontend, and llama3-70b-disagg-mn-frontend. Deploy one configuration at a time:

$export NAMESPACE=your-namespace
$
$# One-time prep: storage + model download (update storageClassName in model-cache.yaml first)
$kubectl apply -f recipes/llama-3-70b/model-cache/ -n ${NAMESPACE}
$kubectl wait --for=condition=Complete job/model-download -n ${NAMESPACE} --timeout=3600s
$
$# Pick one configuration from the table above, deploy it, wait for readiness, then apply its perf.yaml.
$kubectl apply -f recipes/llama-3-70b/vllm/<configuration>/deploy.yaml -n ${NAMESPACE}
$kubectl apply -f recipes/llama-3-70b/vllm/<configuration>/perf.yaml -n ${NAMESPACE}

Notes

  • The source does not publish result numbers; run all three configurations on your hardware and compare total output TPS alongside TPS/GPU, since GPU counts differ per configuration.
  • The model uses FP8 dynamic quantization applied at runtime; the download takes roughly 15-30 minutes.
  • The agg and disagg-single-node configurations also ship optional GAIE (Gateway API Inference Extension) manifests under their gaie/ subfolders.
  • Source: recipes/llama-3-70b

All three configurations are deployable targets on the Llama-3.3-70B recipe page — none is a benchmark-only control.