Profiler

View as Markdown

The Dynamo Profiler is an automated performance analysis tool that measures model inference characteristics to optimize deployment configurations. It determines optimal tensor parallelism (TP) settings for prefill and decode phases, generates performance interpolation data, and enables SLA-driven autoscaling through the Planner.

Feature Matrix

FeaturevLLMSGLangTensorRT-LLM
Dense Model Profiling
MoE Model Profiling🚧🚧
AI Configurator (Offline)
Online Profiling (AIPerf)
Interactive WebUI
Runtime Profiling Endpoints

Quick Start

Prerequisites

  • Dynamo platform installed (see Installation Guide)
  • Kubernetes cluster with GPU nodes (for DGDR-based profiling)
  • kube-prometheus-stack installed (required for SLA planner)

The recommended way to profile models is through DGDRs, which automate the entire profiling and deployment workflow.

1apiVersion: nvidia.com/v1alpha1
2kind: DynamoGraphDeploymentRequest
3metadata:
4 name: my-model-profiling
5spec:
6 model: "Qwen/Qwen3-0.6B"
7 backend: vllm
8
9 profilingConfig:
10 profilerImage: "nvcr.io/nvidia/ai-dynamo/vllm-runtime:0.9.0"
11 config:
12 sla:
13 isl: 3000 # Average input sequence length
14 osl: 150 # Average output sequence length
15 ttft: 200.0 # Target Time To First Token (ms)
16 itl: 20.0 # Target Inter-Token Latency (ms)
17
18 deploymentOverrides:
19 workersImage: "nvcr.io/nvidia/ai-dynamo/vllm-runtime:0.9.0"
20
21 autoApply: true
$kubectl apply -f my-profiling-dgdr.yaml -n $NAMESPACE

Using AI Configurator (Fast Offline Profiling)

For TensorRT-LLM, use AI Configurator for rapid profiling (~30 seconds):

1profilingConfig:
2 config:
3 sweep:
4 useAiConfigurator: true
5 aicSystem: h200_sxm
6 aicHfId: Qwen/Qwen3-32B
7 aicBackendVersion: "0.20.0"

Direct Script Usage (Advanced)

For advanced scenarios, run the profiler directly:

$python -m benchmarks.profiler.profile_sla \
> --backend vllm \
> --config path/to/disagg.yaml \
> --model meta-llama/Llama-3-8B \
> --ttft 200 --itl 15 \
> --isl 3000 --osl 150

Configuration

ParameterDefaultDescription
sla.isl-Average input sequence length (tokens)
sla.osl-Average output sequence length (tokens)
sla.ttft-Target Time To First Token (milliseconds)
sla.itl-Target Inter-Token Latency (milliseconds)
sweep.useAiConfiguratorfalseUse offline simulation instead of real profiling
hardware.minNumGpusPerEngineautoMinimum GPUs per engine (auto-detected from model size)
hardware.maxNumGpusPerEngine8Maximum GPUs per engine

Profiling Methods

MethodDurationAccuracyGPU RequiredBackends
Online (AIPerf)2-4 hoursHighestYesAll
Offline (AI Configurator)20-30 secondsEstimatedNoTensorRT-LLM

Output

The profiler generates:

  1. Optimal Configuration: Recommended TP sizes for prefill and decode engines
  2. Performance Data: Interpolation models for the SLA Planner
  3. Generated DGD: Complete deployment manifest with optimized settings

Example recommendations:

Suggested prefill TP:4 (TTFT 48.37 ms, throughput 15505.23 tokens/s/GPU)
Suggested decode TP:4 (ITL 4.83 ms, throughput 51.22 tokens/s/GPU)

Next Steps

DocumentDescription
Profiler GuideConfiguration, methods, and troubleshooting
Profiler ExamplesComplete DGDR YAMLs, WebUI, script examples
SLA Planner GuideEnd-to-end deployment workflow
SLA Planner ArchitectureHow the Planner uses profiling data