Profiler Examples | NVIDIA Dynamo Documentation

Complete examples for profiling with DGDRs, the interactive WebUI, and direct script usage.

DGDR Examples

Dense Model: AIPerf on Real Engines

Standard online profiling with real GPU measurements:

1 apiVersion: nvidia.com/v1alpha1
2 kind: DynamoGraphDeploymentRequest
3 metadata:
4   name: vllm-dense-online
5 spec:
6   model: "Qwen/Qwen3-0.6B"
7   backend: vllm
8 
9   profilingConfig:
10     profilerImage: "nvcr.io/nvidia/ai-dynamo/vllm-runtime:0.9.0"
11     config:
12       sla:
13         isl: 3000
14         osl: 150
15         ttft: 200.0
16         itl: 20.0
17 
18       hardware:
19         minNumGpusPerEngine: 1
20         maxNumGpusPerEngine: 8
21 
22       sweep:
23         useAiConfigurator: false
24 
25   deploymentOverrides:
26     workersImage: "nvcr.io/nvidia/ai-dynamo/vllm-runtime:0.9.0"
27 
28   autoApply: true

Dense Model: AI Configurator Simulation

Fast offline profiling (~30 seconds, TensorRT-LLM only):

1 apiVersion: nvidia.com/v1alpha1
2 kind: DynamoGraphDeploymentRequest
3 metadata:
4   name: trtllm-aic-offline
5 spec:
6   model: "Qwen/Qwen3-32B"
7   backend: trtllm
8 
9   profilingConfig:
10     profilerImage: "nvcr.io/nvidia/ai-dynamo/tensorrtllm-runtime:0.9.0"
11     config:
12       sla:
13         isl: 4000
14         osl: 500
15         ttft: 300.0
16         itl: 10.0
17 
18       sweep:
19         useAiConfigurator: true
20         aicSystem: h200_sxm  # Also supports h100_sxm, b200_sxm, gb200_sxm, a100_sxm
21         aicHfId: Qwen/Qwen3-32B
22         aicBackendVersion: "0.20.0"
23 
24   deploymentOverrides:
25     workersImage: "nvcr.io/nvidia/ai-dynamo/tensorrtllm-runtime:0.9.0"
26 
27   autoApply: true

MoE Model

Multi-node MoE profiling with SGLang:

1 apiVersion: nvidia.com/v1alpha1
2 kind: DynamoGraphDeploymentRequest
3 metadata:
4   name: sglang-moe
5 spec:
6   model: "deepseek-ai/DeepSeek-R1"
7   backend: sglang
8 
9   profilingConfig:
10     profilerImage: "nvcr.io/nvidia/ai-dynamo/sglang-runtime:0.9.0"
11     config:
12       sla:
13         isl: 2048
14         osl: 512
15         ttft: 300.0
16         itl: 25.0
17 
18       hardware:
19         numGpusPerNode: 8
20         maxNumGpusPerEngine: 32
21 
22       engine:
23         isMoeModel: true
24 
25   deploymentOverrides:
26     workersImage: "nvcr.io/nvidia/ai-dynamo/sglang-runtime:0.9.0"
27 
28   autoApply: true

Using Existing DGD Config (ConfigMap)

Reference a custom DGD configuration via ConfigMap:

$ # Create ConfigMap from your DGD config file
$ kubectl create configmap deepseek-r1-config \
>   --from-file=/path/to/your/disagg.yaml \
>   --namespace $NAMESPACE \
>   --dry-run=client -o yaml | kubectl apply -f -

1 apiVersion: nvidia.com/v1alpha1
2 kind: DynamoGraphDeploymentRequest
3 metadata:
4   name: deepseek-r1
5 spec:
6   model: deepseek-ai/DeepSeek-R1
7   backend: sglang
8 
9   profilingConfig:
10     profilerImage: "nvcr.io/nvidia/ai-dynamo/sglang-runtime:0.9.0"
11     configMapRef:
12       name: deepseek-r1-config
13       key: disagg.yaml
14     config:
15       sla:
16         isl: 4000
17         osl: 500
18         ttft: 300
19         itl: 10
20       sweep:
21         useAiConfigurator: true
22         aicSystem: h200_sxm
23         aicHfId: deepseek-ai/DeepSeek-V3
24         aicBackendVersion: "0.20.0"
25 
26   deploymentOverrides:
27     workersImage: "nvcr.io/nvidia/ai-dynamo/sglang-runtime:0.9.0"
28 
29   autoApply: true

Interactive WebUI

Launch an interactive configuration selection interface:

$ python -m benchmarks.profiler.profile_sla \
>   --backend trtllm \
>   --config path/to/disagg.yaml \
>   --pick-with-webui \
>   --use-ai-configurator \
>   --model Qwen/Qwen3-32B-FP8 \
>   --aic-system h200_sxm \
>   --ttft 200 --itl 15

The WebUI launches on port 8000 by default (configurable with --webui-port).

Features

Interactive Charts: Visualize prefill TTFT, decode ITL, and GPU hours analysis with hover-to-highlight synchronization between charts and tables
Pareto-Optimal Analysis: The GPU Hours table shows pareto-optimal configurations balancing latency and throughput
DGD Config Preview: Click “Show Config” on any row to view the corresponding DynamoGraphDeployment YAML
GPU Cost Estimation: Toggle GPU cost display to convert GPU hours to cost ($/1000 requests)
SLA Visualization: Red dashed lines indicate your TTFT and ITL targets

Selection Methods

GPU Hours Table (recommended): Click any row to select both prefill and decode configurations at once based on the pareto-optimal combination
Individual Selection: Click one row in the Prefill table AND one row in the Decode table to manually choose each

Example DGD Config Output

When you click “Show Config”, you see a DynamoGraphDeployment configuration:

1 # DynamoGraphDeployment Configuration
2 # Prefill: 1 GPU(s), TP=1
3 # Decode: 4 GPU(s), TP=4
4 # Model: Qwen/Qwen3-32B-FP8
5 # Backend: trtllm
6 apiVersion: nvidia.com/v1alpha1
7 kind: DynamoGraphDeployment
8 spec:
9   services:
10     PrefillWorker:
11       subComponentType: prefill
12       replicas: 1
13       extraPodSpec:
14         mainContainer:
15           args:
16           - --tensor-parallel-size=1
17     DecodeWorker:
18       subComponentType: decode
19       replicas: 1
20       extraPodSpec:
21         mainContainer:
22           args:
23           - --tensor-parallel-size=4

Once you select a configuration, the full DGD CRD is saved as config_with_planner.yaml.

Direct Script Examples

Basic Profiling

$ python -m benchmarks.profiler.profile_sla \
>   --backend vllm \
>   --config path/to/disagg.yaml \
>   --model meta-llama/Llama-3-8B \
>   --ttft 200 --itl 15 \
>   --isl 3000 --osl 150

With GPU Constraints

$ python -m benchmarks.profiler.profile_sla \
>   --backend sglang \
>   --config examples/backends/sglang/deploy/disagg.yaml \
>   --model deepseek-ai/DeepSeek-R1-Distill-Llama-8B \
>   --ttft 200 --itl 15 \
>   --isl 3000 --osl 150 \
>   --min-num-gpus 2 \
>   --max-num-gpus 8

AI Configurator (Offline)

$ python -m benchmarks.profiler.profile_sla \
>   --backend trtllm \
>   --config path/to/disagg.yaml \
>   --use-ai-configurator \
>   --model Qwen/Qwen3-32B-FP8 \
>   --aic-system h200_sxm \
>   --ttft 200 --itl 15 \
>   --isl 4000 --osl 500

SGLang Runtime Profiling

Profile SGLang workers at runtime via HTTP endpoints:

$ # Start profiling
$ curl -X POST http://localhost:9090/engine/start_profile \
>   -H "Content-Type: application/json" \
>   -d '{"output_dir": "/tmp/profiler_output"}'
$ 
$ # Run inference requests to generate profiling data...
$ 
$ # Stop profiling
$ curl -X POST http://localhost:9090/engine/stop_profile

A test script is provided at examples/backends/sglang/test_sglang_profile.py:

$ python examples/backends/sglang/test_sglang_profile.py

View traces using Chrome’s chrome://tracing, Perfetto UI, or TensorBoard.