Profiler Examples

View as Markdown

Complete examples for profiling with DGDRs, the interactive WebUI, and direct script usage.

DGDR Examples

Dense Model: AIPerf on Real Engines

Standard online profiling with real GPU measurements:

1apiVersion: nvidia.com/v1alpha1
2kind: DynamoGraphDeploymentRequest
3metadata:
4 name: vllm-dense-online
5spec:
6 model: "Qwen/Qwen3-0.6B"
7 backend: vllm
8
9 profilingConfig:
10 profilerImage: "nvcr.io/nvidia/ai-dynamo/vllm-runtime:0.9.0"
11 config:
12 sla:
13 isl: 3000
14 osl: 150
15 ttft: 200.0
16 itl: 20.0
17
18 hardware:
19 minNumGpusPerEngine: 1
20 maxNumGpusPerEngine: 8
21
22 sweep:
23 useAiConfigurator: false
24
25 deploymentOverrides:
26 workersImage: "nvcr.io/nvidia/ai-dynamo/vllm-runtime:0.9.0"
27
28 autoApply: true

Dense Model: AI Configurator Simulation

Fast offline profiling (~30 seconds, TensorRT-LLM only):

1apiVersion: nvidia.com/v1alpha1
2kind: DynamoGraphDeploymentRequest
3metadata:
4 name: trtllm-aic-offline
5spec:
6 model: "Qwen/Qwen3-32B"
7 backend: trtllm
8
9 profilingConfig:
10 profilerImage: "nvcr.io/nvidia/ai-dynamo/tensorrtllm-runtime:0.9.0"
11 config:
12 sla:
13 isl: 4000
14 osl: 500
15 ttft: 300.0
16 itl: 10.0
17
18 sweep:
19 useAiConfigurator: true
20 aicSystem: h200_sxm # Also supports h100_sxm, b200_sxm, gb200_sxm, a100_sxm
21 aicHfId: Qwen/Qwen3-32B
22 aicBackendVersion: "0.20.0"
23
24 deploymentOverrides:
25 workersImage: "nvcr.io/nvidia/ai-dynamo/tensorrtllm-runtime:0.9.0"
26
27 autoApply: true

MoE Model

Multi-node MoE profiling with SGLang:

1apiVersion: nvidia.com/v1alpha1
2kind: DynamoGraphDeploymentRequest
3metadata:
4 name: sglang-moe
5spec:
6 model: "deepseek-ai/DeepSeek-R1"
7 backend: sglang
8
9 profilingConfig:
10 profilerImage: "nvcr.io/nvidia/ai-dynamo/sglang-runtime:0.9.0"
11 config:
12 sla:
13 isl: 2048
14 osl: 512
15 ttft: 300.0
16 itl: 25.0
17
18 hardware:
19 numGpusPerNode: 8
20 maxNumGpusPerEngine: 32
21
22 engine:
23 isMoeModel: true
24
25 deploymentOverrides:
26 workersImage: "nvcr.io/nvidia/ai-dynamo/sglang-runtime:0.9.0"
27
28 autoApply: true

Using Existing DGD Config (ConfigMap)

Reference a custom DGD configuration via ConfigMap:

$# Create ConfigMap from your DGD config file
$kubectl create configmap deepseek-r1-config \
> --from-file=/path/to/your/disagg.yaml \
> --namespace $NAMESPACE \
> --dry-run=client -o yaml | kubectl apply -f -
1apiVersion: nvidia.com/v1alpha1
2kind: DynamoGraphDeploymentRequest
3metadata:
4 name: deepseek-r1
5spec:
6 model: deepseek-ai/DeepSeek-R1
7 backend: sglang
8
9 profilingConfig:
10 profilerImage: "nvcr.io/nvidia/ai-dynamo/sglang-runtime:0.9.0"
11 configMapRef:
12 name: deepseek-r1-config
13 key: disagg.yaml
14 config:
15 sla:
16 isl: 4000
17 osl: 500
18 ttft: 300
19 itl: 10
20 sweep:
21 useAiConfigurator: true
22 aicSystem: h200_sxm
23 aicHfId: deepseek-ai/DeepSeek-V3
24 aicBackendVersion: "0.20.0"
25
26 deploymentOverrides:
27 workersImage: "nvcr.io/nvidia/ai-dynamo/sglang-runtime:0.9.0"
28
29 autoApply: true

Interactive WebUI

Launch an interactive configuration selection interface:

$python -m benchmarks.profiler.profile_sla \
> --backend trtllm \
> --config path/to/disagg.yaml \
> --pick-with-webui \
> --use-ai-configurator \
> --model Qwen/Qwen3-32B-FP8 \
> --aic-system h200_sxm \
> --ttft 200 --itl 15

The WebUI launches on port 8000 by default (configurable with --webui-port).

Features

  • Interactive Charts: Visualize prefill TTFT, decode ITL, and GPU hours analysis with hover-to-highlight synchronization between charts and tables
  • Pareto-Optimal Analysis: The GPU Hours table shows pareto-optimal configurations balancing latency and throughput
  • DGD Config Preview: Click “Show Config” on any row to view the corresponding DynamoGraphDeployment YAML
  • GPU Cost Estimation: Toggle GPU cost display to convert GPU hours to cost ($/1000 requests)
  • SLA Visualization: Red dashed lines indicate your TTFT and ITL targets

Selection Methods

  1. GPU Hours Table (recommended): Click any row to select both prefill and decode configurations at once based on the pareto-optimal combination
  2. Individual Selection: Click one row in the Prefill table AND one row in the Decode table to manually choose each

Example DGD Config Output

When you click “Show Config”, you see a DynamoGraphDeployment configuration:

1# DynamoGraphDeployment Configuration
2# Prefill: 1 GPU(s), TP=1
3# Decode: 4 GPU(s), TP=4
4# Model: Qwen/Qwen3-32B-FP8
5# Backend: trtllm
6apiVersion: nvidia.com/v1alpha1
7kind: DynamoGraphDeployment
8spec:
9 services:
10 PrefillWorker:
11 subComponentType: prefill
12 replicas: 1
13 extraPodSpec:
14 mainContainer:
15 args:
16 - --tensor-parallel-size=1
17 DecodeWorker:
18 subComponentType: decode
19 replicas: 1
20 extraPodSpec:
21 mainContainer:
22 args:
23 - --tensor-parallel-size=4

Once you select a configuration, the full DGD CRD is saved as config_with_planner.yaml.

Direct Script Examples

Basic Profiling

$python -m benchmarks.profiler.profile_sla \
> --backend vllm \
> --config path/to/disagg.yaml \
> --model meta-llama/Llama-3-8B \
> --ttft 200 --itl 15 \
> --isl 3000 --osl 150

With GPU Constraints

$python -m benchmarks.profiler.profile_sla \
> --backend sglang \
> --config examples/backends/sglang/deploy/disagg.yaml \
> --model deepseek-ai/DeepSeek-R1-Distill-Llama-8B \
> --ttft 200 --itl 15 \
> --isl 3000 --osl 150 \
> --min-num-gpus 2 \
> --max-num-gpus 8

AI Configurator (Offline)

$python -m benchmarks.profiler.profile_sla \
> --backend trtllm \
> --config path/to/disagg.yaml \
> --use-ai-configurator \
> --model Qwen/Qwen3-32B-FP8 \
> --aic-system h200_sxm \
> --ttft 200 --itl 15 \
> --isl 4000 --osl 500

SGLang Runtime Profiling

Profile SGLang workers at runtime via HTTP endpoints:

$# Start profiling
$curl -X POST http://localhost:9090/engine/start_profile \
> -H "Content-Type: application/json" \
> -d '{"output_dir": "/tmp/profiler_output"}'
$
$# Run inference requests to generate profiling data...
$
$# Stop profiling
$curl -X POST http://localhost:9090/engine/stop_profile

A test script is provided at examples/backends/sglang/test_sglang_profile.py:

$python examples/backends/sglang/test_sglang_profile.py

View traces using Chrome’s chrome://tracing, Perfetto UI, or TensorBoard.