Planner | NVIDIA Dynamo Documentation

The Planner monitors system performance and automatically scales prefill/decode workers to meet latency SLAs. It runs as a component inside the Dynamo inference graph on Kubernetes.

New to the Planner? Start with the SLA Planner Quick Start Guide for a complete workflow including profiling and deployment.

Feature Matrix

Category	Feature	Status
Backend	Local (bare metal)	Deprecated
	Kubernetes	Supported
LLM Framework	vLLM	Supported
	TensorRT-LLM	Supported
	SGLang	Supported
Serving Type	Aggregated	Unsupported
	Disaggregated	Supported
Scaling Mode	SLA-based (TTFT/ITL targets)	Supported (primary)
	Load-based (KV cache/queue thresholds)	Deprecated
Load Predictors	ARIMA	Supported
	Prophet	Supported
	Kalman filter	Supported
	Constant (current = next)	Supported
Connectors	KubernetesConnector (native DGD scaling)	Supported
	VirtualConnector (external environments)	Supported

Quick Start

Prerequisites

Dynamo platform installed on Kubernetes (Installation Guide)
kube-prometheus-stack installed (Metrics Setup)
Pre-deployment profiling completed (Profiling Guide)

Deploy with DGDR (Recommended)

The fastest path to a planner-enabled deployment is through a DynamoGraphDeploymentRequest:

$ kubectl apply -f benchmarks/profiler/deploy/profile_sla_aic_dgdr.yaml -n $NAMESPACE

This automatically profiles your model and deploys with the SLA planner. See SLA Planner Guide for the full workflow.

Deploy with DGD (Manual)

For manual control, use the disaggregated planner templates:

$ # After profiling is complete
$ kubectl apply -f examples/backends/vllm/deploy/disagg_planner.yaml -n $NAMESPACE

Documentation

Document	Description
Planner Guide	Deployment, configuration, integration, troubleshooting
Planner Examples	DGDR YAML examples, sample configurations, advanced patterns
SLA Planner Guide	End-to-end DGDR workflow: define SLAs, profile, deploy, monitor
SLA-based Planner	Scaling algorithm, correction factors, load prediction details
Load-based Planner	Legacy load-based scaling (deprecated)
SLA-Driven Profiling	Pre-deployment profiling process and configuration
Planner Design	Architecture deep-dive for contributors

Configuration Reference

Key Arguments

Argument	Default	Description
`--namespace`	`$DYN_NAMESPACE` or `dynamo`	Dynamo logical namespace
`--backend`	`vllm`	Backend framework (`vllm`, `sglang`, `trtllm`)
`--environment`	`kubernetes`	Deployment environment
`--adjustment-interval`	`180`	Seconds between scaling decisions
`--ttft`	`500.0`	Target Time To First Token (ms)
`--itl`	`50.0`	Target Inter-Token Latency (ms)
`--isl`	`3000`	Expected average input sequence length
`--osl`	`150`	Expected average output sequence length
`--load-predictor`	`arima`	Prediction model (`arima`, `prophet`, `kalman`, `constant`)
`--max-gpu-budget`	`8`	Maximum GPUs across all workers
`--min-endpoint`	`1`	Minimum replicas per worker type
`--decode-engine-num-gpu`	`1`	GPUs per decode engine
`--prefill-engine-num-gpu`	`1`	GPUs per prefill engine
`--no-operation`	`false`	Observation mode (no actual scaling)
`--no-correction`	`false`	Disable correction factors
`--profile-results-dir`	`profiling_results`	Path to profiling data (NPZ/JSON)

Environment Variables

Variable	Default	Description
`DYN_NAMESPACE`	`dynamo`	Dynamo logical namespace
`DYN_PARENT_DGD_K8S_NAME`	(required)	Parent DGD K8s resource name
`PROMETHEUS_ENDPOINT`	`http://prometheus-kube-prometheus-prometheus.monitoring.svc.cluster.local:9090`	Prometheus URL
`PLANNER_PROMETHEUS_PORT`	`0` (disabled)	Port for planner’s own Prometheus metrics

Monitoring

Grafana Dashboard

Deploy the planner dashboard:

$ kubectl apply -n monitoring -f deploy/observability/k8s/grafana-planner-dashboard-configmap.yaml

The dashboard shows:

Worker counts and GPU usage over time
Observed TTFT, ITL, request rate, sequence lengths
Predicted load and recommended replica counts
Correction factors (actual vs. expected performance)

Prometheus Metrics

The planner queries the frontend’s /metrics endpoint via Prometheus. Required metrics:

Request count and duration
TTFT and ITL distributions
Input/output sequence lengths