Planner

View as Markdown

The Planner monitors system performance and automatically scales prefill/decode workers to meet latency SLAs. It runs as a component inside the Dynamo inference graph on Kubernetes.

New to the Planner? Start with the SLA Planner Quick Start Guide for a complete workflow including profiling and deployment.

Feature Matrix

CategoryFeatureStatus
BackendLocal (bare metal)Deprecated
KubernetesSupported
LLM FrameworkvLLMSupported
TensorRT-LLMSupported
SGLangSupported
Serving TypeAggregatedUnsupported
DisaggregatedSupported
Scaling ModeSLA-based (TTFT/ITL targets)Supported (primary)
Load-based (KV cache/queue thresholds)Deprecated
Load PredictorsARIMASupported
ProphetSupported
Kalman filterSupported
Constant (current = next)Supported
ConnectorsKubernetesConnector (native DGD scaling)Supported
VirtualConnector (external environments)Supported

Quick Start

Prerequisites

The fastest path to a planner-enabled deployment is through a DynamoGraphDeploymentRequest:

$kubectl apply -f benchmarks/profiler/deploy/profile_sla_aic_dgdr.yaml -n $NAMESPACE

This automatically profiles your model and deploys with the SLA planner. See SLA Planner Guide for the full workflow.

Deploy with DGD (Manual)

For manual control, use the disaggregated planner templates:

$# After profiling is complete
$kubectl apply -f examples/backends/vllm/deploy/disagg_planner.yaml -n $NAMESPACE

Documentation

DocumentDescription
Planner GuideDeployment, configuration, integration, troubleshooting
Planner ExamplesDGDR YAML examples, sample configurations, advanced patterns
SLA Planner GuideEnd-to-end DGDR workflow: define SLAs, profile, deploy, monitor
SLA-based PlannerScaling algorithm, correction factors, load prediction details
Load-based PlannerLegacy load-based scaling (deprecated)
SLA-Driven ProfilingPre-deployment profiling process and configuration
Planner DesignArchitecture deep-dive for contributors

Configuration Reference

Key Arguments

ArgumentDefaultDescription
--namespace$DYN_NAMESPACE or dynamoDynamo logical namespace
--backendvllmBackend framework (vllm, sglang, trtllm)
--environmentkubernetesDeployment environment
--adjustment-interval180Seconds between scaling decisions
--ttft500.0Target Time To First Token (ms)
--itl50.0Target Inter-Token Latency (ms)
--isl3000Expected average input sequence length
--osl150Expected average output sequence length
--load-predictorarimaPrediction model (arima, prophet, kalman, constant)
--max-gpu-budget8Maximum GPUs across all workers
--min-endpoint1Minimum replicas per worker type
--decode-engine-num-gpu1GPUs per decode engine
--prefill-engine-num-gpu1GPUs per prefill engine
--no-operationfalseObservation mode (no actual scaling)
--no-correctionfalseDisable correction factors
--profile-results-dirprofiling_resultsPath to profiling data (NPZ/JSON)

Environment Variables

VariableDefaultDescription
DYN_NAMESPACEdynamoDynamo logical namespace
DYN_PARENT_DGD_K8S_NAME(required)Parent DGD K8s resource name
PROMETHEUS_ENDPOINThttp://prometheus-kube-prometheus-prometheus.monitoring.svc.cluster.local:9090Prometheus URL
PLANNER_PROMETHEUS_PORT0 (disabled)Port for planner’s own Prometheus metrics

Monitoring

Grafana Dashboard

Deploy the planner dashboard:

$kubectl apply -n monitoring -f deploy/observability/k8s/grafana-planner-dashboard-configmap.yaml

The dashboard shows:

  • Worker counts and GPU usage over time
  • Observed TTFT, ITL, request rate, sequence lengths
  • Predicted load and recommended replica counts
  • Correction factors (actual vs. expected performance)

Prometheus Metrics

The planner queries the frontend’s /metrics endpoint via Prometheus. Required metrics:

  • Request count and duration
  • TTFT and ITL distributions
  • Input/output sequence lengths