Planner

View as Markdown

Planner

The Planner monitors system performance and automatically scales prefill/decode workers to meet latency SLAs. It runs as a component inside the Dynamo inference graph on Kubernetes.

The SLA Planner supports two scaling modes:

  • Throughput-based scaling: Uses pre-deployment profiling data and traffic prediction to compute the number of replicas needed to meet TTFT and ITL SLA targets. This is the primary scaling mode for production deployments.
  • Load-based scaling (Experimental): Uses real-time per-worker load metrics (active prefill tokens, active KV blocks) from the router to make SLA-aware scaling decisions via online linear regression. Does not require profiling data. Responds quickly to traffic bursts.

When both modes are enabled, throughput-based scaling provides a lower bound on replicas (long-term capacity planning) while load-based scaling handles real-time adjustments (burst response).

New to the Planner? Start with the SLA Planner Quick Start Guide for a complete workflow including profiling and deployment.

Feature Matrix

FeatureThroughput-BasedLoad-Based (Experimental)
Deployment
DisaggregatedSupportedSupported
AggregatedUnsupportedSupported
LLM Framework
vLLMSupportedSupported
TensorRT-LLMSupportedSupported
SGLangSupportedSupported
Requires Profiling DataYesNo
Load PredictorsARIMA, Prophet, Kalman, ConstantN/A
Connectors
KubernetesConnectorSupportedSupported
VirtualConnectorSupportedSupported

When to Use Which Mode

  • Throughput-based scaling should be enabled whenever engine profiling data is available (through pre-deployment profiling). It provides stable, prediction-based capacity planning.
  • Load-based scaling should be enabled when traffic is bursty or hard to predict. It reacts quickly to real-time load changes without requiring profiling data.
  • Both modes together: For the best of both worlds, enable both. Throughput-based scaling provides a lower bound (long-term capacity), while load-based scaling handles bursts above that floor. When both are enabled, use a longer --adjustment-interval for throughput-based scaling.

Quick Start

Prerequisites

For throughput-based scaling, pre-deployment profiling is also required (Profiling Guide).

Throughput-Based Scaling (with DGDR)

The fastest path to a throughput-based planner deployment is through a DynamoGraphDeploymentRequest, which automatically profiles your model:

$kubectl apply -f components/src/dynamo/profiler/deploy/profile_sla_aic_dgdr.yaml -n $NAMESPACE

See Planner Guide for the full workflow.

Load-Based Scaling (without profiling)

To deploy with load-based scaling only (no profiling required), add these arguments to the planner service in your DGD:

1args:
2 - --enable-loadbased-scaling
3 - --disable-throughput-scaling
4 - --loadbased-adjustment-interval=5

The planner will auto-discover the frontend metrics endpoint from the DGD. See disagg_planner_load.yaml for a complete example.

Manual DGD Deployment

For manual control with throughput-based scaling, use the disaggregated planner templates:

$# After profiling is complete
$kubectl apply -f examples/backends/vllm/deploy/disagg_planner.yaml -n $NAMESPACE

Documentation

DocumentDescription
Planner GuideDeployment, configuration, integration, troubleshooting
Planner ExamplesDGDR YAML examples, sample configurations, advanced patterns
SLA-Driven ProfilingPre-deployment profiling process and configuration
Planner DesignArchitecture deep-dive for contributors

Configuration Reference

Key Arguments

ArgumentDefaultDescription
Common
--namespace$DYN_NAMESPACE or dynamoDynamo logical namespace
--backendvllmBackend framework (vllm, sglang, trtllm)
--modedisaggPlanner mode (disagg, prefill, decode, agg)
--environmentkubernetesDeployment environment
--ttft500.0Target Time To First Token (ms)
--itl50.0Target Inter-Token Latency (ms)
--max-gpu-budget8Maximum GPUs across all workers
--min-endpoint1Minimum replicas per worker type
--decode-engine-num-gpu1GPUs per decode engine
--prefill-engine-num-gpu1GPUs per prefill engine
--no-operationfalseObservation mode (no actual scaling)
Throughput-based scaling
--enable-throughput-scalingtrueEnable throughput-based scaling
--adjustment-interval180Seconds between throughput-based scaling decisions
--profile-results-dirprofiling_resultsPath to profiling data (NPZ/JSON)
--load-predictorarimaPrediction model (arima, prophet, kalman, constant)
--no-correctionfalseDisable correction factors
Load-based scaling (Experimental)
--enable-loadbased-scalingfalseEnable load-based scaling
--disable-throughput-scalingfalseDisable throughput-based scaling (required for agg mode)
--loadbased-router-metrics-urlauto-discoveredURL to router’s /metrics endpoint
--loadbased-adjustment-interval5Seconds between load-based scaling decisions
--loadbased-learning-window50Sliding window size for regression model
--loadbased-scaling-down-sensitivity80Scale-down sensitivity 0-100 (0=never, 100=aggressive)
--loadbased-metric-samples10Number of metric samples per adjustment interval
--loadbased-min-observations5Minimum observations before regression activates

Environment Variables

VariableDefaultDescription
DYN_NAMESPACEdynamoDynamo logical namespace
DYN_PARENT_DGD_K8S_NAME(required)Parent DGD K8s resource name
PROMETHEUS_ENDPOINThttp://prometheus-kube-prometheus-prometheus.monitoring.svc.cluster.local:9090Prometheus URL
PLANNER_PROMETHEUS_PORT0 (disabled)Port for planner’s own Prometheus metrics

Monitoring

Grafana Dashboard

Deploy the planner dashboard:

$kubectl apply -n monitoring -f deploy/observability/k8s/grafana-planner-dashboard-configmap.yaml

The dashboard shows:

  • Worker counts and GPU usage over time
  • Observed TTFT, ITL, request rate, sequence lengths
  • Predicted load and recommended replica counts
  • Correction factors (actual vs. expected performance)

Prometheus Metrics

Throughput-based scaling pulls traffic metrics from the cluster-wide Prometheus server:

  • Request count and duration
  • TTFT and ITL distributions
  • Input/output sequence lengths

Load-based scaling pulls per-engine status directly from the frontend’s /metrics endpoint:

  • Active prefill tokens per worker
  • Active decode blocks per worker
  • Last observed TTFT, ITL, and ISL per worker