Planner
Planner
The Planner monitors system performance and automatically scales prefill/decode workers to meet latency SLAs. It runs as a component inside the Dynamo inference graph on Kubernetes.
The SLA Planner supports two scaling modes:
- Throughput-based scaling: Uses pre-deployment profiling data and traffic prediction to compute the number of replicas needed to meet TTFT and ITL SLA targets. This is the primary scaling mode for production deployments.
- Load-based scaling (Experimental): Uses real-time per-worker load metrics (active prefill tokens, active KV blocks) from the router to make SLA-aware scaling decisions via online linear regression. Does not require profiling data. Responds quickly to traffic bursts.
When both modes are enabled, throughput-based scaling provides a lower bound on replicas (long-term capacity planning) while load-based scaling handles real-time adjustments (burst response).
New to the Planner? Start with the SLA Planner Quick Start Guide for a complete workflow including profiling and deployment.
Feature Matrix
When to Use Which Mode
- Throughput-based scaling should be enabled whenever engine profiling data is available (through pre-deployment profiling). It provides stable, prediction-based capacity planning.
- Load-based scaling should be enabled when traffic is bursty or hard to predict. It reacts quickly to real-time load changes without requiring profiling data.
- Both modes together: For the best of both worlds, enable both. Throughput-based scaling provides a lower bound (long-term capacity), while load-based scaling handles bursts above that floor. When both are enabled, use a longer
--adjustment-intervalfor throughput-based scaling.
Quick Start
Prerequisites
- Dynamo platform installed on Kubernetes (Installation Guide)
- kube-prometheus-stack installed (Metrics Setup)
For throughput-based scaling, pre-deployment profiling is also required (Profiling Guide).
Throughput-Based Scaling (with DGDR)
The fastest path to a throughput-based planner deployment is through a DynamoGraphDeploymentRequest, which automatically profiles your model:
See Planner Guide for the full workflow.
Load-Based Scaling (without profiling)
To deploy with load-based scaling only (no profiling required), add these arguments to the planner service in your DGD:
The planner will auto-discover the frontend metrics endpoint from the DGD. See disagg_planner_load.yaml for a complete example.
Manual DGD Deployment
For manual control with throughput-based scaling, use the disaggregated planner templates:
Documentation
Configuration Reference
Key Arguments
Environment Variables
Monitoring
Grafana Dashboard
Deploy the planner dashboard:
The dashboard shows:
- Worker counts and GPU usage over time
- Observed TTFT, ITL, request rate, sequence lengths
- Predicted load and recommended replica counts
- Correction factors (actual vs. expected performance)
Prometheus Metrics
Throughput-based scaling pulls traffic metrics from the cluster-wide Prometheus server:
- Request count and duration
- TTFT and ITL distributions
- Input/output sequence lengths
Load-based scaling pulls per-engine status directly from the frontend’s /metrics endpoint:
- Active prefill tokens per worker
- Active decode blocks per worker
- Last observed TTFT, ITL, and ISL per worker