Planner
The Planner monitors system performance and automatically scales prefill/decode workers to meet latency SLAs. It runs as a component inside the Dynamo inference graph on Kubernetes.
New to the Planner? Start with the SLA Planner Quick Start Guide for a complete workflow including profiling and deployment.
Feature Matrix
Quick Start
Prerequisites
- Dynamo platform installed on Kubernetes (Installation Guide)
- kube-prometheus-stack installed (Metrics Setup)
- Pre-deployment profiling completed (Profiling Guide)
Deploy with DGDR (Recommended)
The fastest path to a planner-enabled deployment is through a DynamoGraphDeploymentRequest:
This automatically profiles your model and deploys with the SLA planner. See SLA Planner Guide for the full workflow.
Deploy with DGD (Manual)
For manual control, use the disaggregated planner templates:
Documentation
Configuration Reference
Key Arguments
Environment Variables
Monitoring
Grafana Dashboard
Deploy the planner dashboard:
The dashboard shows:
- Worker counts and GPU usage over time
- Observed TTFT, ITL, request rate, sequence lengths
- Predicted load and recommended replica counts
- Correction factors (actual vs. expected performance)
Prometheus Metrics
The planner queries the frontend’s /metrics endpoint via Prometheus. Required metrics:
- Request count and duration
- TTFT and ITL distributions
- Input/output sequence lengths