Tier 3 design documentation for contributors and architects. For user-facing docs, see docs/components/planner/.
The Planner is Dynamo’s autoscaling controller. It supports two scaling modes: throughput-based (using profiling data and traffic prediction) and load-based (using real-time engine metrics and online regression). This document covers the internal architecture, algorithms, and design trade-offs for both modes.
Every adjustment_interval seconds, the planner queries Prometheus for:
The Prometheus query targets the Frontend’s /metrics endpoint, which exposes histograms and counters.
The planner maintains correction factors that adapt profiling-based predictions to real-world behavior:
These factors account for hard to model factors such as:
The correction factors are applied as multipliers to the next scaling decision. Setting --no-correction disables this for debugging or when cold-start artifacts dominate.
The planner forecasts three values for the next interval:
next_num_req: Number of requestsnext_isl: Average input sequence lengthnext_osl: Average output sequence lengthFour predictor implementations are available:
All predictors support warm-starting from trace files (--load-predictor-warmup-trace).
Prefill replicas:
The prefill correction factor has a linear effect on throughput because prefill is single-batched.
Decode replicas:
The planner calls connector.set_component_replicas() with the calculated targets. Scaling is non-blocking by default: the planner continues monitoring while replicas are adjusting.
Directly PATCHes the DGD resource to update replica counts. The operator watches for DGD changes and reconciles component deployments.
Design decisions:
DYN_PARENT_DGD_K8S_NAME to find its parent DGD (injected by operator)subComponentType field (prefill/decode), with fallback to legacy component namesFor non-native environments (e.g., custom orchestrators). Writes scaling decisions to the distributed runtime via VirtualConnectorCoordinator (Rust binding). External systems use VirtualConnectorClient to poll decisions and report completion.
Scaling decision flow:
(num_prefill, num_decode, decision_id) to runtimeclient.wait()client.complete(decision)scaled_decision_id >= decision_id and proceedsTimeout: If scaling isn’t acknowledged within 1800s (configurable), the planner proceeds with new decisions anyway.
The planner uses pre-deployment profiling data (NPZ files) to map (throughput, ISL/OSL, context_length) -> (TTFT, ITL). This data comes from the SLA-driven profiling process (either online GPU profiling or AI Configurator estimation).
Two interpolators are maintained:
The interpolators use the profiling sweep granularity to determine precision. Finer granularity means more profiling samples but more accurate interpolation.
The planner currently waits 30 seconds (INIT_PLANNER_START_DELAY in components/src/dynamo/planner/__main__.py) as a temporary workaround while other components (frontend, workers) register and stabilize; see Known Limitations for the planned readiness-probing replacement.
After the delay:
--environment)adjustment_interval is shorter than the time to add/remove a worker (which includes pod scheduling, model loading, and registration), scaling decisions will overlap. Default of 180s is conservative; workloads with fast model loading can use shorter intervals.--no-correction flag disables correction for scenarios where cold-start artifacts dominate and distort the factor.prefillInterpolationGranularity and decodeInterpolationGranularity in the profiling sweep produce more accurate interpolation but increase profiling time linearly. Default granularity (16 prefill, 6 decode) balances accuracy with profiling duration.--kalman-min-points observations. During warm-up, the planner uses the constant predictor as fallback.The load-based mode uses ForwardPassMetrics (FPM) from the Dynamo event plane to make SLA-aware scaling decisions without requiring profiling data or the KV Router.
Each engine emits per-iteration ForwardPassMetrics via ZMQ -> FpmEventRelay -> event plane. The planner subscribes via FpmEventSubscriber with automatic engine discovery and MDC-based lifecycle tracking. Key fields used:
Each tick, the scaling state machine fills TickDiagnostics with intermediate decision data—estimated latencies, predicted load, per-engine RPS, and decision reasons—via internal _diag_* fields. The adapter layer reads this from PlannerEffects.diagnostics and:
dynamo_planner_estimated_ttft_ms and related estimates)dynamo_planner_load_scaling_decision)DiagnosticsRecorder, which accumulates per-tick snapshots and emits Plotly-based HTML reports on a schedulePer-engine FPM queue depths from _collect_fpm() are exported as labeled Prometheus gauges.
Three specialized regression models (fpm_regression.py):
sum_prefill_tokens -> wall_time. Estimates TTFT by simulating chunked prefill scheduling (chunks of max_num_batched_tokens).sum_decode_kv_tokens -> wall_time. Estimates ITL for total decode load (scheduled + queued + avg decode length).(sum_prefill_tokens, sum_decode_kv_tokens) -> wall_time. Estimates both TTFT (simulated prefill with piggybacked decode) and ITL (decode with average piggybacked prefill).When both modes are enabled, throughput-based scaling (longer interval) sets a lower bound on replicas while load-based scaling (shorter interval) handles real-time adjustments above that floor.
In aggregated mode (--mode agg), engines handle both prefill and decode via chunked prefill. The planner maintains both TTFT and ITL regression models but uses per-worker time-averaged metrics (not instantaneous) for regression training to smooth out chunked prefill noise. Scale up if either prefill or decode signals overload; scale down only if both signal underload.
adjustment_interval < time to scale, scaling decisions can pile up. The planner logs warnings but doesn’t queue.