Autoscaling
This guide explains how to configure autoscaling for DynamoGraphDeployment (DGD) services using the sglang-agg example from examples/backends/sglang/deploy/agg.yaml.
Example DGD
All examples in this guide use the following DGD:
Key identifiers:
- DGD name:
sglang-agg - Namespace:
default - Services:
Frontend,decode - dynamo_namespace label:
default-sglang-agg(used for metric filtering)
Overview
Dynamo provides flexible autoscaling through the DynamoGraphDeploymentScalingAdapter (DGDSA) resource. When you deploy a DGD, the operator automatically creates one adapter per service (unless explicitly disabled). These adapters implement the Kubernetes Scale subresource, enabling integration with:
⚠️ Deprecation Notice: The
spec.services[X].autoscalingfield in DGD is deprecated and ignored. Use DGDSA with HPA, KEDA, or Planner instead. If you have existing DGDs withautoscalingconfigured, you’ll see a warning. Remove the field to silence the warning.
Architecture
How it works:
- You deploy a DGD with services (Frontend, decode)
- The operator auto-creates one DGDSA per service
- Autoscalers (KEDA, HPA, Planner) target the adapters via
/scalesubresource - Adapter controller syncs replica changes to the DGD
- DGD controller reconciles the underlying pods
Viewing Scaling Adapters
After deploying the sglang-agg DGD, verify the auto-created adapters:
Replica Ownership Model
When DGDSA is enabled (the default), it becomes the source of truth for replica counts. This follows the same pattern as Kubernetes Deployments owning ReplicaSets.
How It Works
- DGDSA owns replicas: Autoscalers (HPA, KEDA, Planner) update the DGDSA’s
spec.replicas - DGDSA syncs to DGD: The DGDSA controller writes the replica count to the DGD’s service
- Direct DGD edits blocked: A validating webhook prevents users from directly editing
spec.services[X].replicasin the DGD - Controllers allowed: Only authorized controllers (operator, Planner) can modify DGD replicas
Manual Scaling with DGDSA Enabled
When DGDSA is enabled, use kubectl scale on the adapter (not the DGD):
Enabling DGDSA for a Service
By default, no DGDSA is created for services, allowing direct replica management via the DGD. To enable autoscaling via HPA, KEDA, or Planner, explicitly enable the scaling adapter:
When to enable DGDSA:
- You want to use HPA, KEDA, or Planner for autoscaling
- You want a clear separation between “desired scale” (adapter) and “deployment config” (DGD)
- You want protection against accidental direct replica edits
When to keep DGDSA disabled (default):
- You want simple, manual replica management
- You don’t need autoscaling for that service
- You prefer direct DGD edits over adapter-based scaling
Autoscaling with Dynamo Planner
The Dynamo Planner is an LLM-aware autoscaler that optimizes scaling decisions based on inference-specific metrics like Time To First Token (TTFT), Inter-Token Latency (ITL), and KV cache utilization.
When to use Planner:
- You want LLM-optimized autoscaling out of the box
- You need coordinated scaling across prefill/decode services
- You want SLA-driven scaling (e.g., target TTFT < 500ms)
How Planner works:
Planner is deployed as a service component within your DGD. It:
- Queries Prometheus for frontend metrics (request rate, latency, etc.)
- Uses profiling data to predict optimal replica counts
- Scales prefill/decode workers to meet SLA targets
Deployment:
The recommended way to deploy Planner is via DynamoGraphDeploymentRequest (DGDR). See the SLA Planner Quick Start for complete instructions.
Example configurations with Planner:
examples/backends/vllm/deploy/disagg_planner.yamlexamples/backends/sglang/deploy/disagg_planner.yamlexamples/backends/trtllm/deploy/disagg_planner.yaml
For more details, see the SLA Planner documentation.
Autoscaling with Kubernetes HPA
The Horizontal Pod Autoscaler (HPA) is Kubernetes’ native autoscaling solution.
When to use HPA:
- You have simple, predictable scaling requirements
- You want to use standard Kubernetes tooling
- You need CPU or memory-based scaling
For custom metrics (like TTFT or queue depth), consider using KEDA instead - it’s simpler to configure.
Basic HPA (CPU-based)
HPA with Dynamo Metrics
Dynamo exports several metrics useful for autoscaling. These are available at the /metrics endpoint on each frontend pod.
See also: For a complete list of all Dynamo metrics, see the Metrics Reference. For Prometheus and Grafana setup, see the Prometheus and Grafana Setup Guide.
Available Dynamo Metrics
Metric Labels
Dynamo metrics include these labels for filtering:
When you have multiple DGDs in the same namespace, use dynamo_namespace to filter metrics for a specific DGD.
Example: Scale Decode Service Based on TTFT
Using HPA with Prometheus Adapter requires configuring external metrics.
Step 1: Configure Prometheus Adapter
Add this to your Helm values file (e.g., prometheus-adapter-values.yaml):
Step 2: Install Prometheus Adapter
Step 3: Verify the metric is available
Step 4: Create the HPA
How it works:
- Frontend pods export
dynamo_frontend_time_to_first_token_secondshistogram - Prometheus Adapter calculates p95 TTFT per
dynamo_namespace - HPA monitors this metric filtered by
dynamo_namespace: "default-sglang-agg" - When TTFT p95 > 500ms, HPA scales up the
sglang-agg-decodeadapter - Adapter controller syncs the replica count to the DGD’s
decodeservice - More decode workers are created, reducing TTFT
Example: Scale Based on Queue Depth
Add this rule to your prometheus-adapter-values.yaml (alongside the TTFT rule):
Then create the HPA:
Autoscaling with KEDA (Recommended)
KEDA (Kubernetes Event-driven Autoscaling) extends Kubernetes with event-driven autoscaling, supporting 50+ scalers including Prometheus.
Advantages over HPA + Prometheus Adapter:
- No Prometheus Adapter configuration needed
- PromQL queries are defined in the ScaledObject itself (declarative, per-deployment)
- Easy to update - just
kubectl applythe ScaledObject - Can scale to zero when idle
- Supports multiple triggers per object
When to use KEDA:
- You want simpler configuration (no Prometheus Adapter to manage)
- You need event-driven scaling (e.g., queue depth, Kafka, etc.)
- You want to scale to zero when idle
Installing KEDA
If you have Prometheus Adapter installed, either uninstall it first (helm uninstall prometheus-adapter -n monitoring) or install KEDA with --set metricsServer.enabled=false to avoid API conflicts.
Example: Scale Decode Based on TTFT
Using the sglang-agg DGD from examples/backends/sglang/deploy/agg.yaml:
Apply it:
Verify KEDA Scaling
Example: Scale Based on Queue Depth
How KEDA Works
KEDA creates and manages an HPA under the hood:
Mixed Autoscaling
For disaggregated deployments (prefill + decode), you can use different autoscaling strategies for different services:
Manual Scaling
With DGDSA Enabled (Default)
When DGDSA is enabled (the default), scale via the adapter:
Verify the scaling:
If an autoscaler (KEDA, HPA, Planner) is managing the adapter, your change will be overwritten on the next evaluation cycle.
With DGDSA Disabled
If you’ve disabled the scaling adapter for a service, edit the DGD directly:
Or edit the YAML (no scalingAdapter.enabled: true means direct edits are allowed):
Best Practices
1. Choose One Autoscaler Per Service
Avoid configuring multiple autoscalers for the same service:
2. Use Appropriate Metrics
3. Configure Stabilization Windows
Prevent thrashing with appropriate stabilization:
4. Set Sensible Min/Max Replicas
Always configure minimum and maximum replicas in your HPA/KEDA to prevent:
- Scaling to zero (unless intentional)
- Unbounded scaling that exhausts cluster resources
Troubleshooting
Adapters Not Created
Scaling Not Working
Metrics Not Available
If HPA/KEDA shows <unknown> for metrics:
Rapid Scaling Up and Down
If you see unstable scaling:
- Check if multiple autoscalers are targeting the same adapter
- Increase
cooldownPeriodin KEDA ScaledObject - Increase
stabilizationWindowSecondsin HPA behavior