Autoscaling

View as Markdown

This guide explains how to configure autoscaling for DynamoGraphDeployment (DGD) services using the sglang-agg example from examples/backends/sglang/deploy/agg.yaml.

Example DGD

All examples in this guide use the following DGD:

1# examples/backends/sglang/deploy/agg.yaml
2apiVersion: nvidia.com/v1alpha1
3kind: DynamoGraphDeployment
4metadata:
5 name: sglang-agg
6 namespace: default
7spec:
8 services:
9 Frontend:
10 dynamoNamespace: sglang-agg
11 componentType: frontend
12 replicas: 1
13
14 decode:
15 dynamoNamespace: sglang-agg
16 componentType: worker
17 replicas: 1
18 resources:
19 limits:
20 gpu: "1"

Key identifiers:

  • DGD name: sglang-agg
  • Namespace: default
  • Services: Frontend, decode
  • dynamo_namespace label: default-sglang-agg (used for metric filtering)

Overview

Dynamo provides flexible autoscaling through the DynamoGraphDeploymentScalingAdapter (DGDSA) resource. When you deploy a DGD, the operator automatically creates one adapter per service (unless explicitly disabled). These adapters implement the Kubernetes Scale subresource, enabling integration with:

AutoscalerDescriptionBest For
KEDAEvent-driven autoscaling (recommended)Most use cases
Kubernetes HPANative horizontal pod autoscalingSimple CPU/memory-based scaling
Dynamo PlannerLLM-aware autoscaling with SLA optimizationProduction LLM workloads
Custom ControllersAny scale-subresource-compatible controllerCustom requirements

⚠️ Deprecation Notice: The spec.services[X].autoscaling field in DGD is deprecated and ignored. Use DGDSA with HPA, KEDA, or Planner instead. If you have existing DGDs with autoscaling configured, you’ll see a warning. Remove the field to silence the warning.

Architecture

┌──────────────────────────────────┐ ┌─────────────────────────────────────┐
│ DynamoGraphDeployment │ │ Scaling Adapters (auto-created) │
│ "sglang-agg" │ │ (one per service) │
├──────────────────────────────────┤ ├─────────────────────────────────────┤
│ │ │ │
│ spec.services: │ │ ┌─────────────────────────────┐ │ ┌──────────────────┐
│ │ │ │ sglang-agg-frontend │◄───┼──────│ Autoscalers │
│ ┌────────────────────────┐◄───┼──────────┼──│ spec.replicas: 1 │ │ │ │
│ │ Frontend: 1 replica │ │ │ └─────────────────────────────┘ │ │ • KEDA │
│ └────────────────────────┘ │ │ │ │ • HPA │
│ │ │ ┌─────────────────────────────┐ │ │ • Planner │
│ ┌────────────────────────┐◄───┼──────────┼──│ sglang-agg-decode │◄───┼──────│ • Custom │
│ │ decode: 1 replica │ │ │ │ spec.replicas: 1 │ │ │ │
│ └────────────────────────┘ │ │ └─────────────────────────────┘ │ └──────────────────┘
│ │ │ │
└──────────────────────────────────┘ └─────────────────────────────────────┘

How it works:

  1. You deploy a DGD with services (Frontend, decode)
  2. The operator auto-creates one DGDSA per service
  3. Autoscalers (KEDA, HPA, Planner) target the adapters via /scale subresource
  4. Adapter controller syncs replica changes to the DGD
  5. DGD controller reconciles the underlying pods

Viewing Scaling Adapters

After deploying the sglang-agg DGD, verify the auto-created adapters:

$kubectl get dgdsa -n default
$
$# Example output:
$# NAME DGD SERVICE REPLICAS AGE
$# sglang-agg-frontend sglang-agg Frontend 1 5m
$# sglang-agg-decode sglang-agg decode 1 5m

Replica Ownership Model

When DGDSA is enabled (the default), it becomes the source of truth for replica counts. This follows the same pattern as Kubernetes Deployments owning ReplicaSets.

How It Works

  1. DGDSA owns replicas: Autoscalers (HPA, KEDA, Planner) update the DGDSA’s spec.replicas
  2. DGDSA syncs to DGD: The DGDSA controller writes the replica count to the DGD’s service
  3. Direct DGD edits blocked: A validating webhook prevents users from directly editing spec.services[X].replicas in the DGD
  4. Controllers allowed: Only authorized controllers (operator, Planner) can modify DGD replicas

Manual Scaling with DGDSA Enabled

When DGDSA is enabled, use kubectl scale on the adapter (not the DGD):

$# ✅ Correct - scale via DGDSA
$kubectl scale dgdsa sglang-agg-decode --replicas=3
$
$# ❌ Blocked - direct DGD edit rejected by webhook
$kubectl patch dgd sglang-agg --type=merge -p '{"spec":{"services":{"decode":{"replicas":3}}}}'
$# Error: spec.services[decode].replicas cannot be modified directly when scaling adapter is enabled;
$# use 'kubectl scale dgdsa/sglang-agg-decode --replicas=3' or update the DynamoGraphDeploymentScalingAdapter instead

Enabling DGDSA for a Service

By default, no DGDSA is created for services, allowing direct replica management via the DGD. To enable autoscaling via HPA, KEDA, or Planner, explicitly enable the scaling adapter:

1apiVersion: nvidia.com/v1alpha1
2kind: DynamoGraphDeployment
3metadata:
4 name: sglang-agg
5spec:
6 services:
7 Frontend:
8 replicas: 2 # ← No DGDSA by default, direct edits allowed
9
10 decode:
11 replicas: 1
12 scalingAdapter:
13 enabled: true # ← DGDSA created, managed via adapter

When to enable DGDSA:

  • You want to use HPA, KEDA, or Planner for autoscaling
  • You want a clear separation between “desired scale” (adapter) and “deployment config” (DGD)
  • You want protection against accidental direct replica edits

When to keep DGDSA disabled (default):

  • You want simple, manual replica management
  • You don’t need autoscaling for that service
  • You prefer direct DGD edits over adapter-based scaling

Autoscaling with Dynamo Planner

The Dynamo Planner is an LLM-aware autoscaler that optimizes scaling decisions based on inference-specific metrics like Time To First Token (TTFT), Inter-Token Latency (ITL), and KV cache utilization.

When to use Planner:

  • You want LLM-optimized autoscaling out of the box
  • You need coordinated scaling across prefill/decode services
  • You want SLA-driven scaling (e.g., target TTFT < 500ms)

How Planner works:

Planner is deployed as a service component within your DGD. It:

  1. Queries Prometheus for frontend metrics (request rate, latency, etc.)
  2. Uses profiling data to predict optimal replica counts
  3. Scales prefill/decode workers to meet SLA targets

Deployment:

The recommended way to deploy Planner is via DynamoGraphDeploymentRequest (DGDR). See the SLA Planner Quick Start for complete instructions.

Example configurations with Planner:

  • examples/backends/vllm/deploy/disagg_planner.yaml
  • examples/backends/sglang/deploy/disagg_planner.yaml
  • examples/backends/trtllm/deploy/disagg_planner.yaml

For more details, see the SLA Planner documentation.

Autoscaling with Kubernetes HPA

The Horizontal Pod Autoscaler (HPA) is Kubernetes’ native autoscaling solution.

When to use HPA:

  • You have simple, predictable scaling requirements
  • You want to use standard Kubernetes tooling
  • You need CPU or memory-based scaling

For custom metrics (like TTFT or queue depth), consider using KEDA instead - it’s simpler to configure.

Basic HPA (CPU-based)

1apiVersion: autoscaling/v2
2kind: HorizontalPodAutoscaler
3metadata:
4 name: sglang-agg-frontend-hpa
5 namespace: default
6spec:
7 scaleTargetRef:
8 apiVersion: nvidia.com/v1alpha1
9 kind: DynamoGraphDeploymentScalingAdapter
10 name: sglang-agg-frontend
11 minReplicas: 1
12 maxReplicas: 10
13 metrics:
14 - type: Resource
15 resource:
16 name: cpu
17 target:
18 type: Utilization
19 averageUtilization: 70
20 behavior:
21 scaleDown:
22 stabilizationWindowSeconds: 300
23 scaleUp:
24 stabilizationWindowSeconds: 0

HPA with Dynamo Metrics

Dynamo exports several metrics useful for autoscaling. These are available at the /metrics endpoint on each frontend pod.

See also: For a complete list of all Dynamo metrics, see the Metrics Reference. For Prometheus and Grafana setup, see the Prometheus and Grafana Setup Guide.

Available Dynamo Metrics

MetricTypeDescriptionGood for scaling
dynamo_frontend_queued_requestsGaugeRequests waiting in HTTP queue✅ Workers
dynamo_frontend_inflight_requestsGaugeConcurrent requests to engine✅ All services
dynamo_frontend_time_to_first_token_secondsHistogramTTFT latency✅ Workers
dynamo_frontend_inter_token_latency_secondsHistogramITL latency✅ Decode
dynamo_frontend_request_duration_secondsHistogramTotal request duration⚠️ General
kvstats_gpu_cache_usage_percentGaugeGPU KV cache usage (0-1)✅ Decode

Metric Labels

Dynamo metrics include these labels for filtering:

LabelDescriptionExample
dynamo_namespaceUnique DGD identifier ({k8s-namespace}-{dynamoNamespace})default-sglang-agg
modelModel being servedQwen/Qwen3-0.6B

When you have multiple DGDs in the same namespace, use dynamo_namespace to filter metrics for a specific DGD.

Example: Scale Decode Service Based on TTFT

Using HPA with Prometheus Adapter requires configuring external metrics.

Step 1: Configure Prometheus Adapter

Add this to your Helm values file (e.g., prometheus-adapter-values.yaml):

1# prometheus-adapter-values.yaml
2prometheus:
3 url: http://prometheus-kube-prometheus-prometheus.monitoring.svc
4 port: 9090
5
6rules:
7 external:
8 # TTFT p95 from frontend - used to scale decode
9 - seriesQuery: 'dynamo_frontend_time_to_first_token_seconds_bucket{namespace!=""}'
10 resources:
11 overrides:
12 namespace: {resource: "namespace"}
13 name:
14 as: "dynamo_ttft_p95_seconds"
15 metricsQuery: |
16 histogram_quantile(0.95,
17 sum(rate(dynamo_frontend_time_to_first_token_seconds_bucket{<<.LabelMatchers>>}[5m]))
18 by (le, namespace, dynamo_namespace)
19 )

Step 2: Install Prometheus Adapter

$helm repo add prometheus-community https://prometheus-community.github.io/helm-charts
$helm repo update
$
$helm upgrade --install prometheus-adapter prometheus-community/prometheus-adapter \
> -n monitoring --create-namespace \
> -f prometheus-adapter-values.yaml

Step 3: Verify the metric is available

$kubectl get --raw "/apis/external.metrics.k8s.io/v1beta1/namespaces/<your-namespace>/dynamo_ttft_p95_seconds" | jq

Step 4: Create the HPA

1apiVersion: autoscaling/v2
2kind: HorizontalPodAutoscaler
3metadata:
4 name: sglang-agg-decode-hpa
5spec:
6 scaleTargetRef:
7 apiVersion: nvidia.com/v1alpha1
8 kind: DynamoGraphDeploymentScalingAdapter
9 name: sglang-agg-decode # ← DGD name + service name (lowercase)
10 minReplicas: 1
11 maxReplicas: 10
12 metrics:
13 - type: External
14 external:
15 metric:
16 name: dynamo_ttft_p95_seconds
17 selector:
18 matchLabels:
19 dynamo_namespace: "default-sglang-agg" # ← {namespace}-{dynamoNamespace}
20 target:
21 type: Value
22 value: "500m" # Scale up when TTFT p95 > 500ms
23 behavior:
24 scaleDown:
25 stabilizationWindowSeconds: 60 # Wait 1 min before scaling down
26 policies:
27 - type: Pods
28 value: 1
29 periodSeconds: 30
30 scaleUp:
31 stabilizationWindowSeconds: 0 # Scale up immediately
32 policies:
33 - type: Pods
34 value: 2
35 periodSeconds: 30

How it works:

  1. Frontend pods export dynamo_frontend_time_to_first_token_seconds histogram
  2. Prometheus Adapter calculates p95 TTFT per dynamo_namespace
  3. HPA monitors this metric filtered by dynamo_namespace: "default-sglang-agg"
  4. When TTFT p95 > 500ms, HPA scales up the sglang-agg-decode adapter
  5. Adapter controller syncs the replica count to the DGD’s decode service
  6. More decode workers are created, reducing TTFT

Example: Scale Based on Queue Depth

Add this rule to your prometheus-adapter-values.yaml (alongside the TTFT rule):

1# Add to rules.external in prometheus-adapter-values.yaml
2- seriesQuery: 'dynamo_frontend_queued_requests{namespace!=""}'
3 resources:
4 overrides:
5 namespace: {resource: "namespace"}
6 name:
7 as: "dynamo_queued_requests"
8 metricsQuery: |
9 sum(<<.Series>>{<<.LabelMatchers>>}) by (namespace, dynamo_namespace)

Then create the HPA:

1apiVersion: autoscaling/v2
2kind: HorizontalPodAutoscaler
3metadata:
4 name: sglang-agg-decode-queue-hpa
5 namespace: default
6spec:
7 scaleTargetRef:
8 apiVersion: nvidia.com/v1alpha1
9 kind: DynamoGraphDeploymentScalingAdapter
10 name: sglang-agg-decode
11 minReplicas: 1
12 maxReplicas: 10
13 metrics:
14 - type: External
15 external:
16 metric:
17 name: dynamo_queued_requests
18 selector:
19 matchLabels:
20 dynamo_namespace: "default-sglang-agg"
21 target:
22 type: Value
23 value: "10" # Scale up when queue > 10 requests

KEDA (Kubernetes Event-driven Autoscaling) extends Kubernetes with event-driven autoscaling, supporting 50+ scalers including Prometheus.

Advantages over HPA + Prometheus Adapter:

  • No Prometheus Adapter configuration needed
  • PromQL queries are defined in the ScaledObject itself (declarative, per-deployment)
  • Easy to update - just kubectl apply the ScaledObject
  • Can scale to zero when idle
  • Supports multiple triggers per object

When to use KEDA:

  • You want simpler configuration (no Prometheus Adapter to manage)
  • You need event-driven scaling (e.g., queue depth, Kafka, etc.)
  • You want to scale to zero when idle

Installing KEDA

$# Add KEDA Helm repo
$helm repo add kedacore https://kedacore.github.io/charts
$helm repo update
$
$# Install KEDA
$helm install keda kedacore/keda \
> --namespace keda \
> --create-namespace
$
$# Verify installation
$kubectl get pods -n keda

If you have Prometheus Adapter installed, either uninstall it first (helm uninstall prometheus-adapter -n monitoring) or install KEDA with --set metricsServer.enabled=false to avoid API conflicts.

Example: Scale Decode Based on TTFT

Using the sglang-agg DGD from examples/backends/sglang/deploy/agg.yaml:

1apiVersion: keda.sh/v1alpha1
2kind: ScaledObject
3metadata:
4 name: sglang-agg-decode-scaler
5 namespace: default
6spec:
7 scaleTargetRef:
8 apiVersion: nvidia.com/v1alpha1
9 kind: DynamoGraphDeploymentScalingAdapter
10 name: sglang-agg-decode
11 minReplicaCount: 1
12 maxReplicaCount: 10
13 pollingInterval: 15 # Check metrics every 15 seconds
14 cooldownPeriod: 60 # Wait 60s before scaling down
15 triggers:
16 - type: prometheus
17 metadata:
18 # Update this URL to match your Prometheus service
19 serverAddress: http://prometheus-kube-prometheus-prometheus.monitoring.svc:9090
20 metricName: dynamo_ttft_p95
21 query: |
22 histogram_quantile(0.95,
23 sum(rate(dynamo_frontend_time_to_first_token_seconds_bucket{dynamo_namespace="default-sglang-agg"}[5m]))
24 by (le)
25 )
26 threshold: "0.5" # Scale up when TTFT p95 > 500ms (0.5 seconds)
27 activationThreshold: "0.1" # Start scaling when TTFT > 100ms

Apply it:

$kubectl apply -f sglang-agg-decode-scaler.yaml

Verify KEDA Scaling

$# Check ScaledObject status
$kubectl get scaledobject -n default
$
$# KEDA creates an HPA under the hood - you can see it
$kubectl get hpa -n default
$
$# Example output:
$# NAME REFERENCE TARGETS MINPODS MAXPODS REPLICAS
$# keda-hpa-sglang-agg-decode-scaler DynamoGraphDeploymentScalingAdapter/sglang-agg-decode 45m/500m 1 10 1
$
$# Get detailed status
$kubectl describe scaledobject sglang-agg-decode-scaler -n default

Example: Scale Based on Queue Depth

1apiVersion: keda.sh/v1alpha1
2kind: ScaledObject
3metadata:
4 name: sglang-agg-decode-queue-scaler
5 namespace: default
6spec:
7 scaleTargetRef:
8 apiVersion: nvidia.com/v1alpha1
9 kind: DynamoGraphDeploymentScalingAdapter
10 name: sglang-agg-decode
11 minReplicaCount: 1
12 maxReplicaCount: 10
13 pollingInterval: 15
14 cooldownPeriod: 60
15 triggers:
16 - type: prometheus
17 metadata:
18 serverAddress: http://prometheus-kube-prometheus-prometheus.monitoring.svc:9090
19 metricName: dynamo_queued_requests
20 query: |
21 sum(dynamo_frontend_queued_requests{dynamo_namespace="default-sglang-agg"})
22 threshold: "10" # Scale up when queue > 10 requests

How KEDA Works

KEDA creates and manages an HPA under the hood:

┌──────────────────────────────────────────────────────────────────────┐
│ You create: ScaledObject │
│ - scaleTargetRef: sglang-agg-decode │
│ - triggers: prometheus query │
└──────────────────────────────────────────────────────────────────────┘
┌──────────────────────────────────────────────────────────────────────┐
│ KEDA Operator automatically creates: HPA │
│ - name: keda-hpa-sglang-agg-decode-scaler │
│ - scaleTargetRef: sglang-agg-decode │
│ - metrics: External (from KEDA metrics server) │
└──────────────────────────────────────────────────────────────────────┘
┌──────────────────────────────────────────────────────────────────────┐
│ DynamoGraphDeploymentScalingAdapter: sglang-agg-decode │
│ - spec.replicas: updated by HPA │
└──────────────────────────────────────────────────────────────────────┘
┌──────────────────────────────────────────────────────────────────────┐
│ DynamoGraphDeployment: sglang-agg │
│ - spec.services.decode.replicas: synced from adapter │
└──────────────────────────────────────────────────────────────────────┘

Mixed Autoscaling

For disaggregated deployments (prefill + decode), you can use different autoscaling strategies for different services:

1---
2# HPA for Frontend (CPU-based)
3apiVersion: autoscaling/v2
4kind: HorizontalPodAutoscaler
5metadata:
6 name: sglang-agg-frontend-hpa
7 namespace: default
8spec:
9 scaleTargetRef:
10 apiVersion: nvidia.com/v1alpha1
11 kind: DynamoGraphDeploymentScalingAdapter
12 name: sglang-agg-frontend
13 minReplicas: 1
14 maxReplicas: 5
15 metrics:
16 - type: Resource
17 resource:
18 name: cpu
19 target:
20 type: Utilization
21 averageUtilization: 70
22
23---
24# KEDA for Decode (TTFT-based)
25apiVersion: keda.sh/v1alpha1
26kind: ScaledObject
27metadata:
28 name: sglang-agg-decode-scaler
29 namespace: default
30spec:
31 scaleTargetRef:
32 apiVersion: nvidia.com/v1alpha1
33 kind: DynamoGraphDeploymentScalingAdapter
34 name: sglang-agg-decode
35 minReplicaCount: 1
36 maxReplicaCount: 10
37 triggers:
38 - type: prometheus
39 metadata:
40 serverAddress: http://prometheus-kube-prometheus-prometheus.monitoring.svc:9090
41 query: |
42 histogram_quantile(0.95,
43 sum(rate(dynamo_frontend_time_to_first_token_seconds_bucket{dynamo_namespace="default-sglang-agg"}[5m]))
44 by (le)
45 )
46 threshold: "0.5"

Manual Scaling

With DGDSA Enabled (Default)

When DGDSA is enabled (the default), scale via the adapter:

$kubectl scale dgdsa sglang-agg-decode -n default --replicas=3

Verify the scaling:

$kubectl get dgdsa sglang-agg-decode -n default
$
$# Output:
$# NAME DGD SERVICE REPLICAS AGE
$# sglang-agg-decode sglang-agg decode 3 10m

If an autoscaler (KEDA, HPA, Planner) is managing the adapter, your change will be overwritten on the next evaluation cycle.

With DGDSA Disabled

If you’ve disabled the scaling adapter for a service, edit the DGD directly:

$kubectl patch dgd sglang-agg --type=merge -p '{"spec":{"services":{"decode":{"replicas":3}}}}'

Or edit the YAML (no scalingAdapter.enabled: true means direct edits are allowed):

1spec:
2 services:
3 decode:
4 replicas: 3
5 # No scalingAdapter.enabled means replicas can be edited directly

Best Practices

1. Choose One Autoscaler Per Service

Avoid configuring multiple autoscalers for the same service:

ConfigurationStatus
HPA for frontend, Planner for prefill/decode✅ Good
KEDA for all services✅ Good
Planner only (default)✅ Good
HPA + Planner both targeting decode❌ Bad - they will fight

2. Use Appropriate Metrics

Service TypeRecommended MetricsDynamo Metric
FrontendCPU utilization, request ratedynamo_frontend_requests_total
PrefillQueue depth, TTFTdynamo_frontend_queued_requests, dynamo_frontend_time_to_first_token_seconds
DecodeKV cache utilization, ITLkvstats_gpu_cache_usage_percent, dynamo_frontend_inter_token_latency_seconds

3. Configure Stabilization Windows

Prevent thrashing with appropriate stabilization:

1# HPA
2behavior:
3 scaleDown:
4 stabilizationWindowSeconds: 300 # Wait 5 min before scaling down
5 scaleUp:
6 stabilizationWindowSeconds: 0 # Scale up immediately
7
8# KEDA
9spec:
10 cooldownPeriod: 300

4. Set Sensible Min/Max Replicas

Always configure minimum and maximum replicas in your HPA/KEDA to prevent:

  • Scaling to zero (unless intentional)
  • Unbounded scaling that exhausts cluster resources

Troubleshooting

Adapters Not Created

$# Check DGD status
$kubectl describe dgd sglang-agg -n default
$
$# Check operator logs
$kubectl logs -n dynamo-system deployment/dynamo-operator

Scaling Not Working

$# Check adapter status
$kubectl describe dgdsa sglang-agg-decode -n default
$
$# Check HPA/KEDA status
$kubectl describe hpa sglang-agg-decode-hpa -n default
$kubectl describe scaledobject sglang-agg-decode-scaler -n default
$
$# Verify metrics are available in Kubernetes metrics API
$kubectl get --raw /apis/external.metrics.k8s.io/v1beta1

Metrics Not Available

If HPA/KEDA shows <unknown> for metrics:

$# Check if Dynamo metrics are being scraped
$kubectl port-forward -n default svc/sglang-agg-frontend 8000:8000
$curl http://localhost:8000/metrics | grep dynamo_frontend
$
$# Example output:
$# dynamo_frontend_queued_requests{model="Qwen/Qwen3-0.6B"} 2
$# dynamo_frontend_inflight_requests{model="Qwen/Qwen3-0.6B"} 5
$
$# Verify Prometheus is scraping the metrics
$kubectl port-forward -n monitoring svc/prometheus-kube-prometheus-prometheus 9090:9090
$# Then query: dynamo_frontend_time_to_first_token_seconds_bucket
$
$# Check KEDA operator logs
$kubectl logs -n keda deployment/keda-operator

Rapid Scaling Up and Down

If you see unstable scaling:

  1. Check if multiple autoscalers are targeting the same adapter
  2. Increase cooldownPeriod in KEDA ScaledObject
  3. Increase stabilizationWindowSeconds in HPA behavior

References