Request Rejection
This document describes how Dynamo implements request rejection to prevent system overload and maintain service stability under high load conditions.
Overview
Request rejection (also known as load shedding) is a fault tolerance mechanism that proactively rejects new requests when workers are overloaded. This prevents:
- Cascading failures from resource exhaustion
- Degraded latency for all requests
- Out-of-memory conditions on GPU workers
When all workers exceed their configured busy thresholds, new requests receive an HTTP 503 (Service Unavailable) response, signaling clients to retry later.
Architecture
Configuration
Frontend Arguments
Configure busy thresholds when starting the frontend. --admission-control token-capacity is required to activate the thresholds; the default (none) leaves them disabled.
Dynamic Configuration via API
Thresholds can be adjusted at runtime via the /busy_threshold endpoint:
Set Thresholds
Get Current Thresholds
Response:
Busy Detection Logic
Workers are marked as “busy” based on a dual-threshold system. A worker is considered busy when either threshold is exceeded.
KV Cache Block Threshold
Monitors the percentage of KV cache blocks in use:
Example: With active_decode_blocks_threshold=0.85, a worker using 87% of its KV cache blocks is marked busy.
Prefill Token Threshold
Monitors the number of tokens currently being prefilled:
Example: With active_prefill_tokens_threshold=10000, a worker prefilling 12,000 tokens is marked busy.
Data-Parallel Rank Aggregation
For workers with multiple data-parallel ranks (tensor parallelism), the worker is only marked busy if ALL ranks are busy:
This prevents false positives when only some ranks are temporarily loaded.
Worker Load Monitoring
The KvWorkerMonitor runs as a background task that:
- Subscribes to KV cache metrics events from workers
- Maintains load state for each worker instance
- Recalculates busy instances when metrics change
- Updates the router with the current busy list
Metrics Collected
Workers publish these metrics for monitoring:
Rejection Behavior
Request Flow
- Request arrives at frontend
- Push router checks if busy threshold is configured
- If configured, router retrieves list of free (non-busy) instances
- If no free instances exist (but instances are registered):
- Request is rejected with
PipelineError::ServiceOverloaded - HTTP 503 response is returned to client
- Request is rejected with
Error Response
When requests are rejected, clients receive:
Client Retry Strategy
Clients should implement exponential backoff when receiving 503 responses:
Monitoring
Prometheus Metrics
Track rejection behavior with these metrics:
dynamo_frontend_model_rejection_total: Counter tracking the total number of requests rejected due to resource exhaustion- Labels:
model: The model name being servedendpoint: The API endpoint that received the request (e.g.,chat_completions,completions,embeddings)
- This metric is incremented when the router returns a
ResourceExhaustederror because all workers are busy. The rejected request is surfaced to the client as an HTTP 503 response.
- Labels:
Example metrics output:
Endpoint: Available on the frontend HTTP service at /metrics.
Tuning Thresholds
Conservative Settings (Latency-Focused)
For applications prioritizing low latency:
- Rejects earlier, before workers become fully loaded
- Maintains lower queue depths
- Better tail latencies
Aggressive Settings (Throughput-Focused)
For applications prioritizing throughput:
- Allows higher worker utilization
- May increase latency variability
- Better overall throughput
Disabled (No Rejection)
To disable request rejection entirely:
Without thresholds configured, all requests are accepted regardless of worker load.
Best Practices
1. Start Conservative, Then Tune
Begin with conservative thresholds and increase based on observed behavior:
2. Monitor Before Enabling
Observe worker load patterns before setting thresholds:
3. Use Both Thresholds for Disaggregated Serving
In disaggregated deployments:
- Use
active_prefill_tokens_thresholdfor prefill workers - Use
active_decode_blocks_thresholdfor decode workers
4. Coordinate with Autoscaling
If using Kubernetes HPA, ensure rejection thresholds trigger before autoscaling:
Worker-Side Request Admission
In addition to the frontend’s metric-driven busy detection above, a worker can enforce a hard concurrency cap directly at its request-plane ingress. This is disabled by default — when neither knob is set, the worker behaves exactly as before (a large pool plus a large overflow queue, no rejection).
Knobs
When --engine-request-limit is set, the worker accepts a request directly into
the engine while a slot is free; once all N engine slots are busy, further
requests go into the small overflow queue of size Q; when the engine and
the queue are both full the worker rejects the request with
Server overloaded: worker at capacity. The frontend maps this rejection to
ResourceExhausted → HTTP 503, and temporarily marks the worker overloaded
so it is skipped on the next routing decision (cleared automatically on the next
metric recompute). The effective hard cap is N + Q in-flight requests per
worker. The overflow channel is sized to Q-1 because the single dispatcher
holds one request in transit between the queue and the engine; this makes the
cap exact for Q ≥ 2 (at Q = 1 the channel floors at 1, so the queued
peak is 2 — hence the Q ≥ 2 requirement).
Metrics
Related Documentation
- Health Checks - Worker health monitoring
- Metrics - Available Prometheus metrics
- Request Migration - Handling failed requests