Graceful Shutdown
Graceful Shutdown
This document describes how Dynamo components handle shutdown signals to ensure in-flight requests complete successfully and resources are properly cleaned up.
Overview
Graceful shutdown in Dynamo ensures that:
- Routing stops quickly - Endpoints are unregistered from discovery first
- In-flight requests can finish - Workers keep serving during a short grace period
- Endpoints drain - After the grace period, endpoints are invalidated and optionally wait for in-flight work
- Resources are cleaned up - Engines, connections, and temporary files are released
- Pods restart cleanly - Exit codes signal Kubernetes for proper restart behavior
Signal Handling
All Dynamo components handle Unix signals for graceful shutdown:
Implementation
Each component registers signal handlers at startup:
The graceful_shutdown() function:
- Logs the shutdown signal
- Unregisters all endpoints from discovery
- Waits for a configurable grace period (
DYN_GRACEFUL_SHUTDOWN_GRACE_PERIOD_SECS, default 5s) - Calls
runtime.shutdown()to invalidate endpoints and stop accepting new requests - Waits for in-flight requests (based on
graceful_shutdownper endpoint) - Returns to allow cleanup to proceed
Endpoint Draining
After the grace period, runtime.shutdown() invalidates endpoints so no new requests are accepted. The behavior for in-flight requests depends on the graceful_shutdown parameter when serving the endpoint.
Configuration
When registering an endpoint, the graceful_shutdown parameter controls draining behavior:
Component-Specific Behavior
Migration Integration
Backend workers always use graceful_shutdown=True, meaning they wait for in-flight requests to complete until the engine is stopped. Request migration is configured at the frontend level via --migration-limit:
- When migration is enabled at the frontend, disconnected streams from failed workers are automatically retried on healthy workers
- Workers don’t need to know about migration configuration - they simply complete their work or signal incomplete streams
- See Request Migration Architecture for details on how migration works
Resource Cleanup
After endpoint draining, components clean up their resources in finally blocks:
vLLM Worker Cleanup
The handler’s cleanup() method:
- Removes temporary directories (LoRA adapters, etc.)
- Releases engine resources
SGLang Worker Cleanup
TensorRT-LLM Worker Cleanup
Error-Initiated Shutdown
Workers can initiate graceful shutdown when fatal errors occur:
Engine Health Monitoring (vLLM)
The VllmEngineMonitor continuously checks engine health:
Configuration:
HEALTH_CHECK_INTERVAL: 2 seconds between checksENGINE_SHUTDOWN_TIMEOUT: 30 seconds max for engine shutdown
Fatal Error Handling (TensorRT-LLM)
Kubernetes Integration
Pod Termination Flow
- Kubernetes sends
SIGTERMto the pod - Dynamo initiates graceful shutdown
- Pod has
terminationGracePeriodSecondsto complete (default: 30s) - If not terminated, Kubernetes sends
SIGKILL
Recommended Configuration
For production deployments, configure adequate termination grace period:
Health Check Integration
Kubernetes uses health endpoints to determine pod readiness:
- During shutdown: Endpoints become unavailable
- Readiness probe fails: Traffic stops routing to the pod
- Graceful draining: Existing requests complete
Best Practices
1. Set Appropriate Grace Periods
Match terminationGracePeriodSeconds to your expected request completion time:
- Short requests (< 10s): 30s grace period
- Long generation (> 30s): 120s+ grace period
2. Enable Request Migration
Enable migration at the frontend to allow request recovery when workers shut down:
This allows the frontend to automatically retry disconnected streams on healthy workers.
3. Monitor Shutdown Metrics
Track shutdown behavior via logs:
4. Handle Cleanup Errors
Ensure cleanup methods handle errors gracefully:
Related Documentation
- Request Migration - How requests migrate during shutdown
- Request Cancellation - Canceling in-flight requests
- Health Checks - Liveness and readiness probes