Graceful Shutdown
This document describes how Dynamo components handle shutdown signals to ensure in-flight requests complete successfully and resources are properly cleaned up.
Overview
Graceful shutdown in Dynamo ensures that:
- No new requests are accepted - Endpoints are immediately invalidated
- In-flight requests complete - Existing requests finish processing (configurable)
- Resources are cleaned up - Engines, connections, and temporary files are released
- Pods restart cleanly - Exit codes signal Kubernetes for proper restart behavior
Signal Handling
All Dynamo components handle Unix signals for graceful shutdown:
Implementation
Each component registers signal handlers at startup:
The graceful_shutdown() function:
- Logs the shutdown signal
- Calls
runtime.shutdown()to invalidate endpoints - Waits for in-flight requests (based on configuration)
- Returns to allow cleanup to proceed
Endpoint Draining
When runtime.shutdown() is called, endpoints are immediately invalidated so no new requests are accepted. The behavior for in-flight requests depends on the graceful_shutdown parameter when serving the endpoint.
Configuration
When registering an endpoint, the graceful_shutdown parameter controls draining behavior:
Component-Specific Behavior
Decode Worker Migration Integration
Decode workers use conditional draining based on whether request migration is supported:
When migration_limit > 0:
- Worker shuts down immediately (
graceful_shutdown=False) - In-flight requests are migrated to healthy workers
- No request loss occurs
When migration_limit <= 0:
- Worker waits for in-flight requests (
graceful_shutdown=True) - Migration is not available
- Requests complete on the shutting-down worker
Resource Cleanup
After endpoint draining, components clean up their resources in finally blocks:
vLLM Worker Cleanup
The handler’s cleanup() method:
- Removes temporary directories (LoRA adapters, etc.)
- Releases engine resources
SGLang Worker Cleanup
TensorRT-LLM Worker Cleanup
Error-Initiated Shutdown
Workers can initiate graceful shutdown when fatal errors occur:
Engine Health Monitoring (vLLM)
The VllmEngineMonitor continuously checks engine health:
Configuration:
HEALTH_CHECK_INTERVAL: 2 seconds between checksENGINE_SHUTDOWN_TIMEOUT: 30 seconds max for engine shutdown
Fatal Error Handling (TensorRT-LLM)
Kubernetes Integration
Pod Termination Flow
- Kubernetes sends
SIGTERMto the pod - Dynamo initiates graceful shutdown
- Pod has
terminationGracePeriodSecondsto complete (default: 30s) - If not terminated, Kubernetes sends
SIGKILL
Recommended Configuration
For production deployments, configure adequate termination grace period:
Health Check Integration
Kubernetes uses health endpoints to determine pod readiness:
- During shutdown: Endpoints become unavailable
- Readiness probe fails: Traffic stops routing to the pod
- Graceful draining: Existing requests complete
Best Practices
1. Set Appropriate Grace Periods
Match terminationGracePeriodSeconds to your expected request completion time:
- Short requests (< 10s): 30s grace period
- Long generation (> 30s): 120s+ grace period
2. Enable Request Migration for Decode Workers
If using disaggregated serving, enable migration for decode workers:
This allows immediate shutdown while preserving request state.
3. Monitor Shutdown Metrics
Track shutdown behavior via logs:
4. Handle Cleanup Errors
Ensure cleanup methods handle errors gracefully:
Related Documentation
- Request Migration - How requests migrate during shutdown
- Request Cancellation - Canceling in-flight requests
- Health Checks - Liveness and readiness probes