Graceful Shutdown

This document describes how Dynamo components handle shutdown signals to ensure in-flight requests complete successfully and resources are properly cleaned up.

Overview

Graceful shutdown in Dynamo ensures that:

Routing stops quickly - Endpoints are unregistered from discovery first
In-flight requests can finish - Workers keep serving during a short grace period
Endpoints drain - After the grace period, endpoints are invalidated and optionally wait for in-flight work
Resources are cleaned up - Engines, connections, and temporary files are released
Pods restart cleanly - Exit codes signal Kubernetes for proper restart behavior

Signal Handling

All Dynamo components handle Unix signals for graceful shutdown:

Signal	Trigger	Behavior
`SIGTERM`	Kubernetes pod termination	Graceful shutdown initiated
`SIGINT`	Ctrl+C / manual interrupt	Graceful shutdown initiated

Implementation

Each component registers signal handlers at startup:

1 def signal_handler():
2     asyncio.create_task(graceful_shutdown(runtime, endpoints))
3 
4 for sig in (signal.SIGTERM, signal.SIGINT):
5     loop.add_signal_handler(sig, signal_handler)

The graceful_shutdown() function:

Logs the shutdown signal
Unregisters all endpoints from discovery
Waits for a configurable grace period (DYN_GRACEFUL_SHUTDOWN_GRACE_PERIOD_SECS, default 5s)
Calls runtime.shutdown() to invalidate endpoints and stop accepting new requests
Waits for in-flight requests (based on graceful_shutdown per endpoint)
Returns to allow cleanup to proceed

Endpoint Draining

After the grace period, runtime.shutdown() invalidates endpoints so no new requests are accepted. The behavior for in-flight requests depends on the graceful_shutdown parameter when serving the endpoint.

Configuration

When registering an endpoint, the graceful_shutdown parameter controls draining behavior:

1 generate_endpoint.serve_endpoint(
2     handler.generate,
3     graceful_shutdown=True,  # Wait for all requests to finish
4     metrics_labels=[("model", model_name)],
5     health_check_payload=health_check_payload,
6 )

`graceful_shutdown`	Behavior
`True`	Wait for all in-flight requests to complete before returning
`False`	Return immediately without waiting for requests

Component-Specific Behavior

Component	Default Behavior	Rationale
Frontend	N/A (HTTP server)	HTTP server handles its own shutdown
Prefill Workers	`graceful_shutdown=True`	Prefill operations must complete to avoid wasted computation
Decode Workers	`graceful_shutdown=True`	Decode operations should complete to avoid wasted computation
Router	`graceful_shutdown=True`	Ensure routing decisions complete

Migration Integration

Backend workers always use graceful_shutdown=True, meaning they wait for in-flight requests to complete until the engine is stopped. Request migration is configured at the frontend level via --migration-limit:

When migration is enabled at the frontend, disconnected streams from failed workers are automatically retried on healthy workers
Workers don’t need to know about migration configuration - they simply complete their work or signal incomplete streams
See Request Migration Architecture for details on how migration works

Resource Cleanup

After endpoint draining, components clean up their resources in finally blocks:

vLLM Worker Cleanup

1 finally:
2     logger.debug("Cleaning up worker")
3     handler.cleanup()

The handler’s cleanup() method:

Removes temporary directories (LoRA adapters, etc.)
Releases engine resources

SGLang Worker Cleanup

1 def cleanup(self) -> None:
2     # Cancel pending consume tasks
3     for task in self._consume_tasks:
4         if not task.done():
5             task.cancel()
6     self._consume_tasks.clear()
7 
8     # Shutdown engine
9     self.engine.shutdown()

TensorRT-LLM Worker Cleanup

1 async def cleanup(self):
2     if self._llm:
3         try:
4             self._llm.shutdown()
5         except Exception as e:
6             logging.error(f"Error during cleanup: {e}")
7         finally:
8             self._llm = None

Error-Initiated Shutdown

Workers can initiate graceful shutdown when fatal errors occur:

Engine Health Monitoring (vLLM)

The VllmEngineMonitor continuously checks engine health:

1 async def _check_engine_health(self):
2     while True:
3         try:
4             await self.engine_client.check_health()
5             await asyncio.sleep(HEALTH_CHECK_INTERVAL)  # 2 seconds
6         except EngineDeadError as e:
7             logger.error(f"Health check failed: {e}")
8             self._shutdown_engine()
9             self.runtime.shutdown()
10             os._exit(1)

Configuration:

HEALTH_CHECK_INTERVAL: 2 seconds between checks
ENGINE_SHUTDOWN_TIMEOUT: 30 seconds max for engine shutdown

Fatal Error Handling (TensorRT-LLM)

1 async def _initiate_shutdown(self, error: Exception):
2     logging.warning(f"Initiating graceful shutdown due to: {error}")
3 
4     try:
5         if self.runtime:
6             self.runtime.shutdown()
7         if self.engine:
8             await self.engine.cleanup()
9     except Exception as cleanup_error:
10         logging.error(f"Error during graceful shutdown: {cleanup_error}")
11     finally:
12         logging.critical("Forcing process exit for restart")
13         os._exit(1)

Kubernetes Integration

Pod Termination Flow

Kubernetes sends SIGTERM to the pod
Dynamo initiates graceful shutdown
Pod has terminationGracePeriodSeconds to complete (default: 30s)
If not terminated, Kubernetes sends SIGKILL

Recommended Configuration

For production deployments, configure adequate termination grace period:

1 apiVersion: nvidia.com/v1alpha1
2 kind: DynamoGraphDeployment
3 spec:
4   services:
5     VllmWorker:
6       extraPodSpec:
7         terminationGracePeriodSeconds: 60  # Allow time for request draining

Health Check Integration

Kubernetes uses health endpoints to determine pod readiness:

During shutdown: Endpoints become unavailable
Readiness probe fails: Traffic stops routing to the pod
Graceful draining: Existing requests complete

Best Practices

1. Set Appropriate Grace Periods

Match terminationGracePeriodSeconds to your expected request completion time:

Short requests (< 10s): 30s grace period
Long generation (> 30s): 120s+ grace period

2. Enable Request Migration

Enable migration at the frontend to allow request recovery when workers shut down:

$ python3 -m dynamo.frontend ... --migration-limit 3  # Allow up to 3 migration attempts

This allows the frontend to automatically retry disconnected streams on healthy workers.

3. Monitor Shutdown Metrics

Track shutdown behavior via logs:

INFO  Received shutdown signal, shutting down DistributedRuntime
INFO  DistributedRuntime shutdown complete
DEBUG Cleaning up worker

4. Handle Cleanup Errors

Ensure cleanup methods handle errors gracefully:

1 def cleanup(self):
2     for resource in self.resources:
3         try:
4             resource.cleanup()
5         except Exception as e:
6             logger.warning(f"Cleanup failed: {e}")
7             # Continue with other resources

Request Migration - How requests migrate during shutdown
Request Cancellation - Canceling in-flight requests
Health Checks - Liveness and readiness probes