Graceful Shutdown | NVIDIA Dynamo Documentation

This document describes how Dynamo components handle shutdown signals to ensure in-flight requests complete successfully and resources are properly cleaned up.

Overview

Graceful shutdown in Dynamo ensures that:

No new requests are accepted - Endpoints are immediately invalidated
In-flight requests complete - Existing requests finish processing (configurable)
Resources are cleaned up - Engines, connections, and temporary files are released
Pods restart cleanly - Exit codes signal Kubernetes for proper restart behavior

Signal Handling

All Dynamo components handle Unix signals for graceful shutdown:

Signal	Trigger	Behavior
`SIGTERM`	Kubernetes pod termination	Graceful shutdown initiated
`SIGINT`	Ctrl+C / manual interrupt	Graceful shutdown initiated

Implementation

Each component registers signal handlers at startup:

1 def signal_handler():
2     asyncio.create_task(graceful_shutdown(runtime))
3 
4 for sig in (signal.SIGTERM, signal.SIGINT):
5     loop.add_signal_handler(sig, signal_handler)

The graceful_shutdown() function:

Logs the shutdown signal
Calls runtime.shutdown() to invalidate endpoints
Waits for in-flight requests (based on configuration)
Returns to allow cleanup to proceed

Endpoint Draining

When runtime.shutdown() is called, endpoints are immediately invalidated so no new requests are accepted. The behavior for in-flight requests depends on the graceful_shutdown parameter when serving the endpoint.

Configuration

When registering an endpoint, the graceful_shutdown parameter controls draining behavior:

1 generate_endpoint.serve_endpoint(
2     handler.generate,
3     graceful_shutdown=True,  # Wait for all requests to finish
4     metrics_labels=[("model", model_name)],
5     health_check_payload=health_check_payload,
6 )

`graceful_shutdown`	Behavior
`True`	Wait for all in-flight requests to complete before returning
`False`	Return immediately without waiting for requests

Component-Specific Behavior

Component	Default Behavior	Rationale
Frontend	N/A (HTTP server)	HTTP server handles its own shutdown
Prefill Workers	`graceful_shutdown=True`	Prefill operations must complete to avoid wasted computation
Decode Workers	Conditional	If migration is enabled (`migration_limit > 0`), shutdown immediately to allow migration; otherwise wait
Router	`graceful_shutdown=True`	Ensure routing decisions complete

Decode Worker Migration Integration

Decode workers use conditional draining based on whether request migration is supported:

1 generate_endpoint.serve_endpoint(
2     handler.generate,
3     graceful_shutdown=config.migration_limit <= 0,  # If no migration, wait for requests
4     ...
5 )

When migration_limit > 0:

Worker shuts down immediately (graceful_shutdown=False)
In-flight requests are migrated to healthy workers
No request loss occurs

When migration_limit <= 0:

Worker waits for in-flight requests (graceful_shutdown=True)
Migration is not available
Requests complete on the shutting-down worker

Resource Cleanup

After endpoint draining, components clean up their resources in finally blocks:

vLLM Worker Cleanup

1 finally:
2     logger.debug("Cleaning up worker")
3     handler.cleanup()

The handler’s cleanup() method:

Removes temporary directories (LoRA adapters, etc.)
Releases engine resources

SGLang Worker Cleanup

1 def cleanup(self) -> None:
2     # Cancel pending consume tasks
3     for task in self._consume_tasks:
4         if not task.done():
5             task.cancel()
6     self._consume_tasks.clear()
7 
8     # Shutdown engine
9     self.engine.shutdown()

TensorRT-LLM Worker Cleanup

1 async def cleanup(self):
2     if self._llm:
3         try:
4             self._llm.shutdown()
5         except Exception as e:
6             logging.error(f"Error during cleanup: {e}")
7         finally:
8             self._llm = None

Error-Initiated Shutdown

Workers can initiate graceful shutdown when fatal errors occur:

Engine Health Monitoring (vLLM)

The VllmEngineMonitor continuously checks engine health:

1 async def _check_engine_health(self):
2     while True:
3         try:
4             await self.engine_client.check_health()
5             await asyncio.sleep(HEALTH_CHECK_INTERVAL)  # 2 seconds
6         except EngineDeadError as e:
7             logger.error(f"Health check failed: {e}")
8             self._shutdown_engine()
9             self.runtime.shutdown()
10             os._exit(1)

Configuration:

HEALTH_CHECK_INTERVAL: 2 seconds between checks
ENGINE_SHUTDOWN_TIMEOUT: 30 seconds max for engine shutdown

Fatal Error Handling (TensorRT-LLM)

1 async def _initiate_shutdown(self, error: Exception):
2     logging.warning(f"Initiating graceful shutdown due to: {error}")
3 
4     try:
5         if self.runtime:
6             self.runtime.shutdown()
7         if self.engine:
8             await self.engine.cleanup()
9     except Exception as cleanup_error:
10         logging.error(f"Error during graceful shutdown: {cleanup_error}")
11     finally:
12         logging.critical("Forcing process exit for restart")
13         os._exit(1)

Kubernetes Integration

Pod Termination Flow

Kubernetes sends SIGTERM to the pod
Dynamo initiates graceful shutdown
Pod has terminationGracePeriodSeconds to complete (default: 30s)
If not terminated, Kubernetes sends SIGKILL

Recommended Configuration

For production deployments, configure adequate termination grace period:

1 apiVersion: nvidia.com/v1alpha1
2 kind: DynamoGraphDeployment
3 spec:
4   services:
5     VllmWorker:
6       extraPodSpec:
7         terminationGracePeriodSeconds: 60  # Allow time for request draining

Health Check Integration

Kubernetes uses health endpoints to determine pod readiness:

During shutdown: Endpoints become unavailable
Readiness probe fails: Traffic stops routing to the pod
Graceful draining: Existing requests complete

Best Practices

1. Set Appropriate Grace Periods

Match terminationGracePeriodSeconds to your expected request completion time:

Short requests (< 10s): 30s grace period
Long generation (> 30s): 120s+ grace period

2. Enable Request Migration for Decode Workers

If using disaggregated serving, enable migration for decode workers:

1 --migration-limit 3  # Allow up to 3 migration attempts

This allows immediate shutdown while preserving request state.

3. Monitor Shutdown Metrics

Track shutdown behavior via logs:

INFO  Received shutdown signal, shutting down DistributedRuntime
INFO  DistributedRuntime shutdown complete
DEBUG Cleaning up worker

4. Handle Cleanup Errors

Ensure cleanup methods handle errors gracefully:

1 def cleanup(self):
2     for resource in self.resources:
3         try:
4             resource.cleanup()
5         except Exception as e:
6             logger.warning(f"Cleanup failed: {e}")
7             # Continue with other resources

Request Migration - How requests migrate during shutdown
Request Cancellation - Canceling in-flight requests
Health Checks - Liveness and readiness probes