Graceful Shutdown

View as Markdown

Graceful Shutdown

This document describes how Dynamo components handle shutdown signals to ensure in-flight requests complete successfully and resources are properly cleaned up.

Overview

Graceful shutdown in Dynamo ensures that:

  1. Routing stops quickly - Endpoints are unregistered from discovery first
  2. In-flight requests can finish - Workers keep serving during a short grace period
  3. Endpoints drain - After the grace period, endpoints are invalidated and optionally wait for in-flight work
  4. Resources are cleaned up - Engines, connections, and temporary files are released
  5. Pods restart cleanly - Exit codes signal Kubernetes for proper restart behavior

Signal Handling

All Dynamo components handle Unix signals for graceful shutdown:

SignalTriggerBehavior
SIGTERMKubernetes pod terminationGraceful shutdown initiated
SIGINTCtrl+C / manual interruptGraceful shutdown initiated

Implementation

Each component registers signal handlers at startup:

1def signal_handler():
2 asyncio.create_task(graceful_shutdown(runtime, endpoints))
3
4for sig in (signal.SIGTERM, signal.SIGINT):
5 loop.add_signal_handler(sig, signal_handler)

The graceful_shutdown() function:

  1. Logs the shutdown signal
  2. Unregisters all endpoints from discovery
  3. Waits for a configurable grace period (DYN_GRACEFUL_SHUTDOWN_GRACE_PERIOD_SECS, default 5s)
  4. Calls runtime.shutdown() to invalidate endpoints and stop accepting new requests
  5. Waits for in-flight requests (based on graceful_shutdown per endpoint)
  6. Returns to allow cleanup to proceed

Endpoint Draining

After the grace period, runtime.shutdown() invalidates endpoints so no new requests are accepted. The behavior for in-flight requests depends on the graceful_shutdown parameter when serving the endpoint.

Configuration

When registering an endpoint, the graceful_shutdown parameter controls draining behavior:

1generate_endpoint.serve_endpoint(
2 handler.generate,
3 graceful_shutdown=True, # Wait for all requests to finish
4 metrics_labels=[("model", model_name)],
5 health_check_payload=health_check_payload,
6)
graceful_shutdownBehavior
TrueWait for all in-flight requests to complete before returning
FalseReturn immediately without waiting for requests

Component-Specific Behavior

ComponentDefault BehaviorRationale
FrontendN/A (HTTP server)HTTP server handles its own shutdown
Prefill Workersgraceful_shutdown=TruePrefill operations must complete to avoid wasted computation
Decode Workersgraceful_shutdown=TrueDecode operations should complete to avoid wasted computation
Routergraceful_shutdown=TrueEnsure routing decisions complete

Migration Integration

Backend workers always use graceful_shutdown=True, meaning they wait for in-flight requests to complete until the engine is stopped. Request migration is configured at the frontend level via --migration-limit:

  • When migration is enabled at the frontend, disconnected streams from failed workers are automatically retried on healthy workers
  • Workers don’t need to know about migration configuration - they simply complete their work or signal incomplete streams
  • See Request Migration Architecture for details on how migration works

Resource Cleanup

After endpoint draining, components clean up their resources in finally blocks:

vLLM Worker Cleanup

1finally:
2 logger.debug("Cleaning up worker")
3 handler.cleanup()

The handler’s cleanup() method:

  • Removes temporary directories (LoRA adapters, etc.)
  • Releases engine resources

SGLang Worker Cleanup

1def cleanup(self) -> None:
2 # Cancel pending consume tasks
3 for task in self._consume_tasks:
4 if not task.done():
5 task.cancel()
6 self._consume_tasks.clear()
7
8 # Shutdown engine
9 self.engine.shutdown()

TensorRT-LLM Worker Cleanup

1async def cleanup(self):
2 if self._llm:
3 try:
4 self._llm.shutdown()
5 except Exception as e:
6 logging.error(f"Error during cleanup: {e}")
7 finally:
8 self._llm = None

Error-Initiated Shutdown

Workers can initiate graceful shutdown when fatal errors occur:

Engine Health Monitoring (vLLM)

The VllmEngineMonitor continuously checks engine health:

1async def _check_engine_health(self):
2 while True:
3 try:
4 await self.engine_client.check_health()
5 await asyncio.sleep(HEALTH_CHECK_INTERVAL) # 2 seconds
6 except EngineDeadError as e:
7 logger.error(f"Health check failed: {e}")
8 self._shutdown_engine()
9 self.runtime.shutdown()
10 os._exit(1)

Configuration:

  • HEALTH_CHECK_INTERVAL: 2 seconds between checks
  • ENGINE_SHUTDOWN_TIMEOUT: 30 seconds max for engine shutdown

Fatal Error Handling (TensorRT-LLM)

1async def _initiate_shutdown(self, error: Exception):
2 logging.warning(f"Initiating graceful shutdown due to: {error}")
3
4 try:
5 if self.runtime:
6 self.runtime.shutdown()
7 if self.engine:
8 await self.engine.cleanup()
9 except Exception as cleanup_error:
10 logging.error(f"Error during graceful shutdown: {cleanup_error}")
11 finally:
12 logging.critical("Forcing process exit for restart")
13 os._exit(1)

Kubernetes Integration

Pod Termination Flow

  1. Kubernetes sends SIGTERM to the pod
  2. Dynamo initiates graceful shutdown
  3. Pod has terminationGracePeriodSeconds to complete (default: 30s)
  4. If not terminated, Kubernetes sends SIGKILL

For production deployments, configure adequate termination grace period:

1apiVersion: nvidia.com/v1alpha1
2kind: DynamoGraphDeployment
3spec:
4 services:
5 VllmWorker:
6 extraPodSpec:
7 terminationGracePeriodSeconds: 60 # Allow time for request draining

Health Check Integration

Kubernetes uses health endpoints to determine pod readiness:

  • During shutdown: Endpoints become unavailable
  • Readiness probe fails: Traffic stops routing to the pod
  • Graceful draining: Existing requests complete

Best Practices

1. Set Appropriate Grace Periods

Match terminationGracePeriodSeconds to your expected request completion time:

  • Short requests (< 10s): 30s grace period
  • Long generation (> 30s): 120s+ grace period

2. Enable Request Migration

Enable migration at the frontend to allow request recovery when workers shut down:

$python3 -m dynamo.frontend ... --migration-limit 3 # Allow up to 3 migration attempts

This allows the frontend to automatically retry disconnected streams on healthy workers.

3. Monitor Shutdown Metrics

Track shutdown behavior via logs:

INFO Received shutdown signal, shutting down DistributedRuntime
INFO DistributedRuntime shutdown complete
DEBUG Cleaning up worker

4. Handle Cleanup Errors

Ensure cleanup methods handle errors gracefully:

1def cleanup(self):
2 for resource in self.resources:
3 try:
4 resource.cleanup()
5 except Exception as e:
6 logger.warning(f"Cleanup failed: {e}")
7 # Continue with other resources