Graceful Shutdown

View as Markdown

This document describes how Dynamo components handle shutdown signals to ensure in-flight requests complete successfully and resources are properly cleaned up.

Overview

Graceful shutdown in Dynamo ensures that:

  1. No new requests are accepted - Endpoints are immediately invalidated
  2. In-flight requests complete - Existing requests finish processing (configurable)
  3. Resources are cleaned up - Engines, connections, and temporary files are released
  4. Pods restart cleanly - Exit codes signal Kubernetes for proper restart behavior

Signal Handling

All Dynamo components handle Unix signals for graceful shutdown:

SignalTriggerBehavior
SIGTERMKubernetes pod terminationGraceful shutdown initiated
SIGINTCtrl+C / manual interruptGraceful shutdown initiated

Implementation

Each component registers signal handlers at startup:

1def signal_handler():
2 asyncio.create_task(graceful_shutdown(runtime))
3
4for sig in (signal.SIGTERM, signal.SIGINT):
5 loop.add_signal_handler(sig, signal_handler)

The graceful_shutdown() function:

  1. Logs the shutdown signal
  2. Calls runtime.shutdown() to invalidate endpoints
  3. Waits for in-flight requests (based on configuration)
  4. Returns to allow cleanup to proceed

Endpoint Draining

When runtime.shutdown() is called, endpoints are immediately invalidated so no new requests are accepted. The behavior for in-flight requests depends on the graceful_shutdown parameter when serving the endpoint.

Configuration

When registering an endpoint, the graceful_shutdown parameter controls draining behavior:

1generate_endpoint.serve_endpoint(
2 handler.generate,
3 graceful_shutdown=True, # Wait for all requests to finish
4 metrics_labels=[("model", model_name)],
5 health_check_payload=health_check_payload,
6)
graceful_shutdownBehavior
TrueWait for all in-flight requests to complete before returning
FalseReturn immediately without waiting for requests

Component-Specific Behavior

ComponentDefault BehaviorRationale
FrontendN/A (HTTP server)HTTP server handles its own shutdown
Prefill Workersgraceful_shutdown=TruePrefill operations must complete to avoid wasted computation
Decode WorkersConditionalIf migration is enabled (migration_limit > 0), shutdown immediately to allow migration; otherwise wait
Routergraceful_shutdown=TrueEnsure routing decisions complete

Decode Worker Migration Integration

Decode workers use conditional draining based on whether request migration is supported:

1generate_endpoint.serve_endpoint(
2 handler.generate,
3 graceful_shutdown=config.migration_limit <= 0, # If no migration, wait for requests
4 ...
5)

When migration_limit > 0:

  • Worker shuts down immediately (graceful_shutdown=False)
  • In-flight requests are migrated to healthy workers
  • No request loss occurs

When migration_limit <= 0:

  • Worker waits for in-flight requests (graceful_shutdown=True)
  • Migration is not available
  • Requests complete on the shutting-down worker

Resource Cleanup

After endpoint draining, components clean up their resources in finally blocks:

vLLM Worker Cleanup

1finally:
2 logger.debug("Cleaning up worker")
3 handler.cleanup()

The handler’s cleanup() method:

  • Removes temporary directories (LoRA adapters, etc.)
  • Releases engine resources

SGLang Worker Cleanup

1def cleanup(self) -> None:
2 # Cancel pending consume tasks
3 for task in self._consume_tasks:
4 if not task.done():
5 task.cancel()
6 self._consume_tasks.clear()
7
8 # Shutdown engine
9 self.engine.shutdown()

TensorRT-LLM Worker Cleanup

1async def cleanup(self):
2 if self._llm:
3 try:
4 self._llm.shutdown()
5 except Exception as e:
6 logging.error(f"Error during cleanup: {e}")
7 finally:
8 self._llm = None

Error-Initiated Shutdown

Workers can initiate graceful shutdown when fatal errors occur:

Engine Health Monitoring (vLLM)

The VllmEngineMonitor continuously checks engine health:

1async def _check_engine_health(self):
2 while True:
3 try:
4 await self.engine_client.check_health()
5 await asyncio.sleep(HEALTH_CHECK_INTERVAL) # 2 seconds
6 except EngineDeadError as e:
7 logger.error(f"Health check failed: {e}")
8 self._shutdown_engine()
9 self.runtime.shutdown()
10 os._exit(1)

Configuration:

  • HEALTH_CHECK_INTERVAL: 2 seconds between checks
  • ENGINE_SHUTDOWN_TIMEOUT: 30 seconds max for engine shutdown

Fatal Error Handling (TensorRT-LLM)

1async def _initiate_shutdown(self, error: Exception):
2 logging.warning(f"Initiating graceful shutdown due to: {error}")
3
4 try:
5 if self.runtime:
6 self.runtime.shutdown()
7 if self.engine:
8 await self.engine.cleanup()
9 except Exception as cleanup_error:
10 logging.error(f"Error during graceful shutdown: {cleanup_error}")
11 finally:
12 logging.critical("Forcing process exit for restart")
13 os._exit(1)

Kubernetes Integration

Pod Termination Flow

  1. Kubernetes sends SIGTERM to the pod
  2. Dynamo initiates graceful shutdown
  3. Pod has terminationGracePeriodSeconds to complete (default: 30s)
  4. If not terminated, Kubernetes sends SIGKILL

For production deployments, configure adequate termination grace period:

1apiVersion: nvidia.com/v1alpha1
2kind: DynamoGraphDeployment
3spec:
4 services:
5 VllmWorker:
6 extraPodSpec:
7 terminationGracePeriodSeconds: 60 # Allow time for request draining

Health Check Integration

Kubernetes uses health endpoints to determine pod readiness:

  • During shutdown: Endpoints become unavailable
  • Readiness probe fails: Traffic stops routing to the pod
  • Graceful draining: Existing requests complete

Best Practices

1. Set Appropriate Grace Periods

Match terminationGracePeriodSeconds to your expected request completion time:

  • Short requests (< 10s): 30s grace period
  • Long generation (> 30s): 120s+ grace period

2. Enable Request Migration for Decode Workers

If using disaggregated serving, enable migration for decode workers:

1--migration-limit 3 # Allow up to 3 migration attempts

This allows immediate shutdown while preserving request state.

3. Monitor Shutdown Metrics

Track shutdown behavior via logs:

INFO Received shutdown signal, shutting down DistributedRuntime
INFO DistributedRuntime shutdown complete
DEBUG Cleaning up worker

4. Handle Cleanup Errors

Ensure cleanup methods handle errors gracefully:

1def cleanup(self):
2 for resource in self.resources:
3 try:
4 resource.cleanup()
5 except Exception as e:
6 logger.warning(f"Cleanup failed: {e}")
7 # Continue with other resources