Reference Guide
Reference Guide
Overview
The vLLM backend in Dynamo integrates vLLM engines into Dynamo’s distributed runtime, enabling disaggregated serving, KV-aware routing, and request cancellation. Dynamo leverages vLLM’s native KV cache events, NIXL-based transfer mechanisms, and metric reporting.
Dynamo vLLM uses vLLM’s native argument parser — all vLLM engine arguments are passed through directly. Dynamo adds its own arguments for disaggregation mode, KV transfer, and prompt embeddings.
Argument Reference
The vLLM backend accepts all upstream vLLM engine arguments plus Dynamo-specific arguments. The authoritative source is always the CLI:
The --help output is organized into the following groups:
- Dynamo Runtime Options — Namespace, discovery backend, request/event plane, endpoint types, tool/reasoning parsers, and custom chat templates. These are common across all Dynamo backends and use
DYN_*env vars. - Dynamo vLLM Options — Disaggregation mode, tokenizer selection, sleep mode, multimodal flags, vLLM-Omni pipeline configuration, headless mode, and ModelExpress. These use
DYN_VLLM_*env vars. - vLLM Engine Options — All native vLLM arguments (
--model,--tensor-parallel-size,--kv-transfer-config,--kv-events-config,--enable-prefix-caching, etc.). See the vLLM serve args documentation.
Prompt Embeddings
Dynamo supports vLLM prompt embeddings — pre-computed embeddings bypass tokenization in the Rust frontend and are decoded to tensors in the worker.
- Enable with
--enable-prompt-embeds(disabled by default) - Embeddings are sent as base64-encoded PyTorch tensors via the
prompt_embedsfield in the Completions API - NATS must be configured with a 15MB max payload for large embeddings (already set in default deployments)
Hashing Consistency for KV Events
When using KV-aware routing, ensure deterministic hashing across processes to avoid radix tree mismatches. Choose one of the following:
- Set
PYTHONHASHSEED=0for all vLLM processes when relying on Python’s built-in hashing for prefix caching. - If your vLLM version supports it, configure a deterministic prefix caching algorithm:
See the high-level notes in Router Design on deterministic event IDs.
Graceful Shutdown
vLLM workers use Dynamo’s graceful shutdown mechanism. When a SIGTERM or SIGINT is received:
- Discovery unregister: The worker is removed from service discovery so no new requests are routed to it
- Grace period: In-flight requests are allowed to complete (configurable via
DYN_GRACEFUL_SHUTDOWN_GRACE_PERIOD_SECS, default 5s) - Resource cleanup: Engine resources and temporary files (Prometheus dirs, LoRA adapters) are released
All vLLM endpoints use graceful_shutdown=True, meaning they wait for in-flight requests to finish before exiting. An internal VllmEngineMonitor also checks engine health every 2 seconds and initiates shutdown if the engine becomes unresponsive.
For more details, see Graceful Shutdown.
Health Checks
Each worker type has a specialized health check payload that validates the full inference pipeline:
Health checks are registered with the Dynamo runtime and called by the frontend or Kubernetes liveness probes. The payload can be overridden via DYN_HEALTH_CHECK_PAYLOAD environment variable. See Health Checks for the broader health check architecture.
Request Cancellation
When a user cancels a request (e.g., by disconnecting from the frontend), the request is automatically cancelled across all workers, freeing compute resources.
For more details, see the Request Cancellation Architecture documentation.
Request Migration
Dynamo supports request migration to handle worker failures gracefully. When enabled, requests can be automatically migrated to healthy workers if a worker fails mid-generation. See the Request Migration Architecture documentation for configuration details.
See Also
- Examples: All deployment patterns with launch scripts
- vLLM README: Quick start and feature overview
- Observability: Metrics and monitoring setup
- Router Guide: KV-aware routing configuration
- Fault Tolerance: Request migration, cancellation, and graceful shutdown