--- title: Reference Guide subtitle: 'Configuration, arguments, and operational details for the vLLM backend' --- # Reference Guide ## Overview The vLLM backend in Dynamo integrates [vLLM](https://github.com/vllm-project/vllm) engines into Dynamo's distributed runtime, enabling disaggregated serving, KV-aware routing, and request cancellation. Dynamo leverages vLLM's native KV cache events, NIXL-based transfer mechanisms, and metric reporting. Dynamo vLLM uses vLLM's native argument parser — all vLLM engine arguments are passed through directly. Dynamo adds its own arguments for disaggregation mode, KV transfer, and prompt embeddings. ## Argument Reference The vLLM backend accepts all upstream vLLM engine arguments plus Dynamo-specific arguments. The authoritative source is always the CLI: ```bash python -m dynamo.vllm --help ``` The `--help` output is organized into the following groups: - **Dynamo Runtime Options** — Namespace, discovery backend, request/event plane, endpoint types, tool/reasoning parsers, and custom chat templates. These are common across all Dynamo backends and use `DYN_*` env vars. - **Dynamo vLLM Options** — Disaggregation mode, tokenizer selection, sleep mode, multimodal flags, vLLM-Omni pipeline configuration, headless mode, and ModelExpress. These use `DYN_VLLM_*` env vars. - **vLLM Engine Options** — All native vLLM arguments (`--model`, `--tensor-parallel-size`, `--kv-transfer-config`, `--kv-events-config`, `--enable-prefix-caching`, etc.). See the [vLLM serve args documentation](https://docs.vllm.ai/en/stable/configuration/serve_args.html). ### Prompt Embeddings Dynamo supports [vLLM prompt embeddings](https://docs.vllm.ai/en/stable/features/prompt_embeds.html) — pre-computed embeddings bypass tokenization in the Rust frontend and are decoded to tensors in the worker. - Enable with `--enable-prompt-embeds` (disabled by default) - Embeddings are sent as base64-encoded PyTorch tensors via the `prompt_embeds` field in the Completions API - NATS must be configured with a 15MB max payload for large embeddings (already set in default deployments) ## Hashing Consistency for KV Events When using KV-aware routing, ensure deterministic hashing across processes to avoid radix tree mismatches. Choose one of the following: - Set `PYTHONHASHSEED=0` for all vLLM processes when relying on Python's built-in hashing for prefix caching. - If your vLLM version supports it, configure a deterministic prefix caching algorithm: ```bash vllm serve ... --enable-prefix-caching --prefix-caching-algo sha256 ``` See the high-level notes in [Router Design](/dynamo/dev/design-docs/router-design#deterministic-event-ids) on deterministic event IDs. ## Graceful Shutdown vLLM workers use Dynamo's graceful shutdown mechanism. When a `SIGTERM` or `SIGINT` is received: 1. **Discovery unregister**: The worker is removed from service discovery so no new requests are routed to it 2. **Grace period**: In-flight requests are allowed to complete (configurable via `DYN_GRACEFUL_SHUTDOWN_GRACE_PERIOD_SECS`, default 5s) 3. **Resource cleanup**: Engine resources and temporary files (Prometheus dirs, LoRA adapters) are released All vLLM endpoints use `graceful_shutdown=True`, meaning they wait for in-flight requests to finish before exiting. An internal `VllmEngineMonitor` also checks engine health every 2 seconds and initiates shutdown if the engine becomes unresponsive. For more details, see [Graceful Shutdown](/dynamo/dev/user-guides/fault-tolerance/graceful-shutdown). ## Health Checks Each worker type has a specialized health check payload that validates the full inference pipeline: | Worker Type | Health Check Strategy | |------------|----------------------| | Decode / Aggregated | Short generation request (`max_tokens=1`) using the model's BOS token | | Prefill | Same payload structure as decode, adapted for prefill request format | | vLLM-Omni | Short generation request via AsyncOmni with the model's BOS token | Health checks are registered with the Dynamo runtime and called by the frontend or Kubernetes liveness probes. The payload can be overridden via `DYN_HEALTH_CHECK_PAYLOAD` environment variable. See [Health Checks](/dynamo/dev/user-guides/observability-local/health-checks) for the broader health check architecture. ## Request Cancellation When a user cancels a request (e.g., by disconnecting from the frontend), the request is automatically cancelled across all workers, freeing compute resources. | | Prefill | Decode | |-|---------|--------| | **Aggregated** | ✅ | ✅ | | **Disaggregated** | ✅ | ✅ | For more details, see the [Request Cancellation Architecture](/dynamo/dev/user-guides/fault-tolerance/request-cancellation) documentation. ## Request Migration Dynamo supports [request migration](/dynamo/dev/user-guides/fault-tolerance/request-migration) to handle worker failures gracefully. When enabled, requests can be automatically migrated to healthy workers if a worker fails mid-generation. See the [Request Migration Architecture](/dynamo/dev/user-guides/fault-tolerance/request-migration) documentation for configuration details. ## See Also - **[Examples](/dynamo/dev/additional-resources/v-llm-details/examples)**: All deployment patterns with launch scripts - **[vLLM README](/dynamo/dev/backends/v-llm)**: Quick start and feature overview - **[Observability](/dynamo/dev/additional-resources/v-llm-details/observability)**: Metrics and monitoring setup - **[Router Guide](/dynamo/dev/components/router/router-guide)**: KV-aware routing configuration - **[Fault Tolerance](/dynamo/dev/user-guides/fault-tolerance)**: Request migration, cancellation, and graceful shutdown