Dynamo Architecture Flow
This diagram shows the NVIDIA Dynamo disaggregated inference system as implemented in examples/backends/vllm. Color-coded flows indicate different types of operations.
Note: The “Processor” shown in the diagram represents the request processing logic (tokenization, chat template application, routing) that runs within the Frontend component. It is not a separate deployment—the Frontend handles both HTTP serving and request preprocessing via the
make_enginefunction.
🔵 Main Request Flow (Blue)
The primary user journey through the system:
- Discovery (S1): Client discovers the service endpoint
- Request (S2): HTTP client sends API request to Frontend (OpenAI-compatible server on port 8000)
- Validate (S3): Frontend preprocesses the request (applies chat template, tokenizes) and validates it
- Route (S3): Frontend routes the validated request to appropriate Decode Worker
🟠 Decision and Allocation Flow (Orange)
The system’s intelligent routing and resource allocation:
- Query (S4): Decode Worker queries for prefix cache hits to optimize processing
- Disagg Decision (S5): Based on prefill length and queue size, the system decides whether it needs remote prefill 5a. Allocate (S5a): Decode Worker pre-allocates KV cache blocks in its local GPU memory
- Queue (S6): If remote prefill is required, the system puts the RemotePrefillRequest with block IDs into the PrefillQueue
🟢 Prefill Worker Flow (Green)
The dedicated prefill processing pipeline:
- NATS Pull (S7): PrefillQueue uses a NATS consumer group to distribute work to available PrefillWorkers
- Load Metadata (S8): PrefillWorker loads NIXL metadata from ETCD to establish GPU communication
- Prefill (S9): Worker executes the prefill computation on the input tokens
- NIXL Transfer (S10): Direct GPU-to-GPU transfer writes the prefilled KV cache to the Decode Worker’s pre-allocated blocks
🟣 Completion Flow (Purple)
The response generation and delivery:
- Notify (S11): PrefillWorker sends completion notification to Decode Worker
- Decode (S12): Decode Worker decodes from its local KV cache containing prefilled data
- Response (S13): The generated response flows back through the Frontend for post-processing (detokenization) and delivery to the Client
🔗 Infrastructure Connections (Dotted lines)
Coordination and messaging support:
Service Discovery
- On Kubernetes (default): Uses native K8s resources (DynamoWorkerMetadata CRD, EndpointSlices). No etcd required.
- On bare metal: Uses etcd for service discovery and endpoint registration.
Request Plane
- TCP (default): Direct TCP connections between Frontend and Workers for request/response transport.
- HTTP/NATS: Alternative transports configurable via
DYN_REQUEST_PLANE.
NATS Connections (Optional, for KV routing)
- PrefillQueue: JetStream consumer group for reliable work distribution in disaggregated serving
- KV Events: Cache state events for KV-aware routing (can be disabled with
--no-kv-events)
Planning Connections (Gold, dotted)
- Frontend → Planner: Metrics collection for auto-scaling decisions
- Planner → Workers: Resource scaling commands for both Decode Worker and PrefillWorker
Technical Implementation Details
NIXL (NVIDIA Interchange Library):
- Enables high-speed GPU-to-GPU data transfers using NVLink/PCIe
- Decode Worker publishes GPU metadata to ETCD for coordination
- PrefillWorker loads metadata to establish direct communication channels
- Block-based transfers (64–128 tokens per block) for efficient batching
Disaggregated KV Cache:
- Each Decode Worker maintains local KV cache in its GPU memory
- No shared storage bottlenecks—all transfers are direct worker-to-worker
- Pre-allocated blocks ensure deterministic memory layout and performance