Dynamo Architecture Flow
This diagram shows the NVIDIA Dynamo disaggregated inference system as implemented in examples/backends/vllm. Color-coded flows indicate different types of operations:
π΅ Main Request Flow (Blue)
The primary user journey through the system:
- Discovery (S1): Client discovers the service endpoint
- Request (S2): HTTP client sends API request to Frontend (OpenAI-compatible server on port 8000)
- Validate (S3): Frontend forwards request to Processor for validation and routing
- Route (S3): Processor routes the validated request to appropriate Decode Worker
π Decision and Allocation Flow (Orange)
The systemβs intelligent routing and resource allocation:
- Query (S4): Decode Worker queries for prefix cache hits to optimize processing
- Disagg Decision (S5): Based on prefill length and queue size, the system decides whether it needs remote prefill 5a. Allocate (S5a): Decode Worker pre-allocates KV cache blocks in its local GPU memory
- Queue (S6): If remote prefill is required, the system puts the RemotePrefillRequest with block IDs into the PrefillQueue
π’ Prefill Worker Flow (Green)
The dedicated prefill processing pipeline:
- NATS Pull (S7): PrefillQueue uses a NATS consumer group to distribute work to available PrefillWorkers
- Load Metadata (S8): PrefillWorker loads NIXL metadata from ETCD to establish GPU communication
- Prefill (S9): Worker executes the prefill computation on the input tokens
- NIXL Transfer (S10): Direct GPU-to-GPU transfer writes the prefilled KV cache to the Decode Workerβs pre-allocated blocks
π£ Completion Flow (Purple)
The response generation and delivery:
- Notify (S11): PrefillWorker sends completion notification to Decode Worker
- Decode (S12): Decode Worker decodes from its local KV cache containing prefilled data
- Response (S13): The system sends the generated response to the Processor for post-processing, then through the Frontend to the Client
π Infrastructure Connections (Dotted lines)
Coordination and messaging support:
ETCD Connections (Gray, dotted)
- Frontend, Processor, Planner: Service discovery and registration
- Decode Worker, PrefillWorker: NIXL metadata storage for GPU communication setup
NATS Connections (Teal, dotted)
- PrefillQueue: JetStream consumer group for reliable work distribution
- Processor: Load balancing across workers
Planning Connections (Gold, dotted)
- Frontend β Planner: Metrics collection for auto-scaling decisions
- Planner β Workers: Resource scaling commands for both Decode Worker and PrefillWorker
Technical Implementation Details
NIXL (NVIDIA Interchange Library):
- Enables high-speed GPU-to-GPU data transfers using NVLink/PCIe
- Decode Worker publishes GPU metadata to ETCD for coordination
- PrefillWorker loads metadata to establish direct communication channels
- Block-based transfers (64β128 tokens per block) for efficient batching
Disaggregated KV Cache:
- Each Decode Worker maintains local KV cache in its GPU memory
- No shared storage bottlenecksβall transfers are direct worker-to-worker
- Pre-allocated blocks ensure deterministic memory layout and performance