---
title: Dynamo Architecture Flow
---
This diagram shows the NVIDIA Dynamo disaggregated inference system as implemented in [examples/backends/vllm](https://github.com/ai-dynamo/dynamo/blob/v0.8.1/examples/backends/vllm). Color-coded flows indicate different types of operations.
> **Note**: The "Processor" shown in the diagram represents the request processing logic (tokenization, chat template application, routing) that runs within the Frontend component. It is not a separate deployment—the Frontend handles both HTTP serving and request preprocessing via the `make_engine` function.
## 🔵 Main Request Flow (Blue)
The primary user journey through the system:
1. **Discovery (S1)**: Client discovers the service endpoint
2. **Request (S2)**: HTTP client sends API request to Frontend (OpenAI-compatible server on port 8000)
3. **Validate (S3)**: Frontend preprocesses the request (applies chat template, tokenizes) and validates it
4. **Route (S3)**: Frontend routes the validated request to appropriate Decode Worker
## 🟠 Decision and Allocation Flow (Orange)
The system's intelligent routing and resource allocation:
4. **Query (S4)**: Decode Worker queries for prefix cache hits to optimize processing
5. **Disagg Decision (S5)**: Based on prefill length and queue size, the system decides whether it needs remote prefill
5a. **Allocate (S5a)**: Decode Worker pre-allocates KV cache blocks in its local GPU memory
6. **Queue (S6)**: If remote prefill is required, the system puts the RemotePrefillRequest with block IDs into the PrefillQueue
## 🟢 Prefill Worker Flow (Green)
The dedicated prefill processing pipeline:
7. **NATS Pull (S7)**: PrefillQueue uses a NATS consumer group to distribute work to available PrefillWorkers
8. **Load Metadata (S8)**: PrefillWorker loads NIXL metadata from ETCD to establish GPU communication
9. **Prefill (S9)**: Worker executes the prefill computation on the input tokens
10. **NIXL Transfer (S10)**: Direct GPU-to-GPU transfer writes the prefilled KV cache to the Decode Worker's pre-allocated blocks
## 🟣 Completion Flow (Purple)
The response generation and delivery:
11. **Notify (S11)**: PrefillWorker sends completion notification to Decode Worker
12. **Decode (S12)**: Decode Worker decodes from its local KV cache containing prefilled data
13. **Response (S13)**: The generated response flows back through the Frontend for post-processing (detokenization) and delivery to the Client
## 🔗 Infrastructure Connections (Dotted lines)
Coordination and messaging support:
### Service Discovery
- **On Kubernetes** (default): Uses native K8s resources (DynamoWorkerMetadata CRD, EndpointSlices). No etcd required.
- **On bare metal**: Uses etcd for service discovery and endpoint registration.
### Request Plane
- **TCP** (default): Direct TCP connections between Frontend and Workers for request/response transport.
- **HTTP/NATS**: Alternative transports configurable via `DYN_REQUEST_PLANE`.
### NATS Connections (Optional, for KV routing)
- **PrefillQueue**: JetStream consumer group for reliable work distribution in disaggregated serving
- **KV Events**: Cache state events for KV-aware routing (can be disabled with `--no-kv-events`)
### Planning Connections (Gold, dotted)
- **Frontend → Planner**: Metrics collection for auto-scaling decisions
- **Planner → Workers**: Resource scaling commands for both Decode Worker and PrefillWorker
## Technical Implementation Details
### NIXL (NVIDIA Interchange Library):
- Enables high-speed GPU-to-GPU data transfers using NVLink/PCIe
- Decode Worker publishes GPU metadata to ETCD for coordination
- PrefillWorker loads metadata to establish direct communication channels
- Block-based transfers (64–128 tokens per block) for efficient batching
### Disaggregated KV Cache:
- Each Decode Worker maintains local KV cache in its GPU memory
- No shared storage bottlenecks—all transfers are direct worker-to-worker
- Pre-allocated blocks ensure deterministic memory layout and performance
```mermaid
%%{init: {'theme':'dark', 'themeVariables': {'primaryColor': '#f4f4f4', 'primaryTextColor': '#333333', 'primaryBorderColor': '#888888', 'lineColor': '#4A90E2', 'sectionBkgColor': '#f9f9f9', 'altSectionBkgColor': '#eeeeee', 'tertiaryColor': '#f0f0f0', 'background': '#ffffff', 'mainBkg': '#f8f8f8', 'secondaryColor': '#f4f4f4', 'nodeTextColor': '#333333'}, 'flowchart': {'htmlLabels': true, 'curve': 'basis'}, 'fontFamily': 'Inter, system-ui, -apple-system, "Segoe UI", Roboto, sans-serif', 'fontSize': '18px'}%%
graph TD
%% Top Layer - Client & Frontend
Client["HTTP Client"]
S1[["1 DISCOVERY"]]
Frontend["Frontend
OpenAI Compatible Server
Port 8000"]
S2[["2 REQUEST"]]
%% Processing Layer
Processor["Processor
Request Handler & Router"]
S3[["3 VALIDATE"]]
%% Infrastructure - Positioned strategically to minimize crossings
subgraph INF["Infrastructure Layer"]
ETCD[("ETCD
Service Discovery &
NIXL Metadata")]
NATS[("NATS
Message Broker")]
Planner["Planner
Resource Management
Auto-scaling"]
end
%% Worker Layer - Main processing
subgraph WL["Worker Layer"]
%% VllmWorker section
VllmWorker["Decode Worker
Handles Decoding & Disagg Decisions"]
S4[["4 QUERY"]]
S5[["5 DISAGG DECISION"]]
S5a[["5a ALLOCATE"]]
S12[["12 DECODE"]]
S6[["6 QUEUE"]]
S13[["13 RESPONSE"]]
%% Storage positioned near workers
LocalKVCache[("Local KV Cache
Pre-allocated Blocks")]
%% Prefill System - Right side to minimize crossings
subgraph PS["Prefill System"]
PrefillQueue["Prefill Queue
NATS JetStream
Consumer Group"]
PrefillWorker["Prefill Worker
Dedicated Prefill Processing
(Multiple Instances)"]
S7[["7 NATS PULL"]]
S8[["8 LOAD METADATA"]]
S9[["9 PREFILL"]]
S10[["10 NIXL TRANSFER"]]
S11[["11 NOTIFY"]]
end
end
%% Main Request Flow (Blue) - Clean vertical flow
Client -.-> S1
S1 -->|HTTP API Call| Frontend
Frontend -.-> S2
S2 -->|Process & Validate| Processor
Processor -.-> S3
S3 -->|Route to Worker| VllmWorker
%% VllmWorker Internal Flow (Orange)
VllmWorker -.-> S4
S4 -->|Query Prefix Cache Hit| S5
S5 -->|Prefill Length & Queue Check| S5a
S5a -->|Continue to Decode| S12
%% Allocation & Queuing (Orange) - Minimize crossings
S5a -->|Allocate KV Cache Blocks| LocalKVCache
VllmWorker --> S6
S6 -->|Put RemotePrefillRequest| PrefillQueue
%% Prefill Worker Flow (Green) - Self-contained within PS
PrefillQueue -.-> S7
S7 -->|Consumer Group Pull| PrefillWorker
PrefillWorker -.-> S8
PrefillWorker -.-> S9
S9 -->|Execute Prefill| S10
S10 -->|Direct GPU Transfer| LocalKVCache
PrefillWorker --> S11
%% Return Flow (Purple) - Clean return path
S11 -->|Completion Notification| S12
S12 -->|Decode from KV Cache| S13
S13 -->|Post-process Response| Processor
Processor -->|HTTP Response| Frontend
Frontend -->|Final Response| Client
%% Infrastructure Connections - Organized to avoid crossings
%% ETCD Connections - Grouped by proximity
Frontend -.->|Service Discovery| ETCD
Processor -.->|Service Discovery| ETCD
VllmWorker -.->|NIXL Metadata| ETCD
PrefillWorker -.->|NIXL Metadata| ETCD
S8 -.->|Load NIXL Metadata| ETCD
Planner -.->|Service Discovery| ETCD
%% NATS Connections - Direct to queue system
PrefillQueue -.->|JetStream| NATS
Processor -.->|Load Balancing| NATS
%% Planning Connections - Strategic positioning
Frontend -.->|Metrics| Planner
Planner -.->|Auto-scaling| VllmWorker
Planner -.->|Auto-scaling| PrefillWorker
%% Styling - Each component with unique colors
classDef client fill:#e8f5e8,stroke:#2E7D32,stroke-width:3px
classDef frontend fill:#fff3e0,stroke:#F57C00,stroke-width:3px
classDef processor fill:#f3e5f5,stroke:#7B1FA2,stroke-width:3px
classDef worker fill:#e3f2fd,stroke:#1565C0,stroke-width:3px
classDef prefillQueue fill:#fff8e1,stroke:#E65100,stroke-width:3px
classDef prefillWorker fill:#fce4ec,stroke:#C2185B,stroke-width:3px
classDef prefillBox fill:#eceff1,stroke:#455A64,stroke-width:3px
classDef planner fill:#f1f8e9,stroke:#558B2F,stroke-width:3px
classDef storage fill:#e0f2f1,stroke:#00695C,stroke-width:3px
classDef etcd fill:#fff9c4,stroke:#F9A825,stroke-width:3px
classDef nats fill:#ede7f6,stroke:#5E35B1,stroke-width:3px
classDef infraLayer fill:#fff9c4,stroke:#FFC107,stroke-width:3px
classDef workerLayer fill:#e3f2fd,stroke:#2196F3,stroke-width:3px
class Client client
class Frontend frontend
class Processor processor
class VllmWorker worker
class PrefillQueue prefillQueue
class PrefillWorker prefillWorker
class Planner planner
class LocalKVCache storage
class ETCD etcd
class NATS nats
class PS prefillBox
class INF infraLayer
class WL workerLayer
%% Flow Colors - Different line styles to reduce visual clutter
%% Main Request Flow - Blue (solid)
linkStyle 0 stroke:#1565C0,stroke-width:3px,stroke-dasharray: 3 3
linkStyle 1 stroke:#1565C0,stroke-width:4px
linkStyle 2 stroke:#1565C0,stroke-width:3px,stroke-dasharray: 3 3
linkStyle 3 stroke:#1565C0,stroke-width:4px
linkStyle 4 stroke:#1565C0,stroke-width:3px,stroke-dasharray: 3 3
linkStyle 5 stroke:#1565C0,stroke-width:4px
%% Decision & Allocation Flow - Orange (mixed)
linkStyle 6 stroke:#E65100,stroke-width:3px,stroke-dasharray: 3 3
linkStyle 7 stroke:#E65100,stroke-width:4px
linkStyle 8 stroke:#E65100,stroke-width:4px
linkStyle 9 stroke:#E65100,stroke-width:3px,stroke-dasharray: 3 3
%% KV Cache & Queue - Orange (solid)
linkStyle 10 stroke:#E65100,stroke-width:4px
linkStyle 11 stroke:#E65100,stroke-width:4px
linkStyle 12 stroke:#E65100,stroke-width:4px
%% Prefill Worker Flow - Green (mixed)
linkStyle 13 stroke:#2E7D32,stroke-width:3px,stroke-dasharray: 3 3
linkStyle 14 stroke:#2E7D32,stroke-width:4px
linkStyle 15 stroke:#2E7D32,stroke-width:3px,stroke-dasharray: 3 3
linkStyle 16 stroke:#2E7D32,stroke-width:3px,stroke-dasharray: 3 3
linkStyle 17 stroke:#2E7D32,stroke-width:4px
linkStyle 18 stroke:#2E7D32,stroke-width:4px
linkStyle 19 stroke:#2E7D32,stroke-width:4px
%% Completion Flow - Purple (mixed)
linkStyle 20 stroke:#6A1B9A,stroke-width:4px
linkStyle 21 stroke:#6A1B9A,stroke-width:3px,stroke-dasharray: 3 3
linkStyle 22 stroke:#6A1B9A,stroke-width:4px
linkStyle 23 stroke:#6A1B9A,stroke-width:4px
linkStyle 24 stroke:#6A1B9A,stroke-width:4px
%% Infrastructure Flows - Lighter and dotted to reduce visual noise
%% ETCD Connections - Gray (dotted, thinner)
linkStyle 25 stroke:#757575,stroke-width:2px,stroke-dasharray: 8 8
linkStyle 26 stroke:#757575,stroke-width:2px,stroke-dasharray: 8 8
linkStyle 27 stroke:#757575,stroke-width:2px,stroke-dasharray: 8 8
linkStyle 28 stroke:#757575,stroke-width:2px,stroke-dasharray: 8 8
linkStyle 29 stroke:#757575,stroke-width:2px,stroke-dasharray: 8 8
linkStyle 30 stroke:#757575,stroke-width:2px,stroke-dasharray: 8 8
%% NATS Connections - Teal (dotted, thinner)
linkStyle 31 stroke:#26A69A,stroke-width:2px,stroke-dasharray: 8 8
linkStyle 32 stroke:#26A69A,stroke-width:2px,stroke-dasharray: 8 8
%% Planning Connections - Gold (dotted, thinner)
linkStyle 33 stroke:#FFA726,stroke-width:2px,stroke-dasharray: 8 8
linkStyle 34 stroke:#FFA726,stroke-width:2px,stroke-dasharray: 8 8
linkStyle 35 stroke:#FFA726,stroke-width:2px,stroke-dasharray: 8 8
```