Dynamo Architecture Flow
This diagram shows the NVIDIA Dynamo disaggregated inference system. Color-coded flows indicate different types of operations.
๐ต Main Request Flow (Blue)
The primary user journey through the system:
- Request (S1): HTTP client sends API request to Frontend (OpenAI-compatible server on port 8000)
- Preprocess (S2): Frontend preprocesses the request (applies chat template, tokenizes) and validates it
- Route to Prefill (S3): PrefillRouter selects a prefill worker using KV-aware routing or load balancing
๐ข Prefill Flow (Green)
The prefill processing pipeline:
- Prefill (S4): Prefill worker executes the prefill computation on the input tokens and generates KV cache
- Return Metadata (S5): Prefill worker returns
disaggregated_paramscontaining backend-specific transfer metadata
๐ Decode Routing Flow (Orange)
Router orchestration to decode phase:
- Route to Decode (S6): PrefillRouter injects prefill result into decode request and routes to decode worker
- KV Transfer (S7): Decode worker coordinates with prefill worker for direct GPU-to-GPU KV cache transfer via NIXL
๐ฃ Completion Flow (Purple)
The response generation and delivery:
- Decode (S8): Decode worker generates tokens using the transferred KV cache
- Response (S9): Generated tokens stream back through Frontend for post-processing (detokenization) and delivery to Client
๐ Infrastructure Connections (Dotted lines)
Coordination and messaging support:
Service Discovery
- On Kubernetes (default): Uses native K8s resources (DynamoWorkerMetadata CRD, EndpointSlices). No etcd required.
- On bare metal: Uses etcd or filesystem for service discovery and endpoint registration.
Request Plane
- TCP (default): Direct TCP connections between Frontend and Workers for request/response transport.
- HTTP/NATS: Alternative transports configurable via
DYN_REQUEST_PLANE.
NATS Connections (Optional, for KV routing)
- KV Events: Cache state events for KV-aware routing (can be disabled with
--no-kv-events)
Planning Connections (Gold, dotted)
- Frontend โ Planner: Metrics collection for auto-scaling decisions
- Planner โ Workers: Resource scaling commands for workers
Technical Implementation Details
PrefillRouter Orchestration:
- The
PrefillRoutersits between the Frontend and workers, orchestrating disaggregated serving - Selects prefill workers using KV-aware routing (cache overlap scores + load) or simple load balancing
- Injects transfer metadata into decode requests for KV cache coordination
NIXL (NVIDIA Interchange Library):
- Enables high-speed GPU-to-GPU data transfers using NVLink, InfiniBand/UCX, or PCIe
- Transfer metadata exchanged via
disaggregated_paramsin prefill response - Backend-specific coordination: SGLang uses bootstrap connections, vLLM uses block IDs, TRTLLM uses opaque state
Disaggregated KV Cache:
- Each worker maintains local KV cache in its GPU memory
- No shared storage bottlenecksโtransfers are direct worker-to-worker via NIXL
- Non-blocking transfers allow GPU forward passes to continue during KV transfer