Dynamo Architecture Flow

View as Markdown

This diagram shows the NVIDIA Dynamo disaggregated inference system. Color-coded flows indicate different types of operations.

๐Ÿ”ต Main Request Flow (Blue)

The primary user journey through the system:

  1. Request (S1): HTTP client sends API request to Frontend (OpenAI-compatible server on port 8000)
  2. Preprocess (S2): Frontend preprocesses the request (applies chat template, tokenizes) and validates it
  3. Route to Prefill (S3): PrefillRouter selects a prefill worker using KV-aware routing or load balancing

๐ŸŸข Prefill Flow (Green)

The prefill processing pipeline:

  1. Prefill (S4): Prefill worker executes the prefill computation on the input tokens and generates KV cache
  2. Return Metadata (S5): Prefill worker returns disaggregated_params containing backend-specific transfer metadata

๐ŸŸ  Decode Routing Flow (Orange)

Router orchestration to decode phase:

  1. Route to Decode (S6): PrefillRouter injects prefill result into decode request and routes to decode worker
  2. KV Transfer (S7): Decode worker coordinates with prefill worker for direct GPU-to-GPU KV cache transfer via NIXL

๐ŸŸฃ Completion Flow (Purple)

The response generation and delivery:

  1. Decode (S8): Decode worker generates tokens using the transferred KV cache
  2. Response (S9): Generated tokens stream back through Frontend for post-processing (detokenization) and delivery to Client

๐Ÿ”— Infrastructure Connections (Dotted lines)

Coordination and messaging support:

Service Discovery

  • On Kubernetes (default): Uses native K8s resources (DynamoWorkerMetadata CRD, EndpointSlices). No etcd required.
  • On bare metal: Uses etcd or filesystem for service discovery and endpoint registration.

Request Plane

  • TCP (default): Direct TCP connections between Frontend and Workers for request/response transport.
  • HTTP/NATS: Alternative transports configurable via DYN_REQUEST_PLANE.

NATS Connections (Optional, for KV routing)

  • KV Events: Cache state events for KV-aware routing (can be disabled with --no-kv-events)

Planning Connections (Gold, dotted)

  • Frontend โ†’ Planner: Metrics collection for auto-scaling decisions
  • Planner โ†’ Workers: Resource scaling commands for workers

Technical Implementation Details

PrefillRouter Orchestration:

  • The PrefillRouter sits between the Frontend and workers, orchestrating disaggregated serving
  • Selects prefill workers using KV-aware routing (cache overlap scores + load) or simple load balancing
  • Injects transfer metadata into decode requests for KV cache coordination

NIXL (NVIDIA Interchange Library):

  • Enables high-speed GPU-to-GPU data transfers using NVLink, InfiniBand/UCX, or PCIe
  • Transfer metadata exchanged via disaggregated_params in prefill response
  • Backend-specific coordination: SGLang uses bootstrap connections, vLLM uses block IDs, TRTLLM uses opaque state

Disaggregated KV Cache:

  • Each worker maintains local KV cache in its GPU memory
  • No shared storage bottlenecksโ€”transfers are direct worker-to-worker via NIXL
  • Non-blocking transfers allow GPU forward passes to continue during KV transfer