Disaggregation | NVIDIA Dynamo Documentation

This document explains how SGLang’s disaggregated prefill-decode architecture works, both standalone and within Dynamo.

Overview

Disaggregated serving separates the prefill and decode phases of LLM inference into different workers. This architecture allows for:

SGLang’s standalone approach:

The load balancer receives a request from the client
A random (prefill, decode) pair is selected from the pool of available workers
Request is sent to both prefill and decode workers via asyncio tasks
Internally disaggregation is done from prefill → decode

Dynamo’s approach:

Because Dynamo has a discovery mechanism, we do not use a load balancer. Instead:

Route to a decode worker first
Choose a prefill worker via round-robin or KV-aware selection
Send the request to both workers
SGLang’s bootstrap server (part of the tokenizer_manager) is used in conjunction with NIXL/Mooncake to handle the KV transfer

The following diagram shows the complete request flow for disaggregated serving:

Setup Phase (One-Time)

Per-Request Flow

Request initiation: Client sends request to decode worker
Bootstrap room allocation: Decode forwards to prefill and receives a bootstrap_room ID for coordination
Memory allocation: Decode allocates GPU memory pages for incoming KV cache
Prefill execution: Prefill worker processes the prompt and generates KV cache
KV transfer: Prefill uses RDMA to write KV cache directly to decode’s GPU memory (while decode polls for completion)
Cleanup: Prefill deallocates transfer metadata after confirming completion
Decode phase: Decode worker generates tokens using the transferred KV cache
Streaming: Tokens are streamed back to the client as they’re generated