Disaggregated Serving | NVIDIA Dynamo Documentation

Dynamo supports disaggregated serving where prefill (prompt processing) and decode (token generation) are handled by separate worker pools. When you register prefill workers with WorkerType.Prefill, the frontend automatically detects them and activates an internal prefill router.

For the high-level deployment matrix, see Router Guide. For the router flags used in this setup, see Configuration and Tuning.

If prefill and decode workers span topology domains such as zones or racks, use Topology-Aware KV Transfer to constrain or bias decode routing toward workers in the selected prefill worker’s transfer domain.

Automatic Prefill Router Activation

The prefill router is automatically created when:

A decode model is registered, for example via register_model() with ModelType.Chat | ModelType.Completions.
A prefill worker is detected with the same model name and WorkerType.Prefill.

Key characteristics of the prefill router:

Always disables active block tracking (track_active_blocks=false) since prefill workers do not perform decode.
Seamlessly integrates into the request pipeline between preprocessing and decode routing.
Falls back gracefully to decode-only mode if prefill fails or no prefill workers are available.

Key characteristics of the decode routing stage in disaggregated mode:

Disables overlap scoring (overlap_score_credit=0) because decode routing should not chase prefix reuse.
Disables KV reuse assumption (assume_kv_reuse=false) unless the backend can truly deduplicate transferred blocks.
Disables prefill-token tracking (track_prefill_tokens=false) so decode-side load reflects decode work rather than already-completed prompt work.

Setup Example

When both workers are registered, requests are automatically routed.

1 # Decode worker registration (in your decode worker)
2 decode_endpoint = runtime.endpoint("dynamo.decode.generate")
3 
4 await register_model(
5     model_input=ModelInput.Tokens,
6     model_type=ModelType.Chat | ModelType.Completions,
7     endpoint=decode_endpoint,
8     model_name="meta-llama/Llama-2-7b-hf",
9     worker_type=WorkerType.Decode,
10     needs=[[WorkerType.Prefill]],
11     # ... other parameters
12 )
13 
14 await decode_endpoint.serve_endpoint(decode_handler.generate)
15 
16 # Prefill worker registration (in your prefill worker)
17 prefill_endpoint = runtime.endpoint("dynamo.prefill.generate")
18 
19 await register_model(
20     model_input=ModelInput.Tokens,
21     model_type=ModelType.Empty,  # prefill workers expose no OpenAI surface
22     endpoint=prefill_endpoint,
23     model_name="meta-llama/Llama-2-7b-hf",
24     worker_type=WorkerType.Prefill,
25     needs=[[WorkerType.Decode]],
26     # ... other parameters
27 )
28 
29 await prefill_endpoint.serve_endpoint(prefill_handler.generate)

The automatic disaggregated routing setup described here is currently supported by the integrated dynamo.frontend path. It is not provided as a single turnkey mode by the standalone Python router (python -m dynamo.router). If you build this topology with standalone routers, you must launch and connect the prefill and decode routing stages yourself and handle request handoff, including the disaggregated_params returned by prefill. For an advanced reference, see the Global Router, which composes local prefill and decode router pools explicitly.

Request Flow

The following diagram shows an overview of the major components in disaggregated serving:

When topology-aware KV transfer is enabled, the prefill router also derives decode RoutingConstraints from the selected prefill worker’s runtime topology metadata before the request enters the decode router.