Running SGLang with Dynamo
Use the Latest Release
We recommend using the latest stable release of dynamo to avoid breaking changes:
You can find the latest release here and check out the corresponding branch with:
Table of Contents
- Feature Support Matrix
- Dynamo SGLang Integration
- Installation
- Quick Start
- Single Node Examples
- Multi-Node and Advanced Examples
- Deploy on SLURM or Kubernetes
Feature Support Matrix
Core Dynamo Features
Dynamo SGLang Integration
Dynamo SGLang integrates SGLang engines into Dynamo’s distributed runtime, enabling advanced features like disaggregated serving, KV-aware routing, and request migration while maintaining full compatibility with SGLang’s engine arguments.
Argument Handling
Dynamo SGLang uses SGLang’s native argument parser, so most SGLang engine arguments work identically. You can pass any SGLang argument (like --model-path, --tp, --trust-remote-code) directly to dynamo.sglang.
Dynamo-Specific Arguments
Tokenizer Behavior
- Default (
--use-sglang-tokenizernot set): Dynamo handles tokenization/detokenization via our blazing fast frontend and passesinput_idsto SGLang - With
--use-sglang-tokenizer: SGLang handles tokenization/detokenization, Dynamo passes raw prompts
[!NOTE] When using
--use-sglang-tokenizer, onlyv1/chat/completionsis available through Dynamo’s frontend.
Request Cancellation
When a user cancels a request (e.g., by disconnecting from the frontend), the request is automatically cancelled across all workers, freeing compute resources for other requests.
Cancellation Support Matrix
[!WARNING] ⚠️ SGLang backend currently does not support cancellation during remote prefill phase in disaggregated mode.
For more details, see the Request Cancellation Architecture documentation.
Installation
Install latest release
We suggest using uv to install the latest release of ai-dynamo[sglang]. You can install it with curl -LsSf https://astral.sh/uv/install.sh | sh
Expand for instructions
Install editable version for development
Expand for instructions
This requires having rust installed. We also recommend having a proper installation of the cuda toolkit as sglang requires nvcc to be available.
Using docker containers
Expand for instructions
We are in the process of shipping pre-built docker containers that contain installations of DeepEP, DeepGEMM, and NVSHMEM in order to support WideEP and P/D. For now, you can quickly build the container from source with the following command.
And then run it using
Quick Start
Below we provide a guide that lets you run all of our common deployment patterns on a single node.
Start Infrastructure Services (Local Development Only)
For local/bare-metal development, start etcd and optionally NATS using Docker Compose:
[!NOTE]
- etcd is optional but is the default local discovery backend. You can also use
--kv_store fileto use file system based discovery.- NATS is optional - only needed if using KV routing with events (default). You can disable it with
--no-kv-eventsflag for prediction-based routing- On Kubernetes, neither is required when using the Dynamo operator, which explicitly sets
DYN_DISCOVERY_BACKEND=kubernetesto enable native K8s service discovery (DynamoWorkerMetadata CRD)
[!TIP] Each example corresponds to a simple bash script that runs the OpenAI compatible server, processor, and optional router (written in Rust) and LLM engine (written in Python) in a single terminal. You can easily take each command and run them in separate terminals.
Additionally - because we use sglang’s argument parser, you can pass in any argument that sglang supports to the worker!
Aggregated Serving
Aggregated Serving with KV Routing
Aggregated Serving for Embedding Models
Here’s an example that uses the Qwen/Qwen3-Embedding-4B model.
Send the following request to verify your deployment:
Disaggregated serving
See SGLang Disaggregation to learn more about how sglang and dynamo handle disaggregated serving.
Disaggregated Serving with KV Aware Prefill Routing
Disaggregated Serving with Mixture-of-Experts (MoE) models and DP attention
You can use this configuration to test out disaggregated serving with dp attention and expert parallelism on a single node before scaling to the full DeepSeek-R1 model across multiple nodes.
Testing the Deployment
Send a test request to verify your deployment:
Deployment
We currently provide deployment examples for Kubernetes and SLURM.