Running SGLang with Dynamo

View as Markdown

Use the Latest Release

We recommend using the latest stable release of dynamo to avoid breaking changes:

GitHub Release

You can find the latest release here and check out the corresponding branch with:

$git checkout $(git describe --tags $(git rev-list --tags --max-count=1))

Table of Contents

Feature Support Matrix

Core Dynamo Features

Dynamo SGLang Integration

Dynamo SGLang integrates SGLang engines into Dynamo’s distributed runtime, enabling advanced features like disaggregated serving, KV-aware routing, and request migration while maintaining full compatibility with SGLang’s engine arguments.

Argument Handling

Dynamo SGLang uses SGLang’s native argument parser, so most SGLang engine arguments work identically. You can pass any SGLang argument (like --model-path, --tp, --trust-remote-code) directly to dynamo.sglang.

Dynamo-Specific Arguments

ArgumentDescriptionDefaultSGLang Equivalent
--endpointDynamo endpoint in dyn://namespace.component.endpoint formatAuto-generated based on modeN/A
--migration-limitMax times a request can migrate between workers for fault tolerance. See Request Migration Architecture.0 (disabled)N/A
--dyn-tool-call-parserTool call parser for structured outputs (takes precedence over --tool-call-parser)None--tool-call-parser
--dyn-reasoning-parserReasoning parser for CoT models (takes precedence over --reasoning-parser)None--reasoning-parser
--use-sglang-tokenizerUse SGLang’s tokenizer instead of Dynamo’sFalseN/A
--custom-jinja-templateUse custom chat template for that model (takes precedence over default chat template in model repo)None--chat-template

Tokenizer Behavior

  • Default (--use-sglang-tokenizer not set): Dynamo handles tokenization/detokenization via our blazing fast frontend and passes input_ids to SGLang
  • With --use-sglang-tokenizer: SGLang handles tokenization/detokenization, Dynamo passes raw prompts

[!NOTE] When using --use-sglang-tokenizer, only v1/chat/completions is available through Dynamo’s frontend.

Request Cancellation

When a user cancels a request (e.g., by disconnecting from the frontend), the request is automatically cancelled across all workers, freeing compute resources for other requests.

Cancellation Support Matrix

PrefillDecode
Aggregated
Disaggregated⚠️

[!WARNING] ⚠️ SGLang backend currently does not support cancellation during remote prefill phase in disaggregated mode.

For more details, see the Request Cancellation Architecture documentation.

Installation

Install latest release

We suggest using uv to install the latest release of ai-dynamo[sglang]. You can install it with curl -LsSf https://astral.sh/uv/install.sh | sh

$# create a virtual env
$uv venv --python 3.12 --seed
$# install the latest release (which comes bundled with a stable sglang version)
$uv pip install "ai-dynamo[sglang]"

Install editable version for development

This requires having rust installed. We also recommend having a proper installation of the cuda toolkit as sglang requires nvcc to be available.

$# create a virtual env
$uv venv --python 3.12 --seed
$# build dynamo runtime bindings
$uv pip install maturin
$cd $DYNAMO_HOME/lib/bindings/python
$maturin develop --uv
$cd $DYNAMO_HOME
$# installs sglang supported version along with dynamo
$# include the prerelease flag to install flashinfer rc versions
$uv pip install -e .
$# install any sglang version >= 0.5.3.post2
$uv pip install "sglang[all]==0.5.3.post2"

Using docker containers

We are in the process of shipping pre-built docker containers that contain installations of DeepEP, DeepGEMM, and NVSHMEM in order to support WideEP and P/D. For now, you can quickly build the container from source with the following command.

$cd $DYNAMO_ROOT
$./container/build.sh \
> --framework SGLANG \
> --tag dynamo-sglang:latest \

And then run it using

$docker run \
> --gpus all \
> -it \
> --rm \
> --network host \
> --shm-size=10G \
> --ulimit memlock=-1 \
> --ulimit stack=67108864 \
> --ulimit nofile=65536:65536 \
> --cap-add CAP_SYS_PTRACE \
> --ipc host \
> dynamo:latest-sglang

Quick Start

Below we provide a guide that lets you run all of our common deployment patterns on a single node.

Start Infrastructure Services (Local Development Only)

For local/bare-metal development, start etcd and optionally NATS using Docker Compose:

$docker compose -f deploy/docker-compose.yml up -d

[!NOTE]

  • etcd is optional but is the default local discovery backend. You can also use --kv_store file to use file system based discovery.
  • NATS is optional - only needed if using KV routing with events (default). You can disable it with --no-kv-events flag for prediction-based routing
  • On Kubernetes, neither is required when using the Dynamo operator, which explicitly sets DYN_DISCOVERY_BACKEND=kubernetes to enable native K8s service discovery (DynamoWorkerMetadata CRD)

[!TIP] Each example corresponds to a simple bash script that runs the OpenAI compatible server, processor, and optional router (written in Rust) and LLM engine (written in Python) in a single terminal. You can easily take each command and run them in separate terminals.

Additionally - because we use sglang’s argument parser, you can pass in any argument that sglang supports to the worker!

Aggregated Serving

$cd $DYNAMO_HOME/examples/backends/sglang
$./launch/agg.sh

Aggregated Serving with KV Routing

$cd $DYNAMO_HOME/examples/backends/sglang
$./launch/agg_router.sh

Aggregated Serving for Embedding Models

Here’s an example that uses the Qwen/Qwen3-Embedding-4B model.

$cd $DYNAMO_HOME/examples/backends/sglang
$./launch/agg_embed.sh
$curl localhost:8000/v1/embeddings \
> -H "Content-Type: application/json" \
> -d '{
> "model": "Qwen/Qwen3-Embedding-4B",
> "input": "Hello, world!"
> }'

Disaggregated serving

See SGLang Disaggregation to learn more about how sglang and dynamo handle disaggregated serving.

$cd $DYNAMO_HOME/examples/backends/sglang
$./launch/disagg.sh

Disaggregated Serving with KV Aware Prefill Routing

$cd $DYNAMO_HOME/examples/backends/sglang
$./launch/disagg_router.sh

Disaggregated Serving with Mixture-of-Experts (MoE) models and DP attention

You can use this configuration to test out disaggregated serving with dp attention and expert parallelism on a single node before scaling to the full DeepSeek-R1 model across multiple nodes.

$# note this will require 4 GPUs
$cd $DYNAMO_HOME/examples/backends/sglang
$./launch/disagg_dp_attn.sh

Testing the Deployment

Send a test request to verify your deployment:

$curl localhost:8000/v1/chat/completions \
> -H "Content-Type: application/json" \
> -d '{
> "model": "Qwen/Qwen3-0.6B",
> "messages": [
> {
> "role": "user",
> "content": "Explain why Roger Federer is considered one of the greatest tennis players of all time"
> }
> ],
> "stream": true,
> "max_tokens": 30
> }'

Deployment

We currently provide deployment examples for Kubernetes and SLURM.

Kubernetes

SLURM