Enable SGLang Hierarchical Cache (HiCache)

View as Markdown

This guide shows how to enable SGLang’s Hierarchical Cache (HiCache) inside Dynamo.

1) Start the SGLang worker with HiCache enabled

$python -m dynamo.sglang \
> --model-path Qwen/Qwen3-0.6B \
> --host 0.0.0.0 --port 8000 \
> --page-size 64 \
> --enable-hierarchical-cache \
> --hicache-ratio 2 \
> --hicache-write-policy write_through \
> --hicache-storage-backend nixl \
> --log-level debug \
> --skip-tokenizer-init
  • —enable-hierarchical-cache: Enables hierarchical KV cache/offload
  • —hicache-ratio: The ratio of the size of host KV cache memory pool to the size of device pool. Lower this number if your machine has less CPU memory.
  • —hicache-write-policy: Write policy (e.g., write_through for synchronous host writes)
  • —hicache-storage-backend: Host storage backend for HiCache (e.g., nixl). NIXL selects the concrete store automatically; see PR #8488

Then, start the frontend:

$python -m dynamo.frontend --http-port 8000

2) Send a single request

$curl localhost:8000/v1/chat/completions \
> -H "Content-Type: application/json" \
> -d '{
> "model": "Qwen/Qwen3-0.6B",
> "messages": [
> {
> "role": "user",
> "content": "Explain why Roger Federer is considered one of the greatest tennis players of all time"
> }
> ],
> "stream": false,
> "max_tokens": 30
> }'

3) (Optional) Benchmarking

Run the perf script:

$bash -x $DYNAMO_ROOT/benchmarks/llm/perf.sh \
> --model Qwen/Qwen3-0.6B \
> --tensor-parallelism 1 \
> --data-parallelism 1 \
> --concurrency "2,4,8" \
> --input-sequence-length 2048 \
> --output-sequence-length 256