KVBM Guide | NVIDIA Dynamo Documentation

The Dynamo KV Block Manager (KVBM) is a scalable runtime component designed to handle memory allocation, management, and remote sharing of Key-Value (KV) blocks for inference tasks across heterogeneous and distributed environments. It acts as a unified memory layer and write-through cache for frameworks like vLLM and TensorRT-LLM.

KVBM is modular and can be used standalone via pip install kvbm or as the memory management component in the full Dynamo stack. This guide covers installation, configuration, and deployment of the Dynamo KV Block Manager (KVBM) and other KV cache management systems.

Quick start with the pre-built NGC container

The fastest path is the published Dynamo container, which includes KVBM:

$ docker run --gpus all --rm -it \
>   nvcr.io/nvidia/ai-dynamo/vllm-runtime:1.2.1 \
>   /bin/bash

For installation from source or custom builds, see Local Installation and Release Artifacts.

Run KVBM Standalone

KVBM can be used independently without using the rest of the Dynamo stack:

$ pip install kvbm

See the support matrix for version compatibility.

Build from Source

To build KVBM from source, see the detailed instructions in the KVBM bindings README.

Run KVBM in Dynamo with vLLM

Docker Setup

$ # Start up etcd for KVBM leader/worker registration and discovery
$ docker compose -f dev/docker-compose.yml up -d

Pick one of the following to get a Dynamo vLLM container with KVBM built in. The subsequent serving commands are the same either way.

Option A: Pre-built NGC container (recommended for quick start)

$ docker run --gpus all --network host --rm -it nvcr.io/nvidia/ai-dynamo/vllm-runtime:1.2.1

See the Local Installation Guide for full setup instructions and Release Artifacts for available versions.

Option B: Build from source

$ # Build a dynamo vLLM container (KVBM is built in by default)
$ # x86_64
$ python container/render.py --framework vllm --target runtime --output-short-filename --platform linux/amd64
$ docker buildx build --platform linux/amd64 -t dynamo:latest-vllm-runtime -f container/rendered.Dockerfile .
$ # arm64 (Grace, Jetson, arm64 EC2)
$ python container/render.py --framework vllm --target runtime --output-short-filename --platform linux/arm64
$ docker buildx build --platform linux/arm64 -t dynamo:latest-vllm-runtime -f container/rendered.Dockerfile .
$ 
$ # Launch the container
$ container/run.sh --image dynamo:latest-vllm-runtime -it --mount-workspace --use-nixl-gds

Aggregated Serving

$ cd $DYNAMO_HOME/examples/backends/vllm
$ ./launch/agg_kvbm.sh

Verify Deployment

$ curl localhost:8000/v1/chat/completions \
>   -H "Content-Type: application/json" \
>   -d '{
>     "model": "Qwen/Qwen3-0.6B",
>     "messages": [{"role": "user", "content": "Hello, how are you?"}],
>     "stream": false,
>     "max_tokens": 10
>   }'

Alternative: Using Direct vllm serve

You can also use vllm serve directly with KVBM:

$ vllm serve --kv-transfer-config '{"kv_connector":"DynamoConnector","kv_role":"kv_both", "kv_connector_module_path": "kvbm.vllm_integration.connector"}' Qwen/Qwen3-0.6B

Run KVBM in Dynamo with TensorRT-LLM

Prerequisites:

Ensure etcd and nats are running before starting
KVBM only supports TensorRT-LLM’s PyTorch backend
Disable partial reuse (enable_partial_reuse: false) to increase offloading cache hits
KVBM requires TensorRT-LLM v1.2.0rc2 or newer

Docker Setup

$ # Start up etcd for KVBM leader/worker registration and discovery
$ docker compose -f dev/docker-compose.yml up -d

Pick one of the following to get a Dynamo TensorRT-LLM container with KVBM built in. The subsequent serving commands are the same either way.

Option A: Pre-built NGC container (recommended for quick start)

$ docker run --gpus all --network host --rm -it nvcr.io/nvidia/ai-dynamo/tensorrtllm-runtime:1.2.1

See the Local Installation Guide for full setup instructions and Release Artifacts for available versions.

Option B: Build from source

$ # Build a dynamo TRTLLM container (KVBM is built in by default)
$ # x86_64
$ python container/render.py --framework trtllm --target runtime --output-short-filename --cuda-version=13.1 --platform linux/amd64
$ docker buildx build --platform linux/amd64 -t dynamo:latest-trtllm-runtime -f container/rendered.Dockerfile .
$ # arm64 with NVIDIA GPUs (GH200, GB200, P6e-GB200 UltraServer — *not* generic Graviton instances, which have no GPU)
$ python container/render.py --framework trtllm --target runtime --output-short-filename --cuda-version=13.1 --platform linux/arm64
$ docker buildx build --platform linux/arm64 -t dynamo:latest-trtllm-runtime -f container/rendered.Dockerfile .
$ 
$ # Launch the container
$ container/run.sh --image dynamo:latest-trtllm-runtime -it --mount-workspace --use-nixl-gds

Aggregated Serving

$ # Write the LLM API config
$ cat > "/tmp/kvbm_llm_api_config.yaml" <<EOF
$ backend: pytorch
$ cuda_graph_config: null
$ kv_cache_config:
$   enable_partial_reuse: false
$   free_gpu_memory_fraction: 0.80
$ kv_connector_config:
$   connector_module: kvbm.trtllm_integration.connector
$   connector_scheduler_class: DynamoKVBMConnectorLeader
$   connector_worker_class: DynamoKVBMConnectorWorker
$ EOF
$ 
$ # Start dynamo frontend
$ python3 -m dynamo.frontend --http-port 8000 &
$ 
$ # Serve the model with KVBM
$ python3 -m dynamo.trtllm \
>   --model-path Qwen/Qwen3-0.6B \
>   --served-model-name Qwen/Qwen3-0.6B \
>   --extra-engine-args /tmp/kvbm_llm_api_config.yaml &

Verify Deployment

$ curl localhost:8000/v1/chat/completions \
>   -H "Content-Type: application/json" \
>   -d '{
>     "model": "Qwen/Qwen3-0.6B",
>     "messages": [{"role": "user", "content": "Hello, how are you?"}],
>     "stream": false,
>     "max_tokens": 30
>   }'

Alternative: Using trtllm-serve

$ trtllm-serve Qwen/Qwen3-0.6B --host localhost --port 8000 --backend pytorch --extra_llm_api_options /tmp/kvbm_llm_api_config.yaml

Run Dynamo with SGLang HiCache

SGLang’s Hierarchical Cache (HiCache) extends KV cache storage beyond GPU memory to include host CPU memory. When using NIXL as the storage backend, HiCache integrates with Dynamo’s memory infrastructure.

Quick Start

$ # Start SGLang worker with HiCache enabled
$ python -m dynamo.sglang \
>   --model-path Qwen/Qwen3-0.6B \
>   --host 0.0.0.0 --port 8000 \
>   --enable-hierarchical-cache \
>   --hicache-ratio 2 \
>   --hicache-write-policy write_through \
>   --hicache-storage-backend nixl
$ 
$ # In a separate terminal, start the frontend
$ python -m dynamo.frontend --http-port 8000
$ 
$ # Send a test request
$ curl localhost:8000/v1/chat/completions \
>   -H "Content-Type: application/json" \
>   -d '{
>     "model": "Qwen/Qwen3-0.6B",
>     "messages": [{"role": "user", "content": "Hello!"}],
>     "stream": false,
>     "max_tokens": 30
>   }'

Learn more: See the SGLang HiCache Integration Guide for detailed configuration, deployment examples, and troubleshooting.

Disaggregated Serving with KVBM

KVBM supports disaggregated serving where prefill and decode operations run on separate workers. KVBM is enabled on the prefill worker to offload KV cache.

Disaggregated Serving with vLLM

$ # 1P1D - one prefill worker and one decode worker
$ # NOTE: requires at least 2 GPUs
$ cd $DYNAMO_HOME/examples/backends/vllm
$ ./launch/disagg_kvbm.sh
$ 
$ # 2P2D - two prefill workers and two decode workers
$ # NOTE: requires at least 4 GPUs
$ cd $DYNAMO_HOME/examples/backends/vllm
$ ./launch/disagg_kvbm_2p2d.sh

Disaggregated Serving with TRT-LLM

$ # Launch prefill worker with KVBM
$ python3 -m dynamo.trtllm \
>   --model-path Qwen/Qwen3-0.6B \
>   --served-model-name Qwen/Qwen3-0.6B \
>   --extra-engine-args /tmp/kvbm_llm_api_config.yaml \
>   --disaggregation-mode prefill &

Configuration

Cache Tier Configuration

Configure KVBM cache tiers using environment variables:

$ # Option 1: CPU cache only (GPU -> CPU offloading)
$ export DYN_KVBM_CPU_CACHE_GB=4  # 4GB of pinned CPU memory
$ 
$ # Option 2: Both CPU and Disk cache (GPU -> CPU -> Disk tiered offloading)
$ export DYN_KVBM_CPU_CACHE_GB=4
$ export DYN_KVBM_DISK_CACHE_GB=8  # 8GB of disk
$ 
$ # [Experimental] Option 3: Disk cache only (GPU -> Disk direct offloading)
$ # NOTE: Experimental, may not provide optimal performance
$ # NOTE: Disk offload filtering not supported with this option
$ export DYN_KVBM_DISK_CACHE_GB=8

You can also specify exact block counts instead of GB:

DYN_KVBM_CPU_CACHE_OVERRIDE_NUM_BLOCKS
DYN_KVBM_DISK_CACHE_OVERRIDE_NUM_BLOCKS

[!NOTE] KVBM is a write-through cache and it is possible to misconfigure. Each of the capacities should increase as you enable more tiers. As an example, if you configure your GPU device to have 100GB of memory dedicated for KV cache storage, then configure DYN_KVBM_CPU_CACHE_GB >= 100. The same goes for configuring the disk cache; DYN_KVBM_DISK_CACHE_GB >= DYN_KVBM_CPU_CACHE_GB. If the cpu cache is configured to be less than the device cache, then there will be no benefit from KVBM. In many cases you will see performance degradation as KVBM will churn by offloading blocks from the GPU to CPU after every forward pass. To know what your minimum value for DYN_KVBM_CPU_CACHE_GB should be for your setup, consult your llm engine’s kv cache configuration.

SSD Lifespan Protection

When disk offloading is enabled, disk offload filtering is enabled by default to extend SSD lifespan. The current policy only offloads KV blocks from CPU to disk if the blocks have frequency ≥ 2. Frequency doubles on cache hit (initialized at 1) and decrements by 1 on each time decay step.

To disable disk offload filtering:

$ export DYN_KVBM_DISABLE_DISK_OFFLOAD_FILTER=true

NCCL Replicated Mode for MLA Models

For MLA (Multi-Layer Attention) models such as DeepSeek, KVBM can use NCCL replicated mode so that only rank 0 loads KV blocks from G2/G3 storage and then broadcasts them to all GPUs via NCCL. This avoids redundant loads and can improve performance when multiple GPUs share the same replicated KV cache.

Enable NCCL MLA mode:

$ export DYN_KVBM_NCCL_MLA_MODE=true

Requirements:

MPI must be initialized (e.g., when launching with mpirun or equivalent) so that rank and world size are available for NCCL.
For optimal broadcast-based replication, build KVBM with the NCCL feature: cargo build -p kvbm --features nccl. Without it, the connector falls back to worker-level replication (each GPU loads independently).

When disabled (default), each GPU loads KV blocks independently. Set DYN_KVBM_NCCL_MLA_MODE=true when running MLA models with KVBM to use the NCCL broadcast optimization.

Enable and View KVBM Metrics

Setup Monitoring Stack

$ # Start basic services (etcd & natsd), along with Prometheus and Grafana
$ docker compose -f dev/docker-observability.yml up -d

Enable Metrics for vLLM

$ DYN_KVBM_METRICS=true \
> DYN_KVBM_CPU_CACHE_GB=20 \
> python -m dynamo.vllm \
>     --model Qwen/Qwen3-0.6B \
>     --enforce-eager \
>     --kv-transfer-config '{"kv_connector":"DynamoConnector","kv_connector_module_path":"kvbm.vllm_integration.connector","kv_role":"kv_both"}'

Enable Metrics for TensorRT-LLM

$ DYN_KVBM_METRICS=true \
> DYN_KVBM_CPU_CACHE_GB=20 \
> python3 -m dynamo.trtllm \
>   --model-path Qwen/Qwen3-0.6B \
>   --served-model-name Qwen/Qwen3-0.6B \
>   --extra-engine-args /tmp/kvbm_llm_api_config.yaml &

Firewall Configuration (Optional)

$ # If firewall blocks KVBM metrics ports
$ sudo ufw allow 6880/tcp

View Metrics

Access Grafana at http://localhost:3000 (default login: dynamo/dynamo) and look for the KVBM Dashboard.

Available Metrics

Metric	Description
`kvbm_matched_tokens`	Number of matched tokens
`kvbm_offload_blocks_d2h`	Offload blocks from device to host
`kvbm_offload_blocks_h2d`	Offload blocks from host to disk
`kvbm_offload_blocks_d2d`	Offload blocks from device to disk (bypassing host)
`kvbm_onboard_blocks_d2d`	Onboard blocks from disk to device
`kvbm_onboard_blocks_h2d`	Onboard blocks from host to device
`kvbm_host_cache_hit_rate`	Host cache hit rate (0.0-1.0)
`kvbm_disk_cache_hit_rate`	Disk cache hit rate (0.0-1.0)

Benchmarking KVBM

Use LMBenchmark to evaluate KVBM performance.

Setup

$ git clone https://github.com/LMCache/LMBenchmark.git
$ cd LMBenchmark/synthetic-multi-round-qa

Run Benchmark

$ # Synthetic multi-turn chat dataset
$ # Arguments: model, endpoint, output prefix, qps
$ ./long_input_short_output_run.sh \
>     "Qwen/Qwen3-0.6B" \
>     "http://localhost:8000" \
>     "benchmark_kvbm" \
>     1

Average TTFT and other performance numbers will be in the output.

TIP: If metrics are enabled, observe KV offloading and onboarding in the Grafana dashboard.

Baseline Comparison

vLLM Baseline (without KVBM)

$ vllm serve Qwen/Qwen3-0.6B

TensorRT-LLM Baseline (without KVBM)

$ # Create config without kv_connector_config
$ cat > "/tmp/llm_api_config.yaml" <<EOF
$ backend: pytorch
$ cuda_graph_config: null
$ kv_cache_config:
$   enable_partial_reuse: false
$   free_gpu_memory_fraction: 0.80
$ EOF
$ 
$ trtllm-serve Qwen/Qwen3-0.6B --host localhost --port 8000 --backend pytorch --extra_llm_api_options /tmp/llm_api_config.yaml

Troubleshooting

No TTFT Performance Gain

Symptom: Enabling KVBM does not show TTFT improvement or causes performance degradation.

Cause: Not enough prefix cache hits on KVBM to reuse offloaded KV blocks.

Solution: Enable KVBM metrics and check the Grafana dashboard for Onboard Blocks - Host to Device and Onboard Blocks - Disk to Device. Large numbers of onboarded KV blocks indicate good cache reuse:

Grafana Example

KVBM Worker Initialization Timeout

Symptom: KVBM fails to start when allocating large memory or disk storage.

Solution: Increase the leader-worker initialization timeout (default: 1800 seconds):

$ export DYN_KVBM_LEADER_WORKER_INIT_TIMEOUT_SECS=3600  # 1 hour

Disk Offload Fails to Start

Symptom: KVBM fails to start when disk offloading is enabled.

Cause: fallocate() is not supported on the filesystem (e.g., Lustre, certain network filesystems), or the storage backend requires a different method for setting O_DIRECT.

Solution:

If fallocate() is not supported, enable the zerofill fallback:

$ export DYN_KVBM_DISK_ZEROFILL_FALLBACK=true

If your filesystem ignores fcntl(F_SETFL, O_DIRECT) (e.g., IBM Storage Scale), set the disk allocator type to pass O_DIRECT at file open time instead:

$ export DYN_KVBM_DISK_ALLOCATOR_TYPE=open-direct

Supported values for DYN_KVBM_DISK_ALLOCATOR_TYPE:

default: Apply O_DIRECT via fcntl after file creation. Works on most POSIX filesystems (ext4, XFS, Lustre, etc.).
open-direct: Pass O_DIRECT to mkostemp at file open time. Required on filesystems where fcntl(F_SETFL, O_DIRECT) is ignored (e.g., IBM Storage Scale).

If you encounter “write all error” or EINVAL (errno 22), or need to debug without O_DIRECT:

$ export DYN_KVBM_DISK_DISABLE_O_DIRECT=true

Developing Locally

Inside the Dynamo container, after changing KVBM-related code (Rust and/or Python):

$ cd /workspace/lib/bindings/kvbm
$ uv pip install maturin[patchelf]
$ maturin build --release --out /workspace/dist
$ uv pip install --upgrade --force-reinstall --no-deps /workspace/dist/kvbm*.whl

To use Nsight Systems for perf analysis, please follow below steps (using vLLM as example). KVBM has NVTX annotation on top level KV Connector APIs (search for @nvtx_annotate). If more is needed, please add then rebuild.

$ # build and run local-dev container, which contains nsys
$ python container/render.py --framework=vllm --target=local-dev --output-short-filename
$ docker build --build-arg USER_UID=$(id -u) --build-arg USER_GID=$(id -g) -f container/rendered.Dockerfile -t dynamo:latest-vllm-local-dev .
$ 
$ container/run.sh --image dynamo:latest-vllm-local-dev -it --mount-workspace --use-nixl-gds
$ 
$ # export nsys to PATH
$ # NOTE: change the version accordingly
$ export PATH=/opt/nvidia/nsight-systems/2025.5.1/bin:$PATH
$ 
$ # example usage of nsys: delay 30 seconds and then capture 60 seconds
$ python -m dynamo.frontend &
$ 
$ DYN_KVBM_CPU_CACHE_GB=10 \
> nsys profile -o /tmp/kvbm-nsys --trace-fork-before-exec=true --cuda-graph-trace=node --delay 30 --duration 60 \
> python -m dynamo.vllm --model Qwen/Qwen3-0.6B --kv-transfer-config '{"kv_connector":"DynamoConnector","kv_connector_module_path":"kvbm.vllm_integration.connector","kv_role":"kv_both"}'

Quick start with the pre-built NGC container

Run KVBM Standalone

Build from Source

Run KVBM in Dynamo with vLLM

Docker Setup

Aggregated Serving

Verify Deployment

Alternative: Using Direct vllm serve

Run KVBM in Dynamo with TensorRT-LLM

Docker Setup

Aggregated Serving

Verify Deployment

Alternative: Using trtllm-serve

Run Dynamo with SGLang HiCache

Quick Start

Disaggregated Serving with KVBM

Disaggregated Serving with vLLM

Disaggregated Serving with TRT-LLM

Configuration

Cache Tier Configuration

SSD Lifespan Protection

NCCL Replicated Mode for MLA Models

Enable and View KVBM Metrics

Setup Monitoring Stack

Enable Metrics for vLLM

Enable Metrics for TensorRT-LLM

Firewall Configuration (Optional)

View Metrics

Available Metrics

Benchmarking KVBM

Setup

Run Benchmark

Baseline Comparison

vLLM Baseline (without KVBM)

TensorRT-LLM Baseline (without KVBM)

Troubleshooting

No TTFT Performance Gain

KVBM Worker Initialization Timeout

Disk Offload Fails to Start

Developing Locally

See Also