For AI agents: a documentation index is available at the root level at /llms.txt and /llms-full.txt. Append /llms.txt to any URL for a page-level index, or .md for the markdown version of any page.
Digest
  • Getting Started
    • Quickstart
    • Installation
    • Support Matrix
    • Examples
  • Kubernetes Deployment
  • User Guides
    • Tool Calling
    • Multimodality Support
    • Finding Best Initial Configs
    • Benchmarking
    • Tuning Disaggregated Performance
    • Writing Python Workers in Dynamo
    • Glossary
  • Components
    • Router
      • Overview
      • Motivation
      • Architecture
      • Components
      • Design Deep Dive
      • Integrations
      • KVBM in vLLM
      • KVBM in TRTLLM
      • LMCache Integration
      • Further Reading
  • Design Docs
    • Overall Architecture
    • Architecture Flow
    • Disaggregated Serving
    • Distributed Runtime
NVIDIANVIDIA
Developer-friendly docs for your API
Privacy Policy | Your Privacy Choices | Terms of Service | Accessibility | Corporate Policies | Product Security | Contact

Copyright © 2026, NVIDIA Corporation.

LogoLogoDocumentation
Digest
On this page
  • KVBM Components
  • Core
  • Layouts and Blocks
  • Transfer Manager
  • Storage & Pools
  • KVBM DataFlows
ComponentsKVBM

Understanding KVBM components

||View as Markdown|
Previous

KVBM Architecture

Next

KVBM components

KVBM design takes inspiration from the KV block managers used in vLLM and SGLang, with an added influence from historical memory tiering strategies common in general GPU programming. For more details, See KVBM Reading. The figure below illustrates the internal components of KVBM.

Internal Components of Dynamo KVBM. Internal Components of Dynamo KVBM

KVBM Components

Core

  • KvBlockManager: Public facade. Constructs/owns the internal state and exposes the pools and onboarding APIs.
  • Scheduler: Gates transfer execution relative to model progress (iteration/layer completion) when integrated with a framework connector (e.g., vLLM V1).
  • Config (config.rs): Describes model dims, page size, layout choices, and runtime flags used to build pools and layouts.
  • KvBlockManagerState: Central object wiring together layouts, storage backends, and pools; owns the OffloadManager, metrics, and events hooks.
  • Events/Metrics: Observability components emitting counters/gauges and event hooks for integration/testing.

Layouts and Blocks

  • LayoutConfig & LayoutType: Translate tensor shapes into KV cache layouts (layer-separated or fully-contiguous), including block counts and geometry.
  • Blocks & Metadata: Typed block handles (mutable/immutable), metadata (e.g., priority), and views by layer/outer dims; used to allocate, register, and match by sequence_hash.

Transfer Manager

  • TransferManager: Asynchronous transfer orchestrator with per-path queues (Device→Host, Host→Disk, Host→Device, Disk→Device).

Storage & Pools

  • Device Pool(G1): GPU-resident KV block pool. Allocates mutable GPU blocks, registers completed blocks (immutable), serves lookups by sequence hash, and is the target for onboarding (Host→Device, Disk→Device).
  • Host Pool(G2): CPU pinned-memory KV block pool. Receives Device offloads (Device→Host), can onboard to Device (Host→Device), and offloads to Disk. Uses pinned (page-locked) memory for efficient CUDA transfers and NIXL I/O.
  • Disk Pool(G3): Local SSD NVMe-backed KV block pool. Receives Host offloads (Host→Disk) and provides blocks for onboarding to Device (Disk→Device). NIXL descriptors expose file offsets/regions for zero-copy I/O and optional GDS.

KVBM DataFlows

KVBM Data Flows. KVBM Data Flows from device to other memory hierarchies

Device → Host (Offload)

  • Triggered explicitly requested to offload by the connector scheduler.
  • Worker allocates a Host block and performs CUDA D2H/Custom Kernel copy.
  • Host pool registers the new immutable block (dedup by sequence hash).

Host → Disk (Offload)

  • Local Disk: NIXL Write via POSIX; GDS when available.
  • Remote Disk (Network FS like NFS/Lustre/GPFS): NIXL Write via POSIX to the mounted FS; batching/concurrency identical.
  • Triggered on registered host blocks or explicit offload requests.
  • Worker allocates a Disk block and performs NIXL Write (Host→Disk).
  • Disk pool registers the new immutable block (dedup by sequence hash).

Host → Device (Onboard)

  • Called to bring a host block into GPU memory.
  • Worker uses provided Device targets and performs CUDA H2D/Custom Kernel copy.
  • Device pool registers the new immutable block.

Disk → Device (Onboard)

  • Called to bring a disk block directly into GPU memory.
  • Worker uses provided Device targets and performs NIXL Read (Disk→Device), possibly via GDS.
  • Device pool registers the new immutable block.