LMCache is a high-performance KV cache layer that supercharges LLM serving by enabling prefill-once, reuse-everywhere semantics. As described in the official documentation, LMCache lets LLMs prefill each text only once by storing the KV caches of all reusable texts, allowing reuse of KV caches for any reused text (not necessarily prefix) across any serving engine instance.
This document describes how LMCache is integrated into Dynamo’s vLLM backend to provide enhanced performance and memory efficiency.
Important Note: LMCache integration currently only supports x86 architecture. ARM64 is not supported at this time.
LMCache is enabled by setting the ENABLE_LMCACHE environment variable:
Additional LMCache configuration can be customized via environment variables:
LMCACHE_CHUNK_SIZE=256 - Token chunk size for cache granularity (default: 256)LMCACHE_LOCAL_CPU=True - Enable CPU memory backend for offloadingLMCACHE_MAX_LOCAL_CPU_SIZE=20 - CPU memory limit in GB (user can adjust based on available RAM to a fixed value)For advanced configurations, LMCache supports multiple storage backends:
Use the provided launch script for quick setup:
This will:
In aggregated mode, the system uses:
LMCacheConnectorV1kv_both (handles both reading and writing)Disaggregated serving separates prefill and decode operations into dedicated workers. This provides better resource utilization and scalability for production deployments.
The same ENABLE_LMCACHE=1 environment variable enables LMCache, but the system automatically configures different connector setups for prefill and decode workers.
Use the provided disaggregated launch script(the script requires at least 2 GPUs):
This will:
NixlConnector only for kv transfer between prefill and decode workersMultiConnector with both LMCache and NIXL connectors. This enables prefill worker to use LMCache for kv offloading and use NIXL for kv transfer between prefill and decode workers.--is-prefill-workerThe system automatically configures KV transfer based on the deployment mode and worker type:
The system automatically configures LMCache environment variables when enabled:
Argument Parsing (args.py):
ENABLE_LMCACHE environment variableEngine Setup (main.py):
Chunk Size Tuning: Adjust LMCACHE_CHUNK_SIZE based on your use case:
Memory Allocation: Set LMCACHE_MAX_LOCAL_CPU_SIZE conservatively:
Workload Optimization: LMCache performs best with:
When LMCache is enabled with --connector lmcache and DYN_SYSTEM_PORT is set, LMCache metrics are automatically exposed via Dynamo’s /metrics endpoint alongside vLLM and Dynamo metrics.
Requirements to access LMCache metrics:
--connector lmcache - Enables LMCacheDYN_SYSTEM_PORT=8081 - Enables metrics HTTP endpointPROMETHEUS_MULTIPROC_DIR (optional) - If not set, Dynamo manages it internally. Only set explicitly if you need control over the metrics directory.For detailed information on LMCache metrics, including the complete list of available metrics and how to access them, see the LMCache Metrics section in the vLLM Prometheus Metrics Guide.