KV Cache Transfer in Disaggregated Serving
In disaggregated serving architectures, KV cache must be transferred between prefill and decode workers. TensorRT-LLM supports two methods for this transfer:
Default Method: NIXL
By default, TensorRT-LLM uses NIXL (NVIDIA Inference Xfer Library) with UCX (Unified Communication X) as backend for KV cache transfer between prefill and decode workers. NIXL is NVIDIA’s high-performance communication library designed for efficient data transfer in distributed GPU environments.
Specify Backends for NIXL
TODO: Add instructions for how to specify different backends for NIXL.
Alternative Method: UCX
TensorRT-LLM can also leverage UCX (Unified Communication X) directly for KV cache transfer between prefill and decode workers. There are two ways to enable UCX as the KV cache transfer backend:
- Recommended: Set
cache_transceiver_config.backend: UCXin your engine configuration YAML file. - Alternatively, set the environment variable
TRTLLM_USE_UCX_KV_CACHE=1and configurecache_transceiver_config.backend: DEFAULTin the engine configuration YAML.
This flexibility allows users to choose the most suitable method for their deployment and compatibility requirements.