Encode-Prefill-Decode (EPD) Flow with NIXL
For high-performance multimodal inference with large embeddings, Dynamo supports a specialized Encode-Prefill-Decode (EPD) flow using NIXL (RDMA) for zero-copy tensor transfer.
Enabling the Feature
This is an experimental feature that requires using a specific TensorRT-LLM commit.
To enable it build the dynamo container with the --tensorrtllm-commit flag, followed by the commit hash:
Key Features
- High Performance: Zero-copy RDMA transfer for embeddings
- Dynamic Shape Allocation: Automatically handles variable embedding shapes per image
- Multi-Format Support: Works with tensor files (
.pt) and dictionary-based embeddings - Hybrid Transfer: Large tensors via NIXL, small metadata via JSON
How to use
Pre-requsites
This script is specifically designed to work on 8 node H200 and Llama-4-Maverick-17B-128E-Instruct model with assumption that you already have a model specific embedding file ready.
Configuration
The EPD flow uses a dedicated Encode Worker that runs separately from the Prefill and Decode workers. The ENCODE_ENDPOINT environment variable specifies how the Prefill worker communicates with the Encode worker:
This endpoint follows Dynamo’s standard format: dyn://namespace.component.endpoint where the Encode worker registers itself as dynamo.tensorrt_llm_encode.generate.
For local embedding file access, use the --allowed-local-media-path "$ALLOWED_LOCAL_MEDIA_PATH" parameter to specify the secure directory path where embedding files can be loaded from (default: /tmp). This prevents path traversal attacks while allowing flexible file access within the designated directory.
For tensor file size protection, use the --max-file-size-mb "$MAX_FILE_SIZE_MB" parameter to limit the maximum size of downloadable embedding files/Image URLs (default: 50MB). This prevents Denial of Service (DoS) attacks from maliciously large files while accommodating typical embedding file sizes.
Architecture Overview
The EPD flow implements a 3-worker architecture for high-performance multimodal inference:
- Encode Worker: Loads and processes multimodal embeddings
- Prefill Worker: Handles initial context processing and KV-cache generation
- Decode Worker: Performs streaming token generation
Request Flow Diagram
How the System Works
- Request Processing: Multimodal requests containing embedding file paths or URLs are routed by the frontend to prefill workers
- Multimodal Loading: EncodeWorker loads large embedding files and extracts auxiliary metadata
- NIXL Transfer: Main tensors transferred via zero-copy RDMA, small metadata via JSON for efficiency
- Dynamic Allocation: Consumer workers allocate tensors with exact shapes received from EncodeWorker
- Reconstruction: Original embedding format (dictionary or tensor) is reconstructed for model processing
Example Request
The request format is identical to regular multimodal requests: