EFA (RDMA over AWS Fabric) on EKS
EFA (RDMA over AWS Fabric) on EKS
This guide covers setting up RDMA over AWS Elastic Fabric Adapter (EFA) on EKS for high-performance disaggregated inference with Dynamo. EFA is the only RDMA fabric available on AWS — InfiniBand and RoCE are not offered. With EFA, Dynamo’s prefill and decode workers transfer KV cache directly between GPUs across nodes via GPU-Direct RDMA, bypassing CPU and TCP/IP stacks.
Without RDMA, disaggregated inference falls back to TCP with severe performance degradation (~98s TTFT vs ~1s with EFA on Llama-3.1-8B at ISL 8000). See the Disaggregated Communication Guide for the transport-layer fundamentals.
Prerequisites
Recommended GPU EC2 instance types with EFA:
This table is not an exhaustive list of all AWS instance types that support EFA. It lists the GPU families most relevant to Dynamo disaggregated inference.
Cluster setup:
- GPU-Direct RDMA enabled on the host — either kernel ≥ 5.12 (DMA-BUF path; default on current AWS EKS AMIs, typically 6.14+) or an older kernel with the
nvidia-peermem/ AWSefa_nv_peermemmodule loaded (legacy peer-memory path; see Step 2 for how to install it). - EFA-enabled security group — VPC security groups must allow all traffic between EFA-attached ENIs. The standard recommendation is a self-referencing security group rule that allows all protocols within the group. See AWS EFA security group setup.
- EKS node groups created with EFA support — when using
eksctl, setefaEnabled: trueon the GPU node group. This attaches the appropriate number of EFA ENIs per instance type.
Overview
EFA setup involves three pieces:
- AWS EFA Kubernetes device plugin — exposes EFA NICs as the
vpc.amazonaws.com/efaextended resource (host-level setup, Step 1). On modern kernels (≥ 5.12) the DMA-BUF path is used andefa_nv_peermemis not required; older kernels need it loaded (Step 2). - Container image with libfabric + aws-ofi-nccl + Dynamo (Step 3).
- Workload spec that selects the LIBFABRIC NIXL backend, requests EFA resources, and runs privileged (Step 4, Step 5).
Step 1: Install the AWS EFA Kubernetes Device Plugin
The AWS EFA Kubernetes Device Plugin exposes each node’s EFA endpoints as the vpc.amazonaws.com/efa extended resource so pods can request them. AWS publishes two install paths — pick one:
Helm (recommended, from the official aws/eks-charts repo):
Or raw manifest (from aws-samples/aws-efa-eks):
Wait for the device plugin pods to start on every EFA-capable node:
Verify EFA resources are advertised by each GPU node:
Each EFA-capable node should report a non-zero vpc.amazonaws.com/efa count (e.g., 32 on p5.48xlarge, reflecting that instance’s EFA endpoint count). The exact count depends on instance type and how the node group’s ENIs were configured at launch.
Step 2: Verify Host Kernel Modules
Modern AWS GPU AMIs (Amazon Linux 2023, Ubuntu 22.04+, kernel ≥ 5.12) use DMA-BUF for GPU-Direct RDMA and do not require nvidia-peermem or efa_nv_peermem. The default AMIs for p5/p5e/p5en/p6-b200/GB200 ship with kernels in the 6.x line where DMA-BUF is the active path.
To confirm:
If you are on an older kernel (< 5.12) and the host doesn’t already have efa_nv_peermem loaded, the simplest path is to switch to an AMI that includes EFA host-level components — the EKS-optimized AL2023 NVIDIA AMI and all Bottlerocket AMIs include them. Otherwise, run aws-efa-installer on the host (via a privileged DaemonSet or baked into a custom AMI). See AWS — Manage EFA devices on Amazon EKS for the full picture.
Step 3: Build a Dynamo EFA Image
Dynamo’s image build is two steps: container/render.py writes a Dockerfile for the chosen framework + target, then docker build consumes it. Passing --make-efa to render.py appends the AWS EFA installer stage from container/templates/aws.Dockerfile, which defines a stage named aws on top of runtime. You must pass --target aws to docker build — without it, docker build stops at the runtime stage and you get an image without EFA. See container/README.md for the full build workflow.
--output-short-filename writes to container/rendered.Dockerfile; omit it to get the long auto-generated filename (e.g., vllm-runtime-cuda12.9-amd64-rendered.Dockerfile) — useful when keeping several rendered Dockerfiles side by side.
See Known Issues below for one case where the default-built image does not produce a working EFA deployment out of the box (GB200 / arm64 64K-page kernels). The symptom looks like a working setup but fails at startup during NIXL memory registration.
Step 4: Configure NIXL Backend
NIXL is the high-level KV transfer API and supports multiple backends. For EFA, the LIBFABRIC backend must be selected. UCX is NIXL’s default backend, and while it has CUDA-IPC / RDMA transports available in the image, in standard pod-to-pod EFA configurations it lands on a slow transport (effectively TCP-speed at ~1–3 GB/s) instead of EFA’s line rate. Empirically, LIBFABRIC is the only backend that reaches full EFA bandwidth on AWS.
Each framework selects the backend differently:
This is a silent-failure path — getting it wrong manifests as ~100 s TTFT instead of a clear error. Always verify at startup that LIBFABRIC is active.
Required EFA environment variables
In addition to backend selection, set these on every worker pod:
Recommended EFA performance tuning
When using FI_EFA_USE_HUGE_PAGE=1, also add hugepages-2Mi: 5120Mi to the pod resource limits.
Step 5: Pod Resource Requests
Dynamo pods that use EFA must request the resource and run privileged:
privileged: true is required for NIXL to register CUDA VRAM with the EFA NIC via fi_mr_reg. IPC_LOCK alone is insufficient.
Known Issues
One issue currently affects default-built Dynamo EFA images.
Issue 1: libfabric on GB200 fails fi_mr_reg on CUDA VRAM
Known affected platforms: GB200.
Symptom: Worker pod fails at startup with fi_mr_reg returning EFAULT during NIXL initialization. NIXL VRAM registration fails; depending on the framework, the worker either crashes or silently falls back to TCP.
Root cause: The libfabric version (versions lower than 2.5.x) bundled with the EFA installer (up to currently latest 1.48.0) lacks a CUDA branch in the dmabuf-eligibility check in prov/efa/src/efa_mr.c. On x86_64 hosts the legacy ibv_reg_mr path handles CUDA pointers natively, so the bug doesn’t surface. On arm64 64K-page kernels (GB200), the legacy path returns EFAULT for CUDA VRAM. Tracked in ofiwg/libfabric#12019.
Upstream status: The bug is resolved in ofiwg/libfabric main and v2.5.x via a more comprehensive rewrite of efa_mr_reg_ibv_mr(). AWS’s aws/libfabric fork has not picked up the upstream rewrite; the latest EFA installer (1.48.0) still ships v2.4.0amzn3.0 with the older code path.
Workarounds:
- Apply the one-line patch to the bundled libfabric. During image build, replace the
aws.Dockerfileinstall step with a custom build:
- Replace bundled libfabric with
ofiwg/libfabric@v2.5.1(or newer). The upstream rewrite is already present; no patch needed. Rebuildaws-ofi-ncclagainst it.
Verification
After deployment, confirm EFA is actually being used (not silent TCP fallback):
1. NIXL chose the LIBFABRIC backend (not UCX):
2. The LIBFABRIC plugin is loaded and executing (not just opened):
3. Registered RDMA memory is GPU VRAM, not CPU pinned memory (no CPU bounce):
4. NIXL transfers are happening, none failing (via Prometheus metrics endpoint):
NIXL telemetry is off by default. To enable it, set on each worker:
Then query:
The same metrics with the vllm: prefix are also published to vLLM’s own metrics endpoint (typically DYN_SYSTEM_PORT, e.g. 8081) when vLLM is the frontend.
5. Decode side confirms KV receipt:
Do not use rdma_write_bytes or other /sys/class/infiniband/*/counters/* checks for EFA verification. EFA SRD uses SEND operations at the hardware level, not RDMA READ/WRITE — rdma_write_bytes is always 0 on correctly configured EFA by design. Use the Prometheus + /proc/<pid>/maps methodology above instead.
Common Failure Modes
References
- Disaggregated Communication Guide — transport-layer fundamentals
- RDMA / InfiniBand on AKS — Azure equivalent
container/templates/aws.Dockerfile— EFA installer template- AWS — Manage EFA devices on Amazon EKS — official EKS-side guide (DRA driver + device plugin)
- AWS EFA documentation — EC2-side EFA overview
aws/eks-charts—aws-efa-k8s-device-plugin— Helm chart source- ofiwg/libfabric#12019 — CUDA dmabuf registration on EFA