Gateway API Inference Extension (GAIE)
Gateway API Inference Extension (GAIE)
Gateway API Inference Extension Setup with Dynamo
Integrate Dynamo with the Gateway API Inference Extension, also known as Inference Gateway, for intelligent KV-aware request routing at the gateway layer.
Features
-
EPP’s default kv-routing approach is not token-aware because the prompt is not tokenized. But the Dynamo plugin uses a token-aware KV algorithm. It employs the dynamo router which implements kv routing by running your model’s tokenizer inline. The EPP plugin configuration is embedded in the recipe-based GAIE deploy YAMLs under
recipes/llama-3-70b/vllm/agg/gaie/andrecipes/llama-3-70b/vllm/disagg-single-node/gaie/, following the GAIE/EPP configuration layout used by this repository. -
Dynamo Integration with the Inference Gateway supports Aggregated and Disaggregated Serving. A request only exercises disaggregated routing when the EPP config defines a
prefillprofile and prefill workers are available. The recipe examples provide separate aggregated and disaggregated configs underrecipes/llama-3-70b/vllm/agg/gaie/andrecipes/llama-3-70b/vllm/disagg-single-node/gaie/. UnlessDYN_ENFORCE_DISAGG=true, deployments without aprefillprofile or prefill workers fall back to aggregated serving. -
GAIE integration supports Data Parallelism.
-
If you want to use LoRA deploy Dynamo without the Inference Gateway.
-
These setups use agentgateway as the Inference Gateway implementation. For the Istio Inference Gateway, check out
recipes/qwen3-0.6b/vllm/agg/gaie.
Prerequisites
- Kubernetes cluster with kubectl configured
- NVIDIA GPU drivers installed on worker nodes
Installation Steps
1. Install Dynamo Platform
See Quickstart Guide to install Dynamo Kubernetes Platform.
If you are installing from the source tree rather than a release chart, follow Advanced: Build from Source and run helm dep build ./platform/ before helm install so the vendored subcharts match the local chart contents.
2. Deploy Inference Gateway
First, deploy an inference gateway service. In this example, we’ll install agentgateway with the inference extension enabled.
This script installs the Gateway API CRDs, the GAIE CRDs, agentgateway into agentgateway-system, and a Gateway named inference-gateway into ${NAMESPACE}.
Verify the Gateway is running
2b. Istio Gateway (Alternative)
If you are using Istio as your gateway implementation,
the EPP uses secure serving (TLS) by default. The gateway proxy needs an
Istio DestinationRule to talk to the EPP service; without it the Istio
ext_proc filter fails with connection termination errors.
The Dynamo operator can create this DestinationRule for you. Install or
upgrade the platform Helm chart with dynamo.serviceMesh.enabled=true
(see Service Mesh Integration (Istio)
below). When that is set, you can skip the rest of this section.
If you are not using the operator’s Helm chart, or have left
dynamo.serviceMesh.enabled=false, apply a DestinationRule manually for
each EPP service:
Replace <dgd-name> with your DynamoGraphDeployment name and <namespace> with the namespace where the EPP is deployed. See recipes/qwen3-0.6b/vllm/agg/gaie/dr.yaml for an example.
3. Setup secrets
Do not forget docker registry secret if needed.
Do not forget to include the HuggingFace token.
4. Build EPP image (Optional)
You can either use the provided Dynamo FrontEnd image for the EPP image or you need to build your own Dynamo EPP custom image following the steps below.
All-in-one Targets
4b. Build Rust EPP image (Optional — experimental)
A pure-Rust EPP implementation is available as an alternative to the Go-based EPP. It replaces the Go EPP + CGO bridge with a single native Rust binary that implements the Envoy ext_proc gRPC service and uses Dynamo’s KV-aware router directly — no FFI boundary, no Go runtime.
To build the binary locally without Docker:
Rust EPP Makefile Targets
Rust EPP Configuration
The Rust EPP uses the same environment variables as the Go EPP for namespace resolution and router configuration:
The gRPC port is hardcoded to 9002 (matching the operator’s EPPGRPCPort constant).
Namespace resolution follows the same logic as the Go EPP plugin:
DYN_NAMESPACE_PREFIX > DYN_NAMESPACE > "vllm-agg" (default).
The Rust EPP also respects the standard Dynamo router environment variables
(DYN_ROUTER_KV_OVERLAP_SCORE_CREDIT, DYN_ROUTER_PREFILL_LOAD_SCALE,
DYN_ROUTER_TEMPERATURE, DYN_USE_KV_EVENTS, etc.) documented in the
Configuration section below. The deprecated overlap-weight aliases remain
supported with the same precedence as the Go EPP.
The Rust EPP is experimental. It uses Dynamo’s native discovery system
(DistributedRuntime) instead of the GAIE Kubernetes controllers, so it
does not require InferencePool or InferenceModel CRDs for endpoint
discovery. It discovers workers through Dynamo’s own registration mechanism.
The Rust EPP currently supports only pod-level Kubernetes discovery. Deploy one Rust EPP replica per pool because request selection and booking are not yet atomic across concurrent EPP replicas. After a worker-generation rolling update, restart the Rust EPP so it binds to the new generation namespace. Exact streamed output-block updates are also not yet wired into the Rust EPP.
InferencePool and the data plane (Istio, kGateway, Agentgateway)
Although the Rust EPP does not consult InferencePool for worker discovery,
the CRD is still required by the gateway data plane. Gateway
implementations (Istio, kGateway, Agentgateway) read InferencePool to:
- Attach the
ext_procfilter pointing at the EPP service. - Enable the
override_hostLB policy so the EPP’sx-gateway-destination-endpointheader / dynamic-metadata is honored. - Scope which pods are eligible to receive traffic — the pool’s selector
becomes the
envoy.lb.subset_hintmetadata that the EPP intersects with its own discovered workers before picking one.
The Dynamo operator auto-generates the InferencePool for every
DynamoGraphDeployment (deploy/operator/internal/dynamo/epp/inference_pool.go).
Its Selector matches the operator’s worker-pod labels and its
EndpointPickerRef points at the EPP service on 9002, so Dynamo’s
discovery and the pool’s pod set stay in sync automatically — users do not
hand-craft the pool.
Using Istio instead of kGateway:
- The only Istio-specific step is creating an Istio
Gateway/HTTPRoutethat references the operator-generatedInferencePoolas itsbackendRef. The DGD, the generated pool, and the Rust EPP image are all unchanged. - The operator targets the stable
inference.networking.k8s.io/v1API group, supported in Istio ≥ 1.27. Older Istio versions used the experimentalinference.networking.x-k8s.iogroup and are not compatible. - mTLS to the EPP. Istio expects mTLS between the gateway and the EPP
service. The Rust EPP serves self-signed TLS on
9002by default (DYN_SECURE_SERVING=true). See Service Mesh Integration (Istio) below for theDestinationRulethe Dynamo Helm chart can generate so Istio terminates the EPP’s TLS correctly.
Model card discovery, worker liveness, KV-aware routing, and bookkeeping
remain entirely in Dynamo’s control. The InferencePool provides the
data-plane envelope (which pods, which port, which EPP); Dynamo’s
discovery and the Rust EPP provide the routing intelligence inside that
envelope. Customizing the pool selector by hand is supported but requires
keeping it consistent with the operator’s worker-pod labels — otherwise
pods discovered by Dynamo will fail subset filtering and the EPP will
return RoutingFailed.
5. Deploy
We provide an example for the Qwen vLLM below.
You have to deploy the Dynamo Graph and the HTTPRoute.
The example http-route.yaml resolves the Gateway in the same namespace as
the HTTPRoute, so the simplest path is to apply the route in the same
namespace where you installed the Gateway (i.e. ${NAMESPACE}). If your
Gateway lives in a different namespace, add parentRefs[].namespace to point
at it explicitly:
Examples for other models can be found in the recipes folder.
We provide examples for llama-3-70b vLLM under the recipes/llama-3-70b/vllm/agg/gaie/ for aggregated and recipes/llama-3-70b/vllm/disagg-single-node/gaie/ for disaggregated serving.
Note for the aggregated serving you need to disable DYN_ENFORCE_DISAGG in epp config.
Use the proper folder in commands below.
- When using GAIE the FrontEnd does not choose the workers. The routing is determined in the EPP.
- The FrontEnd must run with
--router-mode directso that it respects the EPP’s routing decisions passed via request headers. - In v1beta1 DGD manifests, set the
frontendSidecarfield on a worker component to the name of a container in that component’s pod template. The operator merges the required Dynamo env vars, probes, and ports into that sidecar container:
- The pre-selected workers (decode and prefill in case of disaggregated serving) are passed in request headers and injected into the request routing hints.
- The
--router-mode directflag ensures the routing respects this selection.
Startup Probe Timeout: The EPP has a default startup probe timeout of 30 minutes (10s × 180 failures).
If your model takes longer to load, increase the failureThreshold in the EPP’s startupProbe. For example,
to allow 60 minutes for startup:
Gateway Namespace
The example http-route.yaml resolves the Gateway in the same namespace as
the route. If you install the Gateway in one namespace and apply the route in
another, add parentRefs[].namespace: <gateway-namespace> to http-route.yaml.
Common Vars for Routing Configuration:
Enabling KV-Aware Routing (most precise)
KV-aware routing uses live KV cache block events from workers so the EPP can route requests to the worker with the best prefix cache overlap. To enable it (default):
- Workers — enable prefix caching and KV event publishing. Each worker must publish KV cache events to event plane (NATS/ZMQ) so the EPP’s router can track per-worker cache state.
- vLLM: Pass
--enable-prefix-cachingand--kv-events-config '{"enable_kv_cache_events":true}'. - SGLang: Pass
--kv-events-configwith the appropriate endpoint. - TRT-LLM: Pass
--publish-events-and-metrics.
- vLLM: Pass
- EPP — leave
DYN_USE_KV_EVENTSat its default (true). The EPP subscribes to worker KV events via event plane (NATS/ZMQ) and uses them for prefix-overlap scoring. - Block size — must be consistent. The
--block-sizeon all workers must matchDYN_KV_CACHE_BLOCK_SIZEon the EPP (default: 128). Mismatched block sizes cause incorrect block hash computation.
Disabling KV-Aware Routing
To disable the EPP from listening for KV events (e.g., when prefix caching is off on workers, or for simpler load-balanced routing):
- EPP: Set
DYN_USE_KV_EVENTS=false. The router falls back to approximate mode (routing decisions are tracked locally with TTL decay instead of live KV events from workers). - Workers: Pass
--no-enable-prefix-cachingto disable prefix caching entirely. Without prefix caching, no KV events are generated regardless of other flags. - Optionally set
DYN_ROUTER_KV_OVERLAP_SCORE_CREDIT=0on the EPP to skip prefix-overlap scoring altogether, making the router select workers based on load only.
- Set
DYN_BUSY_THRESHOLDto configure the upper bound on how “full” a worker can be (often derived from kv_active_blocks or other load metrics) before the router skips it. If the selected worker exceeds this value, routing falls back to the next best candidate. By default the value is negative meaning this is not enabled. - Set
DYN_ENFORCE_DISAGG=true(default:false) to control per-request behavior when prefill workers are unavailable:true(recommended for disaggregated serving): Requests fail with an error if prefill workers are not available. Use this when disaggregated serving is required and aggregated fallback is not acceptable.false(default): Requests gracefully fall back to aggregated mode (skip prefill, route directly to decode) when prefill workers are not available. When prefill workers appear later, subsequent requests automatically use disaggregated routing.
- Set
DYN_ROUTER_KV_OVERLAP_SCORE_CREDITto control the device-local prefix-overlap credit multiplier, from 0.0 to 1.0. Higher values bias toward reusing workers with similar cached prefixes. (default: 1) - Set
DYN_ROUTER_PREFILL_LOAD_SCALEto scale adjusted prompt-side prefill load before decode blocks are added. (default: 1) - Set
DYN_ROUTER_TEMPERATURE(default:0.0) to soften or sharpen normalized worker sampling. Low temperature makes the router pick the top candidate deterministically; higher temperature lets lower-scoring workers through more often (exploration). DYN_ROUTER_REPLICA_SYNC— Enable replica synchronization (default: false)DYN_ROUTER_TRACK_ACTIVE_BLOCKS— Track active blocks (default: true)DYN_ROUTER_TRACK_OUTPUT_BLOCKS— Track output blocks during generation (default: false)DYN_ROUTER_PREDICTED_TTL_SECS— Enable predict-on-route entries with this TTL in seconds- See the KV cache routing design for details.
Service Mesh Integration (Istio)
When running under a service mesh such as Istio, the mesh sidecar proxy may conflict with the EPP’s own TLS serving, causing connection failures (double-TLS). To avoid this, the mesh must be told how to connect to the EPP service via an Istio DestinationRule.
The Dynamo operator can generate this DestinationRule automatically. Enable it by setting the dynamo.serviceMesh parameters when installing or upgrading the Dynamo platform Helm chart:
Or equivalently in a custom values file:
Helm Parameters
The Istio CRDs (networking.istio.io) must be installed on the cluster before enabling this feature. The operator detects Istio availability at startup — if the CRDs are not present, DestinationRule reconciliation is skipped even when serviceMesh.enabled is true.
When enabled, the operator produces a DestinationRule for each EPP service equivalent to:
If you are not using the Dynamo operator’s Helm chart, you must create this DestinationRule manually for each EPP service. Without it, Istio’s default mTLS policy will conflict with the EPP’s gRPC TLS endpoint.
Inference-gateway Istio sidecar exclusion
When namespace-level Istio sidecar injection is enabled (istio-injection=enabled), the agentgateway-proxy pod also receives an Istio sidecar. This sidecar intercepts the ext_proc gRPC connection from agentgateway-proxy to EPP (port 9002) and routes it through PassthroughCluster, which breaks the connection and causes all inference requests to return HTTP 500 with an empty body.
The fix is to tell agentgateway to stamp sidecar.istio.io/inject: "false" on the proxy pod template so the Istio webhook skips that pod. EPP and worker pods still receive sidecars normally.
You have two options depending on how you set up the gateway:
Option A: Per-gateway AgentgatewayParameters (recommended)
This is what install_gaie_crd_agentgateway.sh does automatically. It only affects the inference-gateway proxy pods and leaves any other agentgateway-managed gateways untouched.
-
Create an
AgentgatewayParametersresource in the same namespace as theinference-gatewayGateway (e.g.dynamo-cloud). It must be co-located with theGatewaybecause the Gateway APIspec.infrastructure.parametersRefis aLocalParametersReference— it has nonamespacefield.Apply it with server-side apply (recommended by agentgateway):
-
Wire the existing
Gatewayto use it. If the Gateway already exists, patch it in place:Or include the
infrastructureblock directly in yourGatewaymanifest: -
agentgateway will roll the proxy pod. Verify the new pod no longer has an
istio-proxycontainer:
Option B: Patch the default AgentgatewayParameters CR (cluster-wide)
The agentgateway controller creates a default AgentgatewayParameters resource named agentgateway in agentgateway-system. Any Gateway that does not set spec.infrastructure.parametersRef inherits this default. Patching it affects all agentgateway-managed proxies in the cluster.
Use Option A instead if you have multiple agentgateway-managed gateways in the cluster and only want the inference-gateway proxy to skip injection.
The annotation is a no-op on clusters where Istio is not installed, so it is safe to set unconditionally.
With both the DestinationRule (for EPP) and the AgentgatewayParameters sidecar exclusion (for agentgateway-proxy) in place, end-to-end GAIE inference works correctly under Istio namespace-level injection.
6. Verify Installation
Check that all resources are properly deployed:
Sample output:
7. Usage
The Inference Gateway provides HTTP endpoints for model inference.
1: Populate gateway URL for your k8s cluster
a. To test the integration in minikube, proceed as below:
Use minikube tunnel to expose the gateway to the host. This requires sudo access to the host machine. Alternatively, you can use port-forward to expose the gateway to the host as shown in alternative (b).
b. To test on a cluster use commands below:
use port-forward to expose the gateway to the host
2: Check models deployed to inference gateway
a. Query models:
Sample output:
b. Send inference request to gateway:
or
Sample inference output:
If you have more than one HTTPRoute running on the cluster
Add the host to your http-route.yaml and add the header
curl -H "Host: llama3-70b-agg.example.com" ... or curl -H "Host: llama3-70b-disagg.example.com" http://localhost:8000/v1/models
8. Deleting the installation
If you need to uninstall run:
Gateway API Inference Extension Integration
This section documents the updated plugin implementation for Gateway API Inference Extension v1.5.0-rc.2.
Router bookkeeping operations
EPP performs Dynamo router book keeping operations so the FrontEnd’s Router does not have to sync its state.
Header Routing Hints
Since v1.5.0-rc.1, the EPP uses headers and body mutations for communicating routing decisions.
The plugins set HTTP headers for worker targeting and inject pre-computed token IDs
into the request body (nvext.token_data) so the frontend sidecar can skip redundant tokenization.