Gateway API Inference Extension (GAIE)

Expose DynamoGraphDeployments through Kubernetes Gateway API and Dynamo EPP routing.
View as Markdown

Dynamo supports two request routing topologies on Kubernetes:

  • Dynamo-native Frontend routing. The Dynamo Frontend receives HTTP requests and the integrated Dynamo Router selects workers.
  • Gateway API routing with GAIE. A Kubernetes Gateway receives HTTP requests, the Gateway API Inference Extension (GAIE) calls the Dynamo Endpoint Picker Plugin (EPP) for endpoint selection, and the selected worker’s Frontend sidecar forwards the request in direct mode.

This guide covers the Gateway API path for DynamoGraphDeployment resources managed by the Dynamo operator. Use it when your Kubernetes platform wants Gateway API to own traffic entry, policy, and observability while Dynamo owns the serving graph, discovery, event plane, and routing logic inside the EPP.

Components

The operator-managed GAIE path combines user-created Gateway API objects with resources created from the DynamoGraphDeployment.

ComponentRoleCreated by
GatewayReceives external HTTP traffic for the namespace.User or platform team
HTTPRouteAttaches model traffic to the Gateway and points at the InferencePool.User
DynamoGraphDeploymentDescribes the serving graph, EPP component, workers, and Frontend sidecars.User
Dynamo operatorReconciles the DGD into Kubernetes resources.Dynamo platform
InferencePoolConnects GAIE endpoint selection to the Dynamo EPP service.Dynamo operator
Dynamo EPPScores endpoints and returns the selected worker to the gateway.Dynamo operator
Frontend sidecarReceives the already-selected request and forwards in direct mode.Dynamo operator
WorkerRuns the model backend.Dynamo operator

Request Flow

Gateway API owns the external request path. Dynamo still owns the serving graph: the operator creates the EPP Service, worker pods, Frontend sidecars, and InferencePool that binds the route to the EPP. The EPP receives Dynamo routing state from the runtime event plane and returns the selected worker ID to the gateway. The gateway forwards the request to the selected worker’s Frontend sidecar, which runs in direct routing mode.

In this operator-managed path, the EPP consumes routing state through the Dynamo event plane using NATS/JetStream. Direct vLLM ZMQ KV-event subscriptions are used by other integration shapes, but not by this quickstart path.

Shared Prerequisites

  • A Kubernetes cluster with GPU nodes. For the baseline Gateway API environment, start with the upstream Gateway API getting started guide and the upstream GAIE guide.
  • kubectl, Helm, and jq configured for the cluster.
  • Gateway API and GAIE CRDs installed. The quickstart installs them explicitly from pinned upstream release manifests.
  • A Gateway API implementation that supports GAIE InferencePool resources and endpointPickerRef calls.
  • Dynamo platform installed with the operator. See the Kubernetes Quickstart and Installation Guide.
  • Model credentials and storage needed by the selected model. Hugging Face token secrets are a Dynamo model-serving prerequisite, not a GAIE-specific resource; see the Hugging Face token secret setup.

Gateway Implementation

GAIE requires a Gateway API implementation that can call an Endpoint Picker Plugin before forwarding the request to a backend. Dynamo is independent of the Gateway implementation: pick the gateway that matches your platform, then point its HTTPRoute and generated InferencePool at the Dynamo EPP.

The quickstart shows two verified paths: agentgateway and Istio. Other Gateway API implementations might work when they support the same GAIE InferencePool and endpointPickerRef EPP path; check the upstream GAIE gateway implementation list and your controller’s documentation before choosing another implementation.

Istio uses Envoy in its data plane. agentgateway is a Rust-based AI gateway. The requirement for this guide is not Envoy specifically; it is support for Gateway API plus the GAIE EndpointPicker flow.

Use agentgateway for a small Gateway API footprint or when the cluster does not already standardize on a service mesh. Install the agentgateway chart with inferenceExtension.enabled=true; the GatewayClass is agentgateway.

The quickstart walks through the two verified implementation paths shown in this table:

agentgatewayIstio
Good fitNew clusters or clusters without a mesh standardClusters that already standardize on Istio
Install footprintagentgateway CRDs and controller in agentgateway-systemIstio control plane in istio-system or your chosen namespace
GatewayClassagentgatewayistio
GAIE supportEnable inferenceExtension.enabled=true on the chartInstall Istio with ENABLE_GATEWAY_API_INFERENCE_EXTENSION=true
Mesh interactionUse AgentgatewayParameters to keep agentgateway-proxy out of sidecar injectionConfigure EPP TLS with a DestinationRule when mesh policy applies

Gateway API Concepts

HTTPRoute.spec.parentRefs attaches a route to a Gateway. If the HTTPRoute and Gateway live in different namespaces, set parentRefs[].namespace to the Gateway namespace. rules[].backendRefs points at the InferencePool; the pool points at the EPP service through endpointPickerRef.

For the upstream API model, see the Gateway API HTTP routing guide and the cross-namespace routing guide.

Configure DynamoGraphDeployments for GAIE

In GAIE mode, the EPP chooses workers. The worker Frontend sidecar must run in direct routing mode so it honors the EPP selection instead of choosing a worker again.

1frontendSidecar: sidecar-frontend
2podTemplate:
3 spec:
4 containers:
5 - name: sidecar-frontend
6 args:
7 - -m
8 - dynamo.frontend
9 - --router-mode
10 - direct

The EPP component is part of the DynamoGraphDeployment. The operator creates the EPP Service and the matching InferencePool, so users apply the DGD and the route instead of hand-crafting the pool.

EPP Component Configuration

Start from the recipe EPP component and update the DynamoGraphDeployment for your cluster. Change deployment-level settings such as replicas and resources to fit gateway traffic volume. Change routing plugin settings only when you want different endpoint-selection behavior, then validate the result with production-like traffic.

SettingWhen to change itRule
replicas and podTemplate.spec.containers[].resourcesScale or reserve capacity for EPP pods.Keep EPP capacity aligned with gateway request volume.
DYN_MODEL_NAMEChange the served model.Match the worker model name.
DYN_KV_CACHE_BLOCK_SIZEChange the backend KV block size.Match the backend --block-size value.
DYN_ENFORCE_DISAGGRequire strict prefill/decode routing separation.Set "true" only for disaggregated deployments that should fail closed when topology labels are missing.
label-filter parametersChange worker topology labels or component names.Keep filter labels and values aligned with worker pod labels.
schedulingProfiles[].plugins[].weightAdjust how much each scorer influences endpoint selection.Tune scorer weights deliberately; keep required filters and the picker in the profile.
scorer and picker pluginsChange the routing strategy.Treat this as advanced EPP tuning and validate with traffic.

For upstream EPP configuration semantics, see the GAIE EPP YAML configuration guide and its Scheduling Profiles section for plugin weights. For label-based endpoint selection, see the upstream InferencePool configuration guide. The label-filter plugin shown here is Dynamo-specific; the component role label comes from the Dynamo ComponentType field.

The operator reconciles the EPP Deployment, EPP Service, and generated InferencePool from the DGD. Tune the DGD first; patch generated resources only for short-lived debugging.

1- name: Epp
2 type: epp
3 replicas: 1
4 eppConfig:
5 config:
6 plugins:
7 - type: disagg-profile-handler
8 - name: decode-filter
9 type: label-filter
10 parameters:
11 label: nvidia.com/dynamo-component-type
12 validValues: [decode]
13 allowsNoLabel: true # Aggregated recipes can route unlabeled decode pods.
14 - name: dyn-decode
15 type: dyn-decode-scorer
16 - name: picker
17 type: max-score-picker
18 schedulingProfiles:
19 - name: decode
20 plugins:
21 - pluginRef: decode-filter
22 weight: 1 # Keep topology filters aligned with the worker labels.
23 - pluginRef: dyn-decode
24 weight: 1 # Tune scorer weights to change endpoint scoring.
25 - pluginRef: picker
26 weight: 1

The DYN_* environment values are runtime contracts between the EPP router logic and the workers. Update them when the worker backend changes; do not use them to tune scoring.

1- name: Epp
2 type: epp
3 podTemplate:
4 spec:
5 containers:
6 - name: main
7 env:
8 - name: DYN_MODEL_NAME
9 value: Qwen/Qwen3-0.6B # Match the worker model name.
10 - name: DYN_KV_CACHE_BLOCK_SIZE
11 value: "16" # Match the worker backend's --block-size.
12 - name: DYN_ENFORCE_DISAGG
13 value: "false" # Use "true" for disaggregated fail-closed behavior.

See the complete EPP examples in the source tree: recipes/qwen3-0.6b/vllm/agg/gaie/deploy.yaml for the Qwen 0.6B aggregated recipe manifest and examples/backends/vllm/deploy/gaie/disagg.yaml for the Qwen 0.6B disaggregated example manifest.

Routing Behavior

GAIE does not require one scoring strategy. Choose the routing behavior based on the routing state available to the EPP.

ModeWhat the EPP usesWhen to use it
KV cache aware routingWorker-published KV cache events delivered through the Dynamo event plane.Default path when workers publish KV events and you want cache locality to influence endpoint selection.
Approximate routingEndpoint availability plus local bookkeeping from tokenized requests and request lifecycle.Fallback path when precise worker-published KV events are unavailable, disabled, or not yet supported by the chosen backend or deployment shape.

With operator-managed GAIE, NATS/JetStream backs routing-state delivery. The EPP can receive startup state and subsequent updates through the Dynamo event plane instead of rebuilding all state from new traffic after every EPP restart.

Compatibility and Defaults

The quickstart pins the Gateway API layer so manual setup is repeatable. Keep the Dynamo platform, EPP, and runtime images on the same Dynamo release line.

ComponentDefault shown hereNotes
Gateway API CRDsv1.5.1Installed from the upstream Gateway API release.
GAIE CRDsv1.2.1Installed from the upstream Gateway API Inference Extension release.
agentgatewayv1.0.0Installed with inferenceExtension.enabled=true.
Istio1.29.2Install with ENABLE_GATEWAY_API_INFERENCE_EXTENSION=true.
Dynamo images1.2.1Use one Dynamo release line for the platform chart, EPP image, and runtime images.

Troubleshooting Signals

SymptomLikely causeCheck
HTTPRoute is not acceptedparentRefs points at the wrong Gateway name or namespace.kubectl describe httproute -n <model-namespace> and compare spec.parentRefs with the Gateway.
Requests reach a model but EPP logs stay quietThe route bypasses the InferencePool, or the pool points at the wrong EPP service.Verify rules.backendRefs points at the InferencePool and endpointPickerRef points at the Dynamo EPP service.
EPP does not receive routing stateDynamo event-plane components are not ready, or image tags do not match.Check Dynamo platform pods, DGD status, EPP logs, and image tags against the compatibility table.
Istio path cannot call the EPPIstio was installed without GAIE enabled, or mesh TLS policy blocks the EPP call.Confirm ENABLE_GATEWAY_API_INFERENCE_EXTENSION=true and configure the EPP DestinationRule.

Next Step

Run the GAIE Quickstart to deploy a DynamoGraphDeployment, expose it through Gateway API, and verify an end-to-end request through the Dynamo EPP.

Use GAIE Reference for resource contracts, routing knobs, and service mesh settings.