Inference Gateway (GAIE)
Inference Gateway Setup with Dynamo
When integrating Dynamo with the Inference Gateway you must use the custom Dynamo EPP image.
The custom Dynamo EPP image integrates the Dynamo router directly into the gateway’s endpoint picker. Using the dyn-kv plugin, it selects the optimal worker based on KV cache state and tokenized prompt before routing the request. The integration moves intelligent routing upstream to the gateway layer.
EPP’s default kv-routing approach is not token-aware because the prompt is not tokenized. But the Dynamo plugin uses a token-aware KV algorithm. It employs the dynamo router which implements kv routing by running your model’s tokenizer inline. The EPP plugin configuration lives in helm/dynamo-gaie/epp-config-dynamo.yaml per EPP convention.
Currently, these setups are only supported with the kGateway based Inference Gateway.
Table of Contents
Prerequisites
- Kubernetes cluster with kubectl configured
- NVIDIA GPU drivers installed on worker nodes
Installation Steps
1. Install Dynamo Platform
See Quickstart Guide to install Dynamo Kubernetes Platform.
2. Deploy Inference Gateway
First, deploy an inference gateway service. In this example, we’ll install kgateway based gateway implementation.
Note: The manifest at config/manifests/gateway/kgateway/gateway.yaml uses gatewayClassName: agentgateway, but kGateway’s helm chart creates a GatewayClass named kgateway. The patch command in the script fixes this mismatch.
f. Verify the Gateway is running
3. Setup secrets
Do not forget docker registry secret if needed.
Do not forget to include the HuggingFace token.
Create a model configuration file similar to the vllm_agg_qwen.yaml for your model. This file demonstrates the values needed for the Vllm Agg setup in agg.yaml Take a note of the model’s block size provided in the model card.
4. Build EPP image (Optional)
You can either use the provided Dynamo FrontEnd image for the EPP image or you need to build your own Dynamo EPP custom image following the steps below.
All-in-one Targets
5. Deploy
We recommend deploying Inference Gateway’s Endpoint Picker as a Dynamo operator’s managed component. Alternatively, you could deploy it as a standalone pod
5.a. Deploy as a DGD component (recommended)
We provide an example for llama-3-70b vLLM below.
- When using GAIE the FrontEnd does not choose the workers. The routing is determined in the EPP.
- You must enable the flag in the FrontEnd cli as below.
- The pre-selected worker (decode and prefill in case of the disaggregated serving) are passed in the request headers.
- The flag assures the routing respects this selection.
Startup Probe Timeout: The EPP has a default startup probe timeout of 30 minutes (10s × 180 failures).
If your model takes longer to load, increase the failureThreshold in the EPP’s startupProbe. For example,
to allow 60 minutes for startup:
Gateway Namespace
Note that this assumes your gateway is installed into NAMESPACE=my-model (examples’ default)
If you installed it into a different namespace, you need to adjust the HttpRoute entry in http-route.yaml.
5.b. Deploy as a standalone pod
5.b.1 Deploy Your Model
We provide an example for Qwen vLLM below.
Before deploying you must enable the --direct-route flag in the FrontEnd cli in your Dynamo Graph.
Follow the steps in model deployment to deploy Qwen/Qwen3-0.6B model in aggregate mode using agg.yaml in my-model kubernetes namespace.
Sample commands to deploy model:
5.b.2 Install Dynamo GIE helm chart
By default, the Kubernetes discovery mechanism is used. If you prefer etcd, please use the --set epp.dynamo.useEtcd=true flag below.
Key configurations include:
- An InferenceModel resource for the Qwen model
- A service for the inference gateway
- Required RBAC roles and bindings
- RBAC permissions
- dynamoGraphDeploymentName - the name of the Dynamo Graph where your model is deployed.
Configuration You can configure the plugin by setting environment variables in the EPP component of your DGD in case of the operator-managed installation or in your values.yaml.
Common Vars for Routing Configuration:
- Set
DYN_BUSY_THRESHOLDto configure the upper bound on how “full” a worker can be (often derived from kv_active_blocks or other load metrics) before the router skips it. If the selected worker exceeds this value, routing falls back to the next best candidate. By default the value is negative meaning this is not enabled. - Set
DYN_ENFORCE_DISAGG=trueif you want to enforce every request being served in the disaggregated manner. By default it is false meaning if the the prefill worker is not available the request will be served in the aggregated manner. - Set
DYN_OVERLAP_SCORE_WEIGHTto weigh how heavily the score uses token overlap (predicted KV cache hits) versus other factors (load, historical hit rate). Higher weight biases toward reusing workers with similar cached prefixes. (default: 1) - Set
DYN_ROUTER_TEMPERATUREto soften or sharpen the selection curve when combining scores. Low temperature makes the router pick the top candidate deterministically; higher temperature lets lower-scoring workers through more often (exploration). - Set
DYN_USE_KV_EVENTS=falseif you want to disable the workers sending KV events while using kv-routing (default: true) DYN_ROUTER_TEMPERATURE— Temperature for worker sampling via softmax (default: 0.0)DYN_ROUTER_REPLICA_SYNC— Enable replica synchronization (default: false)DYN_ROUTER_TRACK_ACTIVE_BLOCKS— Track active blocks (default: true)DYN_ROUTER_TRACK_OUTPUT_BLOCKS— Track output blocks during generation (default: false)- See the KV cache routing design for details.
Stand-Alone installation only:
- Overwrite the
DYN_NAMESPACEenv var if needed to match your model’s dynamo namespace.
6. Verify Installation
Check that all resources are properly deployed:
Sample output:
7. Usage
The Inference Gateway provides HTTP endpoints for model inference.
1: Populate gateway URL for your k8s cluster
To test the gateway in minikube, use the following command:
a. User minikube tunnel to expose the gateway to the host
This requires sudo access to the host machine. alternatively, you can use port-forward to expose the gateway to the host as shown in alternative (b).
b. use port-forward to expose the gateway to the host
2: Check models deployed to inference gateway
a. Query models:
Sample output:
b. Send inference request to gateway:
Sample inference output:
If you have more than one HttpRoute running on the cluster
Add the host to your HttpRoute.yaml and add the header curl -H "Host: llama3-70b-agg.example.com" ... to every request.
8. Deleting the installation
If you need to uninstall run:
Gateway API Inference Extension Integration
This section documents the updated plugin implementation for Gateway API Inference Extension v1.2.1.
Router bookkeeping operations
EPP performs Dynamo router book keeping operations so the FrontEnd’s Router does not have to sync its state.
Header Routing Hints
Since v1.2.1, the EPP uses a header-only approach for communicating routing decisions. The plugins set HTTP headers that are forwarded to the backend workers.