GAIE Quickstart | NVIDIA Dynamo Documentation

This quickstart deploys a Dynamo operator-managed serving graph behind Gateway API. The Gateway receives requests, GAIE calls the Dynamo EPP for endpoint selection, and the selected worker’s Frontend sidecar forwards the request in direct mode.

What This Deploys

Prerequisites

Kubernetes cluster with GPU nodes.
kubectl, Helm, and jq.
Access to nvcr.io/nvidia/ai-dynamo images for the Dynamo release you use.
Hugging Face access to Qwen/Qwen3-0.6B.
Shared RWX storage for the recipe’s model-cache PVC.

Set the common variables:

$ export DYNAMO_VERSION=1.2.1
$ export NAMESPACE=gaie-dynamo
$ export DYNAMO_SYSTEM_NAMESPACE=dynamo-system
$ export AGW_NAMESPACE=agentgateway-system
$ export ISTIO_NAMESPACE=istio-system
$ 
$ kubectl create namespace "$NAMESPACE" --dry-run=client -o yaml | kubectl apply -f -

Clone the Dynamo source tree that contains the recipe manifests used below:

$ git clone https://github.com/ai-dynamo/dynamo.git
$ cd dynamo

Install Dynamo Platform

Install the Dynamo platform and operator with the Installation Guide. Use the same Dynamo release line for the platform chart, EPP image, and runtime images.

After installation, verify that the Helm release exists in the platform namespace:

$ helm status dynamo-platform --namespace "$DYNAMO_SYSTEM_NAMESPACE"

Create Model Credentials

Create model credentials if the model requires them. This is a Dynamo model-serving prerequisite, not a GAIE-specific resource. The general Kubernetes quickstart explains the Hugging Face token secret pattern.

$ export HF_TOKEN='your-hf-token'
$ 
$ kubectl create secret generic hf-token-secret \
>   -n "$NAMESPACE" \
>   --from-literal=HF_TOKEN="$HF_TOKEN"

Install Gateway API and GAIE CRDs

Install the Gateway API layer explicitly. If your platform team already installed Gateway API, GAIE, and a compatible Gateway implementation, skip to Create the Gateway.

$ kubectl apply --server-side --force-conflicts \
>   -f https://github.com/kubernetes-sigs/gateway-api/releases/download/v1.5.1/standard-install.yaml
$ 
$ kubectl apply \
>   -f https://github.com/kubernetes-sigs/gateway-api-inference-extension/releases/download/v1.2.1/manifests.yaml

Create the Gateway

Choose the Gateway implementation for this namespace.

agentgateway

Istio

$ helm upgrade -i --create-namespace --namespace "$AGW_NAMESPACE" --version v1.0.0 \
>   agentgateway-crds oci://cr.agentgateway.dev/charts/agentgateway-crds
$ 
$ helm upgrade -i --namespace "$AGW_NAMESPACE" --version v1.0.0 \
>   agentgateway oci://cr.agentgateway.dev/charts/agentgateway \
>   --set inferenceExtension.enabled=true \
>   --wait
$ 
$ kubectl get gatewayclass agentgateway

Create the AgentgatewayParameters resource in the model namespace. The parameters resource excludes Istio sidecar injection from agentgateway-proxy pods when the namespace has istio-injection=enabled.

$ kubectl apply --server-side -n "$NAMESPACE" -f - <<'YAML'
$ apiVersion: agentgateway.dev/v1alpha1
$ kind: AgentgatewayParameters
$ metadata:
$   name: inference-gateway-params
$ spec:
$   deployment:
$     spec:
$       template:
$         metadata:
$           annotations:
$             sidecar.istio.io/inject: "false"
$ YAML

Create the Gateway that uses those parameters.

$ kubectl apply -n "$NAMESPACE" -f - <<'YAML'
$ apiVersion: gateway.networking.k8s.io/v1
$ kind: Gateway
$ metadata:
$   name: inference-gateway
$ spec:
$   gatewayClassName: agentgateway
$   infrastructure:
$     parametersRef:
$       group: agentgateway.dev
$       kind: AgentgatewayParameters
$       name: inference-gateway-params
$   listeners:
$     - name: http
$       port: 80
$       protocol: HTTP
$ YAML

Wait for the gateway controller to program the gateway.

$ kubectl wait gateway/inference-gateway -n "$NAMESPACE" \
>   --for=condition=Programmed --timeout=180s

Keeping the Gateway and HTTPRoute in the same namespace avoids a cross-namespace parentRefs[].namespace field in the route.

Prepare the Model Cache

The Qwen recipe mounts a shared model-cache PVC. Edit recipes/qwen3-0.6b/model-cache/model-cache.yaml first and set storageClassName to a RWX storage class available in your cluster. For the general pattern, see Model Caching.

$ kubectl apply -n "$NAMESPACE" -f recipes/qwen3-0.6b/model-cache/
$ 
$ kubectl wait --for=condition=Complete job/model-download \
>   -n "$NAMESPACE" --timeout=3600s

Deploy the Serving Graph

Deploy the Qwen 0.6B aggregated recipe and its route:

$ kubectl apply -n "$NAMESPACE" \
>   -f recipes/qwen3-0.6b/vllm/agg/gaie/deploy.yaml
$ 
$ kubectl apply -n "$NAMESPACE" \
>   -f recipes/qwen3-0.6b/vllm/agg/gaie/httproute.yaml

Wait for the operator-created resources:

$ export ROUTE_MODEL=Qwen/Qwen3-0.6B
$ 
$ kubectl wait -n "$NAMESPACE" dynamographdeployment/qwen3-0-6b-agg \
>   --for=condition=Ready --timeout=1800s
$ 
$ kubectl get inferencepool qwen3-0-6b-agg-pool -n "$NAMESPACE"
$ kubectl get httproute qwen3-0-6b-agg -n "$NAMESPACE"

Verify End-to-End

Use one access mode to set GATEWAY_URL, then send a request through the Gateway and EPP. Keep ROUTE_MODEL set to the model name from the route manifest you applied.

Port-forward

LoadBalancer or tunnel

$ export GATEWAY_SERVICE=$(kubectl get svc -n "$NAMESPACE" \
>   -l gateway.networking.k8s.io/gateway-name=inference-gateway \
>   -o jsonpath='{.items[0].metadata.name}')
$ 
$ kubectl -n "$NAMESPACE" port-forward "svc/$GATEWAY_SERVICE" 8000:80

In another terminal:

$ export GATEWAY_URL=http://localhost:8000

$ curl --max-time 20 -sS "$GATEWAY_URL/v1/models" \
>   -H "X-Gateway-Model-Name: $ROUTE_MODEL" | jq .

Finish by checking that the EPP path handled the request. A successful smoke test should show the EPP receiving endpoint-picker traffic and selecting a worker near the time of your request; that proves the request flowed through Gateway API and the Dynamo EPP before it reached the Frontend sidecar.

$ kubectl logs -n "$NAMESPACE" -l nvidia.com/dynamo-component-type=epp --tail=200

If the log output is quiet, run the chat request again while tailing the EPP logs in another terminal.

Troubleshooting

agentgateway

Istio

$ kubectl describe gateway inference-gateway -n "$NAMESPACE"
$ kubectl get pods -n "$AGW_NAMESPACE"
$ kubectl logs -n "$AGW_NAMESPACE" deployment/agentgateway --tail=50
$ kubectl get gatewayclass agentgateway
$ kubectl get inferencepool -n "$NAMESPACE"
$ kubectl describe httproute -n "$NAMESPACE"

If requests return HTTP 500 and the namespace has istio-injection=enabled, verify the agentgateway-proxy pod does not have an istio-proxy sidecar:

$ kubectl get pods -n "$NAMESPACE" \
>   -l gateway.networking.k8s.io/gateway-name=inference-gateway \
>   -o jsonpath='{.items[*].spec.containers[*].name}'

See GAIE Reference for the sidecar injection contract.

If model pods restart while loading, inspect the pod events. When events show startup probe failures and the model load time is expected, increase startupProbe.failureThreshold on the affected DGD component. This is general Kubernetes probe tuning, not a GAIE-specific setting.

Clean Up

If this namespace is only for the quickstart, delete it:

$ kubectl delete namespace "$NAMESPACE"