GAIE Reference

Runtime contracts, routing controls, and mesh integration for Dynamo with GAIE.
View as Markdown

Use this reference after the GAIE Quickstart when you need to inspect generated resources, tune routing behavior, or adapt the Gateway API path to a cluster policy.

This page is user-facing runtime reference. It does not cover building custom EPP images, local development loops, minikube-specific setup, or full uninstall procedures.

Resource Contract

Operator-managed GAIE connects Gateway API resources to the Dynamo serving graph through an operator-generated InferencePool.

ResourceContract
GatewayOwns listeners, addresses, and Gateway implementation behavior.
HTTPRouteAttaches traffic to the Gateway through spec.parentRefs and points rules[].backendRefs at the InferencePool.
InferencePoolDefines the eligible backend pod set and the EPP endpoint the gateway calls before forwarding.
EPP ServiceExposes the Dynamo EPP on gRPC port 9002.
Frontend sidecarReceives the selected request on the pod’s http port and runs with --router-mode direct.

The Dynamo operator creates the InferencePool for a DynamoGraphDeployment that contains an EPP component. The pool name is <dgd-name>-pool, it lives in the DGD namespace, and its endpointPickerRef points to the generated EPP Service.

The generated selector matches worker pods by Dynamo labels:

1spec:
2 selector:
3 matchLabels:
4 nvidia.com/dynamo-component-class: worker
5 nvidia.com/dynamo-namespace: <dynamo-namespace>
6 endpointPickerRef:
7 kind: Service
8 name: <dgd-name>-epp
9 port:
10 number: 9002
11 targetPorts:
12 - number: 8000

Do not hand-edit the generated InferencePool unless you also keep its selector aligned with the operator’s worker-pod labels. If the pool selector and Dynamo discovery disagree, the EPP can select a worker that the gateway data plane refuses to forward to.

For the upstream Gateway API model, see the HTTP routing guide and cross-namespace routing guide.

Request Contract

With GAIE, worker selection happens in the EPP before the request reaches the worker sidecar. The sidecar must run in direct mode so it honors the EPP decision instead of routing again.

1frontendSidecar: sidecar-frontend
2podTemplate:
3 spec:
4 containers:
5 - name: sidecar-frontend
6 args:
7 - -m
8 - dynamo.frontend
9 - --router-mode
10 - direct

The EPP sends routing decisions to the selected sidecar through request headers.

HeaderMeaning
x-dynamo-worker-instance-idDecode or aggregated worker selected for the request.
x-dynamo-dp-rankData-parallel rank for the selected decode or aggregated worker.
x-dynamo-routing-modeaggregated or disaggregated.
x-dynamo-prefill-instance-idPrefill worker selected for disaggregated requests.
x-dynamo-prefill-dp-rankData-parallel rank for the selected prefill worker, when present.

For body-bearing OpenAI requests, the EPP also tokenizes the request and injects token data into the request body so the sidecar can avoid repeating the same tokenization work.

Routing Modes

The same Dynamo router logic can run behind the Dynamo-native Frontend entry path or inside the GAIE EPP. In the Gateway API path, the EPP owns endpoint selection and the worker sidecar owns request forwarding.

ModeEPP inputWhen to use
KV cache aware routingWorker KV cache events plus local request bookkeeping.Use when workers publish KV events and cache locality should influence endpoint selection.
Approximate routingTokenized requests, request lifecycle, and local predicted state.Use when KV events are unavailable, disabled, or not supported by the selected deployment shape.

In the operator-managed GAIE path, KV events reach the EPP through the Dynamo event plane using NATS/JetStream. vLLM can also publish KV events through ZMQ in other integration shapes; the operator-managed DynamoGraphDeployment path does not use ZMQ for the EPP.

To use KV cache aware routing:

  1. Enable worker prefix caching and KV event publishing for your backend.
  2. Keep EPP KV events enabled.
  3. Keep the worker KV block size aligned with the EPP block size.

Backend examples:

BackendWorker setting
vLLMPass --enable-prefix-caching and --kv-events-config '{"enable_kv_cache_events":true}'.
SGLangPass the backend’s supported --kv-events-config.
TensorRT-LLMPass --publish-events-and-metrics.

Set DYN_KV_CACHE_BLOCK_SIZE on the EPP only when discovery does not already provide the backend’s block size. It must match the workers’ --block-size. A mismatch changes the block hashes used for prefix overlap and produces incorrect routing scores.

To use approximate routing, disable worker KV events and set the EPP to predicted local state:

1env:
2 - name: DYN_USE_KV_EVENTS
3 value: "false"
4 - name: DYN_ROUTER_KV_OVERLAP_SCORE_CREDIT
5 value: "0"

Router Tuning

Set these values on the EPP component unless the deployment manifest says otherwise.

SettingDefaultEffect
DYN_ENFORCE_DISAGGfalseWhen true, fail requests if prefill routing is unavailable. When false, fall back to aggregated routing until prefill workers appear.
DYN_ROUTER_KV_OVERLAP_SCORE_CREDIT1.0Controls device-local prefix-overlap credit. Higher values prefer workers with cached prompt prefixes.
DYN_ROUTER_KV_OVERLAP_SCORE_CREDIT_DECAY0.0Reduces prefix-overlap credit as active prefill load rises above the least-loaded eligible worker.
DYN_ROUTER_PREFILL_LOAD_SCALE1.0Scales prompt-side prefill load after cache-hit credits are applied.
DYN_ROUTER_TEMPERATURE0.00.0 selects deterministically. Higher values allow more worker exploration through softmax sampling.
DYN_ROUTER_REPLICA_SYNCfalsePublishes and subscribes router state across router replicas.
DYN_ROUTER_TRACK_ACTIVE_BLOCKStrueTracks active decode blocks for load balancing.
DYN_ROUTER_TRACK_OUTPUT_BLOCKSfalsePredicts output blocks during generation and decays them by progress toward expected output length.
DYN_ROUTER_TRACK_PREFILL_TOKENStrueIncludes active prompt-side prefill tokens in load accounting.
DYN_ROUTER_PREDICTED_TTL_SECSunsetEnables predicted entries in the local indexer for this TTL when KV events are enabled.
DYN_ADMISSION_CONTROLnoneSet to token-capacity to skip workers that exceed active decode or prefill thresholds.
DYN_ACTIVE_DECODE_BLOCKS_THRESHOLDunsetDecode worker is busy above this active-block fraction. Setting a numeric value enables token-capacity admission.
DYN_ACTIVE_PREFILL_TOKENS_THRESHOLDunsetWorker is busy above this absolute active-prefill-token count.
DYN_ACTIVE_PREFILL_TOKENS_THRESHOLD_FRACunsetWorker is busy above this fraction of max_num_batched_tokens.

For the broader router configuration surface, see Router Configuration.

Service Mesh Integration

The EPP serves gRPC on port 9002. When an Istio sidecar mediates traffic from the gateway proxy to the EPP service, configure mesh TLS explicitly so the proxy connects to the EPP’s serving mode.

Enable operator-managed Istio DestinationRule generation when installing or upgrading the Dynamo platform chart:

$helm upgrade -i dynamo-platform \
> oci://helm.ngc.nvidia.com/nvidia/ai-dynamo/charts/dynamo-platform \
> --version "$DYNAMO_VERSION" \
> --namespace "$DYNAMO_SYSTEM_NAMESPACE" \
> --reuse-values \
> --set dynamo.serviceMesh.enabled=true \
> --set dynamo.serviceMesh.provider=istio

The platform values are:

ValueDefaultMeaning
dynamo.serviceMesh.enabledfalseGenerate service-mesh resources for EPP services.
dynamo.serviceMesh.provideristioMesh provider. Only Istio is supported.
dynamo.serviceMesh.istio.tlsModeSIMPLETLS mode for generated DestinationRule resources.
dynamo.serviceMesh.istio.insecureSkipVerifytrueSkip server certificate verification for the EPP’s self-signed certificate.
dynamo.serviceMesh.istio.clientCertificate""Client certificate path for MUTUAL TLS mode.
dynamo.serviceMesh.istio.privateKey""Client private key path for MUTUAL TLS mode.
dynamo.serviceMesh.istio.caCertificates""CA certificate path for MUTUAL TLS mode.

When enabled and Istio CRDs are installed, the operator creates a DestinationRule for each EPP service:

1apiVersion: networking.istio.io/v1beta1
2kind: DestinationRule
3metadata:
4 name: <epp-service-name>
5spec:
6 host: <epp-service-name>.<namespace>.svc.cluster.local
7 trafficPolicy:
8 tls:
9 mode: SIMPLE
10 insecureSkipVerify: true

If you install without the Dynamo operator Helm chart or leave dynamo.serviceMesh.enabled=false, create an equivalent DestinationRule for each EPP service used through Istio.

agentgateway and Istio Injection

When namespace-level Istio injection is enabled, the agentgateway-proxy pod can receive an Istio sidecar. That sidecar can intercept the ext_proc gRPC connection from agentgateway to the EPP and cause HTTP 500 responses from the gateway.

Use a per-Gateway AgentgatewayParameters resource in the same namespace as the Gateway:

1apiVersion: agentgateway.dev/v1alpha1
2kind: AgentgatewayParameters
3metadata:
4 name: inference-gateway-params
5spec:
6 deployment:
7 spec:
8 template:
9 metadata:
10 annotations:
11 sidecar.istio.io/inject: "false"

Reference that parameters resource from the Gateway:

1apiVersion: gateway.networking.k8s.io/v1
2kind: Gateway
3metadata:
4 name: inference-gateway
5spec:
6 gatewayClassName: agentgateway
7 infrastructure:
8 parametersRef:
9 group: agentgateway.dev
10 kind: AgentgatewayParameters
11 name: inference-gateway-params
12 listeners:
13 - name: http
14 port: 80
15 protocol: HTTP

Verify that the proxy pod does not contain istio-proxy:

$kubectl get pods -n "$NAMESPACE" \
> -l gateway.networking.k8s.io/gateway-name=inference-gateway \
> -o jsonpath='{.items[*].spec.containers[*].name}{"\n"}'

[!WARNING] Patch the default AgentgatewayParameters resource in agentgateway-system only as a cluster-wide policy decision. Gateways without spec.infrastructure.parametersRef inherit that default.

Developer References

Image build commands belong with the component source, not in this user reference. Use this source location when developing or replacing the standard EPP image: