Deploying Dynamo on Kubernetes

View as Markdown

High-level guide to Dynamo Kubernetes deployments. Start here, then dive into specific guides.

Important Terminology

Kubernetes Namespace: The K8s namespace where your DynamoGraphDeployment resource is created.

  • Used for: Resource isolation, RBAC, organizing deployments
  • Example: dynamo-system, team-a-namespace

Dynamo Namespace: The logical namespace used by Dynamo components for service discovery.

  • Used for: Runtime component communication, service discovery
  • Specified in: .spec.services.<ServiceName>.dynamoNamespace field
  • Example: my-llm, production-model, dynamo-dev

These are independent. A single Kubernetes namespace can host multiple Dynamo namespaces, and vice versa.

Prerequisites

Before you begin, ensure you have the following tools installed:

ToolMinimum VersionInstallation Guide
kubectlv1.24+Install kubectl
Helmv3.0+Install Helm

Verify your installation:

$kubectl version --client # Should show v1.24+
$helm version # Should show v3.0+

For detailed installation instructions, see the Prerequisites section in the Installation Guide.

Pre-deployment Checks

Before deploying the platform, run the pre-deployment checks to ensure the cluster is ready:

$./deploy/pre-deployment/pre-deployment-check.sh

This validates kubectl connectivity, StorageClass configuration, and GPU availability. See pre-deployment checks for more details.

1. Install Platform First

$# 1. Set environment
$export NAMESPACE=dynamo-system
$export RELEASE_VERSION=0.x.x # any version of Dynamo 0.3.2+ listed at https://github.com/ai-dynamo/dynamo/releases
$
$# 2. Install CRDs (skip if on shared cluster where CRDs already exist)
$helm fetch https://helm.ngc.nvidia.com/nvidia/ai-dynamo/charts/dynamo-crds-${RELEASE_VERSION}.tgz
$helm install dynamo-crds dynamo-crds-${RELEASE_VERSION}.tgz --namespace default
$
$# 3. Install Platform
$helm fetch https://helm.ngc.nvidia.com/nvidia/ai-dynamo/charts/dynamo-platform-${RELEASE_VERSION}.tgz
$helm install dynamo-platform dynamo-platform-${RELEASE_VERSION}.tgz --namespace ${NAMESPACE} --create-namespace

For Shared/Multi-Tenant Clusters:

If your cluster has namespace-restricted Dynamo operators, add this flag to step 3:

$--set dynamo-operator.namespaceRestriction.enabled=true

For more details or customization options (including multinode deployments), see Installation Guide for Dynamo Kubernetes Platform.

2. Choose Your Backend

Each backend has deployment examples and configuration options:

BackendAggregatedAggregated + RouterDisaggregatedDisaggregated + RouterDisaggregated + PlannerDisaggregated Multi-node
SGLangβœ…βœ…βœ…βœ…βœ…βœ…
TensorRT-LLMβœ…βœ…βœ…βœ…πŸš§βœ…
vLLMβœ…βœ…βœ…βœ…βœ…βœ…

3. Deploy Your First Model

$export NAMESPACE=dynamo-system
$kubectl create namespace ${NAMESPACE}
$
$# to pull model from HF
$export HF_TOKEN=<Token-Here>
$kubectl create secret generic hf-token-secret \
> --from-literal=HF_TOKEN="$HF_TOKEN" \
> -n ${NAMESPACE};
$
$# Deploy any example (this uses vLLM with Qwen model using aggregated serving)
$kubectl apply -f examples/backends/vllm/deploy/agg.yaml -n ${NAMESPACE}
$
$# Check status
$kubectl get dynamoGraphDeployment -n ${NAMESPACE}
$
$# Test it
$kubectl port-forward svc/vllm-agg-frontend 8000:8000 -n ${NAMESPACE}
$curl http://localhost:8000/v1/models

For SLA-based autoscaling, see SLA Planner Guide.

Understanding Dynamo’s Custom Resources

Dynamo provides two main Kubernetes Custom Resources for deploying models:

DynamoGraphDeploymentRequest (DGDR) - Simplified SLA-Driven Configuration

The recommended approach for generating optimal configurations. DGDR provides a high-level interface where you specify:

  • Model name and backend framework
  • SLA targets (latency requirements)
  • GPU type (optional)

Dynamo automatically handles profiling and generates an optimized DGD spec in the status. Perfect for:

  • SLA-driven configuration generation
  • Automated resource optimization
  • Users who want simplicity over control

Note: DGDR generates a DGD spec which you can then use to deploy.

DynamoGraphDeployment (DGD) - Direct Configuration

A lower-level interface that defines your complete inference pipeline:

  • Model configuration
  • Resource allocation (GPUs, memory)
  • Scaling policies
  • Frontend/backend connections

Use this when you need fine-grained control or have already completed profiling.

Refer to the API Reference and Documentation for more details.

πŸ“– API Reference & Documentation

For detailed technical specifications of Dynamo’s Kubernetes resources:

Choosing Your Architecture Pattern

When creating a deployment, select the architecture pattern that best fits your use case:

  • Development / Testing - Use agg.yaml as the base configuration
  • Production with Load Balancing - Use agg_router.yaml to enable scalable, load-balanced inference
  • High Performance / Disaggregated - Use disagg_router.yaml for maximum throughput and modular scalability

Frontend and Worker Components

You can run the Frontend on one machine (e.g., a CPU node) and workers on different machines (GPU nodes). The Frontend serves as a framework-agnostic HTTP entry point that:

  • Provides OpenAI-compatible /v1/chat/completions endpoint
  • Auto-discovers backend workers via service discovery (Kubernetes-native by default)
  • Routes requests and handles load balancing
  • Validates and preprocesses requests

Customizing Your Deployment

Example structure:

1apiVersion: nvidia.com/v1alpha1
2kind: DynamoGraphDeployment
3metadata:
4 name: my-llm
5spec:
6 services:
7 Frontend:
8 dynamoNamespace: my-llm
9 componentType: frontend
10 replicas: 1
11 extraPodSpec:
12 mainContainer:
13 image: your-image
14 VllmDecodeWorker: # or SGLangDecodeWorker, TrtllmDecodeWorker
15 dynamoNamespace: dynamo-dev
16 componentType: worker
17 replicas: 1
18 envFromSecret: hf-token-secret # for HuggingFace models
19 resources:
20 limits:
21 gpu: "1"
22 extraPodSpec:
23 mainContainer:
24 image: your-image
25 command: ["/bin/sh", "-c"]
26 args:
27 - python3 -m dynamo.vllm --model YOUR_MODEL [--your-flags]

Worker command examples per backend:

1# vLLM worker
2args:
3 - python3 -m dynamo.vllm --model Qwen/Qwen3-0.6B
4
5# SGLang worker
6args:
7 - >-
8 python3 -m dynamo.sglang
9 --model-path deepseek-ai/DeepSeek-R1-Distill-Llama-8B
10 --tp 1
11 --trust-remote-code
12
13# TensorRT-LLM worker
14args:
15 - python3 -m dynamo.trtllm
16 --model-path deepseek-ai/DeepSeek-R1-Distill-Llama-8B
17 --served-model-name deepseek-ai/DeepSeek-R1-Distill-Llama-8B
18 --extra-engine-args /workspace/examples/backends/trtllm/engine_configs/deepseek-r1-distill-llama-8b/agg.yaml

Key customization points include:

  • Model Configuration: Specify model in the args command
  • Resource Allocation: Configure GPU requirements under resources.limits
  • Scaling: Set replicas for number of worker instances
  • Routing Mode: Enable KV-cache routing by setting DYN_ROUTER_MODE=kv in Frontend envs
  • Worker Specialization: Add --is-prefill-worker flag for disaggregated prefill workers

Additional Resources