Deploying Dynamo on Kubernetes
High-level guide to Dynamo Kubernetes deployments. Start here, then dive into specific guides.
Important Terminology
Kubernetes Namespace: The K8s namespace where your DynamoGraphDeployment resource is created.
- Used for: Resource isolation, RBAC, organizing deployments
- Example:
dynamo-system,team-a-namespace
Dynamo Namespace: The logical namespace used by Dynamo components for service discovery.
- Used for: Runtime component communication, service discovery
- Specified in:
.spec.services.<ServiceName>.dynamoNamespacefield - Example:
my-llm,production-model,dynamo-dev
These are independent. A single Kubernetes namespace can host multiple Dynamo namespaces, and vice versa.
Prerequisites
Before you begin, ensure you have the following tools installed:
Verify your installation:
For detailed installation instructions, see the Prerequisites section in the Installation Guide.
Pre-deployment Checks
Before deploying the platform, run the pre-deployment checks to ensure the cluster is ready:
This validates kubectl connectivity, StorageClass configuration, and GPU availability. See pre-deployment checks for more details.
1. Install Platform First
For Shared/Multi-Tenant Clusters:
If your cluster has namespace-restricted Dynamo operators, add this flag to step 3:
For more details or customization options (including multinode deployments), see Installation Guide for Dynamo Kubernetes Platform.
2. Choose Your Backend
Each backend has deployment examples and configuration options:
3. Deploy Your First Model
For SLA-based autoscaling, see SLA Planner Guide.
Understanding Dynamoβs Custom Resources
Dynamo provides two main Kubernetes Custom Resources for deploying models:
DynamoGraphDeploymentRequest (DGDR) - Simplified SLA-Driven Configuration
The recommended approach for generating optimal configurations. DGDR provides a high-level interface where you specify:
- Model name and backend framework
- SLA targets (latency requirements)
- GPU type (optional)
Dynamo automatically handles profiling and generates an optimized DGD spec in the status. Perfect for:
- SLA-driven configuration generation
- Automated resource optimization
- Users who want simplicity over control
Note: DGDR generates a DGD spec which you can then use to deploy.
DynamoGraphDeployment (DGD) - Direct Configuration
A lower-level interface that defines your complete inference pipeline:
- Model configuration
- Resource allocation (GPUs, memory)
- Scaling policies
- Frontend/backend connections
Use this when you need fine-grained control or have already completed profiling.
Refer to the API Reference and Documentation for more details.
π API Reference & Documentation
For detailed technical specifications of Dynamoβs Kubernetes resources:
- API Reference - Complete CRD field specifications for all Dynamo resources
- Create Deployment - Step-by-step deployment creation with DynamoGraphDeployment
- Operator Guide - Dynamo operator configuration and management
Choosing Your Architecture Pattern
When creating a deployment, select the architecture pattern that best fits your use case:
- Development / Testing - Use
agg.yamlas the base configuration - Production with Load Balancing - Use
agg_router.yamlto enable scalable, load-balanced inference - High Performance / Disaggregated - Use
disagg_router.yamlfor maximum throughput and modular scalability
Frontend and Worker Components
You can run the Frontend on one machine (e.g., a CPU node) and workers on different machines (GPU nodes). The Frontend serves as a framework-agnostic HTTP entry point that:
- Provides OpenAI-compatible
/v1/chat/completionsendpoint - Auto-discovers backend workers via service discovery (Kubernetes-native by default)
- Routes requests and handles load balancing
- Validates and preprocesses requests
Customizing Your Deployment
Example structure:
Worker command examples per backend:
Key customization points include:
- Model Configuration: Specify model in the args command
- Resource Allocation: Configure GPU requirements under
resources.limits - Scaling: Set
replicasfor number of worker instances - Routing Mode: Enable KV-cache routing by setting
DYN_ROUTER_MODE=kvin Frontend envs - Worker Specialization: Add
--is-prefill-workerflag for disaggregated prefill workers
Additional Resources
- Examples - Complete working examples
- Create Custom Deployments - Build your own CRDs
- Managing Models with DynamoModel - Deploy LoRA adapters and manage models
- Operator Documentation - How the platform works
- Service Discovery - Discovery backends and configuration
- Helm Charts - For advanced users
- GitOps Deployment with FluxCD - For advanced users
- Logging - For logging setup
- Multinode Deployment - For multinode deployment
- Grove - For grove details and custom installation
- Monitoring - For monitoring setup
- Model Caching with Fluid - For model caching with Fluid