Dynamo Operator
Overview
Dynamo operator is a Kubernetes operator that simplifies the deployment, configuration, and lifecycle management of DynamoGraphs. It automates the reconciliation of custom resources to ensure your desired state is always achieved. This operator is ideal for users who want to manage complex deployments using declarative YAML definitions and Kubernetes-native tooling.
Architecture
-
Operator Deployment: Deployed as a Kubernetes
Deploymentin a specific namespace. -
Controllers:
DynamoGraphDeploymentController: WatchesDynamoGraphDeploymentCRs and orchestrates graph deployments.DynamoComponentDeploymentController: WatchesDynamoComponentDeploymentCRs and handles individual component deployments.DynamoModelController: WatchesDynamoModelCRs and manages model lifecycle (e.g., loading LoRA adapters).
-
Workflow:
- A custom resource is created by the user or API server.
- The corresponding controller detects the change and runs reconciliation.
- Kubernetes resources (Deployments, Services, etc.) are created or updated to match the CR spec.
- Status fields are updated to reflect the current state.
Deployment Modes
The Dynamo operator supports three deployment modes to accommodate different cluster environments and use cases:
1. Cluster-Wide Mode (Default)
The operator monitors and manages DynamoGraph resources across all namespaces in the cluster.
When to Use:
- You have full cluster admin access
- You want centralized management of all Dynamo workloads
- Standard production deployment on a dedicated cluster
2. Namespace-Scoped Mode
The operator monitors and manages DynamoGraph resources only in a specific namespace. A lease marker is created to signal the operator’s presence to any cluster-wide operators.
When to Use:
- You’re on a shared/multi-tenant cluster
- You only have namespace-level permissions
- You want to test a new operator version in isolation
- You need to avoid conflicts with other operators
Installation:
3. Hybrid Mode
A cluster-wide operator manages most namespaces, while one or more namespace-scoped operators run in specific namespaces (e.g., for testing new versions). The cluster-wide operator automatically detects and excludes namespaces with namespace-scoped operators using lease markers.
When to Use:
- Running production workloads with a stable operator version
- Testing new operator versions in isolated namespaces without affecting production
- Gradual rollout of operator updates
- Development/staging environments on production clusters
How It Works:
- Namespace-scoped operator creates a lease named
dynamo-operator-namespace-scopein its namespace - Cluster-wide operator watches for these lease markers across all namespaces
- Cluster-wide operator automatically excludes any namespace with a lease marker
- If namespace-scoped operator stops, its lease expires (TTL: 30s by default)
- Cluster-wide operator automatically resumes managing that namespace
Setup Example:
Observability:
Custom Resource Definitions (CRDs)
Dynamo provides the following Custom Resources:
- DynamoGraphDeployment (DGD): Deploys complete inference pipelines
- DynamoComponentDeployment (DCD): Deploys individual components
- DynamoModel: Manages model lifecycle (e.g., loading LoRA adapters)
For the complete technical API reference for Dynamo Custom Resource Definitions, see:
For a user-focused guide on deploying and managing models with DynamoModel, see:
📖 Managing Models with DynamoModel Guide
Webhooks
The Dynamo Operator uses Kubernetes admission webhooks for real-time validation of custom resources before they are persisted to the cluster. Webhooks are enabled by default and ensure that invalid configurations are rejected immediately at the API server level.
Key Features:
- ✅ Shared certificate infrastructure across all webhook types
- ✅ Automatic certificate generation (for testing/development)
- ✅ cert-manager integration (for production)
- ✅ Multi-operator support with lease-based coordination
- ✅ Immutability enforcement for critical fields
For complete documentation on webhooks, certificate management, and troubleshooting, see:
Observability
The Dynamo Operator provides comprehensive observability through Prometheus metrics and Grafana dashboards. This allows you to monitor:
- Controller Performance: Reconciliation loop duration, success rates, and error rates by resource type
- Webhook Activity: Validation performance, admission rates, and denial patterns
- Resource Inventory: Current count of managed resources by state and namespace
- Operational Health: Success rates and health indicators for controllers and webhooks
Metrics Collection
Metrics are automatically exposed on the operator’s /metrics endpoint (port 8443 by default) and collected by Prometheus via a ServiceMonitor. The ServiceMonitor is automatically created when you install the operator via Helm (controlled by metricsService.enabled, which defaults to true).
Grafana Dashboard
A pre-built Grafana dashboard is available for visualizing operator metrics. The dashboard includes:
- Reconciliation Metrics: Rate, duration (P95), and errors by resource type
- Webhook Metrics: Request rate, duration (P95), and denials by resource type and operation
- Resource Inventory: Count of DynamoGraphDeployments by state and namespace
- Operational Health: Success rate gauges for controllers and webhooks
For complete setup instructions and metrics reference, see:
Installation
Quick Install with Helm
Note: For shared/multi-tenant clusters or testing scenarios, see Deployment Modes above for namespace-scoped and hybrid configurations.
Building from Source
For detailed installation options, see the Installation Guide
Development
- Code Structure:
The operator is built using Kubebuilder and the operator-sdk, with the following structure:
controllers/: Reconciliation logicapi/v1alpha1/: CRD typesconfig/: Manifests and Helm charts