---
title: Webhooks
---
This document describes the webhook functionality in the Dynamo Operator, including validation webhooks, certificate management, and troubleshooting.
## Table of Contents
- [Overview](#overview)
- [Architecture](#architecture)
- [Configuration](#configuration)
- [Enabling/Disabling Webhooks](#enablingdisabling-webhooks)
- [Certificate Management Options](#certificate-management-options)
- [Advanced Configuration](#advanced-configuration)
- [Certificate Management](#certificate-management)
- [Automatic Certificates (Default)](#automatic-certificates-default)
- [cert-manager Integration](#cert-manager-integration)
- [External Certificates](#external-certificates)
- [Multi-Operator Deployments](#multi-operator-deployments)
- [Troubleshooting](#troubleshooting)
---
## Overview
The Dynamo Operator uses **Kubernetes admission webhooks** to provide real-time validation and mutation of custom resources. Currently, the operator implements **validation webhooks** that ensure invalid configurations are rejected immediately at the API server level, providing faster feedback to users compared to controller-based validation.
All webhook types (validating, mutating, conversion, etc.) share the same **webhook server** and **TLS certificate infrastructure**, making certificate management consistent across all webhook operations.
### Key Features
- ✅ **Enabled by default** - Zero-touch validation out of the box
- ✅ **Shared certificate infrastructure** - All webhook types use the same TLS certificates
- ✅ **Automatic certificate generation** - No manual certificate management required
- ✅ **Defense in depth** - Controllers validate when webhooks are disabled
- ✅ **cert-manager integration** - Optional integration for automated certificate lifecycle
- ✅ **Multi-operator support** - Lease-based coordination for cluster-wide and namespace-restricted deployments
- ✅ **Immutability enforcement** - Critical fields protected via CEL validation rules
### Current Webhook Types
- **Validating Webhooks**: Validate custom resource specifications before persistence
- `DynamoComponentDeployment` validation
- `DynamoGraphDeployment` validation
- `DynamoModel` validation
Future releases may add mutating webhooks (for defaults/transformations) and conversion webhooks (for CRD version migrations). All will use the same certificate infrastructure described in this document.
---
## Architecture
```
┌─────────────────────────────────────────────────────────────────┐
│ API Server │
│ 1. User submits CR (kubectl apply) │
│ 2. API server calls ValidatingWebhookConfiguration │
└────────────────────────┬────────────────────────────────────────┘
│ HTTPS (TLS required)
▼
┌─────────────────────────────────────────────────────────────────┐
│ Webhook Server (in Operator Pod) │
│ 3. Validates CR against business rules │
│ 4. Returns admit/deny decision + warnings │
└─────────────────────────────────────────────────────────────────┘
│
▼
┌─────────────────────────────────────────────────────────────────┐
│ API Server │
│ 5. If admitted: Persist CR to etcd │
│ 6. If denied: Return error to user │
└─────────────────────────────────────────────────────────────────┘
```
### Validation Flow
1. **Webhook validation** (if enabled): Validates at API server level
2. **CEL validation**: Kubernetes-native immutability checks (always active)
3. **Controller validation** (if webhooks disabled): Defense-in-depth validation during reconciliation
---
## Configuration
### Enabling/Disabling Webhooks
Webhooks are **enabled by default**. To disable them:
```yaml
# Platform-level values.yaml
dynamo-operator:
webhook:
enabled: false
```
**When to disable webhooks:**
- During development/testing when rapid iteration is needed
- In environments where admission webhooks are not supported
- When troubleshooting validation issues
When webhooks are disabled, controllers perform validation during reconciliation (defense in depth).
---
### Certificate Management Options
The operator supports three certificate management modes:
| Mode | Description | Use Case |
|------|-------------|----------|
| **Automatic (Default)** | Helm hooks generate self-signed certificates | Testing and development environments |
| **cert-manager** | Integrate with cert-manager for automated lifecycle | Production deployments with cert-manager |
| **External** | Bring your own certificates | Production deployments with custom PKI |
---
### Advanced Configuration
#### Complete Configuration Reference
```yaml
dynamo-operator:
webhook:
# Enable/disable validation webhooks
enabled: true
# Certificate management
certManager:
enabled: false
issuerRef:
kind: Issuer
name: selfsigned-issuer
# Certificate secret configuration
certificateSecret:
name: webhook-server-cert
external: false
# Certificate validity period (automatic generation only)
certificateValidity: 3650 # 10 years
# Certificate generator image (automatic generation only)
certGenerator:
image:
repository: bitnami/kubectl
tag: latest
# Webhook behavior configuration
failurePolicy: Fail # Fail (reject on error) or Ignore (allow on error)
timeoutSeconds: 10 # Webhook timeout
# Namespace filtering (advanced)
namespaceSelector: {} # Kubernetes label selector for namespaces
```
#### Failure Policy
```yaml
# Fail: Reject resources if webhook is unavailable (recommended for production)
webhook:
failurePolicy: Fail
# Ignore: Allow resources if webhook is unavailable (use with caution)
webhook:
failurePolicy: Ignore
```
**Recommendation:** Use `Fail` in production to ensure validation is always enforced. Only use `Ignore` if you need high availability and can tolerate occasional invalid resources.
#### Namespace Filtering
Control which namespaces are validated (applies to **cluster-wide operator** only):
```yaml
# Only validate resources in namespaces with specific labels
webhook:
namespaceSelector:
matchLabels:
dynamo-validation: enabled
# Or exclude specific namespaces
webhook:
namespaceSelector:
matchExpressions:
- key: dynamo-validation
operator: NotIn
values: ["disabled"]
```
For **namespace-restricted operators**, the namespace selector is automatically set to validate only the operator's namespace. This configuration is ignored in namespace-restricted mode.
---
## Certificate Management
### Automatic Certificates (Default)
**Zero configuration required!** Certificates are automatically generated during `helm install` and `helm upgrade`.
#### How It Works
1. **Pre-install/pre-upgrade hook**: Generates self-signed TLS certificates
- Root CA (valid 10 years)
- Server certificate (valid 10 years)
- Stores in Secret: `-webhook-server-cert`
2. **Post-install/post-upgrade hook**: Injects CA bundle into `ValidatingWebhookConfiguration`
- Reads `ca.crt` from Secret
- Patches `ValidatingWebhookConfiguration` with base64-encoded CA bundle
3. **Operator pod**: Mounts certificate secret and serves webhook on port 9443
#### Certificate Validity
- **Root CA**: 10 years
- **Server Certificate**: 10 years (same as Root CA)
- **Automatic rotation**: Certificates are re-generated on every `helm upgrade`
#### Smart Certificate Generation
The certificate generation hook is intelligent:
- ✅ **Checks existing certificates** before generating new ones
- ✅ **Skips generation** if valid certificates exist (valid for 30+ days with correct SANs)
- ✅ **Regenerates** only when needed (missing, expiring soon, or incorrect SANs)
This means:
- Fast `helm upgrade` operations (no unnecessary cert generation)
- Safe to run `helm upgrade` frequently
- Certificates persist across reinstalls (stored in Secret)
#### Manual Certificate Rotation
If you need to rotate certificates manually:
```bash
# Delete the certificate secret
kubectl delete secret -webhook-server-cert -n
# Upgrade the release to regenerate certificates
helm upgrade dynamo-platform -n
```
---
### cert-manager Integration
For clusters with cert-manager installed, you can enable automated certificate lifecycle management.
#### Prerequisites
1. **cert-manager installed** (v1.0+)
2. **CA issuer configured** (e.g., `selfsigned-issuer`)
#### Configuration
```yaml
dynamo-operator:
webhook:
certManager:
enabled: true
issuerRef:
kind: Issuer # Or ClusterIssuer
name: selfsigned-issuer # Your issuer name
```
#### How It Works
1. **Helm creates Certificate resource**: Requests TLS certificate from cert-manager
2. **cert-manager generates certificate**: Based on configured issuer
3. **cert-manager stores in Secret**: `-webhook-server-cert`
4. **cert-manager ca-injector**: Automatically injects CA bundle into `ValidatingWebhookConfiguration`
5. **Operator pod**: Mounts certificate secret and serves webhook
#### Benefits Over Automatic Mode
- ✅ **Automated rotation**: cert-manager renews certificates before expiration
- ✅ **Custom validity periods**: Configure certificate lifetime
- ✅ **CA rotation support**: ca-injector handles CA updates automatically
- ✅ **Integration with existing PKI**: Use your organization's certificate infrastructure
#### Certificate Rotation
With cert-manager, certificate rotation is **fully automated**:
1. **Leaf certificate rotation** (default: every year)
- cert-manager auto-renews before expiration
- controller-runtime auto-reloads new certificate
- **No pod restart required**
- **No caBundle update required** (same Root CA)
2. **Root CA rotation** (every 10 years)
- cert-manager rotates Root CA
- ca-injector auto-updates caBundle in `ValidatingWebhookConfiguration`
- **No manual intervention required**
#### Example: Self-Signed Issuer
```yaml
apiVersion: cert-manager.io/v1
kind: Issuer
metadata:
name: selfsigned-issuer
namespace: dynamo-system
spec:
selfSigned: {}
---
# Enable in platform values.yaml
dynamo-operator:
webhook:
certManager:
enabled: true
issuerRef:
kind: Issuer
name: selfsigned-issuer
```
---
### External Certificates
Bring your own certificates for custom PKI requirements.
#### Steps
1. **Create certificate secret manually**:
```bash
kubectl create secret tls -webhook-server-cert \
--cert=tls.crt \
--key=tls.key \
-n
# Also add ca.crt to the secret
kubectl patch secret -webhook-server-cert -n \
--type='json' \
-p='[{"op": "add", "path": "/data/ca.crt", "value": "'$(base64 -w0 < ca.crt)'"}]'
```
2. **Configure operator to use external secret**:
```yaml
dynamo-operator:
webhook:
certificateSecret:
external: true
caBundle: # Must manually specify
```
3. **Deploy operator**:
```bash
helm install dynamo-platform . -n -f values.yaml
```
#### Certificate Requirements
- **Secret name**: Must match `webhook.certificateSecret.name` (default: `webhook-server-cert`)
- **Secret keys**: `tls.crt`, `tls.key`, `ca.crt`
- **Certificate SAN**: Must include `..svc`
- Example: `dynamo-platform-dynamo-operator-webhook-service.dynamo-system.svc`
---
## Multi-Operator Deployments
The operator supports running both **cluster-wide** and **namespace-restricted** instances simultaneously using a **lease-based coordination mechanism**.
### Scenario
```
Cluster:
├─ Operator A (cluster-wide, namespace: platform-system)
│ └─ Validates all namespaces EXCEPT team-a
└─ Operator B (namespace-restricted, namespace: team-a)
└─ Validates only team-a namespace
```
### How It Works
1. **Namespace-restricted operator** creates a Lease in its namespace
2. **Cluster-wide operator** watches for Leases named `dynamo-operator-ns-lock`
3. **Cluster-wide operator** skips validation for namespaces with active Leases
4. **Namespace-restricted operator** validates resources in its namespace
### Lease Configuration
The lease mechanism is **automatically configured** based on deployment mode:
```yaml
# Cluster-wide operator (default)
namespaceRestriction:
enabled: false
# → Watches for leases in all namespaces
# → Skips validation for namespaces with active leases
# Namespace-restricted operator
namespaceRestriction:
enabled: true
namespace: team-a
# → Creates lease in team-a namespace
# → Does NOT check for leases (no cluster permissions)
```
### Deployment Example
```bash
# 1. Deploy cluster-wide operator
helm install platform-operator dynamo-platform \
-n platform-system \
--set namespaceRestriction.enabled=false
# 2. Deploy namespace-restricted operator for team-a
helm install team-a-operator dynamo-platform \
-n team-a \
--set namespaceRestriction.enabled=true \
--set namespaceRestriction.namespace=team-a
```
### ValidatingWebhookConfiguration Naming
The webhook configuration name reflects the deployment mode:
- **Cluster-wide**: `-validating`
- **Namespace-restricted**: `-validating-`
Example:
```bash
# Cluster-wide
platform-operator-validating
# Namespace-restricted (team-a)
team-a-operator-validating-team-a
```
This allows multiple webhook configurations to coexist without conflicts.
### Lease Health
If the namespace-restricted operator is deleted or becomes unhealthy:
- Lease expires after `leaseDuration + gracePeriod` (default: ~30 seconds)
- Cluster-wide operator automatically resumes validation for that namespace
---
## Troubleshooting
### Webhook Not Called
**Symptoms:**
- Invalid resources are accepted
- No validation errors in logs
**Checks:**
1. **Verify webhook is enabled**:
```bash
kubectl get validatingwebhookconfiguration | grep dynamo
```
2. **Check webhook configuration**:
```bash
kubectl get validatingwebhookconfiguration -o yaml
# Verify:
# - caBundle is present and non-empty
# - clientConfig.service points to correct service
# - webhooks[].namespaceSelector matches your namespace
```
3. **Verify webhook service exists**:
```bash
kubectl get service -n | grep webhook
```
4. **Check operator logs for webhook startup**:
```bash
kubectl logs -n deployment/-dynamo-operator | grep webhook
# Should see: "Webhooks are enabled - webhooks will validate, controllers will skip validation"
# Should see: "Starting webhook server"
```
---
### Connection Refused Errors
**Symptoms:**
```
Error from server (InternalError): Internal error occurred: failed calling webhook:
Post "https://...webhook-service...:443/validate-...": dial tcp ...:443: connect: connection refused
```
**Checks:**
1. **Verify operator pod is running**:
```bash
kubectl get pods -n -l app.kubernetes.io/name=dynamo-operator
```
2. **Check webhook server is listening**:
```bash
# Port-forward to pod
kubectl port-forward -n pod/ 9443:9443
# In another terminal, test connection
curl -k https://localhost:9443/validate-nvidia-com-v1alpha1-dynamocomponentdeployment
# Should NOT get "connection refused"
```
3. **Verify webhook port in deployment**:
```bash
kubectl get deployment -n -dynamo-operator -o yaml | grep -A5 "containerPort: 9443"
```
4. **Check for webhook initialization errors**:
```bash
kubectl logs -n deployment/-dynamo-operator | grep -i error
```
---
### Certificate Errors
**Symptoms:**
```
Error from server (InternalError): Internal error occurred: failed calling webhook:
x509: certificate signed by unknown authority
```
**Checks:**
1. **Verify caBundle is present**:
```bash
kubectl get validatingwebhookconfiguration -o jsonpath='{.webhooks[0].clientConfig.caBundle}' | base64 -d
# Should output a valid PEM certificate
```
2. **Verify certificate secret exists**:
```bash
kubectl get secret -n -webhook-server-cert
```
3. **Check certificate validity**:
```bash
kubectl get secret -n -webhook-server-cert -o jsonpath='{.data.tls\.crt}' | base64 -d | openssl x509 -noout -text
# Check:
# - Not expired
# - SAN includes: ..svc
```
4. **Check CA injection job logs**:
```bash
kubectl logs -n job/-webhook-ca-inject-
```
---
### Helm Hook Job Failures
**Symptoms:**
- `helm install` or `helm upgrade` hangs or fails
- Certificate generation errors
**Checks:**
1. **List hook jobs**:
```bash
kubectl get jobs -n | grep webhook
```
2. **Check job logs**:
```bash
# Certificate generation
kubectl logs -n job/-webhook-cert-gen-
# CA injection
kubectl logs -n job/-webhook-ca-inject-
```
3. **Check RBAC permissions**:
```bash
# Verify ServiceAccount exists
kubectl get sa -n -webhook-ca-inject
# Verify ClusterRole and ClusterRoleBinding exist
kubectl get clusterrole -webhook-ca-inject
kubectl get clusterrolebinding -webhook-ca-inject
```
4. **Manual cleanup**:
```bash
# Delete failed jobs
kubectl delete job -n -webhook-cert-gen-
kubectl delete job -n -webhook-ca-inject-
# Retry helm upgrade
helm upgrade dynamo-platform -n
```
---
### Validation Errors Not Clear
**Symptoms:**
- Webhook rejects resource but error message is unclear
**Solution:**
Check operator logs for detailed validation errors:
```bash
kubectl logs -n deployment/-dynamo-operator | grep "validate create\|validate update"
```
Webhook logs include:
- Resource name and namespace
- Validation errors with context
- Warnings for immutable field changes
---
### Stuck Deleting Resources
**Symptoms:**
- Resource stuck in "Terminating" state
- Webhook blocks finalizer removal
**Solution:**
The webhook automatically skips validation for resources being deleted. If stuck:
1. **Check if webhook is blocking**:
```bash
kubectl describe -n
# Look for events mentioning webhook errors
```
2. **Temporarily disable webhook**:
```bash
# Option 1: Delete ValidatingWebhookConfiguration
kubectl delete validatingwebhookconfiguration
# Option 2: Set failurePolicy to Ignore
kubectl patch validatingwebhookconfiguration \
--type='json' \
-p='[{"op": "replace", "path": "/webhooks/0/failurePolicy", "value": "Ignore"}]'
```
3. **Delete resource again**:
```bash
kubectl delete -n
```
4. **Re-enable webhook**:
```bash
helm upgrade dynamo-platform -n
```
---
## Best Practices
### Production Deployments
1. ✅ **Keep webhooks enabled** (default) for real-time validation
2. ✅ **Use `failurePolicy: Fail`** (default) to ensure validation is enforced
3. ✅ **Monitor webhook latency** - Validation adds ~10-50ms per resource operation
4. ✅ **Use cert-manager** for automated certificate lifecycle in large deployments
5. ✅ **Test webhook configuration** in staging before production
### Development Deployments
1. ✅ **Disable webhooks** for rapid iteration if needed
2. ✅ **Use `failurePolicy: Ignore`** if webhook availability is problematic
3. ✅ **Keep automatic certificates** (simpler than cert-manager for dev)
### Multi-Tenant Deployments
1. ✅ **Deploy one cluster-wide operator** for platform-wide validation
2. ✅ **Deploy namespace-restricted operators** for tenant-specific namespaces
3. ✅ **Monitor lease health** to ensure coordination works correctly
4. ✅ **Use unique release names** per namespace to avoid naming conflicts
---
## Additional Resources
- [Kubernetes Admission Webhooks](https://kubernetes.io/docs/reference/access-authn-authz/extensible-admission-controllers/)
- [cert-manager Documentation](https://cert-manager.io/docs/)
- [Kubebuilder Webhook Tutorial](https://book.kubebuilder.io/cronjob-tutorial/webhook-implementation.html)
- [CEL Validation Rules](https://kubernetes.io/docs/reference/using-api/cel/)
---
## Support
For issues or questions:
- Check [Troubleshooting](#troubleshooting) section
- Review operator logs: `kubectl logs -n deployment/-dynamo-operator`
- Open an issue on GitHub