Operator Metrics | NVIDIA Dynamo Documentation

Overview

The Dynamo Operator exposes Prometheus metrics for monitoring its own health and performance. These metrics are separate from application metrics (frontend/worker) and provide visibility into:

Controller Reconciliation: How efficiently controllers process DynamoGraphDeployments, DynamoComponentDeployments, and DynamoModels
Webhook Validation: Performance and outcomes of admission webhook requests
Resource Inventory: Current count of managed resources by state and namespace

Prerequisites

The operator metrics feature requires the same monitoring infrastructure as application metrics. For detailed setup instructions, see the Kubernetes Metrics Guide.

Quick checklist:

✅ kube-prometheus-stack installed (for ServiceMonitor support)
✅ Prometheus and Grafana running
✅ Dynamo Operator installed via Helm

Metrics Collection

ServiceMonitor

Operator metrics are automatically collected via a ServiceMonitor, which is created by the Helm chart when metricsService.enabled: true (default).

Unlike application metrics (which use PodMonitor), the operator uses ServiceMonitor and requires no manual RBAC configuration. The operator’s kube-rbac-proxy sidecar is configured with --ignore-paths=/metrics to allow Prometheus access.

To verify the ServiceMonitor is created:

$ kubectl get servicemonitor -n dynamo-system

Disabling Metrics Collection

To disable operator metrics collection:

$ helm upgrade dynamo-platform dynamo-platform-${RELEASE_VERSION}.tgz \
>   --namespace dynamo-system \
>   --set dynamo-operator.metricsService.enabled=false

Available Metrics

All metrics use the dynamo_operator namespace prefix.

Reconciliation Metrics

Metric	Type	Labels	Description
`dynamo_operator_reconcile_duration_seconds`	Histogram	`resource_type`, `namespace`, `result`	Duration of reconciliation loops
`dynamo_operator_reconcile_total`	Counter	`resource_type`, `namespace`, `result`	Total number of reconciliations
`dynamo_operator_reconcile_errors_total`	Counter	`resource_type`, `namespace`, `error_type`	Total reconciliation errors by type

Labels:

resource_type: DynamoGraphDeployment, DynamoComponentDeployment, DynamoModel, DynamoGraphDeploymentRequest, DynamoGraphDeploymentScalingAdapter
namespace: Target namespace of the resource
result: success, error, requeue
error_type: not_found, already_exists, conflict, validation, bad_request, unauthorized, forbidden, timeout, server_timeout, unavailable, rate_limited, internal

Webhook Metrics

Metric	Type	Labels	Description
`dynamo_operator_webhook_duration_seconds`	Histogram	`resource_type`, `operation`	Duration of webhook validation requests
`dynamo_operator_webhook_requests_total`	Counter	`resource_type`, `operation`, `result`	Total webhook admission requests
`dynamo_operator_webhook_denials_total`	Counter	`resource_type`, `operation`, `reason`	Total webhook denials with reasons

Labels:

resource_type: Same as reconciliation metrics
operation: CREATE, UPDATE, DELETE
result: allowed, denied
reason: Validation failure reason (e.g., immutable_field_changed, invalid_config)

Resource Inventory Metrics

Metric	Type	Labels	Description
`dynamo_operator_resources_total`	Gauge	`resource_type`, `namespace`, `status`	Current count of resources by state

Labels:

resource_type: DynamoGraphDeployment, DynamoComponentDeployment, DynamoModel, DynamoGraphDeploymentRequest, DynamoGraphDeploymentScalingAdapter
namespace: Resource namespace
status: Resource state derived from each CRD’s status. Common values:
- "ready" - Resource is healthy and operational (DCD, DM, DGDSA)
- "not_ready" - Resource exists but is not operational (DCD, DM, DGDSA)
- "unknown" - State cannot be determined (default for empty status)
- DGD uses: "pending", "successful", "failed" from .status.state
- DGDR uses: "Pending", "Profiling", "Deploying", "Ready", "DeploymentDeleted", "Failed" from .status.state

Example Queries

Reconciliation Performance

1 # P95 reconciliation duration by resource type
2 histogram_quantile(0.95,
3   sum by (resource_type, le) (
4     rate(dynamo_operator_reconcile_duration_seconds_bucket[5m])
5   )
6 )
7 
8 # Reconciliation rate by result
9 sum by (resource_type, result) (
10   rate(dynamo_operator_reconcile_total[5m])
11 )
12 
13 # Error rate by type
14 sum by (resource_type, error_type) (
15   rate(dynamo_operator_reconcile_errors_total[5m])
16 )

Webhook Performance

1 # Webhook P95 latency
2 histogram_quantile(0.95,
3   sum by (resource_type, le) (
4     rate(dynamo_operator_webhook_duration_seconds_bucket[5m])
5   )
6 )
7 
8 # Webhook denial rate
9 sum by (resource_type, operation, reason) (
10   rate(dynamo_operator_webhook_denials_total[5m])
11 )

Resource Inventory

1 # Total resources by type and state
2 sum by (resource_type, status) (
3   dynamo_operator_resources_total
4 )
5 
6 # DynamoGraphDeployments by state
7 sum by (status) (
8   dynamo_operator_resources_total{resource_type="DynamoGraphDeployment"}
9 )
10 
11 # All resources by namespace and state
12 sum by (resource_type, namespace, status) (
13   dynamo_operator_resources_total
14 )

Grafana Dashboard

A pre-built Grafana dashboard is available for visualizing operator metrics.

Dashboard Sections

Reconciliation Metrics (3 panels)
- Reconciliation rate by resource type and result
- P95 reconciliation duration
- Reconciliation errors by type
Webhook Metrics (3 panels)
- Webhook request rate by operation
- P95 webhook duration
- Webhook denials by reason
Resource Inventory (2 panels)
- Resource inventory timeline by state and namespace (filterable by resource type)
- Current resource count by state (filterable by resource type)
Operational Health (2 panels)
- Reconciliation success rate gauges
- Webhook admission success rate gauges

Deploying the Dashboard

$ kubectl apply -f deploy/observability/k8s/grafana-operator-dashboard-configmap.yaml

The dashboard will automatically appear in Grafana (assuming you have the Grafana dashboard sidecar configured, which is included in kube-prometheus-stack).

Finding the Dashboard

Port-forward to Grafana (if needed):

$ kubectl port-forward svc/prometheus-grafana 3000:80 -n monitoring

Log in to Grafana at http://localhost:3000
Navigate to Dashboards → Search for “Dynamo Operator”

Dashboard Filters

The dashboard includes two filter variables:

Namespace: View metrics across all namespaces or filter by specific ones (multi-select)
Resource Type: Filter all panels by resource type or select “All” to see aggregated metrics across all CRDs (single select)

When “All” is selected for Resource Type, all panels will show data for all five managed CRDs with resource_type labels for differentiation.

Accessing Metrics Directly

For instructions on accessing Prometheus and Grafana, see the Kubernetes Metrics Guide.

Once you have access to Prometheus, you can query operator metrics directly:

$ # Port-forward to Prometheus
$ kubectl port-forward svc/prometheus-kube-prometheus-prometheus 9090:9090 -n monitoring
$ 
$ # Visit http://localhost:9090 and try queries like:
$ # - dynamo_operator_reconcile_total
$ # - dynamo_operator_webhook_requests_total
$ # - dynamo_operator_resources_total

Troubleshooting

Metrics Not Appearing in Prometheus

Check ServiceMonitor exists:

$ kubectl get servicemonitor -n dynamo-system | grep operator

Check ServiceMonitor is discovered by Prometheus:
- Go to Prometheus UI → Status → Targets
- Look for serviceMonitor/dynamo-system/dynamo-platform-dynamo-operator-operator
- Should show state: UP

Check Prometheus selector configuration:

$ kubectl get prometheus -o yaml | grep serviceMonitorSelector

Ensure serviceMonitorSelectorNilUsesHelmValues: false was set during kube-prometheus-stack installation.

Dashboard Not Appearing in Grafana

Check ConfigMap is created:

$ kubectl get configmap -n monitoring grafana-operator-dashboard

Check ConfigMap has the label:

$ kubectl get configmap -n monitoring grafana-operator-dashboard -o jsonpath='{.metadata.labels.grafana_dashboard}'

Should return "1"

Check Grafana dashboard sidecar configuration:

$ kubectl get deployment -n monitoring prometheus-grafana -o yaml | grep -A 5 sidecar

The sidecar should be configured to watch for grafana_dashboard: "1" label.

Restart Grafana pod to force dashboard refresh:

$ kubectl rollout restart deployment/prometheus-grafana -n monitoring

Kubernetes Metrics Guide - Application metrics for frontends and workers
Dynamo Operator Guide - Operator architecture and deployment modes
Operator Webhooks - Webhook validation details

Overview

Prerequisites

Metrics Collection

ServiceMonitor

Disabling Metrics Collection

Available Metrics

Reconciliation Metrics

Webhook Metrics

Resource Inventory Metrics

Example Queries

Reconciliation Performance

Webhook Performance

Resource Inventory

Grafana Dashboard

Dashboard Sections

Deploying the Dashboard

Finding the Dashboard

Dashboard Filters

Accessing Metrics Directly

Troubleshooting

Metrics Not Appearing in Prometheus

Dashboard Not Appearing in Grafana

Related Documentation