Dynamo Observability
Getting Started Quickly
This is an example to get started quickly on a single machine.
Prerequisites
Install these on your machine:
Starting the Observability Stack
Dynamo provides a Docker Compose-based observability stack that includes Prometheus, Grafana, Tempo, and various exporters for metrics, tracing, and visualization.
From the Dynamo root directory:
For detailed setup instructions and configuration, see Prometheus + Grafana Setup.
Observability Documentations
Variables marked with † are shared across multiple observability systems.
Developer Guides
Kubernetes
For Kubernetes-specific setup and configuration, see docs/kubernetes/observability/.
Operator Metrics: The Dynamo Operator running in Kubernetes exposes its own set of metrics for monitoring controller reconciliation, webhook validation, and resource inventory. See the Operator Metrics Guide.
Topology
This provides:
- Prometheus on
http://localhost:9090- metrics collection and querying - Grafana on
http://localhost:3000- visualization dashboards (username:dynamo, password:dynamo) - Tempo on
http://localhost:3200- distributed tracing backend - DCGM Exporter on
http://localhost:9401/metrics- GPU metrics - NATS Exporter on
http://localhost:7777/metrics- NATS messaging metrics
Service Relationship Diagram
The dcgm-exporter service in the Docker Compose network is configured to use port 9401 instead of the default port 9400. This adjustment is made to avoid port conflicts with other dcgm-exporter instances that may be running simultaneously. Such a configuration is typical in distributed systems like SLURM.
Configuration Files
The following configuration files are located in the deploy/observability/ directory:
- docker-compose.yml: Defines NATS and etcd services
- docker-observability.yml: Defines Prometheus, Grafana, Tempo, and exporters
- prometheus.yml: Contains Prometheus scraping configuration
- grafana-datasources.yml: Contains Grafana datasource configuration
- grafana_dashboards/dashboard-providers.yml: Contains Grafana dashboard provider configuration
- grafana_dashboards/dynamo.json: A general Dynamo Dashboard for both SW and HW metrics
- grafana_dashboards/dcgm-metrics.json: Contains Grafana dashboard configuration for DCGM GPU metrics
- grafana_dashboards/kvbm.json: Contains Grafana dashboard configuration for KVBM metrics