Dynamo Observability

Monitor Dynamo deployments with metrics, logging, and tracing
View as Markdown

Getting Started Quickly

This is an example to get started quickly on a single machine.

Prerequisites

Install these on your machine:

Starting the Observability Stack

Dynamo provides a Docker Compose-based observability stack that includes Prometheus, Grafana, Tempo, and various exporters for metrics, tracing, and visualization.

From the Dynamo root directory:

$# Start infrastructure (NATS, etcd)
$docker compose -f deploy/docker-compose.yml up -d
$
$# Start observability stack (Prometheus, Grafana, Tempo, DCGM GPU exporter, NATS exporter)
$docker compose -f deploy/docker-observability.yml up -d

For detailed setup instructions and configuration, see Prometheus + Grafana Setup.

Observability Documentations

GuideDescriptionEnvironment Variables to Control
MetricsAvailable metrics referenceDYN_SYSTEM_PORT
Operator Metrics (Kubernetes)Operator controller and webhook metrics for KubernetesN/A (configured via Helm)
Health ChecksComponent health monitoring and readiness probesDYN_SYSTEM_PORT†, DYN_SYSTEM_STARTING_HEALTH_STATUS, DYN_SYSTEM_HEALTH_PATH, DYN_SYSTEM_LIVE_PATH, DYN_SYSTEM_USE_ENDPOINT_HEALTH_STATUS
TracingDistributed tracing with OpenTelemetry and TempoDYN_LOGGING_JSONL†, OTEL_EXPORT_ENABLED†, OTEL_EXPORTER_OTLP_TRACES_ENDPOINT†, OTEL_SERVICE_NAME
LoggingStructured logging configurationDYN_LOGGING_JSONL†, DYN_LOG, DYN_LOG_USE_LOCAL_TZ, DYN_LOGGING_CONFIG_PATH, OTEL_SERVICE_NAME†, OTEL_EXPORT_ENABLED†, OTEL_EXPORTER_OTLP_TRACES_ENDPOINT

Variables marked with † are shared across multiple observability systems.

Developer Guides

GuideDescriptionEnvironment Variables to Control
Metrics Developer GuideCreating custom metrics in Rust and PythonDYN_SYSTEM_PORT

Kubernetes

For Kubernetes-specific setup and configuration, see docs/kubernetes/observability/.

Operator Metrics: The Dynamo Operator running in Kubernetes exposes its own set of metrics for monitoring controller reconciliation, webhook validation, and resource inventory. See the Operator Metrics Guide.


Topology

This provides:

  • Prometheus on http://localhost:9090 - metrics collection and querying
  • Grafana on http://localhost:3000 - visualization dashboards (username: dynamo, password: dynamo)
  • Tempo on http://localhost:3200 - distributed tracing backend
  • DCGM Exporter on http://localhost:9401/metrics - GPU metrics
  • NATS Exporter on http://localhost:7777/metrics - NATS messaging metrics

Service Relationship Diagram

The dcgm-exporter service in the Docker Compose network is configured to use port 9401 instead of the default port 9400. This adjustment is made to avoid port conflicts with other dcgm-exporter instances that may be running simultaneously. Such a configuration is typical in distributed systems like SLURM.

Configuration Files

The following configuration files are located in the deploy/observability/ directory: