SGLang Prometheus Metrics

📚 Official Documentation: SGLang Production Metrics

This document describes how SGLang Prometheus metrics are exposed in Dynamo.

Overview

When running SGLang through Dynamo, SGLang engine metrics are automatically passed through and exposed on Dynamo’s /metrics endpoint (default port 8081). This allows you to access both SGLang engine metrics (prefixed with sglang:) and Dynamo runtime metrics (prefixed with dynamo_*) from a single worker backend endpoint.

For the complete and authoritative list of all SGLang metrics, always refer to the official documentation linked above.

Dynamo runtime metrics are documented in docs/observability/metrics.md.

Metric Reference

The official documentation includes:

Complete metric definitions with HELP and TYPE descriptions
Example metric output in Prometheus exposition format
Counter, Gauge, and Histogram metrics
Metric labels (e.g., model_name, engine_type, tp_rank, pp_rank)
Setup guide for Prometheus + Grafana monitoring
Troubleshooting tips and configuration examples

Metric Categories

SGLang provides metrics in the following categories (all prefixed with sglang:):

Throughput metrics
Resource usage
Latency metrics
Disaggregation metrics (when enabled)

Note: Specific metrics are subject to change between SGLang versions. Always refer to the official documentation or inspect the /metrics endpoint for your SGLang version.

Enabling Metrics in Dynamo

SGLang metrics are automatically exposed when running SGLang through Dynamo with metrics enabled.

Inspecting Metrics

To see the actual metrics available in your SGLang version:

1. Launch SGLang with Metrics Enabled

$ # Set system metrics port (automatically enables metrics server)
$ export DYN_SYSTEM_PORT=8081
$ 
$ # Start SGLang worker with metrics enabled
$ python -m dynamo.sglang --model <model_name> --enable-metrics
$ 
$ # Wait for engine to initialize

Metrics will be available at: http://localhost:8081/metrics

2. Fetch Metrics via curl

$ curl http://localhost:8081/metrics | grep "^sglang:"

3. Example Output

Note: The specific metrics shown below are examples and may vary depending on your SGLang version. Always inspect your actual /metrics endpoint for the current list.

# HELP sglang:prompt_tokens_total Number of prefill tokens processed.
# TYPE sglang:prompt_tokens_total counter
sglang:prompt_tokens_total{model_name="meta-llama/Llama-3.1-8B-Instruct"} 8128902.0
# HELP sglang:generation_tokens_total Number of generation tokens processed.
# TYPE sglang:generation_tokens_total counter
sglang:generation_tokens_total{model_name="meta-llama/Llama-3.1-8B-Instruct"} 7557572.0
# HELP sglang:cache_hit_rate The cache hit rate
# TYPE sglang:cache_hit_rate gauge
sglang:cache_hit_rate{model_name="meta-llama/Llama-3.1-8B-Instruct"} 0.0075

Implementation Details

SGLang uses multiprocess metrics collection via prometheus_client.multiprocess.MultiProcessCollector
Metrics are filtered by the sglang: prefix before being exposed
The integration uses Dynamo’s register_engine_metrics_callback() function
Metrics appear after SGLang engine initialization completes