Additional ResourcesBackend DetailsvLLM

vLLM Prometheus Metrics

View as Markdown

📚 Official Documentation: vLLM Metrics Design

This document describes how vLLM Prometheus metrics are exposed in Dynamo.

Overview

When running vLLM through Dynamo, vLLM engine metrics are automatically passed through and exposed on Dynamo’s /metrics endpoint (default port 8081). This allows you to access both vLLM engine metrics (prefixed with vllm:) and Dynamo runtime metrics (prefixed with dynamo_*) from a single worker backend endpoint.

For the complete and authoritative list of all vLLM metrics, always refer to the official documentation linked above.

Dynamo runtime metrics are documented in docs/observability/metrics.md.

Metric Reference

The official documentation includes:

  • Complete metric definitions with detailed explanations
  • Counter, Gauge, and Histogram metrics
  • Metric labels (e.g., model_name, finished_reason, scheduling_event)
  • Design rationale and implementation details
  • Information about v1 metrics migration
  • Future work and deprecated metrics

Metric Categories

vLLM provides metrics in the following categories (all prefixed with vllm:):

  • Request metrics
  • Performance metrics
  • Resource usage
  • Scheduler metrics
  • Disaggregation metrics (when enabled)

Note: Specific metrics are subject to change between vLLM versions. Always refer to the official documentation or inspect the /metrics endpoint for your vLLM version.

Enabling Metrics in Dynamo

vLLM metrics are automatically exposed when running vLLM through Dynamo with metrics enabled.

Inspecting Metrics

To see the actual metrics available in your vLLM version:

1. Launch vLLM with Metrics Enabled

$# Set system metrics port (automatically enables metrics server)
$export DYN_SYSTEM_PORT=8081
$
$# Start vLLM worker (metrics enabled by default via --disable-log-stats=false)
$python -m dynamo.vllm --model <model_name>
$
$# Wait for engine to initialize

Metrics will be available at: http://localhost:8081/metrics

2. Fetch Metrics via curl

$curl http://localhost:8081/metrics | grep "^vllm:"

3. Example Output

Note: The specific metrics shown below are examples and may vary depending on your vLLM version. Always inspect your actual /metrics endpoint for the current list.

# HELP vllm:request_success_total Number of successfully finished requests.
# TYPE vllm:request_success_total counter
vllm:request_success_total{finished_reason="length",model_name="meta-llama/Llama-3.1-8B"} 15.0
vllm:request_success_total{finished_reason="stop",model_name="meta-llama/Llama-3.1-8B"} 150.0
# HELP vllm:time_to_first_token_seconds Histogram of time to first token in seconds.
# TYPE vllm:time_to_first_token_seconds histogram
vllm:time_to_first_token_seconds_bucket{le="0.001",model_name="meta-llama/Llama-3.1-8B"} 0.0
vllm:time_to_first_token_seconds_bucket{le="0.005",model_name="meta-llama/Llama-3.1-8B"} 5.0
vllm:time_to_first_token_seconds_count{model_name="meta-llama/Llama-3.1-8B"} 165.0
vllm:time_to_first_token_seconds_sum{model_name="meta-llama/Llama-3.1-8B"} 89.38

Implementation Details

  • vLLM v1 uses multiprocess metrics collection via prometheus_client.multiprocess
  • PROMETHEUS_MULTIPROC_DIR: (optional). By default, Dynamo automatically manages this environment variable, setting it to a temporary directory where multiprocess metrics are stored as memory-mapped files. Each worker process writes its metrics to separate files in this directory, which are aggregated when /metrics is scraped. Users only need to set this explicitly where complete control over the metrics directory is required.
  • Dynamo uses MultiProcessCollector to aggregate metrics from all worker processes
  • Metrics are filtered by the vllm: and lmcache: prefixes before being exposed (when LMCache is enabled)
  • The integration uses Dynamo’s register_engine_metrics_callback() function with the global REGISTRY
  • Metrics appear after vLLM engine initialization completes
  • vLLM v1 metrics are different from v0 - see the official documentation for migration details

See Also

vLLM Metrics

Dynamo Metrics

  • Dynamo Metrics Guide: See docs/observability/metrics.md for complete documentation on Dynamo runtime metrics
  • Dynamo Runtime Metrics: Metrics prefixed with dynamo_* for runtime, components, endpoints, and namespaces
    • Implementation: lib/runtime/src/metrics.rs (Rust runtime metrics)
    • Metric names: lib/runtime/src/metrics/prometheus_names.rs (metric name constants)
    • Available at the same /metrics endpoint alongside vLLM metrics
  • Integration Code: components/src/dynamo/common/utils/prometheus.py - Prometheus utilities and callback registration