TensorRT-LLM Prometheus Metrics
This document describes how TensorRT-LLM Prometheus metrics are exposed in Dynamo, as well as where to find non-Prometheus metrics.
Overview
When running TensorRT-LLM through Dynamo, TensorRT-LLM’s Prometheus metrics are automatically passed through and exposed on Dynamo’s /metrics endpoint (default port 8081). This allows you to access both TensorRT-LLM engine metrics (prefixed with trtllm:) and Dynamo runtime metrics (prefixed with dynamo_*) from a single worker backend endpoint.
Additional performance metrics are available via non-Prometheus APIs in the RequestPerfMetrics section below.
As of the date of this documentation, the included TensorRT-LLM version 1.1.0rc5 exposes 5 basic Prometheus metrics. Note that the trtllm: prefix is added by Dynamo.
Dynamo runtime metrics are documented in docs/observability/metrics.md.
Metric Reference
TensorRT-LLM provides Prometheus metrics through the MetricsCollector class (see tensorrt_llm/metrics/collector.py), which includes:
- Counter and Histogram metrics
- Metric labels (e.g.,
model_name,engine_type,finished_reason) - note that TensorRT-LLM usesmodel_nameinstead of Dynamo’s standardmodellabel convention
Current Prometheus Metrics (TensorRT-LLM 1.1.0rc5)
The following metrics are exposed via Dynamo’s /metrics endpoint (with the trtllm: prefix added by Dynamo):
trtllm:request_success_total(Counter) — Count of successfully processed requests by finish reason- Labels:
model_name,engine_type,finished_reason
- Labels:
trtllm:e2e_request_latency_seconds(Histogram) — End-to-end request latency (seconds)- Labels:
model_name,engine_type
- Labels:
trtllm:time_to_first_token_seconds(Histogram) — Time to first token, TTFT (seconds)- Labels:
model_name,engine_type
- Labels:
trtllm:time_per_output_token_seconds(Histogram) — Time per output token, TPOT (seconds)- Labels:
model_name,engine_type
- Labels:
trtllm:request_queue_time_seconds(Histogram) — Time a request spends waiting in the queue (seconds)- Labels:
model_name,engine_type
- Labels:
These metric names and availability are subject to change with TensorRT-LLM version updates.
Metric Categories
TensorRT-LLM provides metrics in the following categories (all prefixed with trtllm:):
- Request metrics (latency, throughput)
- Performance metrics (TTFT, TPOT, queue time)
Note: Metrics may change between TensorRT-LLM versions. Always inspect the /metrics endpoint for your version.
Enabling Metrics in Dynamo
TensorRT-LLM Prometheus metrics are automatically exposed when running TensorRT-LLM through Dynamo with the --publish-events-and-metrics flag.
Required Configuration
Backend Requirement
backend: Must be set to"pytorch"for metrics collection (enforced incomponents/src/dynamo/trtllm/main.py)- TensorRT-LLM’s
MetricsCollectorintegration has only been tested/validated with the PyTorch backend
Inspecting Metrics
To see the actual metrics available in your TensorRT-LLM version:
1. Launch TensorRT-LLM with Metrics Enabled
Metrics will be available at: http://localhost:8081/metrics
2. Fetch Metrics via curl
3. Example Output
Note: The specific metrics shown below are examples and may vary depending on your TensorRT-LLM version. Always inspect your actual /metrics endpoint for the current list.
Implementation Details
- Prometheus Integration: Uses the
MetricsCollectorclass fromtensorrt_llm.metrics(see collector.py) - Dynamo Integration: Uses
register_engine_metrics_callback()function withadd_prefix="trtllm:" - Engine Configuration:
return_perf_metricsset toTruewhen--publish-events-and-metricsis enabled - Initialization: Metrics appear after TensorRT-LLM engine initialization completes
- Metadata:
MetricsCollectorinitialized with model metadata (model name, engine type)
TensorRT-LLM Specific: Non-Prometheus Performance Metrics
TensorRT-LLM provides extensive performance data beyond the basic Prometheus metrics. These are not exposed to Prometheus.
Available via Code References:
- RequestPerfMetrics Structure: tensorrt_llm/executor/result.py - KV cache, timing, speculative decoding metrics
- Engine Statistics:
engine.llm.get_stats_async()- System-wide aggregate statistics - KV Cache Events:
engine.llm.get_kv_cache_events_async()- Real-time cache operations
Example RequestPerfMetrics JSON Structure:
Note: These structures are valid as of the date of this documentation but are subject to change with TensorRT-LLM version updates.
See Also
TensorRT-LLM Metrics
- See the “TensorRT-LLM Specific: Non-Prometheus Performance Metrics” section above for detailed performance data and source code references
Dynamo Metrics
- Dynamo Metrics Guide: See docs/observability/metrics.md for complete documentation on Dynamo runtime metrics
- Dynamo Runtime Metrics: Metrics prefixed with
dynamo_*for runtime, components, endpoints, and namespaces- Implementation:
lib/runtime/src/metrics.rs(Rust runtime metrics) - Metric names:
lib/runtime/src/metrics/prometheus_names.rs(metric name constants) - Available at the same
/metricsendpoint alongside TensorRT-LLM metrics
- Implementation:
- Integration Code:
components/src/dynamo/common/utils/prometheus.py- Prometheus utilities and callback registration