TensorRT-LLM Prometheus Metrics
Overview
When running TensorRT-LLM through Dynamo, TensorRT-LLM’s Prometheus metrics are automatically passed through and exposed on Dynamo’s /metrics endpoint (default port 8081). This allows you to access both TensorRT-LLM engine metrics (prefixed with trtllm_) and Dynamo runtime metrics (prefixed with dynamo_*) from a single worker backend endpoint.
Additional performance metrics are available via non-Prometheus APIs (see Non-Prometheus Performance Metrics below).
As of the date of this documentation, the included TensorRT-LLM version 1.1.0rc5 exposes 5 basic Prometheus metrics. Note that the trtllm_ prefix is added by Dynamo.
For Dynamo runtime metrics, see the Dynamo Metrics Guide.
For visualization setup instructions, see the Prometheus and Grafana Setup Guide.
Environment Variables
Getting Started Quickly
This is a single machine example.
Start Observability Stack
For visualizing metrics with Prometheus and Grafana, start the observability stack. See Observability Getting Started for instructions.
Launch Dynamo Components
Launch a frontend and TensorRT-LLM backend to test metrics:
Note: The backend must be set to "pytorch" for metrics collection (enforced in components/src/dynamo/trtllm/main.py). TensorRT-LLM’s MetricsCollector integration has only been tested/validated with the PyTorch backend.
Wait for the TensorRT-LLM worker to start, then send requests and check metrics:
Exposed Metrics
TensorRT-LLM exposes metrics in Prometheus Exposition Format text at the /metrics HTTP endpoint. All TensorRT-LLM engine metrics use the trtllm_ prefix and include labels (e.g., model_name, engine_type, finished_reason) to identify the source.
Note: TensorRT-LLM uses model_name instead of Dynamo’s standard model label convention.
Example Prometheus Exposition Format text:
Note: The specific metrics shown above are examples and may vary depending on your TensorRT-LLM version. Always inspect your actual /metrics endpoint for the current list.
Metric Categories
TensorRT-LLM provides metrics in the following categories (all prefixed with trtllm_):
- Request metrics - Request success tracking and latency measurements
- Performance metrics - Time to first token (TTFT), time per output token (TPOT), and queue time
Note: Metrics may change between TensorRT-LLM versions. Always inspect the /metrics endpoint for your version.
Available Metrics
The following metrics are exposed via Dynamo’s /metrics endpoint (with the trtllm_ prefix added by Dynamo) for TensorRT-LLM version 1.1.0rc5:
trtllm_request_success_total(Counter) — Count of successfully processed requests by finish reason- Labels:
model_name,engine_type,finished_reason
- Labels:
trtllm_e2e_request_latency_seconds(Histogram) — End-to-end request latency (seconds)- Labels:
model_name,engine_type
- Labels:
trtllm_time_to_first_token_seconds(Histogram) — Time to first token, TTFT (seconds)- Labels:
model_name,engine_type
- Labels:
trtllm_time_per_output_token_seconds(Histogram) — Time per output token, TPOT (seconds)- Labels:
model_name,engine_type
- Labels:
trtllm_request_queue_time_seconds(Histogram) — Time a request spends waiting in the queue (seconds)- Labels:
model_name,engine_type
- Labels:
These metric names and availability are subject to change with TensorRT-LLM version updates.
TensorRT-LLM provides Prometheus metrics through the MetricsCollector class (see tensorrt_llm/metrics/collector.py).
Non-Prometheus Performance Metrics
TensorRT-LLM provides extensive performance data beyond the basic Prometheus metrics. These are not currently exposed to Prometheus.
Available via Code References
- RequestPerfMetrics Structure: tensorrt_llm/executor/result.py - KV cache, timing, speculative decoding metrics
- Engine Statistics:
engine.llm.get_stats_async()- System-wide aggregate statistics - KV Cache Events:
engine.llm.get_kv_cache_events_async()- Real-time cache operations
Example RequestPerfMetrics JSON Structure
Note: These structures are valid as of the date of this documentation but are subject to change with TensorRT-LLM version updates.
Implementation Details
- Prometheus Integration: Uses the
MetricsCollectorclass fromtensorrt_llm.metrics(see collector.py) - Dynamo Integration: Uses
register_engine_metrics_callback()function withadd_prefix="trtllm_" - Engine Configuration:
return_perf_metricsset toTruewhen--publish-events-and-metricsis enabled - Initialization: Metrics appear after TensorRT-LLM engine initialization completes
- Metadata:
MetricsCollectorinitialized with model metadata (model name, engine type)
Related Documentation
TensorRT-LLM Metrics
- See the Non-Prometheus Performance Metrics section above for detailed performance data and source code references
- TensorRT-LLM Metrics Collector - Source code reference
Dynamo Metrics
- Dynamo Metrics Guide - Complete documentation on Dynamo runtime metrics
- Prometheus and Grafana Setup - Visualization setup instructions
- Dynamo runtime metrics (prefixed with
dynamo_*) are available at the same/metricsendpoint alongside TensorRT-LLM metrics- Implementation:
lib/runtime/src/metrics.rs(Rust runtime metrics) - Metric names:
lib/runtime/src/metrics/prometheus_names.rs(metric name constants) - Integration code:
components/src/dynamo/common/utils/prometheus.py- Prometheus utilities and callback registration
- Implementation: