Additional ResourcesTensorRT-LLM Details

Logits Processing

View as Markdown

For general TensorRT-LLM features and configuration, see the Reference Guide.


Logits processors let you modify the next-token logits at every decoding step (e.g., to apply custom constraints or sampling transforms). Dynamo provides a backend-agnostic interface and an adapter for TensorRT-LLM so you can plug in custom processors.

How it works

  • Interface: Implement dynamo.logits_processing.BaseLogitsProcessor which defines __call__(input_ids, logits) and modifies logits in-place.
  • TRT-LLM adapter: Use dynamo.trtllm.logits_processing.adapter.create_trtllm_adapters(...) to convert Dynamo processors into TRT-LLM-compatible processors and assign them to SamplingParams.logits_processor.
  • Examples: See example processors in lib/bindings/python/src/dynamo/logits_processing/examples/ (temperature, hello_world).

Quick test: HelloWorld processor

You can enable a test-only processor that forces the model to respond with “Hello world!”. This is useful to verify the wiring without modifying your model or engine code.

$cd $DYNAMO_HOME/examples/backends/trtllm
$export DYNAMO_ENABLE_TEST_LOGITS_PROCESSOR=1
$./launch/agg.sh
  • When enabled, Dynamo initializes the tokenizer so the HelloWorld processor can map text to token IDs.
  • Expected chat response contains “Hello world”.

Bring your own processor

Implement a processor by conforming to BaseLogitsProcessor and modify logits in-place. For example, temperature scaling:

1from typing import Sequence
2import torch
3from dynamo.logits_processing import BaseLogitsProcessor
4
5class TemperatureProcessor(BaseLogitsProcessor):
6 def __init__(self, temperature: float = 1.0):
7 if temperature <= 0:
8 raise ValueError("Temperature must be positive")
9 self.temperature = temperature
10
11 def __call__(self, input_ids: Sequence[int], logits: torch.Tensor):
12 if self.temperature == 1.0:
13 return
14 logits.div_(self.temperature)

Wire it into TRT-LLM by adapting and attaching to SamplingParams:

1from dynamo.trtllm.logits_processing.adapter import create_trtllm_adapters
2from dynamo.logits_processing.examples import TemperatureProcessor
3
4processors = [TemperatureProcessor(temperature=0.7)]
5sampling_params.logits_processor = create_trtllm_adapters(processors)

Current limitations

  • Per-request processing only (batch size must be 1); beam width > 1 is not supported.
  • Processors must modify logits in-place and not return a new tensor.
  • If your processor needs tokenization, ensure the tokenizer is initialized (do not skip tokenizer init).