TensorRT-LLM Multimodal

View as Markdown

This document provides a comprehensive guide for multimodal inference using TensorRT-LLM backend in Dynamo.

You can provide multimodal inputs in the following ways:

  • By sending image URLs
  • By providing paths to pre-computed embedding files

Note: You should provide either image URLs or embedding file paths in a single request.

Support Matrix

ModalityInput FormatAggregatedDisaggregatedNotes
ImageHTTP/HTTPS URLYesYesFull support for all image models
ImagePre-computed Embeddings (.safetensors)YesYesDirect embedding files
VideoHTTP/HTTPS URLNoNoNot implemented
AudioHTTP/HTTPS URLNoNoNot implemented

Supported URL Formats

FormatExampleDescription
HTTP/HTTPShttp://example.com/image.jpgRemote media files
Pre-computed Embeddings/path/to/embedding.safetensorsLocal embedding files (.safetensors only)

Deployment Patterns

TRT-LLM supports aggregated and traditional disaggregated patterns. See Multimodal Model Serving for detailed explanations.

PatternSupportedLaunch ScriptNotes
Aggregatedagg.shEasiest setup, single worker
EP/D (Traditional Disaggregated)disagg_multimodal.shPrefill handles encoding, 2 workers
E/P/D (Full - Image URLs)epd_multimodal_image_and_embeddings.shStandalone encoder with MultimodalEncoder, 3 workers
E/P/D (Full - Pre-computed Embeddings)epd_multimodal_image_and_embeddings.shStandalone encoder with NIXL transfer, 3 workers
E/P/D (Large Models)epd_disagg.shFor Llama-4 Scout/Maverick, multi-node

Component Flags

ComponentFlagPurpose
Worker--modality multimodalComplete pipeline (aggregated)
Prefill Worker--disaggregation-mode prefillImage processing + Prefill (multimodal tokenization happens here)
Decode Worker--disaggregation-mode decodeDecode only
Encode Worker--disaggregation-mode encodeImage encoding (E/P/D flow)

Aggregated Serving

Quick steps to launch Llama-4 Maverick BF16 in aggregated mode:

$cd $DYNAMO_HOME
$
$export AGG_ENGINE_ARGS=./examples/backends/trtllm/engine_configs/llama4/multimodal/agg.yaml
$export SERVED_MODEL_NAME="meta-llama/Llama-4-Maverick-17B-128E-Instruct"
$export MODEL_PATH="meta-llama/Llama-4-Maverick-17B-128E-Instruct"
$./examples/backends/trtllm/launch/agg.sh

Client:

$curl localhost:8000/v1/chat/completions -H "Content-Type: application/json" -d '{
> "model": "meta-llama/Llama-4-Maverick-17B-128E-Instruct",
> "messages": [
> {
> "role": "user",
> "content": [
> {
> "type": "text",
> "text": "Describe the image"
> },
> {
> "type": "image_url",
> "image_url": {
> "url": "https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/diffusers/inpaint.png"
> }
> }
> ]
> }
> ],
> "stream": false,
> "max_tokens": 160
>}'

Disaggregated Serving

Example using Qwen/Qwen2-VL-7B-Instruct:

$cd $DYNAMO_HOME
$
$export MODEL_PATH="Qwen/Qwen2-VL-7B-Instruct"
$export SERVED_MODEL_NAME="Qwen/Qwen2-VL-7B-Instruct"
$export PREFILL_ENGINE_ARGS="examples/backends/trtllm/engine_configs/qwen2-vl-7b-instruct/prefill.yaml"
$export DECODE_ENGINE_ARGS="examples/backends/trtllm/engine_configs/qwen2-vl-7b-instruct/decode.yaml"
$export MODALITY="multimodal"
$
$./examples/backends/trtllm/launch/disagg.sh
$curl localhost:8000/v1/chat/completions -H "Content-Type: application/json" -d '{
> "model": "Qwen/Qwen2-VL-7B-Instruct",
> "messages": [
> {
> "role": "user",
> "content": [
> {
> "type": "text",
> "text": "Describe the image"
> },
> {
> "type": "image_url",
> "image_url": {
> "url": "https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/diffusers/inpaint.png"
> }
> }
> ]
> }
> ],
> "stream": false,
> "max_tokens": 160
>}'

For a large model like meta-llama/Llama-4-Maverick-17B-128E-Instruct, a multi-node setup is required for disaggregated serving (see Multi-node Deployment below), while aggregated serving can run on a single node. This is because the model with a disaggregated configuration is too large to fit on a single node’s GPUs. For instance, running this model in disaggregated mode requires 2 nodes with 8xH200 GPUs or 4 nodes with 4xGB200 GPUs.

Full E/P/D Flow (Image URLs)

For high-performance multimodal inference, Dynamo supports a standalone encoder with an Encode-Prefill-Decode (E/P/D) flow using TRT-LLM’s MultimodalEncoder. This separates the vision encoding stage from prefill and decode, enabling better GPU utilization and scalability.

Supported Input Formats

FormatExampleDescription
HTTP/HTTPS URLhttps://example.com/image.jpgRemote image files
Base64 Data URLdata:image/jpeg;base64,...Inline base64-encoded images

How It Works

In the full E/P/D flow:

  1. Encode Worker: Runs TRT-LLM’s MultimodalEncoder.generate() to process image URLs through the vision encoder and projector
  2. Prefill Worker: Receives disaggregated_params containing multimodal embedding handles, processes context and generates KV cache
  3. Decode Worker: Performs streaming token generation using the KV cache

The encode worker uses TRT-LLM’s MultimodalEncoder class (which inherits from BaseLLM) and only requires the model path and batch size - no KV cache configuration is needed since it only runs the vision encoder + projector.

How to Launch

$cd $DYNAMO_HOME
$
$# Launch 3-worker E/P/D flow with image URL support
$./examples/backends/trtllm/launch/epd_multimodal_image_and_embeddings.sh

Example Request

$curl localhost:8000/v1/chat/completions -H "Content-Type: application/json" -d '{
> "model": "llava-v1.6-mistral-7b-hf",
> "messages": [
> {
> "role": "user",
> "content": [
> {"type": "text", "text": "Describe the image"},
> {
> "type": "image_url",
> "image_url": {
> "url": "https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/diffusers/inpaint.png"
> }
> }
> ]
> }
> ],
> "max_tokens": 160
>}'

E/P/D Architecture (Image URLs)

Key Differences from EP/D (Traditional Disaggregated)

AspectEP/D (Traditional)E/P/D (Full)
EncodingPrefill worker handles image encodingDedicated encode worker
Prefill LoadHigher (encoding + prefill)Lower (prefill only)
Use CaseSimpler setupBetter scalability for vision-heavy workloads
Launch Scriptdisagg_multimodal.shepd_multimodal_image_and_embeddings.sh

Pre-computed Embeddings with E/P/D Flow

For high-performance multimodal inference, Dynamo supports pre-computed embeddings with an Encode-Prefill-Decode (E/P/D) flow using NIXL (RDMA) for zero-copy tensor transfer.

Supported File Types

Security Note: .pt, .pth, and .bin files are rejected because they use Python pickle deserialization, which can execute arbitrary code. Only .safetensors format is accepted.

Embedding File Formats

Embedding files must use the .safetensors format. The first tensor key in the file is used as the embedding tensor.

Saving embeddings:

1from safetensors.torch import save_file
2import torch
3
4embedding_tensor = torch.rand(1, 576, 4096) # [batch, seq_len, hidden_dim]
5save_file({"embedding": embedding_tensor}, "embedding.safetensors")

How to Launch

$cd $DYNAMO_HOME/examples/backends/trtllm
$
$# Launch 3-worker E/P/D flow with NIXL
$./launch/epd_disagg.sh

Note: This script is designed for 8-node H200 with Llama-4-Scout-17B-16E-Instruct model and assumes you have a model-specific .safetensors embedding file ready.

Configuration

$# Encode endpoint for Prefill → Encode communication
$export ENCODE_ENDPOINT="dyn://dynamo.encode.generate"
$
$# Security: Allowed directory for embedding files (default: /tmp)
$export ALLOWED_LOCAL_MEDIA_PATH="/tmp"
$
$# Security: Max file size to prevent DoS attacks (default: 50MB)
$export MAX_FILE_SIZE_MB=50

Example Request with Pre-computed Embeddings

$curl localhost:8000/v1/chat/completions -H "Content-Type: application/json" -d '{
> "model": "meta-llama/Llama-4-Maverick-17B-128E-Instruct",
> "messages": [
> {
> "role": "user",
> "content": [
> {"type": "text", "text": "Describe the image"},
> {"type": "image_url", "image_url": {"url": "/path/to/embedding.safetensors"}}
> ]
> }
> ],
> "max_tokens": 160
>}'

E/P/D Architecture

The E/P/D flow implements a 3-worker architecture:

  • Encode Worker: Loads pre-computed embeddings, transfers via NIXL
  • Prefill Worker: Receives embeddings, handles context processing and KV-cache generation
  • Decode Worker: Performs streaming token generation

Embedding Cache

Dynamo supports embedding cache in both aggregated and disaggregated settings for TRT-LLM:

SettingImplementationLaunch ScriptStatus
Disaggregated EncoderDynamo-managed cache in the PD worker layer on top of TRT-LLM enginedisagg_e_pd.sh + --multimodal-embedding-cache-capacity-gbSupported
AggregatedN/AN/ANot yet supported

The cache uses MultimodalEmbeddingCacheManager to maintain an LRU cache of encoder embeddings on CPU. When the same image is seen again, the cached embedding is reused instead of re-encoding.

Disaggregated Encoder (Embedding Cache in Prefill Worker)

In the disaggregated setting, the Prefill Worker (P) owns a CPU-side LRU embedding cache (EmbeddingCacheManager). On each request P checks the cache first — on a hit, the Encode Worker is skipped entirely. On a miss, P routes to the Encode Worker (E), receives embeddings via NIXL, saves them to the cache, and then feeds the embeddings along with the request into the TRT-LLM Instance for prefill.

The disagg_e_pd.sh script launches a separate encode worker and a PD worker. Extra arguments are forwarded to the PD worker. Enable embedding cache by passing --multimodal-embedding-cache-capacity-gb:

$cd $DYNAMO_HOME/examples/backends/trtllm
$./launch/disagg_e_pd.sh --multimodal-embedding-cache-capacity-gb 10

NIXL Usage

Use CaseScriptNIXL Used?Data Transfer
Aggregatedagg.shNoAll in one worker
EP/D (Traditional Disaggregated)disagg_multimodal.shOptionalPrefill → Decode (KV cache via UCX or NIXL)
E/P/D (Image URLs)epd_multimodal_image_and_embeddings.shNoEncoder → Prefill (handles via params), Prefill → Decode (KV cache)
E/P/D (Pre-computed Embeddings)epd_multimodal_image_and_embeddings.shYesEncoder → Prefill (embeddings via NIXL RDMA)
E/P/D (Large Models)epd_disagg.shYesEncoder → Prefill (embeddings via NIXL), Prefill → Decode (KV cache)

Note: NIXL for KV cache transfer is currently beta and only supported on AMD64 (x86_64) architecture.

ModelInput Types and Registration

TRT-LLM workers register with Dynamo using:

ModelInput TypePreprocessingUse Case
ModelInput.TokensRust frontend may tokenize, but multimodal flows re-tokenize and build inputs in the Python worker; Rust token_ids are ignoredAll TRT-LLM workers
1# TRT-LLM Worker - Register with Tokens
2await register_model(
3 ModelInput.Tokens, # Rust does minimal preprocessing
4 model_type, # ModelType.Chat or ModelType.Empty
5 generate_endpoint,
6 model_name,
7 ...
8)

Inter-Component Communication

Transfer StageMessageNIXL Transfer
Frontend → PrefillRequest with image URL or .safetensors embedding pathNo
Prefill → Encode (Image URL)Request with image URLNo
Encode → Prefill (Image URL)ep_disaggregated_params with multimodal_embedding_handles, processed prompt, and token IDsNo
Prefill → Encode (Embedding Path)Request with .safetensors embedding file pathNo
Encode → Prefill (Embedding Path)NIXL readable metadata + shape/dtype + auxiliary dataYes (Embeddings tensor via RDMA)
Prefill → Decodedisaggregated_params with _epd_metadata (prompt, token IDs)Configurable (KV cache: NIXL default, UCX optional)

Known Limitations

  • No video support - No video encoder implementation
  • No audio support - No audio encoder implementation
  • Multimodal preprocessing/tokenization happens in Python - Rust may forward token_ids, but multimodal requests are parsed and re-tokenized in the Python worker
  • Multi-node H100 limitation - Loading meta-llama/Llama-4-Maverick-17B-128E-Instruct with 8 nodes of H100 with TP=16 is not possible due to head count divisibility (num_attention_heads: 40 not divisible by tp_size: 16)
  • llava-v1.6-mistral-7b-hf model crash - Known issue with TRTLLM backend compatibility with TensorRT LLM version: 1.2.0rc6.post1. To use Llava model download revision revision='52320fb52229 locally using HF.
  • Embeddings file crash - Known issue with TRTLLM backend compatibility with TensorRT LLM version: 1.2.0rc6.post1. Embedding file parsing crashes in attach_multimodal_embeddings(. To be fixed in next TRTLLM upgrade.

Supported Models

Multimodal models listed in TensorRT-LLM supported models are supported by Dynamo.

Common examples:

  • Llama 4 Vision models (Maverick, Scout) - Recommended for large-scale deployments
  • LLaVA models (e.g., llava-hf/llava-v1.6-mistral-7b-hf) - Default model for E/P/D examples
  • Qwen2-VL models - Supported in traditional disaggregated mode
  • Other vision-language models with TRT-LLM support

Key Files

FileDescription
components/src/dynamo/trtllm/main.pyWorker initialization and setup
components/src/dynamo/trtllm/engine.pyTensorRTLLMEngine wrapper (LLM and MultimodalEncoder)
components/src/dynamo/trtllm/constants.pyDisaggregationMode enum (AGGREGATED, PREFILL, DECODE, ENCODE)
components/src/dynamo/trtllm/encode_helper.pyEncode worker request processing (embedding-path and full EPD flows)
components/src/dynamo/trtllm/multimodal_processor.pyMultimodal request processing
components/src/dynamo/trtllm/request_handlers/handlers.pyRequest handlers (EncodeHandler, PrefillHandler, DecodeHandler)
components/src/dynamo/trtllm/request_handlers/handler_base.pyBase handler with disaggregated params encoding/decoding
components/src/dynamo/trtllm/utils/disagg_utils.pyDisaggregatedParamsCodec for network transfer
components/src/dynamo/trtllm/utils/trtllm_utils.pyCommand-line argument parsing