Additional ResourcesvLLM Details

Prompt Embeddings

View as Markdown

Dynamo supports prompt embeddings (also known as prompt embeds) as a secure alternative input method to traditional text prompts. By allowing applications to use pre-computed embeddings for inference, this feature not only offers greater flexibility in prompt engineering but also significantly enhances privacy and data security. With prompt embeddings, sensitive user data can be transformed into embeddings before ever reaching the inference server, reducing the risk of exposing confidential information during the AI workflow.

How It Works

PathWhat Happens
Text promptTokenize → Embedding Layer → Transformer
Prompt embedsValidate → Bypass Embedding → Transformer

Architecture

LayerNormal FlowPrompt Embeds
Frontend (Rust)🔴 Tokenize text → token_ids, compute ISL🟢 Validate base64+size, skip tokenization
Router (NATS)Forward token_ids in PreprocessedRequestForward prompt_embeds string
Worker (Python)TokensPrompt(token_ids)Decode base64 → EmbedsPrompt(tensor)
vLLM Engine🔴 Embedding Layer → Transformer🟢 Bypass Embedding → Transformer

Quick Start

Send pre-computed prompt embeddings directly to vLLM, bypassing tokenization.

1. Enable Feature

$python -m dynamo.vllm --model <model-name> --enable-prompt-embeds

Required: The --enable-prompt-embeds flag must be set or requests will fail.

2. Send Request

1import torch
2import base64
3import io
4from openai import OpenAI
5
6# Prepare embeddings (sequence_length, hidden_dim)
7embeddings = torch.randn(10, 4096, dtype=torch.float32)
8
9# Encode
10buffer = io.BytesIO()
11torch.save(embeddings, buffer)
12buffer.seek(0)
13embeddings_base64 = base64.b64encode(buffer.read()).decode()
14
15# Send
16client = OpenAI(base_url="http://localhost:8000/v1", api_key="EMPTY")
17response = client.completions.create(
18 model="meta-llama/Meta-Llama-3.1-8B-Instruct",
19 prompt="", # Can be empty or present; prompt_embeds takes precedence
20 max_tokens=100,
21 extra_body={"prompt_embeds": embeddings_base64}
22)

Configuration

Docker Compose

1vllm-worker:
2 command:
3 - python
4 - -m
5 - dynamo.vllm
6 - --model
7 - meta-llama/Meta-Llama-3.1-8B-Instruct
8 - --enable-prompt-embeds # Add this

Kubernetes

1extraPodSpec:
2 mainContainer:
3 args:
4 - "--model"
5 - "meta-llama/Meta-Llama-3.1-8B-Instruct"
6 - "--enable-prompt-embeds" # Add this

NATS Configuration

NATS needs 15MB payload limit (already configured in default deployments):

1# Docker Compose - deploy/docker-compose.yml
2nats-server:
3 command: ["-js", "--trace", "-m", "8222", "--max_payload", "15728640"]
4
5# Kubernetes - deploy/cloud/helm/platform/values.yaml
6nats:
7 config:
8 merge:
9 max_payload: 15728640

API Reference

Request

1{
2 "model": "meta-llama/Meta-Llama-3.1-8B-Instruct",
3 "prompt": "",
4 "prompt_embeds": "<base64-encoded-pytorch-tensor>",
5 "max_tokens": 100
6}

Requirements:

  • Format: PyTorch tensor serialized with torch.save() and base64-encoded
  • Size: 100 bytes - 10MB (decoded)
  • Shape: (seq_len, hidden_dim) or (batch, seq_len, hidden_dim)
  • Dtype: torch.float32 (recommended)

Field Precedence:

  • Both prompt and prompt_embeds can be provided in the same request
  • When both are present, prompt_embeds takes precedence and prompt is ignored
  • The prompt field can be empty ("") when using prompt_embeds

Response

Standard OpenAI format with accurate usage:

1{
2 "usage": {
3 "prompt_tokens": 10, // Extracted from embedding shape
4 "completion_tokens": 15,
5 "total_tokens": 25
6 }
7}

Errors

ErrorFix
ValueError: You must set --enable-prompt-embedsAdd --enable-prompt-embeds to worker
prompt_embeds must be valid base64Use .decode('utf-8') after base64.b64encode()
decoded data must be at least 100 bytesIncrease sequence length
exceeds maximum size of 10MBReduce sequence length
must be a torch.TensorUse torch.save() not NumPy
size of tensor must matchUse correct hidden dimension for model

Examples

Streaming

1stream = client.completions.create(
2 model="meta-llama/Meta-Llama-3.1-8B-Instruct",
3 prompt="",
4 max_tokens=100,
5 stream=True,
6 extra_body={"prompt_embeds": embeddings_base64}
7)
8
9for chunk in stream:
10 if chunk.choices:
11 print(chunk.choices[0].text, end="", flush=True)

Load from File

1embeddings = torch.load("embeddings.pt")
2
3buffer = io.BytesIO()
4torch.save(embeddings, buffer)
5buffer.seek(0)
6embeddings_base64 = base64.b64encode(buffer.read()).decode()
7
8# Use in request...

Limitations

  • ❌ Requires --enable-prompt-embeds flag (disabled by default)
  • ❌ PyTorch format only (NumPy not supported)
  • ❌ 10MB decoded size limit
  • ❌ Cannot mix with multimodal data (images/video)

Testing

Comprehensive test coverage ensures reliability:

  • Unit Tests: 31 tests (11 Rust + 20 Python)
    • Validation, decoding, format handling, error cases, usage statistics
  • Integration Tests: 21 end-to-end tests
    • Core functionality, performance, formats, concurrency, usage statistics

Run integration tests:

$# Start worker with flag
$python -m dynamo.vllm --model Qwen/Qwen3-0.6B --enable-prompt-embeds
$
$# Run tests
$pytest tests/integration/test_prompt_embeds_integration.py -v

See Also