Prompt Embeddings
Dynamo supports prompt embeddings (also known as prompt embeds) as a secure alternative input method to traditional text prompts. By allowing applications to use pre-computed embeddings for inference, this feature not only offers greater flexibility in prompt engineering but also significantly enhances privacy and data security. With prompt embeddings, sensitive user data can be transformed into embeddings before ever reaching the inference server, reducing the risk of exposing confidential information during the AI workflow.
How It Works
Architecture
Quick Start
Send pre-computed prompt embeddings directly to vLLM, bypassing tokenization.
1. Enable Feature
Required: The
--enable-prompt-embedsflag must be set or requests will fail.
2. Send Request
Configuration
Docker Compose
Kubernetes
NATS Configuration
NATS needs 15MB payload limit (already configured in default deployments):
API Reference
Request
Requirements:
- Format: PyTorch tensor serialized with
torch.save()and base64-encoded - Size: 100 bytes - 10MB (decoded)
- Shape:
(seq_len, hidden_dim)or(batch, seq_len, hidden_dim) - Dtype:
torch.float32(recommended)
Field Precedence:
- Both
promptandprompt_embedscan be provided in the same request - When both are present,
prompt_embedstakes precedence andpromptis ignored - The
promptfield can be empty ("") when usingprompt_embeds
Response
Standard OpenAI format with accurate usage:
Errors
Examples
Streaming
Load from File
Limitations
- ❌ Requires
--enable-prompt-embedsflag (disabled by default) - ❌ PyTorch format only (NumPy not supported)
- ❌ 10MB decoded size limit
- ❌ Cannot mix with multimodal data (images/video)
Testing
Comprehensive test coverage ensures reliability:
- Unit Tests: 31 tests (11 Rust + 20 Python)
- Validation, decoding, format handling, error cases, usage statistics
- Integration Tests: 21 end-to-end tests
- Core functionality, performance, formats, concurrency, usage statistics
Run integration tests: