LoRA Adapters
Serve fine-tuned LoRA adapters with dynamic loading and routing in Dynamo
LoRA (Low-Rank Adaptation) enables efficient fine-tuning and serving of specialized model variants without duplicating full model weights. Dynamo provides built-in support for dynamic LoRA adapter loading, caching, and inference routing.
Backend Support
See the Feature Matrix for full compatibility details.
Overview
Dynamoβs LoRA implementation provides:
- Dynamic loading: Load and unload LoRA adapters at runtime without restarting workers
- Multiple sources: Load from local filesystem (
file://), S3-compatible storage (s3://), or Hugging Face Hub (hf://) - Automatic caching: Downloaded adapters are cached locally to avoid repeated downloads
- Discovery integration: Loaded LoRAs are automatically registered and discoverable via
/v1/models - KV-aware routing: Route requests to workers with the appropriate LoRA loaded
- Kubernetes native: Declarative LoRA management via the
DynamoModelCRD
Architecture
The LoRA system consists of:
- Rust Core (
lib/llm/src/lora/): High-performance downloading, caching, and validation - Python Manager (
components/src/dynamo/common/lora/): Extensible wrapper with custom source support - Worker Handlers (
components/src/dynamo/vllm/handlers.py): Load/unload API and inference integration
Quick Start
Prerequisites
- Dynamo installed with vLLM support
- For S3 sources: AWS credentials configured
- A LoRA adapter compatible with your base model
Local Development
1. Start Dynamo with LoRA support:
2. Load a LoRA adapter:
3. Run inference with the LoRA:
S3-Compatible Storage
For production deployments, store LoRA adapters in S3-compatible storage:
Configuration
Environment Variables
vLLM Arguments
Backend API Reference
Load LoRA
Load a LoRA adapter from a source URI.
Request:
Response:
List LoRAs
List all loaded LoRA adapters.
Response:
Unload LoRA
Unload a LoRA adapter from the worker.
Response:
Kubernetes Deployment
For Kubernetes deployments, use the DynamoModel Custom Resource to declaratively manage LoRA adapters.
DynamoModel CRD
How It Works
When you create a DynamoModel:
- Discovers endpoints: Finds all pods running your
baseModelName - Creates service: Automatically creates a Kubernetes Service
- Loads LoRA: Calls the LoRA load API on each endpoint
- Updates status: Reports which endpoints are ready
Verify Deployment
For complete Kubernetes deployment details, see:
Examples
Troubleshooting
LoRA Fails to Load
Check S3 connectivity:
Check cache directory:
Check worker logs:
Model Not Found After Loading
- Verify the LoRA name matches exactly (case-sensitive)
- Check if the LoRA is listed:
curl http://localhost:8081/v1/loras - Ensure discovery registration succeeded (check worker logs)
Inference Returns Base Model Response
- Verify the
modelfield in your request matches thelora_name - Check that the LoRA is loaded on the worker handling your request
- For disaggregated serving, ensure both prefill and decode workers have the LoRA
See Also
- Feature Matrix - Backend compatibility overview
- vLLM Backend - vLLM-specific configuration
- Dynamo Operator - Kubernetes operator overview
- KV-Aware Routing - LoRA-aware request routing