LoRA Adapters

Serve fine-tuned LoRA adapters with dynamic loading and routing in Dynamo

View as Markdown

LoRA (Low-Rank Adaptation) enables efficient fine-tuning and serving of specialized model variants without duplicating full model weights. Dynamo provides built-in support for dynamic LoRA adapter loading, caching, and inference routing.

Backend Support

BackendStatusNotes
vLLMβœ…Full support including KV-aware routing
SGLang🚧In progress
TensorRT-LLM❌Not yet supported

See the Feature Matrix for full compatibility details.

Overview

Dynamo’s LoRA implementation provides:

  • Dynamic loading: Load and unload LoRA adapters at runtime without restarting workers
  • Multiple sources: Load from local filesystem (file://), S3-compatible storage (s3://), or Hugging Face Hub (hf://)
  • Automatic caching: Downloaded adapters are cached locally to avoid repeated downloads
  • Discovery integration: Loaded LoRAs are automatically registered and discoverable via /v1/models
  • KV-aware routing: Route requests to workers with the appropriate LoRA loaded
  • Kubernetes native: Declarative LoRA management via the DynamoModel CRD

Architecture

β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
β”‚ LoRA Architecture β”‚
β”œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€
β”‚ β”‚
β”‚ β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β” β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β” β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β” β”‚
β”‚ β”‚ Frontend │────▢│ Router │────▢│ Workers β”‚ β”‚
β”‚ β”‚ /v1/models β”‚ β”‚ LoRA-aware β”‚ β”‚ LoRA-loaded β”‚ β”‚
β”‚ β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜ β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜ β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜ β”‚
β”‚ β”‚ β”‚
β”‚ β–Ό β”‚
β”‚ β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β” β”‚
β”‚ β”‚ LoRA Manager β”‚ β”‚
β”‚ β”‚ β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β” β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β” β”‚ β”‚
β”‚ β”‚ β”‚ Downloaderβ”‚ β”‚ Cache β”‚ β”‚ β”‚
β”‚ β”‚ β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜ β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜ β”‚ β”‚
β”‚ β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜ β”‚
β”‚ β”‚ β”‚
β”‚ β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”Όβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β” β”‚
β”‚ β–Ό β–Ό β–Ό β”‚
β”‚ β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β” β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β” β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”β”‚
β”‚ β”‚ file:// β”‚ β”‚ s3:// β”‚ β”‚ hf:// β”‚β”‚
β”‚ β”‚ Local β”‚ β”‚ S3/MinIO β”‚ β”‚(custom) β”‚β”‚
β”‚ β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜ β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜ β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜β”‚
β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜

The LoRA system consists of:

  • Rust Core (lib/llm/src/lora/): High-performance downloading, caching, and validation
  • Python Manager (components/src/dynamo/common/lora/): Extensible wrapper with custom source support
  • Worker Handlers (components/src/dynamo/vllm/handlers.py): Load/unload API and inference integration

Quick Start

Prerequisites

  • Dynamo installed with vLLM support
  • For S3 sources: AWS credentials configured
  • A LoRA adapter compatible with your base model

Local Development

1. Start Dynamo with LoRA support:

$# Start vLLM worker with LoRA flags
$DYN_SYSTEM_ENABLED=true DYN_SYSTEM_PORT=8081 \
> python -m dynamo.vllm --model Qwen/Qwen3-0.6B --enforce-eager \
> --connector none \
> --enable-lora \
> --max-lora-rank 64

2. Load a LoRA adapter:

$curl -X POST http://localhost:8081/v1/loras \
> -H "Content-Type: application/json" \
> -d '{
> "lora_name": "my-lora",
> "source": {
> "uri": "file:///path/to/my-lora"
> }
> }'

3. Run inference with the LoRA:

$curl -X POST http://localhost:8000/v1/chat/completions \
> -H "Content-Type: application/json" \
> -d '{
> "model": "my-lora",
> "messages": [{"role": "user", "content": "Hello!"}],
> "max_tokens": 100
> }'

S3-Compatible Storage

For production deployments, store LoRA adapters in S3-compatible storage:

$# Configure S3 credentials
$export AWS_ACCESS_KEY_ID=your-access-key
$export AWS_SECRET_ACCESS_KEY=your-secret-key
$export AWS_ENDPOINT=http://minio:9000 # For MinIO
$export AWS_REGION=us-east-1
$
$# Load LoRA from S3
$curl -X POST http://localhost:8081/v1/loras \
> -H "Content-Type: application/json" \
> -d '{
> "lora_name": "customer-support-lora",
> "source": {
> "uri": "s3://my-loras/customer-support-v1"
> }
> }'

Configuration

Environment Variables

VariableDescriptionDefault
DYN_LORA_ENABLEDEnable LoRA adapter supportfalse
DYN_LORA_PATHLocal cache directory for downloaded LoRAs~/.cache/dynamo_loras
AWS_ACCESS_KEY_IDS3 access key (for s3:// URIs)-
AWS_SECRET_ACCESS_KEYS3 secret key (for s3:// URIs)-
AWS_ENDPOINTCustom S3 endpoint (for MinIO, etc.)-
AWS_REGIONAWS regionus-east-1
AWS_ALLOW_HTTPAllow HTTP (non-TLS) connectionsfalse

vLLM Arguments

ArgumentDescription
--enable-loraEnable LoRA adapter support in vLLM
--max-lora-rankMaximum LoRA rank (must be >= your LoRA’s rank)
--max-lorasMaximum number of LoRAs to load simultaneously

Backend API Reference

Load LoRA

Load a LoRA adapter from a source URI.

POST /v1/loras

Request:

1{
2 "lora_name": "string",
3 "source": {
4 "uri": "string"
5 }
6}

Response:

1{
2 "status": "success",
3 "message": "LoRA adapter 'my-lora' loaded successfully",
4 "lora_name": "my-lora",
5 "lora_id": 1207343256
6}

List LoRAs

List all loaded LoRA adapters.

GET /v1/loras

Response:

1{
2 "status": "success",
3 "loras": {
4 "my-lora": 1207343256,
5 "another-lora": 987654321
6 },
7 "count": 2
8}

Unload LoRA

Unload a LoRA adapter from the worker.

DELETE /v1/loras/{lora_name}

Response:

1{
2 "status": "success",
3 "message": "LoRA adapter 'my-lora' unloaded successfully",
4 "lora_name": "my-lora",
5 "lora_id": 1207343256
6}

Kubernetes Deployment

For Kubernetes deployments, use the DynamoModel Custom Resource to declaratively manage LoRA adapters.

DynamoModel CRD

1apiVersion: nvidia.com/v1alpha1
2kind: DynamoModel
3metadata:
4 name: customer-support-lora
5 namespace: dynamo-system
6spec:
7 modelName: customer-support-adapter-v1
8 baseModelName: Qwen/Qwen3-0.6B # Must match modelRef.name in DGD
9 modelType: lora
10 source:
11 uri: s3://my-models-bucket/loras/customer-support/v1

How It Works

When you create a DynamoModel:

  1. Discovers endpoints: Finds all pods running your baseModelName
  2. Creates service: Automatically creates a Kubernetes Service
  3. Loads LoRA: Calls the LoRA load API on each endpoint
  4. Updates status: Reports which endpoints are ready

Verify Deployment

$# Check LoRA status
$kubectl get dynamomodel customer-support-lora
$
$# Expected output:
$# NAME TOTAL READY AGE
$# customer-support-lora 2 2 30s

For complete Kubernetes deployment details, see:

Examples

ExampleDescription
Local LoRA with MinIOLocal development with S3-compatible storage
Kubernetes LoRA DeploymentProduction deployment with DynamoModel CRD

Troubleshooting

LoRA Fails to Load

Check S3 connectivity:

$# Verify LoRA exists in S3
$aws --endpoint-url=$AWS_ENDPOINT s3 ls s3://my-loras/ --recursive

Check cache directory:

$ls -la ~/.cache/dynamo_loras/

Check worker logs:

$# Look for LoRA-related messages
$kubectl logs deployment/my-worker | grep -i lora

Model Not Found After Loading

  • Verify the LoRA name matches exactly (case-sensitive)
  • Check if the LoRA is listed: curl http://localhost:8081/v1/loras
  • Ensure discovery registration succeeded (check worker logs)

Inference Returns Base Model Response

  • Verify the model field in your request matches the lora_name
  • Check that the LoRA is loaded on the worker handling your request
  • For disaggregated serving, ensure both prefill and decode workers have the LoRA

See Also