LoRA Adapters | NVIDIA Dynamo Documentation

LoRA (Low-Rank Adaptation) enables efficient fine-tuning and serving of specialized model variants without duplicating full model weights. Dynamo provides built-in support for dynamic LoRA adapter loading, caching, and inference routing.

Backend Support

Backend	Status	Notes
vLLM	✅	Full support including KV-aware routing
SGLang	🚧	In progress
TensorRT-LLM	❌	Not yet supported

See the Feature Matrix for full compatibility details.

Overview

Dynamo’s LoRA implementation provides:

Dynamic loading: Load and unload LoRA adapters at runtime without restarting workers
Multiple sources: Load from local filesystem (file://), S3-compatible storage (s3://), or Hugging Face Hub (hf://)
Automatic caching: Downloaded adapters are cached locally to avoid repeated downloads
Discovery integration: Loaded LoRAs are automatically registered and discoverable via /v1/models
KV-aware routing: Route requests to workers with the appropriate LoRA loaded
Kubernetes native: Declarative LoRA management via the DynamoModel CRD

Architecture

┌─────────────────────────────────────────────────────────────────┐
│                        LoRA Architecture                         │
├─────────────────────────────────────────────────────────────────┤
│                                                                  │
│  ┌──────────────┐     ┌──────────────┐     ┌──────────────┐     │
│  │   Frontend   │────▶│    Router    │────▶│   Workers    │     │
│  │  /v1/models  │     │  LoRA-aware  │     │  LoRA-loaded │     │
│  └──────────────┘     └──────────────┘     └──────────────┘     │
│                                                   │              │
│                                                   ▼              │
│                              ┌─────────────────────────────────┐ │
│                              │         LoRA Manager            │ │
│                              │  ┌───────────┐ ┌─────────────┐  │ │
│                              │  │ Downloader│ │    Cache    │  │ │
│                              │  └───────────┘ └─────────────┘  │ │
│                              └─────────────────────────────────┘ │
│                                         │                        │
│                     ┌───────────────────┼───────────────────┐   │
│                     ▼                   ▼                   ▼   │
│              ┌────────────┐      ┌────────────┐      ┌─────────┐│
│              │  file://   │      │   s3://    │      │  hf://  ││
│              │   Local    │      │  S3/MinIO  │      │(custom) ││
│              └────────────┘      └────────────┘      └─────────┘│
└─────────────────────────────────────────────────────────────────┘

The LoRA system consists of:

Rust Core (lib/llm/src/lora/): High-performance downloading, caching, and validation
Python Manager (components/src/dynamo/common/lora/): Extensible wrapper with custom source support
Worker Handlers (components/src/dynamo/vllm/handlers.py): Load/unload API and inference integration

Quick Start

Prerequisites

Dynamo installed with vLLM support
For S3 sources: AWS credentials configured
A LoRA adapter compatible with your base model

Local Development

1. Start Dynamo with LoRA support:

$ # Start vLLM worker with LoRA flags
$ DYN_SYSTEM_ENABLED=true DYN_SYSTEM_PORT=8081 \
>     python -m dynamo.vllm --model Qwen/Qwen3-0.6B --enforce-eager \
>     --connector none \
>     --enable-lora \
>     --max-lora-rank 64

2. Load a LoRA adapter:

$ curl -X POST http://localhost:8081/v1/loras \
>   -H "Content-Type: application/json" \
>   -d '{
>     "lora_name": "my-lora",
>     "source": {
>       "uri": "file:///path/to/my-lora"
>     }
>   }'

3. Run inference with the LoRA:

$ curl -X POST http://localhost:8000/v1/chat/completions \
>   -H "Content-Type: application/json" \
>   -d '{
>     "model": "my-lora",
>     "messages": [{"role": "user", "content": "Hello!"}],
>     "max_tokens": 100
>   }'

S3-Compatible Storage

For production deployments, store LoRA adapters in S3-compatible storage:

$ # Configure S3 credentials
$ export AWS_ACCESS_KEY_ID=your-access-key
$ export AWS_SECRET_ACCESS_KEY=your-secret-key
$ export AWS_ENDPOINT=http://minio:9000  # For MinIO
$ export AWS_REGION=us-east-1
$ 
$ # Load LoRA from S3
$ curl -X POST http://localhost:8081/v1/loras \
>   -H "Content-Type: application/json" \
>   -d '{
>     "lora_name": "customer-support-lora",
>     "source": {
>       "uri": "s3://my-loras/customer-support-v1"
>     }
>   }'

Configuration

Environment Variables

Variable	Description	Default
`DYN_LORA_ENABLED`	Enable LoRA adapter support	`false`
`DYN_LORA_PATH`	Local cache directory for downloaded LoRAs	`~/.cache/dynamo_loras`
`AWS_ACCESS_KEY_ID`	S3 access key (for `s3://` URIs)	-
`AWS_SECRET_ACCESS_KEY`	S3 secret key (for `s3://` URIs)	-
`AWS_ENDPOINT`	Custom S3 endpoint (for MinIO, etc.)	-
`AWS_REGION`	AWS region	`us-east-1`
`AWS_ALLOW_HTTP`	Allow HTTP (non-TLS) connections	`false`

vLLM Arguments

Argument	Description
`--enable-lora`	Enable LoRA adapter support in vLLM
`--max-lora-rank`	Maximum LoRA rank (must be >= your LoRA’s rank)
`--max-loras`	Maximum number of LoRAs to load simultaneously

Backend API Reference

Load LoRA

Load a LoRA adapter from a source URI.

POST /v1/loras

Request:

1 {
2   "lora_name": "string",
3   "source": {
4     "uri": "string"
5   }
6 }

Response:

1 {
2   "status": "success",
3   "message": "LoRA adapter 'my-lora' loaded successfully",
4   "lora_name": "my-lora",
5   "lora_id": 1207343256
6 }

List LoRAs

List all loaded LoRA adapters.

GET /v1/loras

Response:

1 {
2   "status": "success",
3   "loras": {
4     "my-lora": 1207343256,
5     "another-lora": 987654321
6   },
7   "count": 2
8 }

Unload LoRA

Unload a LoRA adapter from the worker.

DELETE /v1/loras/{lora_name}

Response:

1 {
2   "status": "success",
3   "message": "LoRA adapter 'my-lora' unloaded successfully",
4   "lora_name": "my-lora",
5   "lora_id": 1207343256
6 }

Kubernetes Deployment

For Kubernetes deployments, use the DynamoModel Custom Resource to declaratively manage LoRA adapters.

DynamoModel CRD

1 apiVersion: nvidia.com/v1alpha1
2 kind: DynamoModel
3 metadata:
4   name: customer-support-lora
5   namespace: dynamo-system
6 spec:
7   modelName: customer-support-adapter-v1
8   baseModelName: Qwen/Qwen3-0.6B  # Must match modelRef.name in DGD
9   modelType: lora
10   source:
11     uri: s3://my-models-bucket/loras/customer-support/v1

How It Works

When you create a DynamoModel:

Discovers endpoints: Finds all pods running your baseModelName
Creates service: Automatically creates a Kubernetes Service
Loads LoRA: Calls the LoRA load API on each endpoint
Updates status: Reports which endpoints are ready

Verify Deployment

$ # Check LoRA status
$ kubectl get dynamomodel customer-support-lora
$ 
$ # Expected output:
$ # NAME                    TOTAL   READY   AGE
$ # customer-support-lora   2       2       30s

For complete Kubernetes deployment details, see:

Examples

Example	Description
Local LoRA with MinIO	Local development with S3-compatible storage
Kubernetes LoRA Deployment	Production deployment with DynamoModel CRD

Troubleshooting

LoRA Fails to Load

Check S3 connectivity:

$ # Verify LoRA exists in S3
$ aws --endpoint-url=$AWS_ENDPOINT s3 ls s3://my-loras/ --recursive

Check cache directory:

$ ls -la ~/.cache/dynamo_loras/

Check worker logs:

$ # Look for LoRA-related messages
$ kubectl logs deployment/my-worker | grep -i lora

Model Not Found After Loading

Verify the LoRA name matches exactly (case-sensitive)
Check if the LoRA is listed: curl http://localhost:8081/v1/loras
Ensure discovery registration succeeded (check worker logs)

Inference Returns Base Model Response

Verify the model field in your request matches the lora_name
Check that the LoRA is loaded on the worker handling your request
For disaggregated serving, ensure both prefill and decode workers have the LoRA

Backend Support

Overview

Architecture

Quick Start

Prerequisites

Local Development

S3-Compatible Storage

Configuration

Environment Variables

vLLM Arguments

Backend API Reference

Load LoRA

List LoRAs

Unload LoRA

Kubernetes Deployment

DynamoModel CRD

How It Works

Verify Deployment

Examples

Troubleshooting

LoRA Fails to Load

Model Not Found After Loading

Inference Returns Base Model Response

See Also