--- subtitle: Serve fine-tuned LoRA adapters with dynamic loading and routing in Dynamo --- # LoRA Adapters LoRA (Low-Rank Adaptation) enables efficient fine-tuning and serving of specialized model variants without duplicating full model weights. Dynamo provides built-in support for dynamic LoRA adapter loading, caching, and inference routing. ## Backend Support | Backend | Status | Notes | |---------|--------|-------| | vLLM | βœ… | Full support including KV-aware routing | | SGLang | 🚧 | In progress | | TensorRT-LLM | ❌ | Not yet supported | See the [Feature Matrix](/dynamo/v-0-9-0/getting-started/feature-matrix) for full compatibility details. ## Overview Dynamo's LoRA implementation provides: - **Dynamic loading**: Load and unload LoRA adapters at runtime without restarting workers - **Multiple sources**: Load from local filesystem (`file://`), S3-compatible storage (`s3://`), or Hugging Face Hub (`hf://`) - **Automatic caching**: Downloaded adapters are cached locally to avoid repeated downloads - **Discovery integration**: Loaded LoRAs are automatically registered and discoverable via `/v1/models` - **KV-aware routing**: Route requests to workers with the appropriate LoRA loaded - **Kubernetes native**: Declarative LoRA management via the `DynamoModel` CRD ### Architecture ```text β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β” β”‚ LoRA Architecture β”‚ β”œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€ β”‚ β”‚ β”‚ β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β” β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β” β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β” β”‚ β”‚ β”‚ Frontend │────▢│ Router │────▢│ Workers β”‚ β”‚ β”‚ β”‚ /v1/models β”‚ β”‚ LoRA-aware β”‚ β”‚ LoRA-loaded β”‚ β”‚ β”‚ β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜ β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜ β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜ β”‚ β”‚ β”‚ β”‚ β”‚ β–Ό β”‚ β”‚ β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β” β”‚ β”‚ β”‚ LoRA Manager β”‚ β”‚ β”‚ β”‚ β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β” β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β” β”‚ β”‚ β”‚ β”‚ β”‚ Downloaderβ”‚ β”‚ Cache β”‚ β”‚ β”‚ β”‚ β”‚ β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜ β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜ β”‚ β”‚ β”‚ β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜ β”‚ β”‚ β”‚ β”‚ β”‚ β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”Όβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β” β”‚ β”‚ β–Ό β–Ό β–Ό β”‚ β”‚ β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β” β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β” β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”β”‚ β”‚ β”‚ file:// β”‚ β”‚ s3:// β”‚ β”‚ hf:// β”‚β”‚ β”‚ β”‚ Local β”‚ β”‚ S3/MinIO β”‚ β”‚(custom) β”‚β”‚ β”‚ β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜ β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜ β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜β”‚ β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜ ``` The LoRA system consists of: - **Rust Core** (`lib/llm/src/lora/`): High-performance downloading, caching, and validation - **Python Manager** (`components/src/dynamo/common/lora/`): Extensible wrapper with custom source support - **Worker Handlers** (`components/src/dynamo/vllm/handlers.py`): Load/unload API and inference integration ## Quick Start ### Prerequisites - Dynamo installed with vLLM support - For S3 sources: AWS credentials configured - A LoRA adapter compatible with your base model ### Local Development **1. Start Dynamo with LoRA support:** ```bash # Start vLLM worker with LoRA flags DYN_SYSTEM_ENABLED=true DYN_SYSTEM_PORT=8081 \ python -m dynamo.vllm --model Qwen/Qwen3-0.6B --enforce-eager \ --connector none \ --enable-lora \ --max-lora-rank 64 ``` **2. Load a LoRA adapter:** ```bash curl -X POST http://localhost:8081/v1/loras \ -H "Content-Type: application/json" \ -d '{ "lora_name": "my-lora", "source": { "uri": "file:///path/to/my-lora" } }' ``` **3. Run inference with the LoRA:** ```bash curl -X POST http://localhost:8000/v1/chat/completions \ -H "Content-Type: application/json" \ -d '{ "model": "my-lora", "messages": [{"role": "user", "content": "Hello!"}], "max_tokens": 100 }' ``` ### S3-Compatible Storage For production deployments, store LoRA adapters in S3-compatible storage: ```bash # Configure S3 credentials export AWS_ACCESS_KEY_ID=your-access-key export AWS_SECRET_ACCESS_KEY=your-secret-key export AWS_ENDPOINT=http://minio:9000 # For MinIO export AWS_REGION=us-east-1 # Load LoRA from S3 curl -X POST http://localhost:8081/v1/loras \ -H "Content-Type: application/json" \ -d '{ "lora_name": "customer-support-lora", "source": { "uri": "s3://my-loras/customer-support-v1" } }' ``` ## Configuration ### Environment Variables | Variable | Description | Default | |----------|-------------|---------| | `DYN_LORA_ENABLED` | Enable LoRA adapter support | `false` | | `DYN_LORA_PATH` | Local cache directory for downloaded LoRAs | `~/.cache/dynamo_loras` | | `AWS_ACCESS_KEY_ID` | S3 access key (for `s3://` URIs) | - | | `AWS_SECRET_ACCESS_KEY` | S3 secret key (for `s3://` URIs) | - | | `AWS_ENDPOINT` | Custom S3 endpoint (for MinIO, etc.) | - | | `AWS_REGION` | AWS region | `us-east-1` | | `AWS_ALLOW_HTTP` | Allow HTTP (non-TLS) connections | `false` | ### vLLM Arguments | Argument | Description | |----------|-------------| | `--enable-lora` | Enable LoRA adapter support in vLLM | | `--max-lora-rank` | Maximum LoRA rank (must be >= your LoRA's rank) | | `--max-loras` | Maximum number of LoRAs to load simultaneously | ## Backend API Reference ### Load LoRA Load a LoRA adapter from a source URI. ```text POST /v1/loras ``` **Request:** ```json { "lora_name": "string", "source": { "uri": "string" } } ``` **Response:** ```json { "status": "success", "message": "LoRA adapter 'my-lora' loaded successfully", "lora_name": "my-lora", "lora_id": 1207343256 } ``` ### List LoRAs List all loaded LoRA adapters. ```text GET /v1/loras ``` **Response:** ```json { "status": "success", "loras": { "my-lora": 1207343256, "another-lora": 987654321 }, "count": 2 } ``` ### Unload LoRA Unload a LoRA adapter from the worker. ```text DELETE /v1/loras/{lora_name} ``` **Response:** ```json { "status": "success", "message": "LoRA adapter 'my-lora' unloaded successfully", "lora_name": "my-lora", "lora_id": 1207343256 } ``` ## Kubernetes Deployment For Kubernetes deployments, use the `DynamoModel` Custom Resource to declaratively manage LoRA adapters. ### DynamoModel CRD ```yaml apiVersion: nvidia.com/v1alpha1 kind: DynamoModel metadata: name: customer-support-lora namespace: dynamo-system spec: modelName: customer-support-adapter-v1 baseModelName: Qwen/Qwen3-0.6B # Must match modelRef.name in DGD modelType: lora source: uri: s3://my-models-bucket/loras/customer-support/v1 ``` ### How It Works When you create a `DynamoModel`: 1. **Discovers endpoints**: Finds all pods running your `baseModelName` 2. **Creates service**: Automatically creates a Kubernetes Service 3. **Loads LoRA**: Calls the LoRA load API on each endpoint 4. **Updates status**: Reports which endpoints are ready ### Verify Deployment ```bash # Check LoRA status kubectl get dynamomodel customer-support-lora # Expected output: # NAME TOTAL READY AGE # customer-support-lora 2 2 30s ``` For complete Kubernetes deployment details, see: - [Managing Models with DynamoModel](/dynamo/v-0-9-0/kubernetes-deployment/deployment-guide/managing-models-with-dynamo-model) - [Kubernetes LoRA Deployment Example](https://github.com/ai-dynamo/dynamo/tree/main/examples/backends/vllm/deploy/lora/README.md) ## Examples | Example | Description | |---------|-------------| | [Local LoRA with MinIO](https://github.com/ai-dynamo/dynamo/tree/main/examples/backends/vllm/launch/lora/README.md) | Local development with S3-compatible storage | | [Kubernetes LoRA Deployment](https://github.com/ai-dynamo/dynamo/tree/main/examples/backends/vllm/deploy/lora/README.md) | Production deployment with DynamoModel CRD | ## Troubleshooting ### LoRA Fails to Load **Check S3 connectivity:** ```bash # Verify LoRA exists in S3 aws --endpoint-url=$AWS_ENDPOINT s3 ls s3://my-loras/ --recursive ``` **Check cache directory:** ```bash ls -la ~/.cache/dynamo_loras/ ``` **Check worker logs:** ```bash # Look for LoRA-related messages kubectl logs deployment/my-worker | grep -i lora ``` ### Model Not Found After Loading - Verify the LoRA name matches exactly (case-sensitive) - Check if the LoRA is listed: `curl http://localhost:8081/v1/loras` - Ensure discovery registration succeeded (check worker logs) ### Inference Returns Base Model Response - Verify the `model` field in your request matches the `lora_name` - Check that the LoRA is loaded on the worker handling your request - For disaggregated serving, ensure both prefill and decode workers have the LoRA ## See Also - [Feature Matrix](/dynamo/v-0-9-0/getting-started/feature-matrix) - Backend compatibility overview - [vLLM Backend](/dynamo/v-0-9-0/components/backends/v-llm) - vLLM-specific configuration - [Dynamo Operator](/dynamo/v-0-9-0/kubernetes-deployment/deployment-guide/dynamo-operator) - Kubernetes operator overview - [KV-Aware Routing](/dynamo/v-0-9-0/components/router/router-guide) - LoRA-aware request routing