Integration with Dynamo
Checkpoint/Restore for Fast Pod Startup
⚠️ Experimental Feature: ChReK is currently in beta/preview. The ChReK DaemonSet runs in privileged mode to perform CRIU operations. See Limitations for details.
Reduce cold start times for LLM inference workers from ~3 minutes to ~30 seconds using container checkpointing.
Overview
Checkpointing captures the complete state of a running worker pod (including GPU memory) and saves it to storage. New pods can restore from this checkpoint instead of performing a full cold start.
Prerequisites
- Dynamo Platform installed (v0.4.0+)
- ChReK Helm chart installed (separate from platform)
- GPU nodes with containerd runtime (CRIU is bundled in ChReK images)
- RWX PVC storage (PVC is currently the only supported backend)
Quick Start
1. Install ChReK Infrastructure
First, install the ChReK Helm chart in each namespace where you need checkpointing:
This creates:
- A PVC for checkpoint storage (
chrek-pvc) - A DaemonSet for CRIU operations (
chrek-agent)
2. Configure Operator Values
Update your Helm values to point to the ChReK infrastructure:
2. Configure Your DGD
Add checkpoint configuration to your service:
3. Deploy
On first deployment:
- A checkpoint job runs to create the checkpoint
- Worker pods start with cold start (checkpoint not ready yet)
- Once checkpoint is ready, new pods (scale-up, restarts) restore from checkpoint
Storage Backends
PVC (Currently Supported)
Use when you have RWX storage available (e.g., NFS, EFS, Filestore).
Requirements:
- RWX (ReadWriteMany) PVC for multi-node access
- Sufficient storage (checkpoints are ~10-50GB per model)
S3 / MinIO (Planned - Not Yet Implemented)
⚠️ Note: S3 storage backend is defined in the API but not yet fully implemented.
Object storage support is planned for a future release. The configuration will look like:
OCI Registry (Planned - Not Yet Implemented)
⚠️ Note: OCI registry storage backend is defined in the API but not yet fully implemented.
Container registry storage support is planned for a future release. The configuration will look like:
Checkpoint Modes
Auto Mode (Recommended)
The operator automatically creates a DynamoCheckpoint CR if one doesn’t exist:
Reference Mode
Reference an existing DynamoCheckpoint CR by its 16-character hash using checkpointRef:
This is useful when:
- You want to pre-warm checkpoints before creating DGDs
- You want to explicit control over which checkpoint to use
Flow:
- Create a
DynamoCheckpointCR (see DynamoCheckpoint CRD section) - Wait for it to become
Ready - Reference it in your DGD using
checkpointRefwith the hash
Checkpoint Identity
Checkpoints are uniquely identified by a 16-character SHA256 hash (64 bits) of configuration that affects runtime state:
Not included in hash (don’t invalidate checkpoint):
replicasnodeSelector,affinity,tolerationsresources(requests/limits)- Logging/observability config
Example with all fields:
Checkpoint Naming: The DynamoCheckpoint CR is automatically named using the 16-character identity hash (e.g., e5962d34ba272638).
Checkpoint Sharing: Multiple DGDs with the same identity automatically share the same checkpoint.
DynamoCheckpoint CRD
The DynamoCheckpoint (shortname: dckpt) is a Kubernetes Custom Resource that manages checkpoint lifecycle.
When to create a DynamoCheckpoint directly:
- Pre-warming: Create checkpoints before deploying DGDs for instant startup
- Explicit control: Manage checkpoint lifecycle independently from DGDs
Note: With the new hash-based naming, checkpoint names are automatically generated (16-character hash). The operator handles checkpoint discovery and reuse automatically in auto mode.
Create a checkpoint:
Note: You can compute the hash yourself, or use auto mode to let the operator create it.
Check status:
Phases:
Detailed status:
Reference from DGD:
Once the checkpoint is Ready, you can reference it by hash:
Or use auto mode and the operator will find/create it automatically.
Limitations
⚠️ Important: ChReK has significant limitations that impact production readiness:
Security Considerations
- 🔴 Privileged DaemonSet: The ChReK DaemonSet runs in privileged mode with
hostPID,hostIPC, andhostNetworkto perform CRIU operations externally - Workload pods (checkpoint jobs, restore pods) do not need privileged mode — all CRIU privilege lives in the DaemonSet
- The privileged DaemonSet has elevated host access, which may violate security policies in many production environments
Technical Limitations
- vLLM backend only: Currently only the vLLM backend supports checkpoint/restore. SGLang and TensorRT-LLM support is planned.
- Single-node only: Checkpoints must be created and restored on the same node
- Single-GPU only: Multi-GPU configurations are not yet supported
- Network state: Active TCP connections are closed during restore (handled with
tcp-closeCRIU option) - Storage: Only PVC backend currently implemented (S3/OCI planned)
Recommendation
ChReK is experimental/beta and best suited for:
- ✅ Development and testing environments
- ✅ Research and experimentation
- ✅ Controlled production environments with appropriate security controls
- ❌ Security-sensitive production workloads without proper risk assessment
Troubleshooting
Checkpoint Not Creating
-
Check the checkpoint job:
-
Check the DaemonSet:
-
Verify storage access:
Restore Failing
-
Check pod logs:
-
Verify checkpoint file exists:
-
Check environment variables:
Cold Start Despite Checkpoint
Pods fall back to cold start if:
- Checkpoint file doesn’t exist yet (still being created)
- Checkpoint file is corrupted
- CRIU restore fails
Check logs for “Falling back to cold start” message.
Best Practices
- Use RWX PVCs for multi-node deployments (currently the only supported backend)
- Pre-warm checkpoints before scaling up
- Monitor checkpoint size - large models create large checkpoints
- Clean up old checkpoints to save storage
Environment Variables
Complete Example
Create a checkpoint and use it in a DGD:
Related Documentation
- ChReK Overview - ChReK architecture and use cases
- ChReK Standalone Usage Guide - Use ChReK without Dynamo Platform
- ChReK Helm Chart README - Chart configuration
- Installation Guide - Platform installation
- API Reference - Complete CRD specifications