Checkpointing
ā ļø Experimental Feature: ChReK is currently in beta/preview. The ChReK DaemonSet runs in privileged mode to perform CRIU operations. See Limitations for details.
ChReK (Checkpoint/Restore in Kubernetes) is an experimental infrastructure for fast-starting GPU applications using CRIU (Checkpoint/Restore in User-space). ChReK dramatically reduces cold-start times for large models from minutes to seconds by capturing initialized application state and restoring it on-demand.
What is ChReK?
ChReK provides:
- Fast cold starts: Restore GPU-accelerated applications in seconds instead of minutes
- CUDA state preservation: Checkpoint and restore GPU memory and CUDA contexts
- Kubernetes-native: Integrates seamlessly with Kubernetes primitives
- Storage flexibility: PVC-based storage (S3/OCI planned for future releases)
- Namespace isolation: Each namespace gets its own checkpoint infrastructure
Use Cases
1. With NVIDIA Dynamo Platform (Recommended)
Use ChReK as part of the Dynamo platform for automatic checkpoint management:
- Automatic checkpoint creation and lifecycle management
- Seamless integration with DynamoGraphDeployment CRDs
- Built-in autoscaling with fast restore
š Read the Dynamo Integration Guide ā
2. Standalone (Without Dynamo)
Use ChReK independently in your own Kubernetes applications:
- Manual checkpoint job creation
- Build your own restore-enabled container images
- Full control over checkpoint lifecycle
š Read the Standalone Usage Guide ā
Architecture
ChReK consists of two main components:
1. ChReK Helm Chart
Deploys the checkpoint/restore infrastructure:
- DaemonSet: Runs on GPU nodes to perform CRIU checkpoint operations
- PVC: Stores checkpoint data (rootfs diffs, CUDA memory state)
- RBAC: Namespace-scoped or cluster-wide permissions
- Seccomp Profile: Security policies for CRIU syscalls
2. External Restore via DaemonSet
The DaemonSet performs checkpoint/restore externally using nsenter to enter pod namespaces:
- Checkpoint: Freezes the running process and dumps state (CPU + GPU) to storage
- Restore: Enters a placeholder podās namespaces and restores the checkpointed process via
nsrestore
Quick Start
Install ChReK Infrastructure
Choose Your Integration Path
- Using Dynamo Platform? ā Follow the Dynamo Integration Guide
- Using standalone? ā Follow the Standalone Usage Guide
Key Features
ā Currently Supported
- ā vLLM backend only (SGLang and TensorRT-LLM planned)
- ā Single-node, single-GPU checkpoints
- ā PVC storage backend (RWX for multi-node)
- ā CUDA checkpoint/restore
- ā
PyTorch distributed state (with
GLOO_SOCKET_IFNAME=lo) - ā Namespace-scoped and cluster-wide RBAC
- ā Idempotent checkpoint creation
- ā Automatic signal-based checkpoint coordination
š§ Planned Features
- š§ SGLang backend support
- š§ TensorRT-LLM backend support
- š§ S3/MinIO storage backend
- š§ OCI registry storage backend
- š§ Multi-GPU checkpoints
- š§ Multi-node distributed checkpoints
Limitations
ā ļø Important: ChReK has significant limitations that may impact production readiness:
Security Considerations
- š“ Privileged DaemonSet: The ChReK DaemonSet runs in privileged mode with
hostPID,hostIPC, andhostNetworkto perform CRIU operations. Workload pods do not need privileged mode ā all CRIU privilege lives in the DaemonSet. - Security Impact: The privileged DaemonSet can:
- Access all host devices and processes
- Bypass most security restrictions
- Potentially compromise node security if exploited
Technical Limitations
- vLLM backend only: Currently only the vLLM backend supports checkpoint/restore. SGLang and TensorRT-LLM support is planned.
- Single-node only: Checkpoints must be created and restored on the same node
- Single-GPU only: Multi-GPU configurations not yet supported
- Network state limitations: Active TCP connections are closed during restore (use
tcp-closeCRIU option) - Storage: Only PVC storage is currently implemented (S3/OCI planned)
Recommendation
ChReK is best suited for:
- ā Development and testing environments
- ā Research and experimentation
- ā Controlled production environments with appropriate security controls
- ā Security-sensitive production workloads without proper risk assessment
Documentation
Getting Started
- Dynamo Integration Guide - Using ChReK with Dynamo Platform
- Standalone Usage Guide - Using ChReK independently
- ChReK Helm Chart README - Helm chart configuration
Related Documentation
- CRIU Documentation - Upstream CRIU docs
Prerequisites
- Kubernetes 1.21+
- GPU nodes with NVIDIA runtime (
nvidiaruntime class) - containerd runtime (for container inspection; CRIU is bundled in ChReK images)
- RWX storage class (for multi-node deployments)
- Security clearance for privileged DaemonSet (the ChReK agent runs privileged with hostPID/hostIPC/hostNetwork)
Troubleshooting
Common Issues
DaemonSet not starting?
- Check GPU node labels:
kubectl get nodes -l nvidia.com/gpu.present=true - Verify NVIDIA runtime is available
Checkpoint fails?
- Check DaemonSet logs:
kubectl logs -l app.kubernetes.io/name=chrek -n <namespace> - Ensure application properly signals readiness
- Verify CRIU is installed in the runtime
Restore fails?
- Ensure restore pod uses the same image (built with
placeholdertarget) and volume mounts as checkpoint job - Verify the DaemonSet is running on the same node as the restore pod
- Check DaemonSet logs for CRIU errors:
kubectl logs -l app.kubernetes.io/name=chrek
For detailed troubleshooting, see:
Contributing
ChReK is part of the NVIDIA Dynamo project. Contributions are welcome!
License
Apache License 2.0