Checkpointing

View as Markdown

āš ļø Experimental Feature: ChReK is currently in beta/preview. The ChReK DaemonSet runs in privileged mode to perform CRIU operations. See Limitations for details.

ChReK (Checkpoint/Restore in Kubernetes) is an experimental infrastructure for fast-starting GPU applications using CRIU (Checkpoint/Restore in User-space). ChReK dramatically reduces cold-start times for large models from minutes to seconds by capturing initialized application state and restoring it on-demand.

What is ChReK?

ChReK provides:

  • Fast cold starts: Restore GPU-accelerated applications in seconds instead of minutes
  • CUDA state preservation: Checkpoint and restore GPU memory and CUDA contexts
  • Kubernetes-native: Integrates seamlessly with Kubernetes primitives
  • Storage flexibility: PVC-based storage (S3/OCI planned for future releases)
  • Namespace isolation: Each namespace gets its own checkpoint infrastructure

Use Cases

Use ChReK as part of the Dynamo platform for automatic checkpoint management:

  • Automatic checkpoint creation and lifecycle management
  • Seamless integration with DynamoGraphDeployment CRDs
  • Built-in autoscaling with fast restore

šŸ“– Read the Dynamo Integration Guide →

2. Standalone (Without Dynamo)

Use ChReK independently in your own Kubernetes applications:

  • Manual checkpoint job creation
  • Build your own restore-enabled container images
  • Full control over checkpoint lifecycle

šŸ“– Read the Standalone Usage Guide →

Architecture

ChReK consists of two main components:

1. ChReK Helm Chart

Deploys the checkpoint/restore infrastructure:

  • DaemonSet: Runs on GPU nodes to perform CRIU checkpoint operations
  • PVC: Stores checkpoint data (rootfs diffs, CUDA memory state)
  • RBAC: Namespace-scoped or cluster-wide permissions
  • Seccomp Profile: Security policies for CRIU syscalls

2. External Restore via DaemonSet

The DaemonSet performs checkpoint/restore externally using nsenter to enter pod namespaces:

  • Checkpoint: Freezes the running process and dumps state (CPU + GPU) to storage
  • Restore: Enters a placeholder pod’s namespaces and restores the checkpointed process via nsrestore

Quick Start

Install ChReK Infrastructure

$helm install chrek nvidia/chrek \
> --namespace my-team \
> --create-namespace \
> --set storage.pvc.size=100Gi

Choose Your Integration Path

Key Features

āœ… Currently Supported

  • āœ… vLLM backend only (SGLang and TensorRT-LLM planned)
  • āœ… Single-node, single-GPU checkpoints
  • āœ… PVC storage backend (RWX for multi-node)
  • āœ… CUDA checkpoint/restore
  • āœ… PyTorch distributed state (with GLOO_SOCKET_IFNAME=lo)
  • āœ… Namespace-scoped and cluster-wide RBAC
  • āœ… Idempotent checkpoint creation
  • āœ… Automatic signal-based checkpoint coordination

🚧 Planned Features

  • 🚧 SGLang backend support
  • 🚧 TensorRT-LLM backend support
  • 🚧 S3/MinIO storage backend
  • 🚧 OCI registry storage backend
  • 🚧 Multi-GPU checkpoints
  • 🚧 Multi-node distributed checkpoints

Limitations

āš ļø Important: ChReK has significant limitations that may impact production readiness:

Security Considerations

  • šŸ”“ Privileged DaemonSet: The ChReK DaemonSet runs in privileged mode with hostPID, hostIPC, and hostNetwork to perform CRIU operations. Workload pods do not need privileged mode — all CRIU privilege lives in the DaemonSet.
  • Security Impact: The privileged DaemonSet can:
    • Access all host devices and processes
    • Bypass most security restrictions
    • Potentially compromise node security if exploited

Technical Limitations

  • vLLM backend only: Currently only the vLLM backend supports checkpoint/restore. SGLang and TensorRT-LLM support is planned.
  • Single-node only: Checkpoints must be created and restored on the same node
  • Single-GPU only: Multi-GPU configurations not yet supported
  • Network state limitations: Active TCP connections are closed during restore (use tcp-close CRIU option)
  • Storage: Only PVC storage is currently implemented (S3/OCI planned)

Recommendation

ChReK is best suited for:

  • āœ… Development and testing environments
  • āœ… Research and experimentation
  • āœ… Controlled production environments with appropriate security controls
  • āŒ Security-sensitive production workloads without proper risk assessment

Documentation

Getting Started

Prerequisites

  • Kubernetes 1.21+
  • GPU nodes with NVIDIA runtime (nvidia runtime class)
  • containerd runtime (for container inspection; CRIU is bundled in ChReK images)
  • RWX storage class (for multi-node deployments)
  • Security clearance for privileged DaemonSet (the ChReK agent runs privileged with hostPID/hostIPC/hostNetwork)

Troubleshooting

Common Issues

DaemonSet not starting?

  • Check GPU node labels: kubectl get nodes -l nvidia.com/gpu.present=true
  • Verify NVIDIA runtime is available

Checkpoint fails?

  • Check DaemonSet logs: kubectl logs -l app.kubernetes.io/name=chrek -n <namespace>
  • Ensure application properly signals readiness
  • Verify CRIU is installed in the runtime

Restore fails?

  • Ensure restore pod uses the same image (built with placeholder target) and volume mounts as checkpoint job
  • Verify the DaemonSet is running on the same node as the restore pod
  • Check DaemonSet logs for CRIU errors: kubectl logs -l app.kubernetes.io/name=chrek

For detailed troubleshooting, see:

Contributing

ChReK is part of the NVIDIA Dynamo project. Contributions are welcome!

License

Apache License 2.0