Standalone Usage | NVIDIA Dynamo Documentation

⚠️ Experimental Feature: ChReK is currently in beta/preview. The ChReK DaemonSet runs in privileged mode to perform CRIU operations. Review the security implications before deploying.

This guide explains how to use ChReK (Checkpoint/Restore for Kubernetes) as a standalone component without deploying the full Dynamo platform. This is useful if you want to add checkpoint/restore capabilities to your own GPU workloads.

Overview

When using ChReK standalone, you are responsible for:

Deploying the ChReK Helm chart (DaemonSet + PVC)
Building checkpoint-enabled container images with the CRIU runtime dependencies
Creating checkpoint jobs with the correct environment variables
Creating restore pods that detect and use the checkpoints

The ChReK DaemonSet handles the actual CRIU checkpoint/restore operations automatically once your pods are configured correctly.

Using ChReK Without the Dynamo Operator

When using ChReK with the Dynamo operator, the operator automatically configures workload pods for checkpoint/restore. Without the operator, you must handle this configuration manually. This section documents what the operator normally injects and how to replicate it.

Container Naming

The ChReK DaemonSet needs to identify which container in your pod is the model-serving workload (as opposed to sidecars like istio-proxy or log collectors). It resolves the target container by name:

If a container is named main, it is selected
Otherwise, the first container in the pod spec is selected

When using the Dynamo operator, the model container is always named main. In standalone mode, you must either name your model container main or ensure it is the first container listed in your pod spec. All YAML examples in this guide use name: main.

Seccomp Profile

The operator sets a seccomp profile on all checkpoint/restore workload pods to block io_uring syscalls. The chrek DaemonSet deploys the profile file (profiles/block-iouring.json) to each node, but you must reference it in your pod specs:

1 spec:
2   securityContext:
3     seccompProfile:
4       type: Localhost
5       localhostProfile: profiles/block-iouring.json

Without this profile, io_uring syscalls during restore can cause CRIU failures.

Sleep Infinity Command for Restore Pods

The operator overrides the container command to ["sleep", "infinity"] on restore-target pods. This produces a Running-but-not-Ready placeholder pod that the chrek DaemonSet watcher detects and restores externally via nsenter. Without this override, the container runs its normal entrypoint (cold-starting instead of waiting for restore).

1 containers:
2 - name: main
3   image: my-app:checkpoint-enabled
4   command: ["sleep", "infinity"]

Recreate Deployment Strategy

The operator forces Recreate strategy when restore labels are present. This prevents the old and new pods from running simultaneously, which would cause failures — two pods competing for the same GPU checkpoint data. If you are using a Deployment, set this manually:

1 apiVersion: apps/v1
2 kind: Deployment
3 spec:
4   strategy:
5     type: Recreate

PVC Volume Mount Consistency

CRIU requires identical mount layouts between checkpoint and restore. The operator ensures the checkpoint PVC is mounted at the same path in both the checkpoint job and restore pod. When configuring manually, make sure your checkpoint job and restore pod use the exact same mountPath for the checkpoint PVC (e.g., /checkpoints).

Downward API Volume (Currently Unused)

The operator injects a Downward API volume at /etc/podinfo for post-restore identity discovery (pod name, namespace, UID). This is not currently consumed by any component — you can skip it for now.

Environment Variables

The following environment variables are normally injected by the operator. They are already documented in the Environment Variables Reference below, but note that without the operator you must set them manually:

Checkpoint jobs: DYN_READY_FOR_CHECKPOINT_FILE, DYN_CHECKPOINT_LOCATION, DYN_CHECKPOINT_STORAGE_TYPE, DYN_CHECKPOINT_HASH
Restore pods: DYN_CHECKPOINT_PATH, DYN_CHECKPOINT_HASH

Prerequisites

Kubernetes cluster with:
- NVIDIA GPUs with checkpoint support
- Privileged DaemonSet allowed (⚠️ the ChReK DaemonSet runs privileged - see Security Considerations)
- PVC storage (ReadWriteMany recommended for multi-node)
Docker or compatible container runtime for building images
Access to the ChReK source code: deploy/chrek/

Security Considerations

⚠️ Important: The ChReK DaemonSet runs in privileged mode to perform CRIU checkpoint/restore operations. Your workload pods (checkpoint jobs, restore pods) do not need privileged mode — all CRIU privilege lives in the DaemonSet, which performs external restore via nsenter.

The DaemonSet has privileged: true, hostPID, hostIPC, and hostNetwork
This may violate security policies in production environments
If the DaemonSet is compromised, it could potentially compromise node security

Recommended for:

✅ Development and testing environments
✅ Research and experimentation
✅ Controlled production environments with appropriate security controls

Not recommended for:

❌ Multi-tenant clusters without proper isolation
❌ Security-sensitive production workloads without risk assessment
❌ Environments with strict security compliance requirements

Technical Limitations

⚠️ Current Restrictions:

vLLM backend only: Currently only the vLLM backend supports checkpoint/restore. SGLang and TensorRT-LLM support is planned.
Single-node only: Checkpoints must be created and restored on the same node
Single-GPU only: Multi-GPU configurations are not yet supported
Network state: Active TCP connections are closed during restore
Storage: Only PVC backend currently implemented (S3/OCI planned)

Step 1: Deploy ChReK

Install the Helm Chart

$ # Clone the repository
$ git clone https://github.com/ai-dynamo/dynamo.git
$ cd dynamo
$ 
$ # Install ChReK in your namespace
$ helm install chrek ./deploy/helm/charts/chrek \
>   --namespace my-app \
>   --create-namespace \
>   --set storage.pvc.size=100Gi \
>   --set storage.pvc.storageClass=your-storage-class

Verify Installation

$ # Check the DaemonSet is running
$ kubectl get daemonset -n my-app
$ # NAME          DESIRED   CURRENT   READY   UP-TO-DATE   AVAILABLE
$ # chrek-agent   3         3         3       3            3
$ 
$ # Check the PVC is bound
$ kubectl get pvc -n my-app
$ # NAME        STATUS   VOLUME     CAPACITY   ACCESS MODES   STORAGECLASS
$ # chrek-pvc   Bound    pvc-xyz    100Gi      RWX            your-storage-class

Step 2: Build Checkpoint-Enabled Images

ChReK provides a placeholder target in its Dockerfile that layers CRIU runtime dependencies onto your existing container images. The DaemonSet performs restore externally via nsenter, so these dependencies must be present in the image.

Quick Start: Using the Placeholder Target (Recommended)

$ cd deploy/chrek
$ 
$ # Define your images
$ export BASE_IMAGE="your-app:latest"           # Your existing application image
$ export RESTORE_IMAGE="your-app:checkpoint-enabled"  # Output checkpoint-enabled image
$ 
$ # Build using the placeholder target
$ docker build \
>   --target placeholder \
>   --build-arg BASE_IMAGE="$BASE_IMAGE" \
>   -t "$RESTORE_IMAGE" \
>   .
$ 
$ # Push to your registry
$ docker push "$RESTORE_IMAGE"

Example with a Dynamo vLLM image:

$ cd deploy/chrek
$ 
$ export DYNAMO_IMAGE="nvidia/dynamo-vllm:v1.2.0"
$ export RESTORE_IMAGE="nvidia/dynamo-vllm:v1.2.0-checkpoint"
$ 
$ docker build \
>   --target placeholder \
>   --build-arg BASE_IMAGE="$DYNAMO_IMAGE" \
>   -t "$RESTORE_IMAGE" \
>   .

What the Placeholder Target Does

The ChReK Dockerfile’s placeholder stage automatically:

✅ Installs CRIU runtime libraries (required by nsrestore running inside the pod’s namespaces)
✅ Copies the criu binary to /usr/local/sbin/criu
✅ Copies cuda-checkpoint to /usr/local/sbin/cuda-checkpoint (used for CUDA state checkpoint/restore)
✅ Copies nsrestore to /usr/local/bin/nsrestore (invoked by DaemonSet via nsenter)
✅ Creates checkpoint directories (/checkpoints, /var/run/criu, /var/criu-work)
✅ Preserves your original application image contents

The placeholder image does not override the entrypoint or CMD. For restore pods, the operator (or you, in standalone mode) overrides the command to sleep infinity.

💡 Tip: Using the placeholder target is the recommended approach as it’s maintained with the ChReK codebase and ensures compatibility.

Step 3: Create Checkpoint Jobs

A checkpoint job loads your application, waits for the ChReK DaemonSet to checkpoint it, and then exits.

Required Environment Variables

Your checkpoint job MUST set these environment variables:

Variable	Description	Example
`DYN_READY_FOR_CHECKPOINT_FILE`	Path where your app signals it’s ready	`/tmp/ready-for-checkpoint`
`DYN_CHECKPOINT_HASH`	Unique identifier for this checkpoint	`abc123def456`
`DYN_CHECKPOINT_LOCATION`	Directory where checkpoint is stored	`/checkpoints/abc123def456`
`DYN_CHECKPOINT_STORAGE_TYPE`	Storage backend type	`pvc`

Required Labels

Add this label to enable DaemonSet checkpoint detection:

1 labels:
2   nvidia.com/chrek-is-checkpoint-source: "true"

Example Checkpoint Job

1 apiVersion: batch/v1
2 kind: Job
3 metadata:
4   name: checkpoint-my-model
5   namespace: my-app
6 spec:
7   template:
8     metadata:
9       labels:
10         nvidia.com/chrek-is-checkpoint-source: "true"  # Required for DaemonSet detection
11         nvidia.com/chrek-checkpoint-hash: "abc123def456"  # Must match DYN_CHECKPOINT_HASH
12     spec:
13       restartPolicy: Never
14 
15       # Seccomp profile to block io_uring syscalls (deployed by the chrek DaemonSet)
16       securityContext:
17         seccompProfile:
18           type: Localhost
19           localhostProfile: profiles/block-iouring.json
20 
21       containers:
22       - name: main
23         image: my-app:checkpoint-enabled
24 
25         # Readiness probe: Pod becomes Ready when model is loaded
26         # This is what triggers the DaemonSet to start checkpointing
27         readinessProbe:
28           exec:
29             command: ["cat", "/tmp/ready-for-checkpoint"]
30           initialDelaySeconds: 15
31           periodSeconds: 2
32 
33         # Remove liveness/startup probes for checkpoint jobs
34         # Model loading can take several minutes
35         livenessProbe: null
36         startupProbe: null
37 
38         # Checkpoint-related environment variables
39         env:
40         - name: DYN_READY_FOR_CHECKPOINT_FILE
41           value: "/tmp/ready-for-checkpoint"
42         - name: DYN_CHECKPOINT_HASH
43           value: "abc123def456"
44         - name: DYN_CHECKPOINT_LOCATION
45           value: "/checkpoints/abc123def456"
46         - name: DYN_CHECKPOINT_STORAGE_TYPE
47           value: "pvc"
48 
49         # GPU request
50         resources:
51           limits:
52             nvidia.com/gpu: 1
53 
54         # Required volume mounts
55         volumeMounts:
56         - name: checkpoint-storage
57           mountPath: /checkpoints
58 
59       volumes:
60       - name: checkpoint-storage
61         persistentVolumeClaim:
62           claimName: chrek-pvc

Application Code Requirements

Your application must implement the checkpoint flow. The DaemonSet communicates with your application via Unix signals (not files):

SIGUSR1: Checkpoint completed — your process should exit gracefully
SIGCONT: Restore completed — your process should wake up and continue
SIGKILL: Checkpoint failed — process is terminated immediately (unhandleable)

Here’s the pattern used by Dynamo vLLM (see components/src/dynamo/vllm/checkpoint_restore.py):

1 import asyncio
2 import os
3 import signal
4 
5 async def main():
6     ready_file = os.environ.get("DYN_READY_FOR_CHECKPOINT_FILE")
7     if not ready_file:
8         # Not in checkpoint mode, run normally
9         await run_application()
10         return
11 
12     print("Checkpoint mode detected")
13 
14     # 1. Load your model/application
15     model = await load_model()
16 
17     # 2. Put model to sleep for CRIU-friendly GPU state
18     await model.sleep()
19 
20     # 3. Install signal handlers BEFORE writing the ready file to avoid a race
21     #    where the DaemonSet sends a signal while default disposition (terminate)
22     #    is still in effect. No handler needed for checkpoint failure — the
23     #    watcher sends SIGKILL which terminates the process immediately.
24     checkpoint_done = asyncio.Event()
25     restore_done = asyncio.Event()
26 
27     loop = asyncio.get_running_loop()
28     loop.add_signal_handler(signal.SIGUSR1, checkpoint_done.set)
29     loop.add_signal_handler(signal.SIGCONT, restore_done.set)
30 
31     # 4. Write ready file — triggers DaemonSet checkpoint via readiness probe
32     with open(ready_file, "w") as f:
33         f.write("ready")
34 
35     print("Ready for checkpoint. Waiting for watcher signal...")
36 
37     # Wait for whichever signal comes first (SIGKILL on failure kills us
38     # immediately, so only success/restore signals reach this point)
39     done, pending = await asyncio.wait(
40         [asyncio.create_task(checkpoint_done.wait()),
41          asyncio.create_task(restore_done.wait())],
42         return_when=asyncio.FIRST_COMPLETED,
43     )
44     for task in pending:
45         task.cancel()
46 
47     if restore_done.is_set():
48         # SIGCONT: Process was restored from checkpoint
49         print("Restore complete, waking model")
50         await model.wake_up()
51         await run_application()
52     else:
53         # SIGUSR1: Checkpoint complete, exit
54         print("Checkpoint complete, exiting")

Important Notes:

Ready File & Readiness Probe: The checkpoint job must have a readiness probe that checks for the ready file. The ChReK DaemonSet triggers checkpointing when:
- Pod has nvidia.com/chrek-is-checkpoint-source: "true" label
- Pod status is Ready (readiness probe passes = ready file exists)
Signal handler ordering: Install signal handlers before writing the ready file. Otherwise there is a race window where the DaemonSet sends a signal while the default disposition (terminate) is still in effect.
Signal-based coordination: The DaemonSet sends SIGUSR1 after checkpoint completes, SIGCONT after restore completes, and SIGKILL if checkpoint fails. Your application must handle SIGUSR1 and SIGCONT (not poll for files). SIGKILL cannot be caught — the kernel terminates the process immediately.
Three exit paths:
- SIGUSR1 received: Checkpoint complete, exit gracefully
- SIGCONT received: Process was restored, wake model and continue
- SIGKILL received: Checkpoint failed, process terminated immediately (no handler needed)

Step 4: Restore from Checkpoints

The DaemonSet performs restore externally — your restore pod just needs to be a placeholder that sleeps until the DaemonSet restores the checkpointed process into it.

Example Restore Pod

1 apiVersion: v1
2 kind: Pod
3 metadata:
4   name: my-app-restored
5   namespace: my-app
6   labels:
7     nvidia.com/chrek-is-restore-target: "true"  # Required: watcher detects restore pods by this label
8     nvidia.com/chrek-checkpoint-hash: "abc123def456"  # Required: watcher uses this to locate the checkpoint
9 spec:
10   restartPolicy: Never
11 
12   # Seccomp profile to block io_uring syscalls (deployed by the chrek DaemonSet)
13   # Without this, io_uring syscalls may cause CRIU restore failures
14   securityContext:
15     seccompProfile:
16       type: Localhost
17       localhostProfile: profiles/block-iouring.json
18 
19   containers:
20   - name: main
21     image: my-app:checkpoint-enabled
22 
23     # Override command to sleep — the chrek DaemonSet performs external restore
24     # on Running-but-not-Ready pods. Without this, the container would cold-start.
25     command: ["sleep", "infinity"]
26 
27     # Set checkpoint environment variables
28     env:
29     - name: DYN_CHECKPOINT_HASH
30       value: "abc123def456"  # Must match checkpoint job
31     - name: DYN_CHECKPOINT_PATH
32       value: "/checkpoints"  # Base path (hash appended automatically)
33 
34     # GPU request
35     resources:
36       limits:
37         nvidia.com/gpu: 1
38 
39     # CRIU needs write access for restore.log — do NOT set readOnly
40     volumeMounts:
41     - name: checkpoint-storage
42       mountPath: /checkpoints
43 
44   volumes:
45   - name: checkpoint-storage
46     persistentVolumeClaim:
47       claimName: chrek-pvc

How Restore Works

Pod starts as placeholder: The sleep infinity command keeps the pod Running but not Ready
DaemonSet detects restore pod: The watcher finds pods with nvidia.com/chrek-is-restore-target=true that are Running but not Ready
External restore via nsenter: The DaemonSet enters the pod’s namespaces and performs CRIU restore, including GPU state
Application continues: Your application resumes exactly where it was checkpointed

Environment Variables Reference

Checkpoint Jobs

Variable	Required	Description
`DYN_READY_FOR_CHECKPOINT_FILE`	Yes	Full path where app signals readiness (e.g., `/tmp/ready-for-checkpoint`)
`DYN_CHECKPOINT_HASH`	Yes	Unique checkpoint identifier (16-char hex string)
`DYN_CHECKPOINT_LOCATION`	Yes	Directory where checkpoint is stored (e.g., `/checkpoints/abc123def456`)
`DYN_CHECKPOINT_STORAGE_TYPE`	Yes	Storage backend: `pvc`, `s3`, or `oci`

Restore Pods

Variable	Required	Description
`DYN_CHECKPOINT_HASH`	Yes	Checkpoint identifier (must match checkpoint job)
`DYN_CHECKPOINT_PATH`	Yes	Base checkpoint directory (hash appended automatically)

Signals (DaemonSet → Application)

The DaemonSet communicates checkpoint/restore completion via Unix signals, not files:

Signal	Direction	Meaning
`SIGUSR1`	DaemonSet → checkpoint pod	Checkpoint completed, process should exit
`SIGCONT`	DaemonSet → restored pod	Restore completed, process should wake up
`SIGKILL`	DaemonSet → checkpoint pod	Checkpoint failed — process terminated immediately

CRIU tuning options are configured via the ChReK Helm chart’s config.checkpoint.criu values, not environment variables. See the Helm Chart Values for available options.

Checkpoint Flow Explained

1. Checkpoint Creation Flow

┌─────────────────────────────────────────────────────────────┐
│ 1. Pod starts with nvidia.com/chrek-is-checkpoint-source=true label  │
└──────────────────────┬──────────────────────────────────────┘
                       │
                       ▼
┌─────────────────────────────────────────────────────────────┐
│ 2. Application loads model and creates ready file           │
│    /tmp/ready-for-checkpoint                                 │
└──────────────────────┬──────────────────────────────────────┘
                       │
                       ▼
┌─────────────────────────────────────────────────────────────┐
│ 3. Pod becomes Ready (kubelet readiness probe passes)       │
└──────────────────────┬──────────────────────────────────────┘
                       │
                       ▼
┌─────────────────────────────────────────────────────────────┐
│ 4. ChReK DaemonSet detects:                                 │
│    - Pod is Ready                                            │
│    - Has chrek-is-checkpoint-source label                     │
│    - Has chrek-checkpoint-hash label                         │
└──────────────────────┬──────────────────────────────────────┘
                       │
                       ▼
┌─────────────────────────────────────────────────────────────┐
│ 5. DaemonSet executes CRIU checkpoint:                      │
│    - Freezes container process                               │
│    - Dumps memory (CPU + GPU)                                │
│    - Saves to /checkpoints/${HASH}/                          │
└──────────────────────┬──────────────────────────────────────┘
                       │
                       ▼
┌─────────────────────────────────────────────────────────────┐
│ 6. DaemonSet sends SIGUSR1 to the application process       │
└──────────────────────┬──────────────────────────────────────┘
                       │
                       ▼
┌─────────────────────────────────────────────────────────────┐
│ 7. Application receives SIGUSR1 and exits gracefully        │
└─────────────────────────────────────────────────────────────┘

2. Restore Flow

┌─────────────────────────────────────────────────────────────┐
│ 1. Pod starts with restore labels and sleep infinity        │
│    (Running but not Ready)                                   │
└──────────────────────┬──────────────────────────────────────┘
                       │
                       ▼
┌─────────────────────────────────────────────────────────────┐
│ 2. ChReK DaemonSet detects:                                 │
│    - Pod is Running but not Ready                            │
│    - Has chrek-is-restore-target label                       │
│    - Has chrek-checkpoint-hash label                         │
└──────────────────────┬──────────────────────────────────────┘
                       │
                       ▼
┌─────────────────────────────────────────────────────────────┐
│ 3. DaemonSet performs external restore via nsenter:          │
│    - Enters pod's namespaces (mount, net, pid, ipc)         │
│    - Runs nsrestore with CRIU inside the pod's context      │
│    - Restores memory (CPU + GPU via cuda-checkpoint)        │
└──────────────────────┬──────────────────────────────────────┘
                       │
                       ▼
┌─────────────────────────────────────────────────────────────┐
│ 4. DaemonSet sends SIGCONT to the restored process           │
└──────────────────────┬──────────────────────────────────────┘
                       │
                       ▼
┌─────────────────────────────────────────────────────────────┐
│ 5. Application receives SIGCONT, wakes model, continues      │
│    (Model already loaded, GPU memory initialized)           │
└─────────────────────────────────────────────────────────────┘

Troubleshooting

Checkpoint Not Created

Symptom: Job runs but no checkpoint appears in /checkpoints/

Checks:

Verify the pod has the label:

$ kubectl get pod <pod-name> -o jsonpath='{.metadata.labels.nvidia\.com/chrek-is-checkpoint-source}'

Check pod readiness:

$ kubectl get pod <pod-name> -o jsonpath='{.status.conditions[?(@.type=="Ready")].status}'

Check ready file was created:

$ kubectl exec <pod-name> -- ls -la /tmp/ready-for-checkpoint

Check DaemonSet logs:

$ kubectl logs -n my-app daemonset/chrek-agent --all-containers

Restore Fails

Symptom: Pod fails to restore from checkpoint

Checks:

Verify checkpoint files exist:

$ kubectl exec <pod-name> -- ls -la /checkpoints/${DYN_CHECKPOINT_HASH}/

Check DaemonSet logs for restore errors:

$ kubectl logs -n my-app daemonset/chrek-agent --all-containers

Check pod events for restore status annotations:

$ kubectl describe pod <pod-name>

Ensure checkpoint and restore have same:
- Container image (built with placeholder target)
- GPU count
- Volume mounts (same mountPath for checkpoint PVC)

Restore Pod Not Detected

Symptom: Pod runs sleep infinity but DaemonSet never restores it

Checks:

Verify the pod has the required labels:

$ kubectl get pod <pod-name> -o jsonpath='{.metadata.labels}'

Must have both nvidia.com/chrek-is-restore-target: "true" and nvidia.com/chrek-checkpoint-hash: "<hash>".

Verify the pod is Running but not Ready (this is the trigger):

$ kubectl get pod <pod-name> -o jsonpath='{.status.phase}'
$ kubectl get pod <pod-name> -o jsonpath='{.status.conditions[?(@.type=="Ready")].status}'

Verify the DaemonSet is running on the same node:

$ kubectl get pods -n my-app -l app.kubernetes.io/name=chrek -o wide

Additional Resources

ChReK Helm Chart Values
Dynamo vLLM ChReK Integration - Reference signal handler implementation
ChReK Dockerfile
CRIU Documentation
CUDA Checkpoint Utility

Getting Help

If you encounter issues:

Check the Troubleshooting section
Review DaemonSet logs: kubectl logs -n <namespace> daemonset/chrek-agent
Open an issue on GitHub

Table of Contents

Overview

Using ChReK Without the Dynamo Operator

Container Naming

Seccomp Profile

Sleep Infinity Command for Restore Pods

Recreate Deployment Strategy

PVC Volume Mount Consistency

Downward API Volume (Currently Unused)

Environment Variables

Prerequisites

Security Considerations

Technical Limitations

Step 1: Deploy ChReK

Install the Helm Chart

Verify Installation

Step 2: Build Checkpoint-Enabled Images

Quick Start: Using the Placeholder Target (Recommended)

What the Placeholder Target Does

Step 3: Create Checkpoint Jobs

Required Environment Variables

Required Labels

Example Checkpoint Job

Application Code Requirements

Step 4: Restore from Checkpoints

Example Restore Pod

How Restore Works

Environment Variables Reference

Checkpoint Jobs

Restore Pods

Signals (DaemonSet → Application)

Checkpoint Flow Explained

1. Checkpoint Creation Flow

2. Restore Flow

Troubleshooting

Checkpoint Not Created

Restore Fails

Restore Pod Not Detected

Additional Resources

Getting Help