Checkpoint/Restore for Fast Pod Startup

⚠️ Experimental Feature: ChReK is currently in beta/preview. The ChReK DaemonSet runs in privileged mode to perform CRIU operations. See Limitations for details.

Reduce cold start times for LLM inference workers from ~3 minutes to ~30 seconds using container checkpointing.

Overview

Checkpointing captures the complete state of a running worker pod (including GPU memory) and saves it to storage. New pods can restore from this checkpoint instead of performing a full cold start.

Startup Type	Time	What Happens
Cold Start	~3 min	Download model, load to GPU, initialize engine
Warm Start (checkpoint)	~30 sec	Restore from checkpoint tar

Prerequisites

Dynamo Platform installed (v0.4.0+)
ChReK Helm chart installed (separate from platform)
GPU nodes with containerd runtime (CRIU is bundled in ChReK images)
RWX PVC storage (PVC is currently the only supported backend)

Quick Start

1. Install ChReK Infrastructure

First, install the ChReK Helm chart in each namespace where you need checkpointing:

$ # Install ChReK infrastructure
$ helm install chrek nvidia/chrek \
>   --namespace my-team \
>   --create-namespace \
>   --set storage.pvc.size=100Gi

This creates:

A PVC for checkpoint storage (chrek-pvc)
A DaemonSet for CRIU operations (chrek-agent)

2. Configure Operator Values

Update your Helm values to point to the ChReK infrastructure:

1 # values.yaml
2 dynamo-operator:
3   checkpoint:
4     enabled: true
5     storage:
6       type: pvc  # Only PVC is currently supported (S3/OCI planned)
7       pvc:
8         pvcName: "chrek-pvc"  # Must match ChReK chart
9         basePath: "/checkpoints"
10       signalHostPath: "/var/lib/chrek/signals"  # Must match ChReK chart

2. Configure Your DGD

Add checkpoint configuration to your service:

1 apiVersion: nvidia.com/v1alpha1
2 kind: DynamoGraphDeployment
3 metadata:
4   name: my-llm
5 spec:
6   services:
7     VllmWorker:
8       replicas: 1
9       extraPodSpec:
10         mainContainer:
11           image: nvcr.io/nvidia/ai-dynamo/dynamo-vllm:latest
12           args:
13             - python3 -m dynamo.vllm --model meta-llama/Llama-3-8B
14       resources:
15         limits:
16           nvidia.com/gpu: "1"
17 
18       # Checkpoint configuration
19       checkpoint:
20         enabled: true
21         mode: auto  # Automatically create checkpoint if not found
22         identity:
23           model: "meta-llama/Llama-3-8B"
24           backendFramework: "vllm"
25           tensorParallelSize: 1
26           dtype: "bfloat16"

3. Deploy

$ kubectl apply -f my-llm.yaml -n dynamo-system

On first deployment:

A checkpoint job runs to create the checkpoint
Worker pods start with cold start (checkpoint not ready yet)
Once checkpoint is ready, new pods (scale-up, restarts) restore from checkpoint

Storage Backends

PVC (Currently Supported)

Use when you have RWX storage available (e.g., NFS, EFS, Filestore).

1 checkpoint:
2   storage:
3     type: pvc
4     pvc:
5       pvcName: "chrek-pvc"
6       basePath: "/checkpoints"

Requirements:

RWX (ReadWriteMany) PVC for multi-node access
Sufficient storage (checkpoints are ~10-50GB per model)

S3 / MinIO (Planned - Not Yet Implemented)

⚠️ Note: S3 storage backend is defined in the API but not yet fully implemented.

Object storage support is planned for a future release. The configuration will look like:

1 checkpoint:
2   storage:
3     type: s3  # Not yet supported
4     s3:
5       # AWS S3
6       uri: "s3://my-bucket/checkpoints"
7 
8       # Or MinIO / custom S3
9       uri: "s3://minio.example.com/my-bucket/checkpoints"
10 
11       # Optional: credentials secret
12       credentialsSecretRef: "s3-creds"

OCI Registry (Planned - Not Yet Implemented)

⚠️ Note: OCI registry storage backend is defined in the API but not yet fully implemented.

Container registry storage support is planned for a future release. The configuration will look like:

1 checkpoint:
2   storage:
3     type: oci  # Not yet supported
4     oci:
5       uri: "oci://myregistry.io/checkpoints"
6       credentialsSecretRef: "registry-creds"  # Docker config secret

Checkpoint Modes

Auto Mode (Recommended)

The operator automatically creates a DynamoCheckpoint CR if one doesn’t exist:

1 checkpoint:
2   enabled: true
3   mode: auto
4   identity:
5     model: "meta-llama/Llama-3-8B"
6     backendFramework: "vllm"
7     tensorParallelSize: 1

Reference Mode

Reference an existing DynamoCheckpoint CR by its 16-character hash using checkpointRef:

1 checkpoint:
2   enabled: true
3   checkpointRef: "e5962d34ba272638"  # 16-char hash of DynamoCheckpoint CR

This is useful when:

You want to pre-warm checkpoints before creating DGDs
You want to explicit control over which checkpoint to use

Flow:

Create a DynamoCheckpoint CR (see DynamoCheckpoint CRD section)
Wait for it to become Ready
Reference it in your DGD using checkpointRef with the hash

$ # Check checkpoint status (using 16-char hash name)
$ kubectl get dynamocheckpoint e5962d34ba272638 -n dynamo-system
$ NAME                MODEL                   BACKEND  PHASE  HASH              AGE
$ e5962d34ba272638    meta-llama/Llama-3-8B  vllm     Ready  e5962d34ba272638  5m
$ 
$ # Now create DGD referencing it
$ kubectl apply -f my-dgd.yaml

Checkpoint Identity

Checkpoints are uniquely identified by a 16-character SHA256 hash (64 bits) of configuration that affects runtime state:

Field	Required	Affects Hash	Example
`model`	✓	✓	`meta-llama/Llama-3-8B`
`framework`	✓	✓	`vllm`, `sglang`, `trtllm`
`dynamoVersion`		✓	`0.9.0`, `1.0.0`
`tensorParallelSize`		✓	`1`, `2`, `4`, `8` (default: 1)
`pipelineParallelSize`		✓	`1`, `2` (default: 1)
`dtype`		✓	`float16`, `bfloat16`, `fp8`
`maxModelLen`		✓	`4096`, `8192`
`extraParameters`		✓	Custom key-value pairs

Not included in hash (don’t invalidate checkpoint):

replicas
nodeSelector, affinity, tolerations
resources (requests/limits)
Logging/observability config

Example with all fields:

1 checkpoint:
2   enabled: true
3   mode: auto
4   identity:
5     model: "meta-llama/Llama-3-8B"
6     backendFramework: "vllm"
7     dynamoVersion: "0.9.0"
8     tensorParallelSize: 1
9     pipelineParallelSize: 1
10     dtype: "bfloat16"
11     maxModelLen: 8192
12     extraParameters:
13       enableChunkedPrefill: "true"
14       quantization: "awq"

Checkpoint Naming: The DynamoCheckpoint CR is automatically named using the 16-character identity hash (e.g., e5962d34ba272638).

Checkpoint Sharing: Multiple DGDs with the same identity automatically share the same checkpoint.

DynamoCheckpoint CRD

The DynamoCheckpoint (shortname: dckpt) is a Kubernetes Custom Resource that manages checkpoint lifecycle.

When to create a DynamoCheckpoint directly:

Pre-warming: Create checkpoints before deploying DGDs for instant startup
Explicit control: Manage checkpoint lifecycle independently from DGDs

Note: With the new hash-based naming, checkpoint names are automatically generated (16-character hash). The operator handles checkpoint discovery and reuse automatically in auto mode.

Create a checkpoint:

1 apiVersion: nvidia.com/v1alpha1
2 kind: DynamoCheckpoint
3 metadata:
4   name: e5962d34ba272638  # Use the computed 16-char hash
5 spec:
6   identity:
7     model: "meta-llama/Llama-3-8B"
8     backendFramework: "vllm"
9     tensorParallelSize: 1
10     dtype: "bfloat16"
11 
12   job:
13     activeDeadlineSeconds: 3600
14     podTemplateSpec:
15       spec:
16         containers:
17           - name: main
18             image: nvcr.io/nvidia/ai-dynamo/dynamo-vllm:latest
19             command: ["python3", "-m", "dynamo.vllm"]
20             args: ["--model", "meta-llama/Llama-3-8B"]
21             resources:
22               limits:
23                 nvidia.com/gpu: "1"
24             env:
25               - name: HF_TOKEN
26                 valueFrom:
27                   secretKeyRef:
28                     name: hf-token-secret
29                     key: HF_TOKEN

Note: You can compute the hash yourself, or use auto mode to let the operator create it.

Check status:

$ # List all checkpoints
$ kubectl get dynamocheckpoint -n dynamo-system
$ # Or use shortname
$ kubectl get dckpt -n dynamo-system
$ 
$ NAME                MODEL                          BACKEND  PHASE    HASH              AGE
$ e5962d34ba272638    meta-llama/Llama-3-8B         vllm     Ready    e5962d34ba272638  5m
$ a7b4f89c12de3456    meta-llama/Llama-3-70B        vllm     Creating a7b4f89c12de3456  2m

Phases:

Phase	Description
`Pending`	CR created, waiting for job to start
`Creating`	Checkpoint job is running
`Ready`	Checkpoint available for use
`Failed`	Checkpoint creation failed

Detailed status:

$ kubectl describe dckpt e5962d34ba272638 -n dynamo-system

1 Status:
2   Phase: Ready
3   IdentityHash: e5962d34ba272638
4   Location: /checkpoints/e5962d34ba272638
5   StorageType: pvc
6   CreatedAt: 2026-01-29T10:05:00Z

Reference from DGD:

Once the checkpoint is Ready, you can reference it by hash:

1 spec:
2   services:
3     VllmWorker:
4       checkpoint:
5         enabled: true
6         checkpointRef: "e5962d34ba272638"  # 16-char hash

Or use auto mode and the operator will find/create it automatically.

Limitations

⚠️ Important: ChReK has significant limitations that impact production readiness:

Security Considerations

🔴 Privileged DaemonSet: The ChReK DaemonSet runs in privileged mode with hostPID, hostIPC, and hostNetwork to perform CRIU operations externally
Workload pods (checkpoint jobs, restore pods) do not need privileged mode — all CRIU privilege lives in the DaemonSet
The privileged DaemonSet has elevated host access, which may violate security policies in many production environments

Technical Limitations

vLLM backend only: Currently only the vLLM backend supports checkpoint/restore. SGLang and TensorRT-LLM support is planned.
Single-node only: Checkpoints must be created and restored on the same node
Single-GPU only: Multi-GPU configurations are not yet supported
Network state: Active TCP connections are closed during restore (handled with tcp-close CRIU option)
Storage: Only PVC backend currently implemented (S3/OCI planned)

Recommendation

ChReK is experimental/beta and best suited for:

✅ Development and testing environments
✅ Research and experimentation
✅ Controlled production environments with appropriate security controls
❌ Security-sensitive production workloads without proper risk assessment

Troubleshooting

Checkpoint Not Creating

Check the checkpoint job:

$ kubectl get jobs -l nvidia.com/chrek-is-checkpoint-source=true -n dynamo-system
$ kubectl logs job/checkpoint-<name> -n dynamo-system

Check the DaemonSet:

$ kubectl logs daemonset/chrek-agent -n dynamo-system

Verify storage access:

$ kubectl exec -it <checkpoint-agent-pod> -- ls -la /checkpoints

Restore Failing

Check pod logs:

$ kubectl logs <worker-pod> -n dynamo-system

Verify checkpoint file exists:

$ # For PVC
$ kubectl exec -it <any-pod-with-pvc> -- ls -la /checkpoints/
$ 
$ # For S3
$ aws s3 ls s3://my-bucket/checkpoints/

Check environment variables:

$ kubectl exec <worker-pod> -- env | grep DYN_CHECKPOINT

Cold Start Despite Checkpoint

Pods fall back to cold start if:

Checkpoint file doesn’t exist yet (still being created)
Checkpoint file is corrupted
CRIU restore fails

Check logs for “Falling back to cold start” message.

Best Practices

Use RWX PVCs for multi-node deployments (currently the only supported backend)
Pre-warm checkpoints before scaling up
Monitor checkpoint size - large models create large checkpoints
Clean up old checkpoints to save storage

Environment Variables

Variable	Description
`DYN_CHECKPOINT_STORAGE_TYPE`	Backend: `pvc`, `s3`, `oci`
`DYN_CHECKPOINT_LOCATION`	Full checkpoint location (checkpoint jobs)
`DYN_CHECKPOINT_PATH`	Base checkpoint directory (restore pods, PVC)
`DYN_CHECKPOINT_HASH`	Identity hash
`DYN_READY_FOR_CHECKPOINT_FILE`	Ready-for-checkpoint file path (checkpoint jobs)

Complete Example

Create a checkpoint and use it in a DGD:

1 # 1. Create the DynamoCheckpoint CR
2 apiVersion: nvidia.com/v1alpha1
3 kind: DynamoCheckpoint
4 metadata:
5   name: e5962d34ba272638  # 16-char hash (computed from identity)
6   namespace: dynamo-system
7 spec:
8   identity:
9     model: "meta-llama/Meta-Llama-3-8B-Instruct"
10     backendFramework: "vllm"
11     tensorParallelSize: 1
12     dtype: "bfloat16"
13   job:
14     activeDeadlineSeconds: 3600
15     backoffLimit: 3
16     podTemplateSpec:
17       spec:
18         containers:
19           - name: main
20             image: nvcr.io/nvidia/ai-dynamo/dynamo-vllm:latest
21             command: ["python3", "-m", "dynamo.vllm"]
22             args:
23               - "--model"
24               - "meta-llama/Meta-Llama-3-8B-Instruct"
25               - "--tensor-parallel-size"
26               - "1"
27               - "--dtype"
28               - "bfloat16"
29             env:
30               - name: HF_TOKEN
31                 valueFrom:
32                   secretKeyRef:
33                     name: hf-token-secret
34                     key: HF_TOKEN
35             resources:
36               limits:
37                 nvidia.com/gpu: "1"
38         restartPolicy: Never
39 ---
40 # 2. Wait for Ready: kubectl get dckpt e5962d34ba272638 -n dynamo-system -w
41 ---
42 # 3. Reference the checkpoint in your DGD
43 apiVersion: nvidia.com/v1alpha1
44 kind: DynamoGraphDeployment
45 metadata:
46   name: my-llm
47   namespace: dynamo-system
48 spec:
49   services:
50     VllmWorker:
51       replicas: 2
52       extraPodSpec:
53         mainContainer:
54           image: nvcr.io/nvidia/ai-dynamo/dynamo-vllm:latest
55       resources:
56         limits:
57           nvidia.com/gpu: "1"
58       checkpoint:
59         enabled: true
60         checkpointRef: "e5962d34ba272638"  # Reference by hash

ChReK Overview - ChReK architecture and use cases
ChReK Standalone Usage Guide - Use ChReK without Dynamo Platform
ChReK Helm Chart README - Chart configuration
Installation Guide - Platform installation
API Reference - Complete CRD specifications