Integration with Dynamo

View as Markdown

Checkpoint/Restore for Fast Pod Startup

⚠️ Experimental Feature: ChReK is currently in beta/preview. The ChReK DaemonSet runs in privileged mode to perform CRIU operations. See Limitations for details.

Reduce cold start times for LLM inference workers from ~3 minutes to ~30 seconds using container checkpointing.

Overview

Checkpointing captures the complete state of a running worker pod (including GPU memory) and saves it to storage. New pods can restore from this checkpoint instead of performing a full cold start.

Startup TypeTimeWhat Happens
Cold Start~3 minDownload model, load to GPU, initialize engine
Warm Start (checkpoint)~30 secRestore from checkpoint tar

Prerequisites

  • Dynamo Platform installed (v0.4.0+)
  • ChReK Helm chart installed (separate from platform)
  • GPU nodes with containerd runtime (CRIU is bundled in ChReK images)
  • RWX PVC storage (PVC is currently the only supported backend)

Quick Start

1. Install ChReK Infrastructure

First, install the ChReK Helm chart in each namespace where you need checkpointing:

$# Install ChReK infrastructure
$helm install chrek nvidia/chrek \
> --namespace my-team \
> --create-namespace \
> --set storage.pvc.size=100Gi

This creates:

  • A PVC for checkpoint storage (chrek-pvc)
  • A DaemonSet for CRIU operations (chrek-agent)

2. Configure Operator Values

Update your Helm values to point to the ChReK infrastructure:

1# values.yaml
2dynamo-operator:
3 checkpoint:
4 enabled: true
5 storage:
6 type: pvc # Only PVC is currently supported (S3/OCI planned)
7 pvc:
8 pvcName: "chrek-pvc" # Must match ChReK chart
9 basePath: "/checkpoints"
10 signalHostPath: "/var/lib/chrek/signals" # Must match ChReK chart

2. Configure Your DGD

Add checkpoint configuration to your service:

1apiVersion: nvidia.com/v1alpha1
2kind: DynamoGraphDeployment
3metadata:
4 name: my-llm
5spec:
6 services:
7 VllmWorker:
8 replicas: 1
9 extraPodSpec:
10 mainContainer:
11 image: nvcr.io/nvidia/ai-dynamo/dynamo-vllm:latest
12 args:
13 - python3 -m dynamo.vllm --model meta-llama/Llama-3-8B
14 resources:
15 limits:
16 nvidia.com/gpu: "1"
17
18 # Checkpoint configuration
19 checkpoint:
20 enabled: true
21 mode: auto # Automatically create checkpoint if not found
22 identity:
23 model: "meta-llama/Llama-3-8B"
24 backendFramework: "vllm"
25 tensorParallelSize: 1
26 dtype: "bfloat16"

3. Deploy

$kubectl apply -f my-llm.yaml -n dynamo-system

On first deployment:

  1. A checkpoint job runs to create the checkpoint
  2. Worker pods start with cold start (checkpoint not ready yet)
  3. Once checkpoint is ready, new pods (scale-up, restarts) restore from checkpoint

Storage Backends

PVC (Currently Supported)

Use when you have RWX storage available (e.g., NFS, EFS, Filestore).

1checkpoint:
2 storage:
3 type: pvc
4 pvc:
5 pvcName: "chrek-pvc"
6 basePath: "/checkpoints"

Requirements:

  • RWX (ReadWriteMany) PVC for multi-node access
  • Sufficient storage (checkpoints are ~10-50GB per model)

S3 / MinIO (Planned - Not Yet Implemented)

⚠️ Note: S3 storage backend is defined in the API but not yet fully implemented.

Object storage support is planned for a future release. The configuration will look like:

1checkpoint:
2 storage:
3 type: s3 # Not yet supported
4 s3:
5 # AWS S3
6 uri: "s3://my-bucket/checkpoints"
7
8 # Or MinIO / custom S3
9 uri: "s3://minio.example.com/my-bucket/checkpoints"
10
11 # Optional: credentials secret
12 credentialsSecretRef: "s3-creds"

OCI Registry (Planned - Not Yet Implemented)

⚠️ Note: OCI registry storage backend is defined in the API but not yet fully implemented.

Container registry storage support is planned for a future release. The configuration will look like:

1checkpoint:
2 storage:
3 type: oci # Not yet supported
4 oci:
5 uri: "oci://myregistry.io/checkpoints"
6 credentialsSecretRef: "registry-creds" # Docker config secret

Checkpoint Modes

The operator automatically creates a DynamoCheckpoint CR if one doesn’t exist:

1checkpoint:
2 enabled: true
3 mode: auto
4 identity:
5 model: "meta-llama/Llama-3-8B"
6 backendFramework: "vllm"
7 tensorParallelSize: 1

Reference Mode

Reference an existing DynamoCheckpoint CR by its 16-character hash using checkpointRef:

1checkpoint:
2 enabled: true
3 checkpointRef: "e5962d34ba272638" # 16-char hash of DynamoCheckpoint CR

This is useful when:

  • You want to pre-warm checkpoints before creating DGDs
  • You want to explicit control over which checkpoint to use

Flow:

  1. Create a DynamoCheckpoint CR (see DynamoCheckpoint CRD section)
  2. Wait for it to become Ready
  3. Reference it in your DGD using checkpointRef with the hash
$# Check checkpoint status (using 16-char hash name)
$kubectl get dynamocheckpoint e5962d34ba272638 -n dynamo-system
$NAME MODEL BACKEND PHASE HASH AGE
$e5962d34ba272638 meta-llama/Llama-3-8B vllm Ready e5962d34ba272638 5m
$
$# Now create DGD referencing it
$kubectl apply -f my-dgd.yaml

Checkpoint Identity

Checkpoints are uniquely identified by a 16-character SHA256 hash (64 bits) of configuration that affects runtime state:

FieldRequiredAffects HashExample
modelmeta-llama/Llama-3-8B
frameworkvllm, sglang, trtllm
dynamoVersion0.9.0, 1.0.0
tensorParallelSize1, 2, 4, 8 (default: 1)
pipelineParallelSize1, 2 (default: 1)
dtypefloat16, bfloat16, fp8
maxModelLen4096, 8192
extraParametersCustom key-value pairs

Not included in hash (don’t invalidate checkpoint):

  • replicas
  • nodeSelector, affinity, tolerations
  • resources (requests/limits)
  • Logging/observability config

Example with all fields:

1checkpoint:
2 enabled: true
3 mode: auto
4 identity:
5 model: "meta-llama/Llama-3-8B"
6 backendFramework: "vllm"
7 dynamoVersion: "0.9.0"
8 tensorParallelSize: 1
9 pipelineParallelSize: 1
10 dtype: "bfloat16"
11 maxModelLen: 8192
12 extraParameters:
13 enableChunkedPrefill: "true"
14 quantization: "awq"

Checkpoint Naming: The DynamoCheckpoint CR is automatically named using the 16-character identity hash (e.g., e5962d34ba272638).

Checkpoint Sharing: Multiple DGDs with the same identity automatically share the same checkpoint.

DynamoCheckpoint CRD

The DynamoCheckpoint (shortname: dckpt) is a Kubernetes Custom Resource that manages checkpoint lifecycle.

When to create a DynamoCheckpoint directly:

  • Pre-warming: Create checkpoints before deploying DGDs for instant startup
  • Explicit control: Manage checkpoint lifecycle independently from DGDs

Note: With the new hash-based naming, checkpoint names are automatically generated (16-character hash). The operator handles checkpoint discovery and reuse automatically in auto mode.

Create a checkpoint:

1apiVersion: nvidia.com/v1alpha1
2kind: DynamoCheckpoint
3metadata:
4 name: e5962d34ba272638 # Use the computed 16-char hash
5spec:
6 identity:
7 model: "meta-llama/Llama-3-8B"
8 backendFramework: "vllm"
9 tensorParallelSize: 1
10 dtype: "bfloat16"
11
12 job:
13 activeDeadlineSeconds: 3600
14 podTemplateSpec:
15 spec:
16 containers:
17 - name: main
18 image: nvcr.io/nvidia/ai-dynamo/dynamo-vllm:latest
19 command: ["python3", "-m", "dynamo.vllm"]
20 args: ["--model", "meta-llama/Llama-3-8B"]
21 resources:
22 limits:
23 nvidia.com/gpu: "1"
24 env:
25 - name: HF_TOKEN
26 valueFrom:
27 secretKeyRef:
28 name: hf-token-secret
29 key: HF_TOKEN

Note: You can compute the hash yourself, or use auto mode to let the operator create it.

Check status:

$# List all checkpoints
$kubectl get dynamocheckpoint -n dynamo-system
$# Or use shortname
$kubectl get dckpt -n dynamo-system
$
$NAME MODEL BACKEND PHASE HASH AGE
$e5962d34ba272638 meta-llama/Llama-3-8B vllm Ready e5962d34ba272638 5m
$a7b4f89c12de3456 meta-llama/Llama-3-70B vllm Creating a7b4f89c12de3456 2m

Phases:

PhaseDescription
PendingCR created, waiting for job to start
CreatingCheckpoint job is running
ReadyCheckpoint available for use
FailedCheckpoint creation failed

Detailed status:

$kubectl describe dckpt e5962d34ba272638 -n dynamo-system
1Status:
2 Phase: Ready
3 IdentityHash: e5962d34ba272638
4 Location: /checkpoints/e5962d34ba272638
5 StorageType: pvc
6 CreatedAt: 2026-01-29T10:05:00Z

Reference from DGD:

Once the checkpoint is Ready, you can reference it by hash:

1spec:
2 services:
3 VllmWorker:
4 checkpoint:
5 enabled: true
6 checkpointRef: "e5962d34ba272638" # 16-char hash

Or use auto mode and the operator will find/create it automatically.

Limitations

⚠️ Important: ChReK has significant limitations that impact production readiness:

Security Considerations

  • 🔴 Privileged DaemonSet: The ChReK DaemonSet runs in privileged mode with hostPID, hostIPC, and hostNetwork to perform CRIU operations externally
  • Workload pods (checkpoint jobs, restore pods) do not need privileged mode — all CRIU privilege lives in the DaemonSet
  • The privileged DaemonSet has elevated host access, which may violate security policies in many production environments

Technical Limitations

  • vLLM backend only: Currently only the vLLM backend supports checkpoint/restore. SGLang and TensorRT-LLM support is planned.
  • Single-node only: Checkpoints must be created and restored on the same node
  • Single-GPU only: Multi-GPU configurations are not yet supported
  • Network state: Active TCP connections are closed during restore (handled with tcp-close CRIU option)
  • Storage: Only PVC backend currently implemented (S3/OCI planned)

Recommendation

ChReK is experimental/beta and best suited for:

  • ✅ Development and testing environments
  • ✅ Research and experimentation
  • ✅ Controlled production environments with appropriate security controls
  • ❌ Security-sensitive production workloads without proper risk assessment

Troubleshooting

Checkpoint Not Creating

  1. Check the checkpoint job:

    $kubectl get jobs -l nvidia.com/chrek-is-checkpoint-source=true -n dynamo-system
    $kubectl logs job/checkpoint-<name> -n dynamo-system
  2. Check the DaemonSet:

    $kubectl logs daemonset/chrek-agent -n dynamo-system
  3. Verify storage access:

    $kubectl exec -it <checkpoint-agent-pod> -- ls -la /checkpoints

Restore Failing

  1. Check pod logs:

    $kubectl logs <worker-pod> -n dynamo-system
  2. Verify checkpoint file exists:

    $# For PVC
    $kubectl exec -it <any-pod-with-pvc> -- ls -la /checkpoints/
    $
    $# For S3
    $aws s3 ls s3://my-bucket/checkpoints/
  3. Check environment variables:

    $kubectl exec <worker-pod> -- env | grep DYN_CHECKPOINT

Cold Start Despite Checkpoint

Pods fall back to cold start if:

  • Checkpoint file doesn’t exist yet (still being created)
  • Checkpoint file is corrupted
  • CRIU restore fails

Check logs for “Falling back to cold start” message.

Best Practices

  1. Use RWX PVCs for multi-node deployments (currently the only supported backend)
  2. Pre-warm checkpoints before scaling up
  3. Monitor checkpoint size - large models create large checkpoints
  4. Clean up old checkpoints to save storage

Environment Variables

VariableDescription
DYN_CHECKPOINT_STORAGE_TYPEBackend: pvc, s3, oci
DYN_CHECKPOINT_LOCATIONFull checkpoint location (checkpoint jobs)
DYN_CHECKPOINT_PATHBase checkpoint directory (restore pods, PVC)
DYN_CHECKPOINT_HASHIdentity hash
DYN_READY_FOR_CHECKPOINT_FILEReady-for-checkpoint file path (checkpoint jobs)

Complete Example

Create a checkpoint and use it in a DGD:

1# 1. Create the DynamoCheckpoint CR
2apiVersion: nvidia.com/v1alpha1
3kind: DynamoCheckpoint
4metadata:
5 name: e5962d34ba272638 # 16-char hash (computed from identity)
6 namespace: dynamo-system
7spec:
8 identity:
9 model: "meta-llama/Meta-Llama-3-8B-Instruct"
10 backendFramework: "vllm"
11 tensorParallelSize: 1
12 dtype: "bfloat16"
13 job:
14 activeDeadlineSeconds: 3600
15 backoffLimit: 3
16 podTemplateSpec:
17 spec:
18 containers:
19 - name: main
20 image: nvcr.io/nvidia/ai-dynamo/dynamo-vllm:latest
21 command: ["python3", "-m", "dynamo.vllm"]
22 args:
23 - "--model"
24 - "meta-llama/Meta-Llama-3-8B-Instruct"
25 - "--tensor-parallel-size"
26 - "1"
27 - "--dtype"
28 - "bfloat16"
29 env:
30 - name: HF_TOKEN
31 valueFrom:
32 secretKeyRef:
33 name: hf-token-secret
34 key: HF_TOKEN
35 resources:
36 limits:
37 nvidia.com/gpu: "1"
38 restartPolicy: Never
39---
40# 2. Wait for Ready: kubectl get dckpt e5962d34ba272638 -n dynamo-system -w
41---
42# 3. Reference the checkpoint in your DGD
43apiVersion: nvidia.com/v1alpha1
44kind: DynamoGraphDeployment
45metadata:
46 name: my-llm
47 namespace: dynamo-system
48spec:
49 services:
50 VllmWorker:
51 replicas: 2
52 extraPodSpec:
53 mainContainer:
54 image: nvcr.io/nvidia/ai-dynamo/dynamo-vllm:latest
55 resources:
56 limits:
57 nvidia.com/gpu: "1"
58 checkpoint:
59 enabled: true
60 checkpointRef: "e5962d34ba272638" # Reference by hash