Snapshotting GPU Workers | NVIDIA Dynamo Documentation

⚠️ Experimental Feature: Dynamo Snapshot is currently in preview and may only be functional in some cluster setups. The snapshot-agent DaemonSet runs in privileged mode to perform CRIU operations. See Limitations for details.

Dynamo Snapshot is infrastructure for fast-starting GPU applications in Kubernetes using CRIU (Checkpoint/Restore in Userspace) and NVIDIA’s cuda-checkpoint utility. The usual flow is:

start a worker once and checkpoint its initialized state
store that checkpoint on a namespace-local snapshot volume
restore later workers from that checkpoint instead of cold-starting again

Startup Type	Time	What Happens
Cold Start	~1 min	Download model, load to GPU, initialize engine
Warm Start (restore from checkpoint)	~10 sec	Restore from a ready checkpoint directory

⚠️ Restore time depends on storage bandwidth, GPU model, and whether the restore stays on the same node.

For more background on the snapshot architecture and startup improvements, see NVIDIA Dynamo Snapshot: Fast Startup for Inference Workloads on Kubernetes.

Prerequisites

x86_64 (amd64) GPU nodes
NVIDIA driver 580.xx or newer on the target GPU nodes (590.xx or newer if testing multi-GPU snapshots)
vLLM or SGLang backend today
Checkpoint storage. ReadWriteMany is the safest default for cross-node or concurrent multi-node access, but podMount mode can also use suitable ReadWriteOnce storage for sequential checkpoint/restore workflows.
CRI-O / OpenShift: set runtime.type=crio on the snapshot chart (and openshift.enabled=true on OpenShift). Defaults are for containerd; see the chart README for sockets and Helm flags.

Quick Start via `DynamoCheckpoint` CR

Build a placeholder image
Install the snapshot chart
Create a DynamoCheckpoint and wait for it to become ready
Deploy a DynamoGraphDeployment that restores from the corresponding checkpointRef

1. Build and push a placeholder image

Snapshot-enabled workers must use a placeholder image that wraps the normal runtime image with restore tooling. If you do not already have one, build it and push it to a registry your cluster can pull from:

$ export RUNTIME_IMAGE=registry.example.com/dynamo/vllm-runtime:1.0.0
$ export PLACEHOLDER_IMAGE=registry.example.com/dynamo/vllm-placeholder:1.0.0
$ 
$ cd deploy/snapshot
$ 
$ make docker-build-placeholder \
>   PLACEHOLDER_BASE_IMG="${RUNTIME_IMAGE}" \
>   PLACEHOLDER_IMG="${PLACEHOLDER_IMAGE}"
$ 
$ make docker-push-placeholder \
>   PLACEHOLDER_IMG="${PLACEHOLDER_IMAGE}"

The placeholder image preserves the normal runtime entrypoint/command contract and adds the criu, cuda-checkpoint, and nsrestore tooling needed for checkpoint and restore.

To build either snapshot image against a custom CRIU fork or ref, pass CRIU_REPO and CRIU_REF through make. If they are unset, the Dockerfile defaults are used.

$ make docker-build-agent \
>   IMG=registry.example.com/dynamo/snapshot-agent:1.0.0 \
>   CRIU_REPO="${YOUR_CRIU_REPO}" \
>   CRIU_REF="branch-or-sha"
$ 
$ make docker-build-placeholder \
>   PLACEHOLDER_BASE_IMG="${RUNTIME_IMAGE}" \
>   PLACEHOLDER_IMG="${PLACEHOLDER_IMAGE}" \
>   CRIU_REPO="${YOUR_CRIU_REPO}" \
>   CRIU_REF="branch-or-sha"

2. Enable checkpointing in the platform and verify it

Whether you are installing or upgrading dynamo-platform, the operator only needs checkpointing enabled:

1 dynamo-operator:
2   checkpoint:
3     enabled: true

If the platform is already installed, verify that the operator config contains the checkpoint block:

$ OPERATOR_CONFIG=$(kubectl get deploy -n "${PLATFORM_NAMESPACE}" \
>   -l app.kubernetes.io/name=dynamo-operator,app.kubernetes.io/component=manager \
>   -o jsonpath='{.items[0].spec.template.spec.volumes[?(@.name=="operator-config")].configMap.name}')
$ 
$ kubectl get configmap "${OPERATOR_CONFIG}" -n "${PLATFORM_NAMESPACE}" \
>   -o jsonpath='{.data.config\.yaml}' | sed -n '/^checkpoint:/,/^[^[:space:]]/p'

Verify that the rendered config includes enabled: true.

3. Install the snapshot chart

For the default namespace-local mode, install the snapshot chart in each workload namespace. The chart creates the PVC and the agent in that namespace:

$ helm upgrade --install snapshot ./deploy/helm/charts/snapshot \
>   --namespace ${NAMESPACE} \
>   --create-namespace \
>   --set storage.pvc.create=true

In the default agentMount mode, the snapshot-agent DaemonSet mounts the checkpoint PVC directly. On a multi-node GPU cluster that means agent pods on multiple nodes may mount the same PVC, so the PVC generally needs ReadWriteMany. The chart defaults to that mode. If your cluster does not have a default storage class, also set storage.pvc.storageClass.

If you are reusing an existing checkpoint PVC, do not set storage.pvc.create=true; install the chart with storage.pvc.create=false and set storage.pvc.name instead.

CRI-O or OpenShift: append for example --set runtime.type=crio and, on OpenShift, --set openshift.enabled=true (see deploy/helm/charts/snapshot/README.md).

For clusters that prefer one privileged snapshot agent instead of one DaemonSet per workload namespace, install the chart once in an infrastructure namespace. In this mode the chart does not create workload PVCs; the Dynamo operator either creates each namespace-local PVC or verifies that it already exists:

$ helm upgrade --install snapshot ./deploy/helm/charts/snapshot \
>   --namespace dynamo-system \
>   --create-namespace \
>   --set storage.accessMode=podMount \
>   --set storage.pvc.create=false \
>   --set rbac.namespaceRestricted=false

To let the operator create the workload PVC in each namespace that uses checkpoint/restore, configure the operator with create: true:

1 dynamo-operator:
2   checkpoint:
3     enabled: true
4     storage:
5       type: pvc
6       pvc:
7         pvcName: snapshot-pvc
8         basePath: /checkpoints
9         create: true
10         size: 1Ti
11         storageClassName: ""
12         accessMode: ReadWriteMany

The chart and operator use separate configuration surfaces here: the snapshot chart PVC name is storage.pvc.name, while the operator config field is checkpoint.storage.pvc.pvcName.

This is a key difference from agentMount: podMount removes the requirement that the snapshot-agent DaemonSet mount the checkpoint PVC on every GPU node. Only the active checkpoint/restore workload pod mounts the PVC, and the agent reaches it through that pod’s mount namespace. ReadWriteMany remains the safest operator-managed default, especially when multiple checkpoint/restore pods may access the same PVC concurrently or when restore scheduling can span nodes. Suitable ReadWriteOnce storage classes can still be used for sequential podMount checkpoint/restore flows when the backend can attach the volume to the node running the active workload pod.

podMount depends on the target container remaining alive while the agent resolves /host/proc/<pid>/root/<basePath>. If the container exits or restarts during checkpoint/restore setup, if the runtime cannot expose a stable host PID, or if node security settings prevent host proc traversal, the agent fails or skips that attempt and Kubernetes/operator reconciliation must try again after a fresh container is available.

To use an already-present PVC instead, omit create or set it to false. The operator will fail reconciliation with a clear error if the named PVC does not exist in the workload namespace.

Verify that the DaemonSet is ready. After a checkpoint or restore workload is reconciled, verify the workload namespace PVC:

$ kubectl rollout status daemonset/snapshot-agent -n dynamo-system
$ kubectl get pods -n dynamo-system -l app.kubernetes.io/component=snapshot-agent -o wide
$ kubectl get pvc snapshot-pvc -n ${NAMESPACE}

4. Create a standalone `DynamoCheckpoint`

checkpoint.mode is deprecated. Use checkpoint.enabled and omit mode. spec.identity is legacy v1alpha1 compatibility only; use checkpointRef to restore an existing checkpoint.

The checkpoint Job pod template should match the worker container you want to checkpoint. For a standalone checkpoint, the important parts are the deprecated legacy spec.identity metadata, a container named main, and the placeholder image; the rest of the pod template should mirror your normal worker config. Extra containers are allowed, but only main is checkpointed unless spec.job.targetContainerName selects another container.

1 apiVersion: nvidia.com/v1alpha1
2 kind: DynamoCheckpoint
3 metadata:
4   name: qwen3-06b-bf16
5 spec:
6   identity:
7     model: Qwen/Qwen3-0.6B
8     backendFramework: vllm
9     tensorParallelSize: 1
10     dtype: bfloat16
11     maxModelLen: 2048
12 
13   job:
14     activeDeadlineSeconds: 3600
15     podTemplateSpec:
16       spec:
17         ...
18         containers:
19           - name: main
20             image: registry.example.com/dynamo/vllm-placeholder:1.0.0
21             ...

GMS + Snapshot support is currently disabled.

For a full working example, see deploy/operator/config/samples/nvidia.com_v1alpha1_dynamocheckpoint.yaml.

Apply it:

$ kubectl apply -f qwen3-checkpoint.yaml -n ${NAMESPACE}

5. Wait for the checkpoint to become ready

$ kubectl get dckpt -n ${NAMESPACE} \
>   -o custom-columns=NAME:.metadata.name,CHECKPOINT_ID:.status.checkpointID,PHASE:.status.phase
$ 
$ kubectl wait \
>   --for=jsonpath='{.status.phase}'=Ready \
>   dynamocheckpoint/qwen3-06b-bf16 \
>   -n ${NAMESPACE} \
>   --timeout=30m

The useful status fields are:

status.phase: high-level lifecycle (Pending, Creating, Ready, Failed)
status.checkpointID: artifact ID used by the snapshot protocol
status.identityHash: deprecated compatibility alias for status.checkpointID
status.jobName: checkpoint Job name
status.createdAt: timestamp recorded when the checkpoint became ready
status.message: progress or failure detail when available

6. Deploy a `DynamoGraphDeployment` that restores from `checkpointRef`

Once the checkpoint is Ready, restore a worker from it explicitly:

1 apiVersion: nvidia.com/v1alpha1
2 kind: DynamoGraphDeployment
3 metadata:
4   name: vllm-checkpointref-demo
5 spec:
6   services:
7     Frontend:
8       componentType: frontend
9       replicas: 1
10       extraPodSpec:
11         mainContainer:
12           image: registry.example.com/dynamo/vllm-runtime:1.0.0
13 
14     VllmDecodeWorker:
15       componentType: worker
16       replicas: 1
17       checkpoint:
18         enabled: true
19         checkpointRef: qwen3-06b-bf16
20       extraPodSpec:
21         mainContainer:
22           image: registry.example.com/dynamo/vllm-placeholder:1.0.0
23           ...
24         ...

Apply it:

$ kubectl apply -f vllm-checkpointref-demo.yaml -n ${NAMESPACE}
$ kubectl get pods -n ${NAMESPACE} -w

The VllmDecodeWorker pod should restore from the ready checkpoint instead of creating a new one.

Choosing a DGD checkpoint flow

Enable checkpointing with checkpoint.enabled: true. The presence of checkpointRef selects the restore flow:

Config	What happens	Use when
`enabled: true` and no `checkpointRef`	The DGD creates and owns one automatic checkpoint per worker generation.	Normal DGD-managed checkpointing.
`enabled: true` and `checkpointRef: <name>`	The DGD restores from the named existing checkpoint and creates none.	Explicit restore from a retained or pre-warmed checkpoint.

In v1beta1, replace the old mode: Auto form with:

1 experimental:
2   checkpoint:
3     enabled: true

In v1alpha1, keep the existing enable flag and omit mode:

1 checkpoint:
2   enabled: true

The old mode field is deprecated. Omit it in new configs.

startupPolicy controls when normal workers start relative to checkpoint readiness. Immediate starts workers cold while the checkpoint job runs; WaitForCheckpoint keeps replicas at zero until the checkpoint is ready.

deletionPolicy applies only to DGD-owned automatic checkpoints. Delete removes the checkpoint CR and artifact with the DGD; Retain keeps them so they can be used later with checkpointRef.

DGD-managed automatic checkpoints

Without checkpointRef, the DGD-managed path is used. For each checkpoint-enabled worker generation, the DGD controller creates a DGD-owned DynamoCheckpoint, and the checkpoint controller starts a checkpoint Job. Automatic checkpoints are not reused across DGDs, even when two manifests are identical.

The automatic checkpoint ID is derived from the DGD namespace/name/UID, component name, and active worker hash. The DGD UID prevents cross-DGD reuse; the worker hash lets a scale down/up on the same worker generation use the same DGD-scoped checkpoint while creating a new checkpoint for a new worker generation.

Treat a restored pod template as a compatibility template for the same workload: once the checkpoint is ready, restore admission replaces the target container’s command and args with the restore placeholder, and the restored process resumes from the checkpointed state rather than newly supplied command-line or environment settings.

With startupPolicy: Immediate, existing Pods are not mutated or restarted just because the checkpoint became ready. Scale or roll the worker to create restored Pods after the checkpoint is ready.

For v1beta1 components, the automatic checkpoint config is:

1 experimental:
2   checkpoint:
3     enabled: true
4     startupPolicy: Immediate # default; optional
5     deletionPolicy: Delete  # default; use Retain to keep CR/artifact after DGD deletion

For v1alpha1 services, the automatic checkpoint config is:

1 checkpoint:
2   enabled: true
3   startupPolicy: Immediate # default; optional
4   deletionPolicy: Delete  # default; use Retain to keep CR/artifact after DGD deletion

Inside a DynamoGraphDeployment, it looks like this:

1 apiVersion: nvidia.com/v1alpha1
2 kind: DynamoGraphDeployment
3 metadata:
4   name: vllm-auto-demo
5 spec:
6   services:
7     Frontend:
8       componentType: frontend
9       replicas: 1
10       extraPodSpec:
11         mainContainer:
12           image: registry.example.com/dynamo/vllm-runtime:1.0.0
13 
14     VllmDecodeWorker:
15       componentType: worker
16       replicas: 1
17       checkpoint:
18         enabled: true
19         startupPolicy: Immediate
20         deletionPolicy: Delete
21       extraPodSpec:
22         mainContainer:
23           image: registry.example.com/dynamo/vllm-placeholder:1.0.0
24           ...
25         ...

The legacy checkpoint.identity field is ignored for DGD-managed automatic checkpoints. It is retained only for API compatibility and standalone DynamoCheckpoint workflows.

Useful inspection commands:

$ kubectl get dgd vllm-auto-demo -n ${NAMESPACE} \
>   -o jsonpath='{.status.checkpoints.VllmDecodeWorker.checkpointName}{"\n"}{.status.checkpoints.VllmDecodeWorker.checkpointID}{"\n"}{.status.checkpoints.VllmDecodeWorker.ready}{"\n"}'
$ 
$ kubectl get dckpt -n ${NAMESPACE}

If you use the default Immediate policy and want to create restored pods after the checkpoint becomes ready, scale the worker:

$ kubectl patch dgd vllm-auto-demo -n ${NAMESPACE} --type=merge \
>   -p '{"spec":{"services":{"VllmDecodeWorker":{"replicas":2}}}}'

Failover Restore

Failover restore is not yet available. The current Snapshot flow does not support GMS + Snapshot, so do not use failover restore as a supported checkpoint/restore path. For current GMS and active/passive failover guidance, see Shadow Engine Failover.

Lower-Level Testing With `snapshotctl`

It is possible to checkpoint and restore pods without the Dynamo operator via the lower-level snapshotctl utility. However, the snapshot helm chart must be installed, with a running snapshot-agent DaemonSet in the namespace with the checkpoint PVC mounted.

snapshotctl is intended for lower-level debugging and validation workflows, not as the primary user-facing checkpoint interface. For command details and manifest requirements, see deploy/snapshot/cmd/snapshotctl/README.md.

Checkpoint from a worker pod manifest

$ snapshotctl checkpoint \
>   --manifest ./worker-pod.yaml \
>   --container main \
>   --namespace ${NAMESPACE}

The checkpoint manifest must be for a pod and use a placeholder image. --container names the workload container to checkpoint.

If you do not pass --checkpoint-id, snapshotctl generates one and prints it:

status=completed
namespace=...
name=...
checkpoint_job=...
checkpoint_id=manual-snapshot-...
checkpoint_location=/checkpoints/...

Restore from a worker pod manifest

$ snapshotctl restore \
>   --manifest ./worker-pod.yaml \
>   --namespace ${NAMESPACE} \
>   --checkpoint-id manual-snapshot-... \
>   --containers main

This creates a new restore pod and returns after the request is submitted. Observe progress through Kubernetes readiness, events, and logs.

Restore an existing pod in place

$ snapshotctl restore \
>   --pod existing-restore-target \
>   --namespace ${NAMESPACE} \
>   --checkpoint-id manual-snapshot-... \
>   --containers main

This patches restore metadata onto an existing pod that is already snapshot-compatible and returns after the patch is accepted.

Checkpoint IDs and Legacy Identity

status.checkpointID is the artifact ID used by the snapshot protocol and the directory name under checkpoint storage. For DGD-managed automatic checkpoints, this ID is scoped to a single DGD/component worker generation. It is not a compatibility claim across DGDs, and identical manifests are not treated as proof that a checkpoint can be reused safely.

The deprecated legacy spec.identity shape is still required on standalone v1alpha1 DynamoCheckpoint objects. When a standalone checkpoint does not already have status.checkpointID or the checkpoint-ID label, the operator computes the legacy 16-character SHA256 hash (64 bits) from these fields:

Legacy field	Required	Affects legacy hash	Example
`model`	✓	✓	`meta-llama/Llama-3-8B`
`backendFramework`	✓	✓	`vllm`
`dynamoVersion`		✓	`0.9.0`, `1.0.0`
`tensorParallelSize`		✓	`1`, `2`, `4`, `8`
`pipelineParallelSize`		✓	`1`, `2`
`dtype`		✓	`float16`, `bfloat16`, `fp8`
`maxModelLen`		✓	`4096`, `8192`
`extraParameters`		✓	custom key-value pairs

Fields that do not change the legacy hash include:

replica count
node placement (nodeSelector, affinity, tolerations)
resource requests/limits
logging or observability configuration

DGD-managed automatic checkpoints ignore legacy identity as a reuse boundary: omit identity; use checkpointRef to restore an existing checkpoint. The DGD controller creates a DGD-scoped checkpoint ID and only synthesizes legacy identity because the v1alpha1 DynamoCheckpoint API still requires it.

Checkpoint job launch behavior is inferred from the checkpoint Job pod template (GPU resources or DRA claims), not from legacy identity fields such as tensorParallelSize or pipelineParallelSize.

`DynamoCheckpoint` CRD

The DynamoCheckpoint (shortname: dckpt) is the operator-managed resource for checkpoint lifecycle.

Use it when you want:

pre-warmed checkpoints before any DynamoGraphDeployment exists
explicit lifecycle control independent from a DGD
a stable human-readable name that services can reference with checkpointRef

The operator requires:

spec.identity
spec.job.podTemplateSpec

spec.job.backoffLimit is deprecated and ignored. Checkpoint Jobs are always single-attempt.

Check status with:

$ kubectl get dckpt -n ${NAMESPACE}
$ kubectl describe dckpt qwen3-06b-bf16 -n ${NAMESPACE}
$ kubectl get dckpt qwen3-06b-bf16 -n ${NAMESPACE} -o yaml

The status block looks like:

1 status:
2   phase: Ready
3   checkpointID: 3bff874d069f0ed5
4   identityHash: 3bff874d069f0ed5 # deprecated compatibility alias
5   jobName: checkpoint-job-3bff874d069f0ed5-1
6   createdAt: "2026-01-29T10:05:00Z"
7   message: ""

Limitations

Backend support is limited: checkpoint/restore currently supports vLLM workers only, and that support is still a limited preview.
Worker coverage is narrow: specialized workers such as multimodal, embedding, and diffusion are not supported.
Multi-GPU remains preview: vLLM tensor-parallel configurations have limited validation and are not yet a broadly supported path across clusters.
GMS restore remains experimental: GMS + Snapshot is currently disabled.
Admission is create-only: with DGD startupPolicy: Immediate, only Pods created after a checkpoint is Ready are restore-shaped. Existing Pods cold-started before checkpoint readiness keep running as-is.
Restore admission must be installed: DGD restores rely on the snapshot Pod mutating webhook, so upgrade the snapshot chart/webhook configuration along with the operator and CRDs when enabling these features.
Network state is sensitive: restore is sensitive to live TCP socket state. Loopback bootstrap/control sockets are the most reliable path today.
Privileged DaemonSet required: snapshot-agent must run privileged to execute CRIU and cuda-checkpoint. Workload pods do not need to be privileged.

Troubleshooting

Checkpoint Job finishes but the checkpoint never becomes `Ready`

Snapshot only becomes Ready after snapshot-agent confirms the checkpoint contents. A completed Job is not enough by itself.

$ kubectl get dckpt <checkpoint-name> -n ${NAMESPACE} \
>   -o custom-columns=NAME:.metadata.name,PHASE:.status.phase,MESSAGE:.status.message,JOB:.status.jobName
$ 
$ JOB_NAME=$(kubectl get dckpt <checkpoint-name> -n ${NAMESPACE} -o jsonpath='{.status.jobName}')
$ if [ -n "${JOB_NAME}" ]; then
$   kubectl logs job/"${JOB_NAME}" -n ${NAMESPACE}
$ fi
$ 
$ kubectl logs daemonset/snapshot-agent -n ${NAMESPACE} --all-containers

If the worker template is wrong, the most common causes are using the raw runtime image instead of the placeholder image, or leaving out normal mounts and secrets that the worker needs to start.

Restore cannot find or mount checkpoint storage

For the default agentMount install, restore discovers checkpoint storage from the snapshot-agent DaemonSet in the workload namespace. That DaemonSet must be ready and must mount the checkpoint PVC.

$ kubectl rollout status daemonset/snapshot-agent -n ${NAMESPACE}
$ kubectl get daemonset -n ${NAMESPACE} -l app.kubernetes.io/component=snapshot-agent -o wide
$ kubectl get pvc -n ${NAMESPACE}

For a shared-agent podMount install, the snapshot-agent DaemonSet can run in the infrastructure namespace instead. Verify the shared-agent pods there, then verify that the workload namespace has the checkpoint PVC that the operator created or validated:

$ kubectl rollout status daemonset/snapshot-agent -n dynamo-system
$ kubectl get pods -n dynamo-system -l app.kubernetes.io/component=snapshot-agent -o wide
$ kubectl get pvc snapshot-pvc -n ${NAMESPACE}

In podMount mode the agent reaches the checkpoint through the workload pod’s mount namespace rather than by mounting the PVC itself. Check the workload pod’s checkpoint storage annotations and the snapshot-agent logs to see the actual resolved checkpoint path. snapshotctl uses the chart’s storage resolution path, so for lower-level snapshotctl debugging make sure the snapshot chart configuration matches the access mode you are testing.

`snapshotctl` manifest is rejected or the restore target is wrong

snapshotctl requires a Pod manifest and a target-container list. Multi-container manifests are supported as long as every name passed via --container or --containers exists in the pod spec.

$ snapshotctl checkpoint --manifest ./worker-pod.yaml --container main --namespace ${NAMESPACE}
$ snapshotctl restore  --manifest ./worker-pod.yaml --containers main --namespace ${NAMESPACE} --checkpoint-id <checkpoint-id>

If the manifest already carries snapshot target metadata, it must agree with the CLI flag; snapshotctl rejects mismatches instead of silently picking one.

Planned Features

Stable multi-GPU and multinode support
TensorRT-LLM support