Snapshotting GPU Workers

View as Markdown

⚠️ Experimental Feature: Dynamo Snapshot is currently in preview and may only be functional in some cluster setups. The snapshot-agent DaemonSet runs in privileged mode to perform CRIU operations. See Limitations for details.

Dynamo Snapshot is infrastructure for fast-starting GPU applications in Kubernetes using CRIU (Checkpoint/Restore in Userspace) and NVIDIA’s cuda-checkpoint utility. The usual flow is:

  1. start a worker once and checkpoint its initialized state
  2. store that checkpoint on a namespace-local snapshot volume
  3. restore later workers from that checkpoint instead of cold-starting again
Startup TypeTimeWhat Happens
Cold Start~1 minDownload model, load to GPU, initialize engine
Warm Start (restore from checkpoint)~10 secRestore from a ready checkpoint directory

⚠️ Restore time depends on storage bandwidth, GPU model, and whether the restore stays on the same node.

For more background on the snapshot architecture and startup improvements, see NVIDIA Dynamo Snapshot: Fast Startup for Inference Workloads on Kubernetes.

Prerequisites

  • x86_64 (amd64) GPU nodes
  • NVIDIA driver 580.xx or newer on the target GPU nodes (590.xx or newer if testing multi-GPU snapshots)
  • vLLM or SGLang backend today
  • Checkpoint storage. ReadWriteMany is the safest default for cross-node or concurrent multi-node access, but podMount mode can also use suitable ReadWriteOnce storage for sequential checkpoint/restore workflows.
  • CRI-O / OpenShift: set runtime.type=crio on the snapshot chart (and openshift.enabled=true on OpenShift). Defaults are for containerd; see the chart README for sockets and Helm flags.

Quick Start via DynamoCheckpoint CR

  1. Build a placeholder image
  2. Install the snapshot chart
  3. Create a DynamoCheckpoint and wait for it to become ready
  4. Deploy a DynamoGraphDeployment that restores from the corresponding checkpointRef

1. Build and push a placeholder image

Snapshot-enabled workers must use a placeholder image that wraps the normal runtime image with restore tooling. If you do not already have one, build it and push it to a registry your cluster can pull from:

$export RUNTIME_IMAGE=registry.example.com/dynamo/vllm-runtime:1.0.0
$export PLACEHOLDER_IMAGE=registry.example.com/dynamo/vllm-placeholder:1.0.0
$
$cd deploy/snapshot
$
$make docker-build-placeholder \
> PLACEHOLDER_BASE_IMG="${RUNTIME_IMAGE}" \
> PLACEHOLDER_IMG="${PLACEHOLDER_IMAGE}"
$
$make docker-push-placeholder \
> PLACEHOLDER_IMG="${PLACEHOLDER_IMAGE}"

The placeholder image preserves the normal runtime entrypoint/command contract and adds the criu, cuda-checkpoint, and nsrestore tooling needed for checkpoint and restore.

To build either snapshot image against a custom CRIU fork or ref, pass CRIU_REPO and CRIU_REF through make. If they are unset, the Dockerfile defaults are used.

$make docker-build-agent \
> IMG=registry.example.com/dynamo/snapshot-agent:1.0.0 \
> CRIU_REPO="${YOUR_CRIU_REPO}" \
> CRIU_REF="branch-or-sha"
$
$make docker-build-placeholder \
> PLACEHOLDER_BASE_IMG="${RUNTIME_IMAGE}" \
> PLACEHOLDER_IMG="${PLACEHOLDER_IMAGE}" \
> CRIU_REPO="${YOUR_CRIU_REPO}" \
> CRIU_REF="branch-or-sha"

2. Enable checkpointing in the platform and verify it

Whether you are installing or upgrading dynamo-platform, the operator only needs checkpointing enabled:

1dynamo-operator:
2 checkpoint:
3 enabled: true

If the platform is already installed, verify that the operator config contains the checkpoint block:

$OPERATOR_CONFIG=$(kubectl get deploy -n "${PLATFORM_NAMESPACE}" \
> -l app.kubernetes.io/name=dynamo-operator,app.kubernetes.io/component=manager \
> -o jsonpath='{.items[0].spec.template.spec.volumes[?(@.name=="operator-config")].configMap.name}')
$
$kubectl get configmap "${OPERATOR_CONFIG}" -n "${PLATFORM_NAMESPACE}" \
> -o jsonpath='{.data.config\.yaml}' | sed -n '/^checkpoint:/,/^[^[:space:]]/p'

Verify that the rendered config includes enabled: true.

3. Install the snapshot chart

For the default namespace-local mode, install the snapshot chart in each workload namespace. The chart creates the PVC and the agent in that namespace:

$helm upgrade --install snapshot ./deploy/helm/charts/snapshot \
> --namespace ${NAMESPACE} \
> --create-namespace \
> --set storage.pvc.create=true

In the default agentMount mode, the snapshot-agent DaemonSet mounts the checkpoint PVC directly. On a multi-node GPU cluster that means agent pods on multiple nodes may mount the same PVC, so the PVC generally needs ReadWriteMany. The chart defaults to that mode. If your cluster does not have a default storage class, also set storage.pvc.storageClass.

If you are reusing an existing checkpoint PVC, do not set storage.pvc.create=true; install the chart with storage.pvc.create=false and set storage.pvc.name instead.

CRI-O or OpenShift: append for example --set runtime.type=crio and, on OpenShift, --set openshift.enabled=true (see deploy/helm/charts/snapshot/README.md).

For clusters that prefer one privileged snapshot agent instead of one DaemonSet per workload namespace, install the chart once in an infrastructure namespace. In this mode the chart does not create workload PVCs; the Dynamo operator either creates each namespace-local PVC or verifies that it already exists:

$helm upgrade --install snapshot ./deploy/helm/charts/snapshot \
> --namespace dynamo-system \
> --create-namespace \
> --set storage.accessMode=podMount \
> --set storage.pvc.create=false \
> --set rbac.namespaceRestricted=false

To let the operator create the workload PVC in each namespace that uses checkpoint/restore, configure the operator with create: true:

1dynamo-operator:
2 checkpoint:
3 enabled: true
4 storage:
5 type: pvc
6 pvc:
7 pvcName: snapshot-pvc
8 basePath: /checkpoints
9 create: true
10 size: 1Ti
11 storageClassName: ""
12 accessMode: ReadWriteMany

The chart and operator use separate configuration surfaces here: the snapshot chart PVC name is storage.pvc.name, while the operator config field is checkpoint.storage.pvc.pvcName.

This is a key difference from agentMount: podMount removes the requirement that the snapshot-agent DaemonSet mount the checkpoint PVC on every GPU node. Only the active checkpoint/restore workload pod mounts the PVC, and the agent reaches it through that pod’s mount namespace. ReadWriteMany remains the safest operator-managed default, especially when multiple checkpoint/restore pods may access the same PVC concurrently or when restore scheduling can span nodes. Suitable ReadWriteOnce storage classes can still be used for sequential podMount checkpoint/restore flows when the backend can attach the volume to the node running the active workload pod.

podMount depends on the target container remaining alive while the agent resolves /host/proc/<pid>/root/<basePath>. If the container exits or restarts during checkpoint/restore setup, if the runtime cannot expose a stable host PID, or if node security settings prevent host proc traversal, the agent fails or skips that attempt and Kubernetes/operator reconciliation must try again after a fresh container is available.

To use an already-present PVC instead, omit create or set it to false. The operator will fail reconciliation with a clear error if the named PVC does not exist in the workload namespace.

Verify that the DaemonSet is ready. After a checkpoint or restore workload is reconciled, verify the workload namespace PVC:

$kubectl rollout status daemonset/snapshot-agent -n dynamo-system
$kubectl get pods -n dynamo-system -l app.kubernetes.io/component=snapshot-agent -o wide
$kubectl get pvc snapshot-pvc -n ${NAMESPACE}

4. Create a standalone DynamoCheckpoint

checkpoint.mode is deprecated. Use checkpoint.enabled and omit mode. spec.identity is legacy v1alpha1 compatibility only; use checkpointRef to restore an existing checkpoint.

The checkpoint Job pod template should match the worker container you want to checkpoint. For a standalone checkpoint, the important parts are the deprecated legacy spec.identity metadata, a container named main, and the placeholder image; the rest of the pod template should mirror your normal worker config. Extra containers are allowed, but only main is checkpointed unless spec.job.targetContainerName selects another container.

1apiVersion: nvidia.com/v1alpha1
2kind: DynamoCheckpoint
3metadata:
4 name: qwen3-06b-bf16
5spec:
6 identity:
7 model: Qwen/Qwen3-0.6B
8 backendFramework: vllm
9 tensorParallelSize: 1
10 dtype: bfloat16
11 maxModelLen: 2048
12
13 job:
14 activeDeadlineSeconds: 3600
15 podTemplateSpec:
16 spec:
17 ...
18 containers:
19 - name: main
20 image: registry.example.com/dynamo/vllm-placeholder:1.0.0
21 ...

GMS + Snapshot support is currently disabled.

For a full working example, see deploy/operator/config/samples/nvidia.com_v1alpha1_dynamocheckpoint.yaml.

Apply it:

$kubectl apply -f qwen3-checkpoint.yaml -n ${NAMESPACE}

5. Wait for the checkpoint to become ready

$kubectl get dckpt -n ${NAMESPACE} \
> -o custom-columns=NAME:.metadata.name,CHECKPOINT_ID:.status.checkpointID,PHASE:.status.phase
$
$kubectl wait \
> --for=jsonpath='{.status.phase}'=Ready \
> dynamocheckpoint/qwen3-06b-bf16 \
> -n ${NAMESPACE} \
> --timeout=30m

The useful status fields are:

  • status.phase: high-level lifecycle (Pending, Creating, Ready, Failed)
  • status.checkpointID: artifact ID used by the snapshot protocol
  • status.identityHash: deprecated compatibility alias for status.checkpointID
  • status.jobName: checkpoint Job name
  • status.createdAt: timestamp recorded when the checkpoint became ready
  • status.message: progress or failure detail when available

6. Deploy a DynamoGraphDeployment that restores from checkpointRef

Once the checkpoint is Ready, restore a worker from it explicitly:

1apiVersion: nvidia.com/v1alpha1
2kind: DynamoGraphDeployment
3metadata:
4 name: vllm-checkpointref-demo
5spec:
6 services:
7 Frontend:
8 componentType: frontend
9 replicas: 1
10 extraPodSpec:
11 mainContainer:
12 image: registry.example.com/dynamo/vllm-runtime:1.0.0
13
14 VllmDecodeWorker:
15 componentType: worker
16 replicas: 1
17 checkpoint:
18 enabled: true
19 checkpointRef: qwen3-06b-bf16
20 extraPodSpec:
21 mainContainer:
22 image: registry.example.com/dynamo/vllm-placeholder:1.0.0
23 ...
24 ...

Apply it:

$kubectl apply -f vllm-checkpointref-demo.yaml -n ${NAMESPACE}
$kubectl get pods -n ${NAMESPACE} -w

The VllmDecodeWorker pod should restore from the ready checkpoint instead of creating a new one.

Choosing a DGD checkpoint flow

Enable checkpointing with checkpoint.enabled: true. The presence of checkpointRef selects the restore flow:

ConfigWhat happensUse when
enabled: true and no checkpointRefThe DGD creates and owns one automatic checkpoint per worker generation.Normal DGD-managed checkpointing.
enabled: true and checkpointRef: <name>The DGD restores from the named existing checkpoint and creates none.Explicit restore from a retained or pre-warmed checkpoint.

In v1beta1, replace the old mode: Auto form with:

1experimental:
2 checkpoint:
3 enabled: true

In v1alpha1, keep the existing enable flag and omit mode:

1checkpoint:
2 enabled: true

The old mode field is deprecated. Omit it in new configs.

startupPolicy controls when normal workers start relative to checkpoint readiness. Immediate starts workers cold while the checkpoint job runs; WaitForCheckpoint keeps replicas at zero until the checkpoint is ready.

deletionPolicy applies only to DGD-owned automatic checkpoints. Delete removes the checkpoint CR and artifact with the DGD; Retain keeps them so they can be used later with checkpointRef.

DGD-managed automatic checkpoints

Without checkpointRef, the DGD-managed path is used. For each checkpoint-enabled worker generation, the DGD controller creates a DGD-owned DynamoCheckpoint, and the checkpoint controller starts a checkpoint Job. Automatic checkpoints are not reused across DGDs, even when two manifests are identical.

The automatic checkpoint ID is derived from the DGD namespace/name/UID, component name, and active worker hash. The DGD UID prevents cross-DGD reuse; the worker hash lets a scale down/up on the same worker generation use the same DGD-scoped checkpoint while creating a new checkpoint for a new worker generation.

Treat a restored pod template as a compatibility template for the same workload: once the checkpoint is ready, restore admission replaces the target container’s command and args with the restore placeholder, and the restored process resumes from the checkpointed state rather than newly supplied command-line or environment settings.

With startupPolicy: Immediate, existing Pods are not mutated or restarted just because the checkpoint became ready. Scale or roll the worker to create restored Pods after the checkpoint is ready.

For v1beta1 components, the automatic checkpoint config is:

1experimental:
2 checkpoint:
3 enabled: true
4 startupPolicy: Immediate # default; optional
5 deletionPolicy: Delete # default; use Retain to keep CR/artifact after DGD deletion

For v1alpha1 services, the automatic checkpoint config is:

1checkpoint:
2 enabled: true
3 startupPolicy: Immediate # default; optional
4 deletionPolicy: Delete # default; use Retain to keep CR/artifact after DGD deletion

Inside a DynamoGraphDeployment, it looks like this:

1apiVersion: nvidia.com/v1alpha1
2kind: DynamoGraphDeployment
3metadata:
4 name: vllm-auto-demo
5spec:
6 services:
7 Frontend:
8 componentType: frontend
9 replicas: 1
10 extraPodSpec:
11 mainContainer:
12 image: registry.example.com/dynamo/vllm-runtime:1.0.0
13
14 VllmDecodeWorker:
15 componentType: worker
16 replicas: 1
17 checkpoint:
18 enabled: true
19 startupPolicy: Immediate
20 deletionPolicy: Delete
21 extraPodSpec:
22 mainContainer:
23 image: registry.example.com/dynamo/vllm-placeholder:1.0.0
24 ...
25 ...

The legacy checkpoint.identity field is ignored for DGD-managed automatic checkpoints. It is retained only for API compatibility and standalone DynamoCheckpoint workflows.

Useful inspection commands:

$kubectl get dgd vllm-auto-demo -n ${NAMESPACE} \
> -o jsonpath='{.status.checkpoints.VllmDecodeWorker.checkpointName}{"\n"}{.status.checkpoints.VllmDecodeWorker.checkpointID}{"\n"}{.status.checkpoints.VllmDecodeWorker.ready}{"\n"}'
$
$kubectl get dckpt -n ${NAMESPACE}

If you use the default Immediate policy and want to create restored pods after the checkpoint becomes ready, scale the worker:

$kubectl patch dgd vllm-auto-demo -n ${NAMESPACE} --type=merge \
> -p '{"spec":{"services":{"VllmDecodeWorker":{"replicas":2}}}}'

Failover Restore

Failover restore is not yet available. The current Snapshot flow does not support GMS + Snapshot, so do not use failover restore as a supported checkpoint/restore path. For current GMS and active/passive failover guidance, see Shadow Engine Failover.

Lower-Level Testing With snapshotctl

It is possible to checkpoint and restore pods without the Dynamo operator via the lower-level snapshotctl utility. However, the snapshot helm chart must be installed, with a running snapshot-agent DaemonSet in the namespace with the checkpoint PVC mounted.

snapshotctl is intended for lower-level debugging and validation workflows, not as the primary user-facing checkpoint interface. For command details and manifest requirements, see deploy/snapshot/cmd/snapshotctl/README.md.

Checkpoint from a worker pod manifest

$snapshotctl checkpoint \
> --manifest ./worker-pod.yaml \
> --container main \
> --namespace ${NAMESPACE}

The checkpoint manifest must be for a pod and use a placeholder image. --container names the workload container to checkpoint.

If you do not pass --checkpoint-id, snapshotctl generates one and prints it:

status=completed
namespace=...
name=...
checkpoint_job=...
checkpoint_id=manual-snapshot-...
checkpoint_location=/checkpoints/...

Restore from a worker pod manifest

$snapshotctl restore \
> --manifest ./worker-pod.yaml \
> --namespace ${NAMESPACE} \
> --checkpoint-id manual-snapshot-... \
> --containers main

This creates a new restore pod and returns after the request is submitted. Observe progress through Kubernetes readiness, events, and logs.

Restore an existing pod in place

$snapshotctl restore \
> --pod existing-restore-target \
> --namespace ${NAMESPACE} \
> --checkpoint-id manual-snapshot-... \
> --containers main

This patches restore metadata onto an existing pod that is already snapshot-compatible and returns after the patch is accepted.

Checkpoint IDs and Legacy Identity

status.checkpointID is the artifact ID used by the snapshot protocol and the directory name under checkpoint storage. For DGD-managed automatic checkpoints, this ID is scoped to a single DGD/component worker generation. It is not a compatibility claim across DGDs, and identical manifests are not treated as proof that a checkpoint can be reused safely.

The deprecated legacy spec.identity shape is still required on standalone v1alpha1 DynamoCheckpoint objects. When a standalone checkpoint does not already have status.checkpointID or the checkpoint-ID label, the operator computes the legacy 16-character SHA256 hash (64 bits) from these fields:

Legacy fieldRequiredAffects legacy hashExample
modelmeta-llama/Llama-3-8B
backendFrameworkvllm
dynamoVersion0.9.0, 1.0.0
tensorParallelSize1, 2, 4, 8
pipelineParallelSize1, 2
dtypefloat16, bfloat16, fp8
maxModelLen4096, 8192
extraParameterscustom key-value pairs

Fields that do not change the legacy hash include:

  • replica count
  • node placement (nodeSelector, affinity, tolerations)
  • resource requests/limits
  • logging or observability configuration

DGD-managed automatic checkpoints ignore legacy identity as a reuse boundary: omit identity; use checkpointRef to restore an existing checkpoint. The DGD controller creates a DGD-scoped checkpoint ID and only synthesizes legacy identity because the v1alpha1 DynamoCheckpoint API still requires it.

Checkpoint job launch behavior is inferred from the checkpoint Job pod template (GPU resources or DRA claims), not from legacy identity fields such as tensorParallelSize or pipelineParallelSize.

DynamoCheckpoint CRD

The DynamoCheckpoint (shortname: dckpt) is the operator-managed resource for checkpoint lifecycle.

Use it when you want:

  • pre-warmed checkpoints before any DynamoGraphDeployment exists
  • explicit lifecycle control independent from a DGD
  • a stable human-readable name that services can reference with checkpointRef

The operator requires:

  • spec.identity
  • spec.job.podTemplateSpec

spec.job.backoffLimit is deprecated and ignored. Checkpoint Jobs are always single-attempt.

Check status with:

$kubectl get dckpt -n ${NAMESPACE}
$kubectl describe dckpt qwen3-06b-bf16 -n ${NAMESPACE}
$kubectl get dckpt qwen3-06b-bf16 -n ${NAMESPACE} -o yaml

The status block looks like:

1status:
2 phase: Ready
3 checkpointID: 3bff874d069f0ed5
4 identityHash: 3bff874d069f0ed5 # deprecated compatibility alias
5 jobName: checkpoint-job-3bff874d069f0ed5-1
6 createdAt: "2026-01-29T10:05:00Z"
7 message: ""

Limitations

  • Backend support is limited: checkpoint/restore currently supports vLLM workers only, and that support is still a limited preview.
  • Worker coverage is narrow: specialized workers such as multimodal, embedding, and diffusion are not supported.
  • Multi-GPU remains preview: vLLM tensor-parallel configurations have limited validation and are not yet a broadly supported path across clusters.
  • GMS restore remains experimental: GMS + Snapshot is currently disabled.
  • Admission is create-only: with DGD startupPolicy: Immediate, only Pods created after a checkpoint is Ready are restore-shaped. Existing Pods cold-started before checkpoint readiness keep running as-is.
  • Restore admission must be installed: DGD restores rely on the snapshot Pod mutating webhook, so upgrade the snapshot chart/webhook configuration along with the operator and CRDs when enabling these features.
  • Network state is sensitive: restore is sensitive to live TCP socket state. Loopback bootstrap/control sockets are the most reliable path today.
  • Privileged DaemonSet required: snapshot-agent must run privileged to execute CRIU and cuda-checkpoint. Workload pods do not need to be privileged.

Troubleshooting

Checkpoint Job finishes but the checkpoint never becomes Ready

Snapshot only becomes Ready after snapshot-agent confirms the checkpoint contents. A completed Job is not enough by itself.

$kubectl get dckpt <checkpoint-name> -n ${NAMESPACE} \
> -o custom-columns=NAME:.metadata.name,PHASE:.status.phase,MESSAGE:.status.message,JOB:.status.jobName
$
$JOB_NAME=$(kubectl get dckpt <checkpoint-name> -n ${NAMESPACE} -o jsonpath='{.status.jobName}')
$if [ -n "${JOB_NAME}" ]; then
$ kubectl logs job/"${JOB_NAME}" -n ${NAMESPACE}
$fi
$
$kubectl logs daemonset/snapshot-agent -n ${NAMESPACE} --all-containers

If the worker template is wrong, the most common causes are using the raw runtime image instead of the placeholder image, or leaving out normal mounts and secrets that the worker needs to start.

Restore cannot find or mount checkpoint storage

For the default agentMount install, restore discovers checkpoint storage from the snapshot-agent DaemonSet in the workload namespace. That DaemonSet must be ready and must mount the checkpoint PVC.

$kubectl rollout status daemonset/snapshot-agent -n ${NAMESPACE}
$kubectl get daemonset -n ${NAMESPACE} -l app.kubernetes.io/component=snapshot-agent -o wide
$kubectl get pvc -n ${NAMESPACE}

For a shared-agent podMount install, the snapshot-agent DaemonSet can run in the infrastructure namespace instead. Verify the shared-agent pods there, then verify that the workload namespace has the checkpoint PVC that the operator created or validated:

$kubectl rollout status daemonset/snapshot-agent -n dynamo-system
$kubectl get pods -n dynamo-system -l app.kubernetes.io/component=snapshot-agent -o wide
$kubectl get pvc snapshot-pvc -n ${NAMESPACE}

In podMount mode the agent reaches the checkpoint through the workload pod’s mount namespace rather than by mounting the PVC itself. Check the workload pod’s checkpoint storage annotations and the snapshot-agent logs to see the actual resolved checkpoint path. snapshotctl uses the chart’s storage resolution path, so for lower-level snapshotctl debugging make sure the snapshot chart configuration matches the access mode you are testing.

snapshotctl manifest is rejected or the restore target is wrong

snapshotctl requires a Pod manifest and a target-container list. Multi-container manifests are supported as long as every name passed via --container or --containers exists in the pod spec.

$snapshotctl checkpoint --manifest ./worker-pod.yaml --container main --namespace ${NAMESPACE}
$snapshotctl restore --manifest ./worker-pod.yaml --containers main --namespace ${NAMESPACE} --checkpoint-id <checkpoint-id>

If the manifest already carries snapshot target metadata, it must agree with the CLI flag; snapshotctl rejects mismatches instead of silently picking one.

Planned Features

  • Stable multi-GPU and multinode support
  • TensorRT-LLM support