> For clean Markdown content of this page, append .md to this URL. For the complete documentation index, see https://docs.nvidia.com/dynamo/llms.txt. For full content including API reference and SDK examples, see https://docs.nvidia.com/dynamo/llms-full.txt.

# Snapshotting GPU Workers

> ⚠️ **Experimental Feature**: Dynamo Snapshot is currently in preview and may only be functional in some cluster setups. The `snapshot-agent` DaemonSet runs in privileged mode to perform CRIU operations. See [Limitations](#limitations) for details.

**Dynamo Snapshot** is infrastructure for fast-starting GPU applications in Kubernetes using CRIU (Checkpoint/Restore in Userspace) and NVIDIA's `cuda-checkpoint` utility. The usual flow is:

1. start a worker once and checkpoint its initialized state
2. store that checkpoint on a namespace-local snapshot volume
3. restore later workers from that checkpoint instead of cold-starting again

| Startup Type | Time | What Happens |
|--------------|------|--------------|
| **Cold Start** | ~1 min | Download model, load to GPU, initialize engine |
| **Warm Start** (restore from checkpoint) | ~10 sec | Restore from a ready checkpoint directory |

> ⚠️ Restore time depends on storage bandwidth, GPU model, and whether the restore stays on the same node.

For more background on the snapshot architecture and startup improvements, see
[NVIDIA Dynamo Snapshot: Fast Startup for Inference Workloads on Kubernetes](https://developer.nvidia.com/blog/nvidia-dynamo-snapshot-fast-startup-for-inference-workloads-on-kubernetes/).

## Prerequisites

- x86_64 (`amd64`) GPU nodes
- NVIDIA driver 580.xx or newer on the target GPU nodes (590.xx or newer if testing multi-GPU snapshots)
- vLLM or SGLang backend today
- Checkpoint storage. `ReadWriteMany` is the safest default for cross-node or
  concurrent multi-node access, but `podMount` mode can also use suitable
  `ReadWriteOnce` storage for sequential checkpoint/restore workflows.
- **CRI-O / OpenShift:** set `runtime.type=crio` on the snapshot chart (and `openshift.enabled=true` on OpenShift). Defaults are for containerd; see the chart README for sockets and Helm flags.

## Quick Start via `DynamoCheckpoint` CR

1. Build a placeholder image
2. Install the snapshot chart
3. Create a `DynamoCheckpoint` and wait for it to become ready
4. Deploy a `DynamoGraphDeployment` that restores from the corresponding `checkpointRef`

### 1. Build and push a placeholder image

Snapshot-enabled workers must use a placeholder image that wraps the normal runtime image with restore tooling. If you do not already have one, build it and push it to a registry your cluster can pull from:

```bash
export RUNTIME_IMAGE=registry.example.com/dynamo/vllm-runtime:1.0.0
export PLACEHOLDER_IMAGE=registry.example.com/dynamo/vllm-placeholder:1.0.0

cd deploy/snapshot

make docker-build-placeholder \
  PLACEHOLDER_BASE_IMG="${RUNTIME_IMAGE}" \
  PLACEHOLDER_IMG="${PLACEHOLDER_IMAGE}"

make docker-push-placeholder \
  PLACEHOLDER_IMG="${PLACEHOLDER_IMAGE}"
```

The placeholder image preserves the normal runtime entrypoint/command contract and adds the `criu`, `cuda-checkpoint`, and `nsrestore` tooling needed for checkpoint and restore.

To build either snapshot image against a custom CRIU fork or ref, pass
`CRIU_REPO` and `CRIU_REF` through `make`. If they are unset, the Dockerfile
defaults are used.

```bash
make docker-build-agent \
  IMG=registry.example.com/dynamo/snapshot-agent:1.0.0 \
  CRIU_REPO="${YOUR_CRIU_REPO}" \
  CRIU_REF="branch-or-sha"

make docker-build-placeholder \
  PLACEHOLDER_BASE_IMG="${RUNTIME_IMAGE}" \
  PLACEHOLDER_IMG="${PLACEHOLDER_IMAGE}" \
  CRIU_REPO="${YOUR_CRIU_REPO}" \
  CRIU_REF="branch-or-sha"
```

### 2. Enable checkpointing in the platform and verify it

Whether you are installing or upgrading `dynamo-platform`, the operator only needs checkpointing enabled:

```yaml
dynamo-operator:
  checkpoint:
    enabled: true
```

If the platform is already installed, verify that the operator config contains the checkpoint block:

```bash
OPERATOR_CONFIG=$(kubectl get deploy -n "${PLATFORM_NAMESPACE}" \
  -l app.kubernetes.io/name=dynamo-operator,app.kubernetes.io/component=manager \
  -o jsonpath='{.items[0].spec.template.spec.volumes[?(@.name=="operator-config")].configMap.name}')

kubectl get configmap "${OPERATOR_CONFIG}" -n "${PLATFORM_NAMESPACE}" \
  -o jsonpath='{.data.config\.yaml}' | sed -n '/^checkpoint:/,/^[^[:space:]]/p'
```

Verify that the rendered config includes `enabled: true`.

### 3. Install the snapshot chart

For the default namespace-local mode, install the snapshot chart in each
workload namespace. The chart creates the PVC and the agent in that namespace:

```bash
helm upgrade --install snapshot ./deploy/helm/charts/snapshot \
  --namespace ${NAMESPACE} \
  --create-namespace \
  --set storage.pvc.create=true
```

In the default `agentMount` mode, the snapshot-agent DaemonSet mounts the
checkpoint PVC directly. On a multi-node GPU cluster that means agent pods on
multiple nodes may mount the same PVC, so the PVC generally needs
`ReadWriteMany`. The chart defaults to that mode. If your cluster does not have
a default storage class, also set `storage.pvc.storageClass`.

If you are reusing an existing checkpoint PVC, do not set `storage.pvc.create=true`; install the chart with `storage.pvc.create=false` and set `storage.pvc.name` instead.

CRI-O or OpenShift: append for example `--set runtime.type=crio` and, on OpenShift, `--set openshift.enabled=true` (see `deploy/helm/charts/snapshot/README.md`).

For clusters that prefer one privileged snapshot agent instead of one DaemonSet
per workload namespace, install the chart once in an infrastructure namespace.
In this mode the chart does not create workload PVCs; the Dynamo operator either
creates each namespace-local PVC or verifies that it already exists:

```bash
helm upgrade --install snapshot ./deploy/helm/charts/snapshot \
  --namespace dynamo-system \
  --create-namespace \
  --set storage.accessMode=podMount \
  --set storage.pvc.create=false \
  --set rbac.namespaceRestricted=false
```

To let the operator create the workload PVC in each namespace that uses
checkpoint/restore, configure the operator with `create: true`:

```yaml
dynamo-operator:
  checkpoint:
    enabled: true
    storage:
      type: pvc
      pvc:
        pvcName: snapshot-pvc
        basePath: /checkpoints
        create: true
        size: 1Ti
        storageClassName: ""
        accessMode: ReadWriteMany
```

The chart and operator use separate configuration surfaces here: the snapshot
chart PVC name is `storage.pvc.name`, while the operator config field is
`checkpoint.storage.pvc.pvcName`.

This is a key difference from `agentMount`: `podMount` removes the requirement
that the snapshot-agent DaemonSet mount the checkpoint PVC on every GPU node.
Only the active checkpoint/restore workload pod mounts the PVC, and the agent
reaches it through that pod's mount namespace. `ReadWriteMany` remains the
safest operator-managed default, especially when multiple checkpoint/restore
pods may access the same PVC concurrently or when restore scheduling can span
nodes. Suitable `ReadWriteOnce` storage classes can still be used for
sequential `podMount` checkpoint/restore flows when the backend can attach the
volume to the node running the active workload pod.

`podMount` depends on the target container remaining alive while the agent
resolves `/host/proc/<pid>/root/<basePath>`. If the container exits or restarts
during checkpoint/restore setup, if the runtime cannot expose a stable host PID,
or if node security settings prevent host proc traversal, the agent fails or
skips that attempt and Kubernetes/operator reconciliation must try again after a
fresh container is available.

To use an already-present PVC instead, omit `create` or set it to `false`. The
operator will fail reconciliation with a clear error if the named PVC does not
exist in the workload namespace.

Verify that the DaemonSet is ready. After a checkpoint or restore workload is
reconciled, verify the workload namespace PVC:

```bash
kubectl rollout status daemonset/snapshot-agent -n dynamo-system
kubectl get pods -n dynamo-system -l app.kubernetes.io/component=snapshot-agent -o wide
kubectl get pvc snapshot-pvc -n ${NAMESPACE}
```

### 4. Create a standalone `DynamoCheckpoint`

The checkpoint Job pod template should match the worker container you want to checkpoint. For a standalone checkpoint, the important parts are the legacy `spec.identity` metadata, a container named `main`, and the placeholder image; the rest of the pod template should mirror your normal worker config. Extra containers are allowed, but only `main` is checkpointed unless `spec.job.targetContainerName` selects another container.

```yaml
apiVersion: nvidia.com/v1alpha1
kind: DynamoCheckpoint
metadata:
  name: qwen3-06b-bf16
spec:
  identity:
    model: Qwen/Qwen3-0.6B
    backendFramework: vllm
    tensorParallelSize: 1
    dtype: bfloat16
    maxModelLen: 2048

  job:
    activeDeadlineSeconds: 3600
    podTemplateSpec:
      spec:
        ...
        containers:
          - name: main
            image: registry.example.com/dynamo/vllm-placeholder:1.0.0
            ...
```

GMS + Snapshot support is currently disabled.

For a full working example, see [deploy/operator/config/samples/nvidia.com_v1alpha1_dynamocheckpoint.yaml](https://github.com/ai-dynamo/dynamo/blob/main/deploy/operator/config/samples/nvidia.com_v1alpha1_dynamocheckpoint.yaml).

Apply it:

```bash
kubectl apply -f qwen3-checkpoint.yaml -n ${NAMESPACE}
```

### 5. Wait for the checkpoint to become ready

```bash
kubectl get dckpt -n ${NAMESPACE} \
  -o custom-columns=NAME:.metadata.name,CHECKPOINT_ID:.status.checkpointID,PHASE:.status.phase

kubectl wait \
  --for=jsonpath='{.status.phase}'=Ready \
  dynamocheckpoint/qwen3-06b-bf16 \
  -n ${NAMESPACE} \
  --timeout=30m
```

The useful status fields are:

- `status.phase`: high-level lifecycle (`Pending`, `Creating`, `Ready`, `Failed`)
- `status.checkpointID`: artifact ID used by the snapshot protocol
- `status.identityHash`: deprecated compatibility alias for `status.checkpointID`
- `status.jobName`: checkpoint Job name
- `status.createdAt`: timestamp recorded when the checkpoint became ready
- `status.message`: progress or failure detail when available

### 6. Deploy a `DynamoGraphDeployment` that restores from `checkpointRef`

Once the checkpoint is `Ready`, restore a worker from it explicitly:

```yaml
apiVersion: nvidia.com/v1alpha1
kind: DynamoGraphDeployment
metadata:
  name: vllm-checkpointref-demo
spec:
  services:
    Frontend:
      componentType: frontend
      replicas: 1
      extraPodSpec:
        mainContainer:
          image: registry.example.com/dynamo/vllm-runtime:1.0.0

    VllmDecodeWorker:
      componentType: worker
      replicas: 1
      checkpoint:
        enabled: true
        checkpointRef: qwen3-06b-bf16
      extraPodSpec:
        mainContainer:
          image: registry.example.com/dynamo/vllm-placeholder:1.0.0
          ...
        ...
```

Apply it:

```bash
kubectl apply -f vllm-checkpointref-demo.yaml -n ${NAMESPACE}
kubectl get pods -n ${NAMESPACE} -w
```

The `VllmDecodeWorker` pod should restore from the ready checkpoint instead of creating a new one.

## DGD Auto Flow

`checkpointRef` is the most explicit path. If you set it, the DGD uses that
existing `DynamoCheckpoint` and does not create a new automatic checkpoint for
the component. This is the escape hatch for users who intentionally want to
reuse a pre-existing checkpoint and accept the compatibility risk.

Without `checkpointRef`, `mode: Auto` is the DGD-managed path: for each
checkpoint-enabled worker generation, the DGD controller creates a DGD-scoped
`DynamoCheckpoint` and the checkpoint controller starts a checkpoint Job.
Automatic DGD checkpoints are not reused across DGDs, even when two manifests
are identical.

The automatic checkpoint ID is derived from the DGD namespace/name/UID, component name, and active worker hash. The DGD UID prevents cross-DGD reuse; the worker hash keeps a scale down/up on the same worker generation using the same DGD-scoped checkpoint while creating a new checkpoint for a new worker generation.

By default, `startupPolicy: Immediate` starts workers cold while the checkpoint job runs in the background. Once the checkpoint becomes `Ready`, only newly-created Pods restore from it. Existing Pods are not mutated or restarted just because the checkpoint became ready.

If you want workers to wait for the checkpoint before starting, set `startupPolicy: WaitForCheckpoint`. That policy keeps normal worker replicas at zero until the checkpoint is `Ready`, then starts workers from the checkpoint.

```yaml
checkpoint:
  enabled: true
  mode: Auto
  startupPolicy: Immediate # default; optional
```

Inside a `DynamoGraphDeployment`, it looks like this:

```yaml
apiVersion: nvidia.com/v1alpha1
kind: DynamoGraphDeployment
metadata:
  name: vllm-auto-demo
spec:
  services:
    Frontend:
      componentType: frontend
      replicas: 1
      extraPodSpec:
        mainContainer:
          image: registry.example.com/dynamo/vllm-runtime:1.0.0

    VllmDecodeWorker:
      componentType: worker
      replicas: 1
      checkpoint:
        enabled: true
        mode: Auto
        startupPolicy: Immediate
      extraPodSpec:
        mainContainer:
          image: registry.example.com/dynamo/vllm-placeholder:1.0.0
          ...
        ...
```

The legacy `checkpoint.identity` field is ignored for DGD-managed automatic checkpoints. It is retained only for API compatibility and standalone `DynamoCheckpoint` workflows.

Useful inspection commands:

```bash
kubectl get dgd vllm-auto-demo -n ${NAMESPACE} \
  -o jsonpath='{.status.checkpoints.VllmDecodeWorker.checkpointName}{"\n"}{.status.checkpoints.VllmDecodeWorker.checkpointID}{"\n"}{.status.checkpoints.VllmDecodeWorker.ready}{"\n"}'

kubectl get dckpt -n ${NAMESPACE}
```

If you use the default `Immediate` policy and want to create restored pods after the checkpoint becomes ready, scale the worker:

```bash
kubectl patch dgd vllm-auto-demo -n ${NAMESPACE} --type=merge \
  -p '{"spec":{"services":{"VllmDecodeWorker":{"replicas":2}}}}'
```

## Failover Restore

Failover restore is not yet available. The current Snapshot flow does not support GMS + Snapshot, so do not use failover restore as a supported checkpoint/restore path. For current GMS and active/passive failover guidance, see [Shadow Engine Failover](/dynamo/kubernetes-deployment/advanced-platform/shadow-engine-failover).

## Lower-Level Testing With `snapshotctl`

It is possible to checkpoint and restore pods without the Dynamo operator via the lower-level `snapshotctl` utility. However, the snapshot helm chart must be installed, with a running `snapshot-agent` DaemonSet in the namespace with the checkpoint PVC mounted.

`snapshotctl` is intended for lower-level debugging and validation workflows, not as the primary user-facing checkpoint interface. For command details and manifest requirements, see [deploy/snapshot/cmd/snapshotctl/README.md](../../deploy/snapshot/cmd/snapshotctl/README.md).

### Checkpoint from a worker pod manifest

```bash
snapshotctl checkpoint \
  --manifest ./worker-pod.yaml \
  --container main \
  --namespace ${NAMESPACE}
```

The checkpoint manifest must be for a pod and use a placeholder image. `--container` names the workload container to checkpoint.

If you do not pass `--checkpoint-id`, `snapshotctl` generates one and prints it:

```text
status=completed
namespace=...
name=...
checkpoint_job=...
checkpoint_id=manual-snapshot-...
checkpoint_location=/checkpoints/...
```

### Restore from a worker pod manifest

```bash
snapshotctl restore \
  --manifest ./worker-pod.yaml \
  --namespace ${NAMESPACE} \
  --checkpoint-id manual-snapshot-... \
  --containers main
```

This creates a new restore pod and returns after the request is submitted. Observe progress through Kubernetes readiness, events, and logs.

### Restore an existing pod in place

```bash
snapshotctl restore \
  --pod existing-restore-target \
  --namespace ${NAMESPACE} \
  --checkpoint-id manual-snapshot-... \
  --containers main
```

This patches restore metadata onto an existing pod that is already snapshot-compatible and returns after the patch is accepted.

## Checkpoint IDs and Legacy Identity

`status.checkpointID` is the artifact ID used by the snapshot protocol and the
directory name under checkpoint storage. For DGD-managed automatic checkpoints,
this ID is scoped to a single DGD/component worker generation. It is not a
compatibility claim across DGDs, and identical manifests are not treated as
proof that a checkpoint can be reused safely.

The legacy `spec.identity` shape is still required on standalone
`DynamoCheckpoint` objects and remains the fallback for explicit/manual
workflows. When a standalone checkpoint does not already have
`status.checkpointID` or the checkpoint-ID label, the operator computes the
legacy **16-character SHA256 hash** (64 bits) from these fields:

| Legacy field | Required | Affects legacy hash | Example |
|--------------|----------|---------------------|---------|
| `model` | ✓ | ✓ | `meta-llama/Llama-3-8B` |
| `backendFramework` | ✓ | ✓ | `vllm` |
| `dynamoVersion` | | ✓ | `0.9.0`, `1.0.0` |
| `tensorParallelSize` | | ✓ | `1`, `2`, `4`, `8` |
| `pipelineParallelSize` | | ✓ | `1`, `2` |
| `dtype` | | ✓ | `float16`, `bfloat16`, `fp8` |
| `maxModelLen` | | ✓ | `4096`, `8192` |
| `extraParameters` | | ✓ | custom key-value pairs |

Fields that do **not** change the legacy hash include:

- replica count
- node placement (`nodeSelector`, `affinity`, `tolerations`)
- resource requests/limits
- logging or observability configuration

DGD-managed automatic checkpoints ignore this legacy identity as a reuse
boundary. The DGD controller creates its own DGD-scoped checkpoint ID and
synthesizes a legacy identity only because the v1alpha1 `DynamoCheckpoint` API
still requires the field.

## `DynamoCheckpoint` CRD

The `DynamoCheckpoint` (shortname: `dckpt`) is the operator-managed resource for checkpoint lifecycle.

Use it when you want:

- pre-warmed checkpoints before any `DynamoGraphDeployment` exists
- explicit lifecycle control independent from a DGD
- a stable human-readable name that services can reference with `checkpointRef`

The operator requires:

- `spec.identity`
- `spec.job.podTemplateSpec`

`spec.job.backoffLimit` is deprecated and ignored. Checkpoint Jobs are always single-attempt.

Check status with:

```bash
kubectl get dckpt -n ${NAMESPACE}
kubectl describe dckpt qwen3-06b-bf16 -n ${NAMESPACE}
kubectl get dckpt qwen3-06b-bf16 -n ${NAMESPACE} -o yaml
```

The `status` block looks like:

```yaml
status:
  phase: Ready
  checkpointID: 3bff874d069f0ed5
  identityHash: 3bff874d069f0ed5 # deprecated compatibility alias
  jobName: checkpoint-job-3bff874d069f0ed5-1
  createdAt: "2026-01-29T10:05:00Z"
  message: ""
```

## Limitations

- **Backend support is limited**: checkpoint/restore currently supports vLLM workers only, and that support is still a limited preview.
- **Worker coverage is narrow**: specialized workers such as multimodal, embedding, and diffusion are not supported.
- **Multi-GPU remains preview**: vLLM tensor-parallel configurations have limited validation and are not yet a broadly supported path across clusters.
- **GMS restore remains experimental**: GMS + Snapshot is currently disabled.
- **Admission is create-only**: with DGD `startupPolicy: Immediate`, only Pods created after a checkpoint is `Ready` are restore-shaped. Existing Pods cold-started before checkpoint readiness keep running as-is.
- **Network state is sensitive**: restore is sensitive to live TCP socket state. Loopback bootstrap/control sockets are the most reliable path today.
- **Privileged DaemonSet required**: `snapshot-agent` must run privileged to execute CRIU and `cuda-checkpoint`. Workload pods do not need to be privileged.

## Troubleshooting

### Checkpoint Job finishes but the checkpoint never becomes `Ready`

Snapshot only becomes `Ready` after `snapshot-agent` confirms the checkpoint contents. A completed Job is not enough by itself.

```bash
kubectl get dckpt <checkpoint-name> -n ${NAMESPACE} \
  -o custom-columns=NAME:.metadata.name,PHASE:.status.phase,MESSAGE:.status.message,JOB:.status.jobName

JOB_NAME=$(kubectl get dckpt <checkpoint-name> -n ${NAMESPACE} -o jsonpath='{.status.jobName}')
if [ -n "${JOB_NAME}" ]; then
  kubectl logs job/"${JOB_NAME}" -n ${NAMESPACE}
fi

kubectl logs daemonset/snapshot-agent -n ${NAMESPACE} --all-containers
```

If the worker template is wrong, the most common causes are using the raw runtime image instead of the placeholder image, or leaving out normal mounts and secrets that the worker needs to start.

### Restore cannot find or mount checkpoint storage

For the default `agentMount` install, restore discovers checkpoint storage from
the `snapshot-agent` DaemonSet in the workload namespace. That DaemonSet must be
ready and must mount the checkpoint PVC.

```bash
kubectl rollout status daemonset/snapshot-agent -n ${NAMESPACE}
kubectl get daemonset -n ${NAMESPACE} -l app.kubernetes.io/component=snapshot-agent -o wide
kubectl get pvc -n ${NAMESPACE}
```

For a shared-agent `podMount` install, the `snapshot-agent` DaemonSet can run in
the infrastructure namespace instead. Verify the shared-agent pods there, then
verify that the workload namespace has the checkpoint PVC that the operator
created or validated:

```bash
kubectl rollout status daemonset/snapshot-agent -n dynamo-system
kubectl get pods -n dynamo-system -l app.kubernetes.io/component=snapshot-agent -o wide
kubectl get pvc snapshot-pvc -n ${NAMESPACE}
```

In `podMount` mode the agent reaches the checkpoint through the workload pod's
mount namespace rather than by mounting the PVC itself. Check the workload pod's
checkpoint storage annotations and the `snapshot-agent` logs to see the actual
resolved checkpoint path. `snapshotctl` uses the chart's storage resolution
path, so for lower-level `snapshotctl` debugging make sure the snapshot chart
configuration matches the access mode you are testing.

### `snapshotctl` manifest is rejected or the restore target is wrong

`snapshotctl` requires a `Pod` manifest and a target-container list. Multi-container manifests are supported as long as every name passed via `--container` or `--containers` exists in the pod spec.

```bash
snapshotctl checkpoint --manifest ./worker-pod.yaml --container main --namespace ${NAMESPACE}
snapshotctl restore  --manifest ./worker-pod.yaml --containers main --namespace ${NAMESPACE} --checkpoint-id <checkpoint-id>
```

If the manifest already carries snapshot target metadata, it must agree with the CLI flag; `snapshotctl` rejects mismatches instead of silently picking one.

## Planned Features

- Stable multi-GPU and multinode support
- TensorRT-LLM support

## Related Documentation

- [Installation Guide](/dynamo/kubernetes-deployment/start-here/installation-guide)
- [Shadow Engine Failover](/dynamo/kubernetes-deployment/advanced-platform/shadow-engine-failover)
- [API Reference](/dynamo/additional-resources/api-reference-k-8-s)