--- title: Snapshot --- > ⚠️ **Experimental Feature**: Dynamo Snapshot is currently in **preview** and may only be functional in some k8s cluster setups. The Dynamo Snapshot DaemonSet runs in privileged mode to perform CRIU operations. See [Limitations](#limitations) for details. **Dynamo Snapshot** is an experimental infrastructure for fast-starting GPU applications in Kubernetes using CRIU (Checkpoint/Restore in User-space) and NVIDIA's cuda-checkpoint utility. Dynamo Snapshot dramatically reduces cold-start times for large models from minutes to seconds by capturing initialized application state and restoring it on-demand. | Startup Type | Time | What Happens | |--------------|------|--------------| | **Cold Start** | ~1 min | Download model, load to GPU, initialize engine | | **Warm Start** (restore from checkpoint) | ~ 10 sec | Restore from checkpoint tar | > ⚠️ Restore time may vary depending on cluster configuration (storage bandwidth, GPU model, etc.) ## Prerequisites - Dynamo Platform/Operator installed on a k8s cluster with **x86_64 (amd64)** GPU nodes - NVIDIA driver 580.xx or newer on the target GPU nodes - `ReadWriteMany` storage if you need cross-node restore - vLLM or SGLang backend (TensorRT-LLM is not supported yet) - Security clearance to run a privileged DaemonSet ## Quick Start This guide assumes a normal Dynamo deployment workflow is already present on your Kubernetes cluster. ### 1. Build and push a placeholder image Snapshot-enabled workers must use a placeholder image that wraps the normal runtime image with the restore tooling. If you do not already have one, build it with the snapshot placeholder target and push it to a registry your cluster can pull from: ```bash export RUNTIME_IMAGE=registry.example.com/dynamo/vllm-runtime:1.0.0 export PLACEHOLDER_IMAGE=registry.example.com/dynamo/vllm-placeholder:1.0.0 cd deploy/snapshot make docker-build-placeholder \ PLACEHOLDER_BASE_IMG="${RUNTIME_IMAGE}" \ PLACEHOLDER_IMG="${PLACEHOLDER_IMAGE}" make docker-push-placeholder \ PLACEHOLDER_IMG="${PLACEHOLDER_IMAGE}" ``` This flow is defined in [deploy/snapshot/Makefile](../../deploy/snapshot/Makefile) and [deploy/snapshot/Dockerfile](../../deploy/snapshot/Dockerfile). The placeholder image preserves the base runtime entrypoint and command contract, and adds the CRIU, `cuda-checkpoint`, and `nsrestore` tooling needed for restore. ### 2. Enable checkpointing in the platform and verify it Whether you are installing or upgrading `dynamo-platform`, the operator must have checkpointing enabled and must point at the same storage that the snapshot chart will use: ```yaml dynamo-operator: checkpoint: enabled: true storage: type: pvc pvc: pvcName: snapshot-pvc basePath: /checkpoints ``` If the platform is already installed, verify that the operator config contains the checkpoint block: ```bash OPERATOR_CONFIG=$(kubectl get deploy -n "${PLATFORM_NAMESPACE}" \ -l app.kubernetes.io/name=dynamo-operator,app.kubernetes.io/component=manager \ -o jsonpath='{.items[0].spec.template.spec.volumes[?(@.name=="operator-config")].configMap.name}') kubectl get configmap "${OPERATOR_CONFIG}" -n "${PLATFORM_NAMESPACE}" \ -o jsonpath='{.data.config\.yaml}' | sed -n '/^checkpoint:/,/^[^[:space:]]/p' ``` Verify that the rendered config includes `enabled: true` and the same PVC name and base path you plan to use for the snapshot chart. For the full platform/operator configuration surface, see [deploy/helm/charts/platform/README.md](../../deploy/helm/charts/platform/README.md) and [deploy/helm/charts/platform/components/operator/values.yaml](../../deploy/helm/charts/platform/components/operator/values.yaml). ### 3. Install the snapshot chart ```bash helm upgrade --install snapshot ./deploy/helm/charts/snapshot \ --namespace ${NAMESPACE} \ --create-namespace \ --set storage.pvc.create=true ``` Cross-node restore requires `ReadWriteMany` storage. The chart defaults to that mode. For better restore times, use a fast `ReadWriteMany` StorageClass for the checkpoint PVC. If you are reusing an existing checkpoint PVC, do not set `storage.pvc.create=true`; install the chart with `storage.pvc.create=false` and point `storage.pvc.name` at the existing PVC instead. Verify that the PVC and DaemonSet are ready: ```bash kubectl get pvc snapshot-pvc -n ${NAMESPACE} kubectl rollout status daemonset/snapshot-agent -n ${NAMESPACE} ``` For the full snapshot chart configuration surface, see [deploy/helm/charts/snapshot/README.md](../../deploy/helm/charts/snapshot/README.md) and [deploy/helm/charts/snapshot/values.yaml](../../deploy/helm/charts/snapshot/values.yaml). ### 4. Apply a snapshot-compatible `DynamoGraphDeployment` This example is adapted from [examples/backends/vllm/deploy/agg.yaml](../../examples/backends/vllm/deploy/agg.yaml). The worker must use the placeholder image from step 1, and the checkpoint identity must describe the runtime state you want to reuse. ```yaml apiVersion: nvidia.com/v1alpha1 kind: DynamoGraphDeployment metadata: name: vllm-snapshot-demo spec: services: Frontend: componentType: frontend replicas: 1 extraPodSpec: mainContainer: image: registry.example.com/dynamo/vllm-runtime:1.0.0 VllmDecodeWorker: componentType: worker replicas: 1 resources: limits: gpu: "1" readinessProbe: httpGet: path: /live port: system periodSeconds: 1 timeoutSeconds: 4 failureThreshold: 3 checkpoint: enabled: true mode: Auto identity: model: Qwen/Qwen3-0.6B backendFramework: vllm extraPodSpec: mainContainer: image: registry.example.com/dynamo/vllm-placeholder:1.0.0 command: - python3 - -m - dynamo.vllm args: - --model - Qwen/Qwen3-0.6B - --disable-custom-all-reduce env: - name: GLOO_SOCKET_IFNAME value: lo - name: NCCL_SOCKET_IFNAME value: lo - name: NCCL_DEBUG value: ERROR - name: TORCH_CPP_LOG_LEVEL value: ERROR - name: TORCH_DISTRIBUTED_DEBUG value: "OFF" - name: CUDA_ERROR_LEVEL value: "10" - name: NCCL_CUMEM_ENABLE value: "0" - name: NCCL_CUMEM_HOST_ENABLE value: "0" - name: NCCL_NVLS_ENABLE value: "0" - name: NCCL_P2P_DISABLE value: "0" - name: NCCL_SHM_DISABLE value: "1" - name: NCCL_IB_DISABLE value: "1" - name: TORCH_NCCL_ENABLE_MONITORING value: "0" ``` For SGLang, use `dynamo.sglang`, an SGLang placeholder image, `backendFramework: sglang`, and the matching CLI flags. Apply the manifest: ```bash kubectl apply -f vllm-snapshot-demo.yaml -n ${NAMESPACE} ``` On the first rollout, the worker cold-starts, the operator creates a `DynamoCheckpoint`, and the checkpoint Job writes data into `snapshot-pvc`. ### 5. Wait for the checkpoint to become ready Capture the checkpoint name from DGD status, then wait for the `DynamoCheckpoint` phase to become `Ready`: ```bash CHECKPOINT_NAME=$(kubectl get dgd vllm-snapshot-demo -n ${NAMESPACE} \ -o jsonpath='{.status.checkpoints.VllmDecodeWorker.checkpointName}') kubectl wait \ --for=jsonpath='{.status.phase}'=Ready \ "dynamocheckpoint/${CHECKPOINT_NAME}" \ -n ${NAMESPACE} \ --timeout=30m ``` The DGD status also reports the computed checkpoint hash at `.status.checkpoints.VllmDecodeWorker.identityHash`. ### 6. Trigger restore Once the checkpoint is ready, scale the worker replicas from `1` to `2`: ```bash kubectl patch dgd vllm-snapshot-demo -n ${NAMESPACE} --type=merge \ -p '{"spec":{"services":{"VllmDecodeWorker":{"replicas":2}}}}' ``` New worker pods for `VllmDecodeWorker` will restore from the ready checkpoint automatically. ## Checkpoint Configuration ### Auto Mode (Recommended) The operator computes the checkpoint identity hash, looks for an existing `DynamoCheckpoint` with a matching `nvidia.com/snapshot-checkpoint-hash` label, and creates one if it does not find one: ```yaml checkpoint: enabled: true mode: Auto identity: model: "meta-llama/Llama-3-8B" backendFramework: "vllm" # or "sglang" tensorParallelSize: 1 dtype: "bfloat16" maxModelLen: 4096 ``` When a service uses checkpointing, DGD status reports the resolved `checkpointName`, `identityHash`, and `ready` fields under `.status.checkpoints.`. ### Manual Management and `checkpointRef` Use `checkpointRef` when you want a service to restore from a specific `DynamoCheckpoint` CR: ```yaml checkpoint: enabled: true checkpointRef: "qwen3-06b-vllm-prewarm" ``` This is useful when: - You want to **pre-warm checkpoints** before creating DGDs - You want **explicit control** over which checkpoint to use `checkpointRef` resolves by `DynamoCheckpoint.metadata.name`, not by `status.identityHash`. A manual checkpoint can use any valid Kubernetes resource name. If you are managing checkpoint CRs yourself, set `mode: Manual` on the service to prevent the operator from creating a new `DynamoCheckpoint` when identity-based lookup does not find one. ```bash # Check checkpoint status by CR name kubectl get dynamocheckpoint qwen3-06b-vllm-prewarm -n ${NAMESPACE} # Now create DGD referencing it kubectl apply -f my-dgd.yaml -n ${NAMESPACE} ``` If you want `mode: Auto` DGDs to discover a manually created checkpoint by identity, add the label `nvidia.com/snapshot-checkpoint-hash=` to that `DynamoCheckpoint`. Auto-created checkpoints already use that label, and currently use the same hash as the CR name. ## Checkpoint Identity Checkpoints are uniquely identified by a **16-character SHA256 hash** (64 bits) of configuration that affects runtime state: | Field | Required | Affects Hash | Example | |-------|----------|-------------|---------| | `model` | ✓ | ✓ | `meta-llama/Llama-3-8B` | | `backendFramework` | ✓ | ✓ | `sglang`, `vllm` | | `dynamoVersion` | | ✓ | `0.9.0`, `1.0.0` | | `tensorParallelSize` | | ✓ | `1`, `2`, `4`, `8` (default: 1) | | `pipelineParallelSize` | | ✓ | `1`, `2` (default: 1) | | `dtype` | | ✓ | `float16`, `bfloat16`, `fp8` | | `maxModelLen` | | ✓ | `4096`, `8192` | | `extraParameters` | | ✓ | Custom key-value pairs | **Not included in hash** (don't invalidate checkpoint): - `replicas` - `nodeSelector`, `affinity`, `tolerations` - `resources` (requests/limits) - Logging/observability config **Example with all fields:** ```yaml checkpoint: enabled: true mode: Auto identity: model: "meta-llama/Llama-3-8B" backendFramework: "vllm" dynamoVersion: "1.0.0" tensorParallelSize: 1 pipelineParallelSize: 1 dtype: "bfloat16" maxModelLen: 8192 extraParameters: enableChunkedPrefill: "true" quantization: "awq" ``` ## DynamoCheckpoint CRD The `DynamoCheckpoint` (shortname: `dckpt`) is a Kubernetes Custom Resource that manages checkpoint lifecycle. **When to create a DynamoCheckpoint directly:** - **Pre-warming:** Create checkpoints before deploying DGDs for instant startup - **Explicit control:** Manage checkpoint lifecycle independently from DGDs The operator requires `spec.identity` and `spec.job.podTemplateSpec`. The pod template should match the worker container you want checkpointed, including image, command, args, secrets, volumes, and resource limits. You do not need to set the checkpoint environment variables manually; the operator injects them for checkpoint jobs and restored pods. **Create a checkpoint:** ```yaml apiVersion: nvidia.com/v1alpha1 kind: DynamoCheckpoint metadata: name: qwen3-06b-vllm-prewarm labels: nvidia.com/snapshot-checkpoint-hash: "e5962d34ba272638" # Add this if Auto-mode identity lookup should find the CR spec: identity: model: Qwen/Qwen3-0.6B backendFramework: vllm tensorParallelSize: 1 dtype: bfloat16 maxModelLen: 4096 job: activeDeadlineSeconds: 3600 backoffLimit: 3 ttlSecondsAfterFinished: 300 podTemplateSpec: spec: restartPolicy: Never containers: - name: main image: registry.example.com/dynamo/vllm-placeholder:1.0.0 command: - python3 - -m - dynamo.vllm args: - --model - Qwen/Qwen3-0.6B - --disable-custom-all-reduce env: - name: GLOO_SOCKET_IFNAME value: lo - name: NCCL_SOCKET_IFNAME value: lo resources: limits: nvidia.com/gpu: "1" ``` You can name the CR however you want if you plan to use `checkpointRef`. If you want `mode: Auto` identity lookup to find a manual CR, set the `nvidia.com/snapshot-checkpoint-hash` label to the computed 16-character identity hash. Using the hash as the CR name is a convenient convention, but it is not required. **Check status:** ```bash # List all checkpoints kubectl get dynamocheckpoint -n ${NAMESPACE} # Or use shortname kubectl get dckpt -n ${NAMESPACE} NAME MODEL BACKEND PHASE HASH AGE qwen3-06b-vllm-prewarm Qwen/Qwen3-0.6B vllm Ready e5962d34ba272638 5m llama3-8b-vllm-prewarm meta-llama/Llama-3-8B vllm Creating 7ab4f89c12de3456 2m ``` **Phases:** | Phase | Description | |-------|-------------| | `Pending` | CR created, waiting for job to start | | `Creating` | Checkpoint job is running | | `Ready` | Checkpoint available for use | | `Failed` | Checkpoint creation failed | `Ready` is a value in `status.phase`, not a Kubernetes condition. The `conditions` array tracks job lifecycle events: | Condition Type | Meaning | |----------------|---------| | `JobCreated` | The checkpoint Job has been created | | `JobCompleted` | The checkpoint Job has completed successfully or failed | Other useful status fields are: | Field | Meaning | |-------|---------| | `status.jobName` | Name of the checkpoint Job | | `status.identityHash` | Computed 16-character hash for the checkpoint identity | | `status.location` | Checkpoint location in the configured storage backend | | `status.storageType` | Storage backend type (`pvc`, `s3`, or `oci`) | | `status.createdAt` | Timestamp recorded when the checkpoint becomes ready | | `status.message` | Failure or progress message when available | **Detailed status:** ```bash kubectl describe dckpt qwen3-06b-vllm-prewarm -n ${NAMESPACE} ``` ```yaml Status: Phase: Ready IdentityHash: e5962d34ba272638 JobName: checkpoint-qwen3-06b-vllm-prewarm Location: /checkpoints/e5962d34ba272638.tar StorageType: pvc CreatedAt: 2026-01-29T10:05:00Z Conditions: - Type: JobCreated Status: "True" Reason: JobCreated - Type: JobCompleted Status: "True" Reason: JobSucceeded ``` **Reference from DGD:** Once the checkpoint is `Ready`, you can reference it by CR name: ```yaml spec: services: VllmDecodeWorker: checkpoint: enabled: true checkpointRef: "qwen3-06b-vllm-prewarm" ``` Or use `mode: Auto` with the same identity and snapshot-hash label, and the operator will reuse it automatically. ## Limitations - **LLM workers only**: Checkpoint/restore supports LLM decode and prefill workers. Specialized workers (multimodal, embedding, diffusion) are not supported. - **Single-GPU only**: Multi-GPU configurations may work in very basic hardware configurations, but are not officially supported yet. - **Network state**: No active TCP connections can be checkpointed - **Security**: Dynamo Snapshot runs as a **privileged DaemonSet** which is required to run CRIU and cuda-checkpoint. However, workload pods do not need to be privileged. ## Troubleshooting ### Checkpoint Not Ready 1. Check the checkpoint job: ```bash kubectl get dckpt -n ${NAMESPACE} kubectl describe dckpt -n ${NAMESPACE} kubectl logs job/$(kubectl get dckpt -n ${NAMESPACE} -o jsonpath='{.status.jobName}') -n ${NAMESPACE} ``` 2. Check the DaemonSet: ```bash kubectl logs daemonset/snapshot-agent -n ${NAMESPACE} --all-containers ``` 3. Verify that platform and chart storage settings match: ```bash kubectl get dckpt -n ${NAMESPACE} -o yaml ``` ### Restore Failing 1. Check pod logs: ```bash kubectl logs -n ${NAMESPACE} ``` 2. Describe the restore target pod: ```bash kubectl describe pod -n ${NAMESPACE} ``` 3. Confirm the referenced checkpoint is still `Ready`: ```bash kubectl get dckpt -n ${NAMESPACE} ``` ## Planned Features - TensorRT-LLM backend support - S3/MinIO storage backend - OCI registry storage backend - Multi-GPU checkpoints ## Related Documentation - [Dynamo Snapshot Helm Chart README](../../deploy/helm/charts/snapshot/README.md) - Chart configuration - [Installation Guide](/dynamo/kubernetes-deployment/deployment-guide/detailed-installation-guide) - Platform installation - [API Reference](/dynamo/additional-resources/api-reference-k-8-s) - Complete CRD specifications