Model Caching with Fluid | NVIDIA Dynamo Documentation

Fluid is an open-source, cloud-native data orchestration and acceleration platform for Kubernetes. It virtualizes and accelerates data access from various sources (object storage, distributed file systems, cloud storage), making it ideal for AI, machine learning, and big data workloads.

Key Features

Data Caching and Acceleration: Cache remote data close to compute workloads for faster access.
Unified Data Access: Access data from S3, HDFS, NFS, and more through a single interface.
Kubernetes Native: Integrates with Kubernetes using CRDs for data management.
Scalability: Supports large-scale data and compute clusters.

Installation

You can install Fluid on any Kubernetes cluster using Helm.

Prerequisites:

Kubernetes >= 1.18
kubectl >= 1.18
Helm >= 3.5

Quick Install:

1 kubectl create ns fluid-system
2 helm repo add fluid https://fluid-cloudnative.github.io/charts
3 helm repo update
4 helm install fluid fluid/fluid -n fluid-system

For advanced configuration, see the Fluid Installation Guide.

Pre-deployment Steps

Install Fluid (see Installation).
Create a Dataset and Runtime (see the following example).
Mount the resulting PVC in your workload.

Mounting Data Sources

WebUFS Example

WebUFS allows mounting HTTP/HTTPS sources as filesystems.

1 # Mount a public HTTP directory as a Fluid Dataset
2 apiVersion: data.fluid.io/v1alpha1
3 kind: Dataset
4 metadata:
5   name: webufs-model
6 spec:
7   mounts:
8     - mountPoint: https://myhost.org/path_to_my_model  # Replace with your HTTP source
9       name: webufs-model
10 ---
11 apiVersion: data.fluid.io/v1alpha1
12 kind: AlluxioRuntime
13 metadata:
14   name: webufs-model
15 spec:
16   replicas: 2
17   tieredstore:
18     levels:
19       - mediumtype: MEM
20         path: /dev/shm
21         quota: 2Gi
22         high: "0.95"
23         low: "0.7"

After applying, Fluid creates a PersistentVolumeClaim (PVC) named webufs-model containing the files.

S3 Example

Mount an S3 bucket as a Fluid Dataset.

1 # Mount an S3 bucket as a Fluid Dataset
2 apiVersion: data.fluid.io/v1alpha1
3 kind: Dataset
4 metadata:
5   name: s3-model
6 spec:
7   mounts:
8     - mountPoint: s3://<your-bucket>  # Replace with your bucket name
9       options:
10         alluxio.underfs.s3.endpoint: http://minio:9000  # S3 endpoint (e.g., MinIO)
11         alluxio.underfs.s3.disable.dns.buckets: "true"
12         aws.secretKey: "<your-secret>"
13         aws.accessKeyId: "<your-access-key>"
14 ---
15 apiVersion: data.fluid.io/v1alpha1
16 kind: AlluxioRuntime
17 metadata:
18   name: s3-model
19 spec:
20   replicas: 1
21   tieredstore:
22     levels:
23       - mediumtype: MEM
24         path: /dev/shm
25         quota: 1Gi
26         high: "0.95"
27         low: "0.7"
28 ---
29 apiVersion: data.fluid.io/v1alpha1
30 kind: DataLoad
31 metadata:
32   name: s3-model-loader
33 spec:
34   dataset:
35     name: s3-model
36     namespace: <your-namespace>  # Replace with your namespace
37   loadMetadata: true
38   target:
39     - path: "/"
40       replicas: 1

The resulting PVC is named s3-model.

Using HuggingFace Models with Fluid

Limitations:

HuggingFace models are not exposed as simple filesystems or buckets.
No native integration exists between Fluid and the HuggingFace Hub API.

Workaround: Download and Upload to S3/MinIO

Download the model using the HuggingFace CLI or SDK.
Upload the model files to a supported storage backend (S3, GCS, NFS).
Mount that backend using Fluid.

Example Pod to Download and Upload:

1 apiVersion: v1
2 kind: Pod
3 metadata:
4   name: download-hf-to-minio
5 spec:
6   restartPolicy: Never
7   containers:
8     - name: downloader
9       image: python:3.10-slim
10       command: ["sh", "-c"]
11       args:
12         - |
13           set -eux
14           pip install --no-cache-dir huggingface_hub awscli
15           BUCKET_NAME=hf-models
16           ENDPOINT_URL=http://minio:9000
17           MODEL_NAME=deepseek-ai/DeepSeek-R1-Distill-Llama-8B
18           LOCAL_DIR=/tmp/model
19           if ! aws --endpoint-url $ENDPOINT_URL s3 ls "s3://$BUCKET_NAME" > /dev/null 2>&1; then
20             aws --endpoint-url $ENDPOINT_URL s3 mb "s3://$BUCKET_NAME"
21           fi
22           huggingface-cli download $MODEL_NAME --local-dir $LOCAL_DIR --local-dir-use-symlinks False
23           aws --endpoint-url $ENDPOINT_URL s3 cp $LOCAL_DIR s3://$BUCKET_NAME/$MODEL_NAME --recursive
24       env:
25         - name: AWS_ACCESS_KEY_ID
26           value: "<your-access-key>"
27         - name: AWS_SECRET_ACCESS_KEY
28           value: "<your-secret>"
29       volumeMounts:
30         - name: tmp-volume
31           mountPath: /tmp/model
32   volumes:
33     - name: tmp-volume
34       emptyDir: {}

You can then use s3://hf-models/deepseek-ai/DeepSeek-R1-Distill-Llama-8B as your Dataset mount.

Usage with Dynamo

Mount the Fluid-generated PVC in your DynamoGraphDeployment:

1 apiVersion: nvidia.com/v1alpha1
2 kind: DynamoGraphDeployment
3 metadata:
4   name: model-caching
5 spec:
6   pvcs:
7     - name: s3-model
8   envs:
9     - name: HF_HOME
10       value: /model
11     - name: DYN_DEPLOYMENT_CONFIG
12       value: '{"Common": {"model": "/model", ...}}'
13   services:
14     VllmWorker:
15       volumeMounts:
16         - name: s3-model
17           mountPoint: /model
18     Processor:
19       volumeMounts:
20         - name: s3-model
21           mountPoint: /model

Full example with llama3.3 70B

Performance

When deploying LLaMA 3.3 70B using Fluid as the caching layer, we observed the best performance by configuring a single-node cache that holds 100% of the model files locally. By ensuring that the vllm worker pod is scheduled on the same node as the Fluid cache, we were able to eliminate network I/O bottlenecks, which resulted in the fastest model startup time and the highest inference efficiency during our tests.

Cache Configuration	vLLM Pod Placement	Startup Time
❌ No Cache (Download from HuggingFace)	N/A	~9 minutes
🟡 Multi-Node Cache (100% Model Cached)	Not on Cache Node	~18 minutes
🟡 Multi-Node Cache (100% Model Cached)	On Cache Node	~10 minutes
✅ Single-Node Cache (100% Model Cached)	On Cache Node	~80 seconds

Resources

1 # dataset.yaml
2 apiVersion: data.fluid.io/v1alpha1
3 kind: Dataset
4 metadata:
5   name: llama-3-3-70b-instruct-model
6   namespace: my-namespace
7 spec:
8   mounts:
9     - mountPoint: s3://hf-models/meta-llama/Llama-3.3-70B-Instruct
10       options:
11         alluxio.underfs.s3.endpoint: http://minio:9000
12         alluxio.underfs.s3.disable.dns.buckets: "true"
13         aws.secretKey: "minioadmin"
14         aws.accessKeyId: "minioadmin"
15         alluxio.underfs.s3.streaming.upload.enabled: "true"
16         alluxio.underfs.s3.multipart.upload.threads: "20"
17         alluxio.underfs.s3.socket.timeout: "50s"
18         alluxio.underfs.s3.request.timeout: "60s"
19 ---
20 # runtime.yaml
21 apiVersion: data.fluid.io/v1alpha1
22 kind: AlluxioRuntime
23 metadata:
24   name: llama-3-3-70b-instruct-model
25   namespace: my-namespace
26 spec:
27   replicas: 1
28   properties:
29     alluxio.user.file.readtype.default: CACHE_PROMOTE
30     alluxio.user.file.write.type.default: CACHE_THROUGH
31     alluxio.user.block.size.bytes.default: 128MB
32   tieredstore:
33     levels:
34       - mediumtype: MEM
35         path: /dev/shm
36         quota: 300Gi
37         high: "1.0"
38         low: "0.7"
39 ---
40 # DataLoad - Preloads the model into cache
41 apiVersion: data.fluid.io/v1alpha1
42 kind: DataLoad
43 metadata:
44   name: llama-3-3-70b-instruct-model-loader
45 spec:
46   dataset:
47     name: llama-3-3-70b-instruct-model
48     namespace: my-namespace
49   loadMetadata: true
50   target:
51     - path: "/"
52       replicas: 1

and the associated DynamoGraphDeployment with pod affinity to schedule the vllm worker on the same node than the Alluxio cache worker

1 apiVersion: nvidia.com/v1alpha1
2 kind: DynamoGraphDeployment
3 metadata:
4   name: my-hello-world
5 spec:
6   envs:
7   - name: DYN_LOG
8     value: "debug"
9   - name: DYN_DEPLOYMENT_CONFIG
10     value: '{"Common": {"model": "/model", "block-size": 64, "max-model-len": 16384},
11       "Frontend": {"served_model_name": "meta-llama/Llama-3.3-70B-Instruct", "endpoint":
12       "dynamo.Processor.chat/completions", "port": 8000}, "Processor": {"router":
13       "round-robin", "router-num-threads": 4, "common-configs": ["model", "block-size",
14       "max-model-len"]}, "VllmWorker": {"tensor-parallel-size": 4, "enforce-eager": true, "max-num-batched-tokens":
15       16384, "enable-prefix-caching": true, "ServiceArgs": {"workers": 1, "resources":
16       {"gpu": "4", "memory": "40Gi"}}, "common-configs": ["model", "block-size", "max-model-len"]},
17       "Planner": {"environment": "kubernetes", "no-operation": true}}'
18   pvcs:
19     - name: llama-3-3-70b-instruct-model
20   services:
21     Processor:
22       volumeMounts:
23         - name: llama-3-3-70b-instruct-model
24           mountPoint: /model
25     VllmWorker:
26       volumeMounts:
27         - name: llama-3-3-70b-instruct-model
28           mountPoint: /model
29       extraPodSpec:
30         affinity:
31           nodeAffinity:
32             requiredDuringSchedulingIgnoredDuringExecution:
33               nodeSelectorTerms:
34                 - matchExpressions:
35                   - key: fluid.io/s-alluxio-my-namespace-llama-3-3-70b-instruct-model
36                     operator: In
37                     values:
38                       - "true"

Troubleshooting & FAQ

PVC not created? Check Fluid and AlluxioRuntime pod logs.
Model not found? Ensure the model was uploaded to the correct bucket/path.
Permission errors? Verify S3/MinIO credentials and bucket policies.