Additional Resources

Model Caching with Fluid

View as Markdown

Fluid is an open-source, cloud-native data orchestration and acceleration platform for Kubernetes. It virtualizes and accelerates data access from various sources (object storage, distributed file systems, cloud storage), making it ideal for AI, machine learning, and big data workloads.

Key Features

  • Data Caching and Acceleration: Cache remote data close to compute workloads for faster access.
  • Unified Data Access: Access data from S3, HDFS, NFS, and more through a single interface.
  • Kubernetes Native: Integrates with Kubernetes using CRDs for data management.
  • Scalability: Supports large-scale data and compute clusters.

Installation

You can install Fluid on any Kubernetes cluster using Helm.

Prerequisites:

  • Kubernetes >= 1.18
  • kubectl >= 1.18
  • Helm >= 3.5

Quick Install:

1kubectl create ns fluid-system
2helm repo add fluid https://fluid-cloudnative.github.io/charts
3helm repo update
4helm install fluid fluid/fluid -n fluid-system

For advanced configuration, see the Fluid Installation Guide.

Pre-deployment Steps

  1. Install Fluid (see Installation).
  2. Create a Dataset and Runtime (see the following example).
  3. Mount the resulting PVC in your workload.

Mounting Data Sources

WebUFS Example

WebUFS allows mounting HTTP/HTTPS sources as filesystems.

1# Mount a public HTTP directory as a Fluid Dataset
2apiVersion: data.fluid.io/v1alpha1
3kind: Dataset
4metadata:
5 name: webufs-model
6spec:
7 mounts:
8 - mountPoint: https://myhost.org/path_to_my_model # Replace with your HTTP source
9 name: webufs-model
10---
11apiVersion: data.fluid.io/v1alpha1
12kind: AlluxioRuntime
13metadata:
14 name: webufs-model
15spec:
16 replicas: 2
17 tieredstore:
18 levels:
19 - mediumtype: MEM
20 path: /dev/shm
21 quota: 2Gi
22 high: "0.95"
23 low: "0.7"

After applying, Fluid creates a PersistentVolumeClaim (PVC) named webufs-model containing the files.

S3 Example

Mount an S3 bucket as a Fluid Dataset.

1# Mount an S3 bucket as a Fluid Dataset
2apiVersion: data.fluid.io/v1alpha1
3kind: Dataset
4metadata:
5 name: s3-model
6spec:
7 mounts:
8 - mountPoint: s3://<your-bucket> # Replace with your bucket name
9 options:
10 alluxio.underfs.s3.endpoint: http://minio:9000 # S3 endpoint (e.g., MinIO)
11 alluxio.underfs.s3.disable.dns.buckets: "true"
12 aws.secretKey: "<your-secret>"
13 aws.accessKeyId: "<your-access-key>"
14---
15apiVersion: data.fluid.io/v1alpha1
16kind: AlluxioRuntime
17metadata:
18 name: s3-model
19spec:
20 replicas: 1
21 tieredstore:
22 levels:
23 - mediumtype: MEM
24 path: /dev/shm
25 quota: 1Gi
26 high: "0.95"
27 low: "0.7"
28---
29apiVersion: data.fluid.io/v1alpha1
30kind: DataLoad
31metadata:
32 name: s3-model-loader
33spec:
34 dataset:
35 name: s3-model
36 namespace: <your-namespace> # Replace with your namespace
37 loadMetadata: true
38 target:
39 - path: "/"
40 replicas: 1

The resulting PVC is named s3-model.

Using HuggingFace Models with Fluid

Limitations:

  • HuggingFace models are not exposed as simple filesystems or buckets.
  • No native integration exists between Fluid and the HuggingFace Hub API.

Workaround: Download and Upload to S3/MinIO

  1. Download the model using the HuggingFace CLI or SDK.
  2. Upload the model files to a supported storage backend (S3, GCS, NFS).
  3. Mount that backend using Fluid.

Example Pod to Download and Upload:

1apiVersion: v1
2kind: Pod
3metadata:
4 name: download-hf-to-minio
5spec:
6 restartPolicy: Never
7 containers:
8 - name: downloader
9 image: python:3.10-slim
10 command: ["sh", "-c"]
11 args:
12 - |
13 set -eux
14 pip install --no-cache-dir huggingface_hub awscli
15 BUCKET_NAME=hf-models
16 ENDPOINT_URL=http://minio:9000
17 MODEL_NAME=deepseek-ai/DeepSeek-R1-Distill-Llama-8B
18 LOCAL_DIR=/tmp/model
19 if ! aws --endpoint-url $ENDPOINT_URL s3 ls "s3://$BUCKET_NAME" > /dev/null 2>&1; then
20 aws --endpoint-url $ENDPOINT_URL s3 mb "s3://$BUCKET_NAME"
21 fi
22 huggingface-cli download $MODEL_NAME --local-dir $LOCAL_DIR --local-dir-use-symlinks False
23 aws --endpoint-url $ENDPOINT_URL s3 cp $LOCAL_DIR s3://$BUCKET_NAME/$MODEL_NAME --recursive
24 env:
25 - name: AWS_ACCESS_KEY_ID
26 value: "<your-access-key>"
27 - name: AWS_SECRET_ACCESS_KEY
28 value: "<your-secret>"
29 volumeMounts:
30 - name: tmp-volume
31 mountPath: /tmp/model
32 volumes:
33 - name: tmp-volume
34 emptyDir: {}

You can then use s3://hf-models/deepseek-ai/DeepSeek-R1-Distill-Llama-8B as your Dataset mount.

Usage with Dynamo

Mount the Fluid-generated PVC in your DynamoGraphDeployment:

1apiVersion: nvidia.com/v1alpha1
2kind: DynamoGraphDeployment
3metadata:
4 name: model-caching
5spec:
6 pvcs:
7 - name: s3-model
8 envs:
9 - name: HF_HOME
10 value: /model
11 - name: DYN_DEPLOYMENT_CONFIG
12 value: '{"Common": {"model": "/model", ...}}'
13 services:
14 VllmWorker:
15 volumeMounts:
16 - name: s3-model
17 mountPoint: /model
18 Processor:
19 volumeMounts:
20 - name: s3-model
21 mountPoint: /model

Full example with llama3.3 70B

Performance

When deploying LLaMA 3.3 70B using Fluid as the caching layer, we observed the best performance by configuring a single-node cache that holds 100% of the model files locally. By ensuring that the vllm worker pod is scheduled on the same node as the Fluid cache, we were able to eliminate network I/O bottlenecks, which resulted in the fastest model startup time and the highest inference efficiency during our tests.

Cache ConfigurationvLLM Pod PlacementStartup Time
❌ No Cache (Download from HuggingFace)N/A~9 minutes
🟡 Multi-Node Cache (100% Model Cached)Not on Cache Node~18 minutes
🟡 Multi-Node Cache (100% Model Cached)On Cache Node~10 minutes
✅ Single-Node Cache (100% Model Cached)On Cache Node~80 seconds

Resources

1# dataset.yaml
2apiVersion: data.fluid.io/v1alpha1
3kind: Dataset
4metadata:
5 name: llama-3-3-70b-instruct-model
6 namespace: my-namespace
7spec:
8 mounts:
9 - mountPoint: s3://hf-models/meta-llama/Llama-3.3-70B-Instruct
10 options:
11 alluxio.underfs.s3.endpoint: http://minio:9000
12 alluxio.underfs.s3.disable.dns.buckets: "true"
13 aws.secretKey: "minioadmin"
14 aws.accessKeyId: "minioadmin"
15 alluxio.underfs.s3.streaming.upload.enabled: "true"
16 alluxio.underfs.s3.multipart.upload.threads: "20"
17 alluxio.underfs.s3.socket.timeout: "50s"
18 alluxio.underfs.s3.request.timeout: "60s"
19---
20# runtime.yaml
21apiVersion: data.fluid.io/v1alpha1
22kind: AlluxioRuntime
23metadata:
24 name: llama-3-3-70b-instruct-model
25 namespace: my-namespace
26spec:
27 replicas: 1
28 properties:
29 alluxio.user.file.readtype.default: CACHE_PROMOTE
30 alluxio.user.file.write.type.default: CACHE_THROUGH
31 alluxio.user.block.size.bytes.default: 128MB
32 tieredstore:
33 levels:
34 - mediumtype: MEM
35 path: /dev/shm
36 quota: 300Gi
37 high: "1.0"
38 low: "0.7"
39---
40# DataLoad - Preloads the model into cache
41apiVersion: data.fluid.io/v1alpha1
42kind: DataLoad
43metadata:
44 name: llama-3-3-70b-instruct-model-loader
45spec:
46 dataset:
47 name: llama-3-3-70b-instruct-model
48 namespace: my-namespace
49 loadMetadata: true
50 target:
51 - path: "/"
52 replicas: 1

and the associated DynamoGraphDeployment with pod affinity to schedule the vllm worker on the same node than the Alluxio cache worker

1apiVersion: nvidia.com/v1alpha1
2kind: DynamoGraphDeployment
3metadata:
4 name: my-hello-world
5spec:
6 envs:
7 - name: DYN_LOG
8 value: "debug"
9 - name: DYN_DEPLOYMENT_CONFIG
10 value: '{"Common": {"model": "/model", "block-size": 64, "max-model-len": 16384},
11 "Frontend": {"served_model_name": "meta-llama/Llama-3.3-70B-Instruct", "endpoint":
12 "dynamo.Processor.chat/completions", "port": 8000}, "Processor": {"router":
13 "round-robin", "router-num-threads": 4, "common-configs": ["model", "block-size",
14 "max-model-len"]}, "VllmWorker": {"tensor-parallel-size": 4, "enforce-eager": true, "max-num-batched-tokens":
15 16384, "enable-prefix-caching": true, "ServiceArgs": {"workers": 1, "resources":
16 {"gpu": "4", "memory": "40Gi"}}, "common-configs": ["model", "block-size", "max-model-len"]},
17 "Planner": {"environment": "kubernetes", "no-operation": true}}'
18 pvcs:
19 - name: llama-3-3-70b-instruct-model
20 services:
21 Processor:
22 volumeMounts:
23 - name: llama-3-3-70b-instruct-model
24 mountPoint: /model
25 VllmWorker:
26 volumeMounts:
27 - name: llama-3-3-70b-instruct-model
28 mountPoint: /model
29 extraPodSpec:
30 affinity:
31 nodeAffinity:
32 requiredDuringSchedulingIgnoredDuringExecution:
33 nodeSelectorTerms:
34 - matchExpressions:
35 - key: fluid.io/s-alluxio-my-namespace-llama-3-3-70b-instruct-model
36 operator: In
37 values:
38 - "true"

Troubleshooting & FAQ

  • PVC not created? Check Fluid and AlluxioRuntime pod logs.
  • Model not found? Ensure the model was uploaded to the correct bucket/path.
  • Permission errors? Verify S3/MinIO credentials and bucket policies.

Resources