Model Caching with Fluid
Fluid is an open-source, cloud-native data orchestration and acceleration platform for Kubernetes. It virtualizes and accelerates data access from various sources (object storage, distributed file systems, cloud storage), making it ideal for AI, machine learning, and big data workloads.
Key Features
- Data Caching and Acceleration: Cache remote data close to compute workloads for faster access.
- Unified Data Access: Access data from S3, HDFS, NFS, and more through a single interface.
- Kubernetes Native: Integrates with Kubernetes using CRDs for data management.
- Scalability: Supports large-scale data and compute clusters.
Installation
You can install Fluid on any Kubernetes cluster using Helm.
Prerequisites:
- Kubernetes >= 1.18
kubectl>= 1.18Helm>= 3.5
Quick Install:
For advanced configuration, see the Fluid Installation Guide.
Pre-deployment Steps
- Install Fluid (see Installation).
- Create a Dataset and Runtime (see the following example).
- Mount the resulting PVC in your workload.
Mounting Data Sources
WebUFS Example
WebUFS allows mounting HTTP/HTTPS sources as filesystems.
After applying, Fluid creates a PersistentVolumeClaim (PVC) named webufs-model containing the files.
S3 Example
Mount an S3 bucket as a Fluid Dataset.
The resulting PVC is named s3-model.
Using HuggingFace Models with Fluid
Limitations:
- HuggingFace models are not exposed as simple filesystems or buckets.
- No native integration exists between Fluid and the HuggingFace Hub API.
Workaround: Download and Upload to S3/MinIO
- Download the model using the HuggingFace CLI or SDK.
- Upload the model files to a supported storage backend (S3, GCS, NFS).
- Mount that backend using Fluid.
Example Pod to Download and Upload:
You can then use s3://hf-models/deepseek-ai/DeepSeek-R1-Distill-Llama-8B as your Dataset mount.
Usage with Dynamo
Mount the Fluid-generated PVC in your DynamoGraphDeployment:
Full example with llama3.3 70B
Performance
When deploying LLaMA 3.3 70B using Fluid as the caching layer, we observed the best performance by configuring a single-node cache that holds 100% of the model files locally. By ensuring that the vllm worker pod is scheduled on the same node as the Fluid cache, we were able to eliminate network I/O bottlenecks, which resulted in the fastest model startup time and the highest inference efficiency during our tests.
Resources
and the associated DynamoGraphDeployment with pod affinity to schedule the vllm worker on the same node than the Alluxio cache worker
Troubleshooting & FAQ
- PVC not created? Check Fluid and AlluxioRuntime pod logs.
- Model not found? Ensure the model was uploaded to the correct bucket/path.
- Permission errors? Verify S3/MinIO credentials and bucket policies.