22 KiB
Tiered Storage Strategy: Longhorn + NFS
- Status: accepted
- Date: 2026-02-04
- Deciders: Billy
- Technical Story: Provide tiered storage for Kubernetes workloads balancing performance and capacity
Context and Problem Statement
Kubernetes requires a storage solution for stateful applications like databases, message queues, and AI model caches. Different workloads have vastly different requirements:
- Databases need fast, reliable storage with replication
- Media libraries need large capacity but can tolerate slower access
- AI/ML workloads need both - fast storage for models, large capacity for datasets
The homelab has heterogeneous nodes including x86_64 servers and ARM64 Raspberry Pis, plus an external NAS for bulk storage.
How do we provide tiered storage that balances performance, reliability, and capacity for diverse homelab workloads?
Decision Drivers
- Performance - fast IOPS for databases and critical workloads
- Capacity - large storage for media, datasets, and archives
- Reliability - data must survive node failures
- Heterogeneous support - work on both x86_64 and ARM64 (with limitations)
- Backup capability - support for off-cluster backups
- GitOps deployment - Helm charts with Flux management
Considered Options
- Longhorn + NFS dual-tier storage
- Rook-Ceph for everything
- OpenEBS with Mayastor
- NFS only
- Longhorn only
Decision Outcome
Chosen option: Option 1 - Longhorn + NFS dual-tier storage
Three storage tiers optimized for different use cases:
longhorn(default): Fast distributed block storage on NVMe/SSDs for databases and critical workloadsnfs-fast: High-performance NFS + S3 storage on gravenhollow (all-SSD TrueNAS Scale, dual 10GbE, 12.2 TB) for AI model cache, hot data, and S3-compatible object storage via RustFSnfs-slow: High-capacity NFS storage on candlekeep (QNAP HDD NAS) for media, datasets, and bulk storage
Positive Consequences
- Right-sized storage for each workload type
- Longhorn provides HA with automatic replication
- NFS provides massive capacity without consuming cluster disk space
- ReadWriteMany (RWX) easy on NFS tier
- Cost-effective - use existing NAS investment
Negative Consequences
- Two storage systems to manage
- NFS is slower (hence
nfs-slownaming) - NFS single point of failure (no replication)
- Network dependency for both tiers
Architecture
┌────────────────────────────────────────────────────────────────────────────┐
│ TIER 1: LONGHORN │
│ (Fast Distributed Block Storage) │
│ ┌─────────────┐ ┌─────────────┐ ┌─────────────┐ │
│ │ khelben │ │ mystra │ │ selune │ │
│ │ (NVIDIA) │ │ (AMD) │ │ (AMD) │ │
│ │ │ │ │ │ │ │
│ │ /var/mnt/ │ │ /var/mnt/ │ │ /var/mnt/ │ │
│ │ longhorn │ │ longhorn │ │ longhorn │ │
│ │ (NVMe) │ │ (SSD) │ │ (SSD) │ │
│ └──────┬──────┘ └──────┬──────┘ └──────┬──────┘ │
│ │ │ │ │
│ └────────────────┼────────────────┘ │
│ ▼ │
│ ┌───────────────────────┐ │
│ │ Longhorn Manager │ │
│ │ (Schedules replicas) │ │
│ └───────────┬───────────┘ │
│ ▼ │
│ ┌──────────┐ ┌──────────┐ ┌──────────┐ ┌──────────┐ │
│ │ Postgres │ │ Vault │ │Prometheus│ │ClickHouse│ │
│ │ PVC │ │ PVC │ │ PVC │ │ PVC │ │
│ └──────────┘ └──────────┘ └──────────┘ └──────────┘ │
└────────────────────────────────────────────────────────────────────────────┘
┌────────────────────────────────────────────────────────────────────────────┐
│ TIER 2: NFS-SLOW │
│ (High-Capacity Bulk Storage) │
│ │
│ ┌────────────────────────────────────────────────────────────────┐ │
│ │ candlekeep.lab.daviestechlabs.io │ │
│ │ (QNAP NAS) │ │
│ │ │ │
│ │ /kubernetes │ │
│ │ ├── jellyfin-media/ (1TB+ media library) │ │
│ │ ├── nextcloud/ (user files) │ │
│ │ ├── immich/ (photo backups) │ │
│ │ ├── kavita/ (ebooks, comics, manga) │ │
│ │ ├── mlflow-artifacts/ (model artifacts) │ │
│ │ ├── ray-models/ (AI model weights) │ │
│ │ └── gitea-runner/ (build caches) │ │
│ └────────────────────────────────────────────────────────────────┘ │
│ │ │
│ ▼ │
│ ┌───────────────────────┐ │
│ │ NFS CSI Driver │ │
│ │ (csi-driver-nfs) │ │
│ └───────────┬───────────┘ │
│ ▼ │
│ ┌──────────┐ ┌──────────┐ ┌──────────┐ ┌──────────┐ │
│ │ Jellyfin │ │Nextcloud │ │ Immich │ │ Kavita │ │
│ │ PVC │ │ PVC │ │ PVC │ │ PVC │ │
│ └──────────┘ └──────────┘ └──────────┘ └──────────┘ │
└────────────────────────────────────────────────────────────────────────────┘
┌────────────────────────────────────────────────────────────────────────────┐
│ TIER 3: NFS-FAST │
│ (High-Performance SSD NFS + S3 Storage) │
│ │
│ ┌────────────────────────────────────────────────────────────────┐ │
│ │ gravenhollow.lab.daviestechlabs.io │ │
│ │ (TrueNAS Scale · All-SSD · Dual 10GbE · 12.2 TB) │ │
│ │ │ │
│ │ NFS: /mnt/gravenhollow/kubernetes │ │
│ │ ├── ray-model-cache/ (AI model weights - hot) │ │
│ │ ├── mlflow-artifacts/ (ML experiment tracking) │ │
│ │ └── training-data/ (datasets for fine-tuning) │ │
│ │ │ │
│ │ S3 (RustFS): https://gravenhollow.lab.daviestechlabs.io:30292 │ │
│ │ ├── kubeflow-pipelines (pipeline artifacts) │ │
│ │ ├── training-data (large dataset staging) │ │
│ │ └── longhorn-backups (off-cluster backup target) │ │
│ └────────────────────────────────────────────────────────────────┘ │
│ │ │
│ ▼ │
│ ┌───────────────────────┐ │
│ │ NFS CSI Driver │ │
│ │ (csi-driver-nfs) │ │
│ └───────────┬───────────┘ │
│ ▼ │
│ ┌──────────┐ ┌──────────┐ ┌──────────┐ │
│ │Ray Model │ │ MLflow │ │ Training │ │
│ │ Cache │ │ Artifact │ │ Data │ │
│ │ PVC │ │ PVC │ │ PVC │ │
│ └──────────┘ └──────────┘ └──────────┘ │
└────────────────────────────────────────────────────────────────────────────┘
Tier 1: Longhorn Configuration
Helm Values
persistence:
defaultClass: true
defaultClassReplicaCount: 2
defaultDataPath: /var/mnt/longhorn
defaultSettings:
defaultDataPath: /var/mnt/longhorn
# Allow on vllm-tainted nodes
taintToleration: "dedicated=vllm:NoSchedule"
# Exclude Raspberry Pi nodes (ARM64)
systemManagedComponentsNodeSelector: "kubernetes.io/arch:amd64"
# Snapshot retention
defaultRecurringJobs:
- name: nightly-snapshots
task: snapshot
cron: "0 2 * * *"
retain: 7
- name: weekly-backups
task: backup
cron: "0 3 * * 0"
retain: 4
Longhorn Storage Classes
| StorageClass | Replicas | Use Case |
|---|---|---|
longhorn (default) |
2 | General workloads, databases |
longhorn-single |
1 | Development/ephemeral |
longhorn-strict |
3 | Critical databases |
Tier 2: NFS Configuration
Helm Values (csi-driver-nfs)
storageClass:
create: true
name: nfs-slow
parameters:
server: candlekeep.lab.daviestechlabs.io
share: /kubernetes
mountOptions:
- nfsvers=4.1
- nconnect=16 # Multiple TCP connections for throughput
- hard # Retry indefinitely on failure
- noatime # Don't update access times (performance)
reclaimPolicy: Delete
volumeBindingMode: Immediate
Why "nfs-slow"?
The naming is intentional - it sets correct expectations:
- Latency: NAS is over network, higher latency than local NVMe
- IOPS: Spinning disks in NAS can't match SSD performance
- Throughput: Adequate for streaming media, not for databases
- Benefit: Massive capacity without consuming cluster disk space
Tier 3: NFS-Fast Configuration
Helm Values (second csi-driver-nfs installation)
A second HelmRelease (csi-driver-nfs-fast) references the same OCI chart but only creates the StorageClass — the CSI driver pods are already running from the nfs-slow installation.
controller:
replicas: 0
node:
enabled: false
storageClass:
create: true
name: nfs-fast
parameters:
server: gravenhollow.lab.daviestechlabs.io
share: /mnt/gravenhollow/kubernetes
mountOptions:
- nfsvers=4.2 # Server-side copy, fallocate, seekhole
- nconnect=16 # 16 TCP connections across bonded 10GbE
- rsize=1048576 # 1 MB read block size
- wsize=1048576 # 1 MB write block size
- hard # Retry indefinitely on timeout
- noatime # Skip access-time updates
- nodiratime # Skip directory access-time updates
- nocto # Disable close-to-open consistency (read-heavy workloads)
- actimeo=600 # Cache attributes for 10 min
- max_connect=16 # Allow up to 16 connections to the same server
reclaimPolicy: Delete
volumeBindingMode: Immediate
Performance Tuning Rationale
| Option | Why |
|---|---|
nfsvers=4.2 |
Enables server-side copy, hole punch, and fallocate — TrueNAS Scale supports NFSv4.2 natively |
nconnect=16 |
Opens 16 parallel TCP connections per mount, spreading I/O across both 10GbE bond members |
rsize/wsize=1048576 |
1 MB block sizes maximise throughput per operation — jumbo frames (MTU 9000) carry each 1 MB payload in fewer packets, reducing per-packet overhead |
nocto |
Skips close-to-open consistency checks — safe because model weights and artifacts are write-once/read-many |
actimeo=600 |
Caches file and directory attributes for 10 minutes, reducing metadata round-trips for static content |
nodiratime |
Avoids unnecessary directory timestamp writes alongside noatime |
Why "nfs-fast"?
Gravenhollow addresses the performance gap between Longhorn (local) and candlekeep (HDD NAS):
- All-SSD: No spinning disk latency — suitable for random I/O workloads like model loading
- Dual 10GbE: 2× 10 Gbps network links via link aggregation
- 12.2 TB capacity: Enough for model cache, artifacts, and training data
- RustFS S3: S3-compatible object storage endpoint for pipeline artifacts and backups
- Use case: AI/ML model cache, MLflow artifacts, training data — workloads that need better than HDD but don't require local NVMe
S3 Endpoint (RustFS)
Gravenhollow also provides S3-compatible object storage via RustFS:
- Endpoint:
https://gravenhollow.lab.daviestechlabs.io:30292 - Use cases: Kubeflow pipeline artifacts, Longhorn off-cluster backups, training dataset staging
- Credentials: Managed via Vault ExternalSecret (
/kv/data/gravenhollow→access_key,secret_key)
Storage Tier Selection Guide
| Workload Type | Storage Class | Rationale |
|---|---|---|
| PostgreSQL (CNPG) | longhorn |
HA with replication, low latency |
| Prometheus/ClickHouse | longhorn |
High write IOPS required |
| Vault | longhorn |
Security-critical, needs HA |
| AI/ML models (Ray) | nfs-fast |
Large model weights, SSD speed |
| MLflow artifacts | nfs-fast |
Experiment tracking, frequent reads |
| Training data | nfs-fast |
Dataset staging for fine-tuning |
| Media (Jellyfin, Kavita) | nfs-slow |
Large files, sequential reads |
| Photos (Immich) | nfs-slow |
Bulk storage for photos |
| User files (Nextcloud) | nfs-slow |
Capacity over speed |
| Build caches (Gitea runner) | nfs-slow |
Ephemeral, large |
Volume Usage by Tier
Longhorn Volumes (Performance Tier)
| Workload | Size | Replicas | Access Mode |
|---|---|---|---|
| Prometheus | 50Gi | 2 | RWO |
| Vault | 2Gi | 2 | RWO |
| ClickHouse | 100Gi | 2 | RWO |
| Alertmanager | 1Gi | 2 | RWO |
NFS Volumes (Capacity Tier)
| Workload | Size | Access Mode | Notes |
|---|---|---|---|
| Jellyfin | 2Ti | RWX | Media library |
| Immich | 500Gi | RWX | Photo storage |
| Nextcloud | 1Ti | RWX | User files |
| Kavita | 200Gi | RWX | Ebooks, comics |
| MLflow | 100Gi | RWX | Model artifacts |
| Ray models | 200Gi | RWX | AI model weights |
| Gitea runner | 50Gi | RWO | Build caches |
| Gitea DB (CNPG) | 10Gi | RWO | Capacity-optimized |
Backup Strategy
Longhorn Tier
Local Snapshots
- Frequency: Nightly at 2 AM
- Retention: 7 days
- Purpose: Quick recovery from accidental deletion
Off-Cluster Backups
- Frequency: Weekly on Sundays at 3 AM
- Destination: S3-compatible storage (MinIO/Backblaze)
- Retention: 4 weeks
- Purpose: Disaster recovery
NFS Tier
NAS-Level Backups
- Handled by NAS backup solution (snapshots, replication)
- Not managed by Kubernetes
- Relies on NAS raid configuration for redundancy
Backup Target Configuration (Longhorn)
# ExternalSecret for backup credentials
apiVersion: external-secrets.io/v1
kind: ExternalSecret
metadata:
name: longhorn-backup-secret
spec:
secretStoreRef:
kind: ClusterSecretStore
name: vault
target:
name: longhorn-backup-secret
data:
- secretKey: AWS_ACCESS_KEY_ID
remoteRef:
key: kv/data/longhorn
property: backup_access_key
- secretKey: AWS_SECRET_ACCESS_KEY
remoteRef:
key: kv/data/longhorn
property: backup_secret_key
Node Exclusions (Longhorn Only)
Raspberry Pi nodes excluded because:
- Limited disk I/O performance
- SD card wear concerns
- Memory constraints for Longhorn components
GPU nodes included with tolerations:
khelben(NVIDIA) participates in Longhorn storage- Taint toleration allows Longhorn to schedule there
Performance Considerations
Longhorn Performance
khelbenhas NVMe - fastest storage nodemystra/selunehave SATA SSDs - adequate for most workloads- 2 replicas across different nodes ensures single node failure survival
- Trade-off: 2x storage consumption
NFS Performance
- Optimized with
nconnect=16for parallel connections noatimereduces unnecessary write operations- Sequential read workloads perform well (media streaming)
- Random I/O workloads should use Longhorn instead
When to Choose Each Tier
| Requirement | Longhorn | NFS-Fast | NFS-Slow |
|---|---|---|---|
| Low latency | ✅ | ⚡ | ❌ |
| High IOPS | ✅ | ⚡ | ❌ |
| Large capacity | ❌ | ✅ (12.2 TB) | ✅✅ |
| ReadWriteMany (RWX) | Limited | ✅ | ✅ |
| S3 compatible | ❌ | ✅ (RustFS) | ✅ (Quobjects) |
| Node failure survival | ✅ | ✅ (NAS) | ✅ (NAS) |
| Kubernetes-native | ✅ | ✅ | ✅ |
Monitoring
Grafana Dashboard: Longhorn dashboard for:
- Volume health and replica status
- IOPS and throughput per volume
- Disk space utilization per node
- Backup job status
Alerts:
- Volume degraded (replica count < desired)
- Disk space low (< 20% free)
- Backup job failed
Future Enhancements
NAS high availability - Second NAS with replication✅ Done — gravenhollow adds a second NAS- Dedicated storage network - Separate VLAN for storage traffic (gravenhollow's dual 10GbE makes this more impactful)
- NVMe-oF - Network NVMe for lower latency
- Tiered Longhorn - Hot (NVMe) and warm (SSD) within Longhorn
S3 tier - MinIO for object storage workloads✅ Done — gravenhollow RustFS provides S3- Migrate AI/ML PVCs to nfs-fast - Move ray-model-cache and mlflow-artifacts from nfs-slow to nfs-fast
- Longhorn backups to gravenhollow S3 - Use RustFS as off-cluster backup target