# Tiered Storage Strategy: Longhorn + NFS * Status: accepted * Date: 2026-02-04 * Deciders: Billy * Technical Story: Provide tiered storage for Kubernetes workloads balancing performance and capacity ## Context and Problem Statement Kubernetes requires a storage solution for stateful applications like databases, message queues, and AI model caches. Different workloads have vastly different requirements: - Databases need fast, reliable storage with replication - Media libraries need large capacity but can tolerate slower access - AI/ML workloads need both - fast storage for models, large capacity for datasets The homelab has heterogeneous nodes including x86_64 servers and ARM64 Raspberry Pis, plus an external NAS for bulk storage. How do we provide tiered storage that balances performance, reliability, and capacity for diverse homelab workloads? ## Decision Drivers * Performance - fast IOPS for databases and critical workloads * Capacity - large storage for media, datasets, and archives * Reliability - data must survive node failures * Heterogeneous support - work on both x86_64 and ARM64 (with limitations) * Backup capability - support for off-cluster backups * GitOps deployment - Helm charts with Flux management ## Considered Options 1. **Longhorn + NFS dual-tier storage** 2. **Rook-Ceph for everything** 3. **OpenEBS with Mayastor** 4. **NFS only** 5. **Longhorn only** ## Decision Outcome Chosen option: **Option 1 - Longhorn + NFS dual-tier storage** Three storage tiers optimized for different use cases: - **`longhorn`** (default): Fast distributed block storage on NVMe/SSDs for databases and critical workloads - **`nfs-fast`**: High-performance NFS + S3 storage on gravenhollow (all-SSD TrueNAS Scale, dual 10GbE, 12.2 TB) for AI model cache, hot data, and S3-compatible object storage via RustFS - **`nfs-slow`**: High-capacity NFS storage on candlekeep (QNAP HDD NAS) for media, datasets, and bulk storage ### Positive Consequences * Right-sized storage for each workload type * Longhorn provides HA with automatic replication * NFS provides massive capacity without consuming cluster disk space * ReadWriteMany (RWX) easy on NFS tier * Cost-effective - use existing NAS investment ### Negative Consequences * Two storage systems to manage * NFS is slower (hence `nfs-slow` naming) * NFS single point of failure (no replication) * Network dependency for both tiers ## Architecture ``` ┌────────────────────────────────────────────────────────────────────────────┐ │ TIER 1: LONGHORN │ │ (Fast Distributed Block Storage) │ │ ┌─────────────┐ ┌─────────────┐ ┌─────────────┐ │ │ │ khelben │ │ mystra │ │ selune │ │ │ │ (NVIDIA) │ │ (AMD) │ │ (AMD) │ │ │ │ │ │ │ │ │ │ │ │ /var/mnt/ │ │ /var/mnt/ │ │ /var/mnt/ │ │ │ │ longhorn │ │ longhorn │ │ longhorn │ │ │ │ (NVMe) │ │ (SSD) │ │ (SSD) │ │ │ └──────┬──────┘ └──────┬──────┘ └──────┬──────┘ │ │ │ │ │ │ │ └────────────────┼────────────────┘ │ │ ▼ │ │ ┌───────────────────────┐ │ │ │ Longhorn Manager │ │ │ │ (Schedules replicas) │ │ │ └───────────┬───────────┘ │ │ ▼ │ │ ┌──────────┐ ┌──────────┐ ┌──────────┐ ┌──────────┐ │ │ │ Postgres │ │ Vault │ │Prometheus│ │ClickHouse│ │ │ │ PVC │ │ PVC │ │ PVC │ │ PVC │ │ │ └──────────┘ └──────────┘ └──────────┘ └──────────┘ │ └────────────────────────────────────────────────────────────────────────────┘ ┌────────────────────────────────────────────────────────────────────────────┐ │ TIER 2: NFS-SLOW │ │ (High-Capacity Bulk Storage) │ │ │ │ ┌────────────────────────────────────────────────────────────────┐ │ │ │ candlekeep.lab.daviestechlabs.io │ │ │ │ (QNAP NAS) │ │ │ │ │ │ │ │ /kubernetes │ │ │ │ ├── jellyfin-media/ (1TB+ media library) │ │ │ │ ├── nextcloud/ (user files) │ │ │ │ ├── immich/ (photo backups) │ │ │ │ ├── kavita/ (ebooks, comics, manga) │ │ │ │ ├── mlflow-artifacts/ (model artifacts) │ │ │ │ ├── ray-models/ (AI model weights) │ │ │ │ └── gitea-runner/ (build caches) │ │ │ └────────────────────────────────────────────────────────────────┘ │ │ │ │ │ ▼ │ │ ┌───────────────────────┐ │ │ │ NFS CSI Driver │ │ │ │ (csi-driver-nfs) │ │ │ └───────────┬───────────┘ │ │ ▼ │ │ ┌──────────┐ ┌──────────┐ ┌──────────┐ ┌──────────┐ │ │ │ Jellyfin │ │Nextcloud │ │ Immich │ │ Kavita │ │ │ │ PVC │ │ PVC │ │ PVC │ │ PVC │ │ │ └──────────┘ └──────────┘ └──────────┘ └──────────┘ │ └────────────────────────────────────────────────────────────────────────────┘ ┌────────────────────────────────────────────────────────────────────────────┐ │ TIER 3: NFS-FAST │ │ (High-Performance SSD NFS + S3 Storage) │ │ │ │ ┌────────────────────────────────────────────────────────────────┐ │ │ │ gravenhollow.lab.daviestechlabs.io │ │ │ │ (TrueNAS Scale · All-SSD · Dual 10GbE · 12.2 TB) │ │ │ │ │ │ │ │ NFS: /mnt/gravenhollow/kubernetes │ │ │ │ ├── ray-model-cache/ (AI model weights - hot) │ │ │ │ ├── mlflow-artifacts/ (ML experiment tracking) │ │ │ │ └── training-data/ (datasets for fine-tuning) │ │ │ │ │ │ │ │ S3 (RustFS): https://gravenhollow.lab.daviestechlabs.io:30292 │ │ │ │ ├── kubeflow-pipelines (pipeline artifacts) │ │ │ │ ├── training-data (large dataset staging) │ │ │ │ └── longhorn-backups (off-cluster backup target) │ │ │ └────────────────────────────────────────────────────────────────┘ │ │ │ │ │ ▼ │ │ ┌───────────────────────┐ │ │ │ NFS CSI Driver │ │ │ │ (csi-driver-nfs) │ │ │ └───────────┬───────────┘ │ │ ▼ │ │ ┌──────────┐ ┌──────────┐ ┌──────────┐ │ │ │Ray Model │ │ MLflow │ │ Training │ │ │ │ Cache │ │ Artifact │ │ Data │ │ │ │ PVC │ │ PVC │ │ PVC │ │ │ └──────────┘ └──────────┘ └──────────┘ │ └────────────────────────────────────────────────────────────────────────────┘ ``` ## Tier 1: Longhorn Configuration ### Helm Values ```yaml persistence: defaultClass: true defaultClassReplicaCount: 2 defaultDataPath: /var/mnt/longhorn defaultSettings: defaultDataPath: /var/mnt/longhorn # Allow on vllm-tainted nodes taintToleration: "dedicated=vllm:NoSchedule" # Exclude Raspberry Pi nodes (ARM64) systemManagedComponentsNodeSelector: "kubernetes.io/arch:amd64" # Snapshot retention defaultRecurringJobs: - name: nightly-snapshots task: snapshot cron: "0 2 * * *" retain: 7 - name: weekly-backups task: backup cron: "0 3 * * 0" retain: 4 ``` ### Longhorn Storage Classes | StorageClass | Replicas | Use Case | |--------------|----------|----------| | `longhorn` (default) | 2 | General workloads, databases | | `longhorn-single` | 1 | Development/ephemeral | | `longhorn-strict` | 3 | Critical databases | ## Tier 2: NFS Configuration ### Helm Values (csi-driver-nfs) ```yaml storageClass: create: true name: nfs-slow parameters: server: candlekeep.lab.daviestechlabs.io share: /kubernetes mountOptions: - nfsvers=4.1 - nconnect=16 # Multiple TCP connections for throughput - hard # Retry indefinitely on failure - noatime # Don't update access times (performance) reclaimPolicy: Delete volumeBindingMode: Immediate ``` ### Why "nfs-slow"? The naming is intentional - it sets correct expectations: - **Latency:** NAS is over network, higher latency than local NVMe - **IOPS:** Spinning disks in NAS can't match SSD performance - **Throughput:** Adequate for streaming media, not for databases - **Benefit:** Massive capacity without consuming cluster disk space ## Tier 3: NFS-Fast Configuration ### Helm Values (second csi-driver-nfs installation) A second HelmRelease (`csi-driver-nfs-fast`) references the same OCI chart but only creates the StorageClass — the CSI driver pods are already running from the nfs-slow installation. ```yaml controller: replicas: 0 node: enabled: false storageClass: create: true name: nfs-fast parameters: server: gravenhollow.lab.daviestechlabs.io share: /mnt/gravenhollow/kubernetes mountOptions: - nfsvers=4.2 # Server-side copy, fallocate, seekhole - nconnect=16 # 16 TCP connections across bonded 10GbE - rsize=1048576 # 1 MB read block size - wsize=1048576 # 1 MB write block size - hard # Retry indefinitely on timeout - noatime # Skip access-time updates - nodiratime # Skip directory access-time updates - nocto # Disable close-to-open consistency (read-heavy workloads) - actimeo=600 # Cache attributes for 10 min - max_connect=16 # Allow up to 16 connections to the same server reclaimPolicy: Delete volumeBindingMode: Immediate ``` ### Performance Tuning Rationale | Option | Why | |--------|-----| | `nfsvers=4.2` | Enables server-side copy, hole punch, and fallocate — TrueNAS Scale supports NFSv4.2 natively | | `nconnect=16` | Opens 16 parallel TCP connections per mount, spreading I/O across both 10GbE bond members | | `rsize/wsize=1048576` | 1 MB block sizes maximise throughput per operation — jumbo frames (MTU 9000) carry each 1 MB payload in fewer packets, reducing per-packet overhead | | `nocto` | Skips close-to-open consistency checks — safe because model weights and artifacts are write-once/read-many | | `actimeo=600` | Caches file and directory attributes for 10 minutes, reducing metadata round-trips for static content | | `nodiratime` | Avoids unnecessary directory timestamp writes alongside `noatime` | ### Why "nfs-fast"? Gravenhollow addresses the performance gap between Longhorn (local) and candlekeep (HDD NAS): - **All-SSD:** No spinning disk latency — suitable for random I/O workloads like model loading - **Dual 10GbE:** 2× 10 Gbps network links via link aggregation - **12.2 TB capacity:** Enough for model cache, artifacts, and training data - **RustFS S3:** S3-compatible object storage endpoint for pipeline artifacts and backups - **Use case:** AI/ML model cache, MLflow artifacts, training data — workloads that need better than HDD but don't require local NVMe ### S3 Endpoint (RustFS) Gravenhollow also provides S3-compatible object storage via RustFS: - **Endpoint:** `https://gravenhollow.lab.daviestechlabs.io:30292` - **Use cases:** Kubeflow pipeline artifacts, Longhorn off-cluster backups, training dataset staging - **Credentials:** Managed via Vault ExternalSecret (`/kv/data/gravenhollow` → `access_key`, `secret_key`) ## Storage Tier Selection Guide | Workload Type | Storage Class | Rationale | |---------------|---------------|-----------| | PostgreSQL (CNPG) | `longhorn` | HA with replication, low latency | | Prometheus/ClickHouse | `longhorn` | High write IOPS required | | Vault | `longhorn` | Security-critical, needs HA | | AI/ML models (Ray) | `nfs-fast` | Large model weights, SSD speed | | MLflow artifacts | `nfs-fast` | Experiment tracking, frequent reads | | Training data | `nfs-fast` | Dataset staging for fine-tuning | | Media (Jellyfin, Kavita) | `nfs-slow` | Large files, sequential reads | | Photos (Immich) | `nfs-slow` | Bulk storage for photos | | User files (Nextcloud) | `nfs-slow` | Capacity over speed | | Build caches (Gitea runner) | `nfs-slow` | Ephemeral, large | ## Volume Usage by Tier ### Longhorn Volumes (Performance Tier) | Workload | Size | Replicas | Access Mode | |----------|------|----------|-------------| | Prometheus | 50Gi | 2 | RWO | | Vault | 2Gi | 2 | RWO | | ClickHouse | 100Gi | 2 | RWO | | Alertmanager | 1Gi | 2 | RWO | ### NFS Volumes (Capacity Tier) | Workload | Size | Access Mode | Notes | |----------|------|-------------|-------| | Jellyfin | 2Ti | RWX | Media library | | Immich | 500Gi | RWX | Photo storage | | Nextcloud | 1Ti | RWX | User files | | Kavita | 200Gi | RWX | Ebooks, comics | | MLflow | 100Gi | RWX | Model artifacts | | Ray models | 200Gi | RWX | AI model weights | | Gitea runner | 50Gi | RWO | Build caches | | Gitea DB (CNPG) | 10Gi | RWO | Capacity-optimized | ## Backup Strategy ### Longhorn Tier #### Local Snapshots - **Frequency:** Nightly at 2 AM - **Retention:** 7 days - **Purpose:** Quick recovery from accidental deletion #### Off-Cluster Backups - **Frequency:** Weekly on Sundays at 3 AM - **Destination:** S3-compatible storage (MinIO/Backblaze) - **Retention:** 4 weeks - **Purpose:** Disaster recovery ### NFS Tier #### NAS-Level Backups - Handled by NAS backup solution (snapshots, replication) - Not managed by Kubernetes - Relies on NAS raid configuration for redundancy ### Backup Target Configuration (Longhorn) ```yaml # ExternalSecret for backup credentials apiVersion: external-secrets.io/v1 kind: ExternalSecret metadata: name: longhorn-backup-secret spec: secretStoreRef: kind: ClusterSecretStore name: vault target: name: longhorn-backup-secret data: - secretKey: AWS_ACCESS_KEY_ID remoteRef: key: kv/data/longhorn property: backup_access_key - secretKey: AWS_SECRET_ACCESS_KEY remoteRef: key: kv/data/longhorn property: backup_secret_key ``` ## Node Exclusions (Longhorn Only) **Raspberry Pi nodes excluded because:** - Limited disk I/O performance - SD card wear concerns - Memory constraints for Longhorn components **GPU nodes included with tolerations:** - `khelben` (NVIDIA) participates in Longhorn storage - Taint toleration allows Longhorn to schedule there ## Performance Considerations ### Longhorn Performance - `khelben` has NVMe - fastest storage node - `mystra`/`selune` have SATA SSDs - adequate for most workloads - 2 replicas across different nodes ensures single node failure survival - Trade-off: 2x storage consumption ### NFS Performance - Optimized with `nconnect=16` for parallel connections - `noatime` reduces unnecessary write operations - Sequential read workloads perform well (media streaming) - Random I/O workloads should use Longhorn instead ### When to Choose Each Tier | Requirement | Longhorn | NFS-Fast | NFS-Slow | |-------------|----------|----------|----------| | Low latency | ✅ | ⚡ | ❌ | | High IOPS | ✅ | ⚡ | ❌ | | Large capacity | ❌ | ✅ (12.2 TB) | ✅✅ | | ReadWriteMany (RWX) | Limited | ✅ | ✅ | | S3 compatible | ❌ | ✅ (RustFS) | ✅ (Quobjects) | | Node failure survival | ✅ | ✅ (NAS) | ✅ (NAS) | | Kubernetes-native | ✅ | ✅ | ✅ | ## Monitoring **Grafana Dashboard:** Longhorn dashboard for: - Volume health and replica status - IOPS and throughput per volume - Disk space utilization per node - Backup job status **Alerts:** - Volume degraded (replica count < desired) - Disk space low (< 20% free) - Backup job failed ## Future Enhancements 1. ~~**NAS high availability** - Second NAS with replication~~ ✅ Done — gravenhollow adds a second NAS 2. **Dedicated storage network** - Separate VLAN for storage traffic (gravenhollow's dual 10GbE makes this more impactful) 3. **NVMe-oF** - Network NVMe for lower latency 4. **Tiered Longhorn** - Hot (NVMe) and warm (SSD) within Longhorn 5. ~~**S3 tier** - MinIO for object storage workloads~~ ✅ Done — gravenhollow RustFS provides S3 6. **Migrate AI/ML PVCs to nfs-fast** - Move ray-model-cache and mlflow-artifacts from nfs-slow to nfs-fast 7. **Longhorn backups to gravenhollow S3** - Use RustFS as off-cluster backup target ## References * [Longhorn Documentation](https://longhorn.io/docs/) * [Longhorn Best Practices](https://longhorn.io/docs/latest/best-practices/) * [NFS CSI Driver](https://github.com/kubernetes-csi/csi-driver-nfs) * [Talos Longhorn Integration](https://www.talos.dev/v1.6/kubernetes-guides/configuration/storage/)