Files
homelab-design/decisions/0026-storage-strategy.md
Billy D. b4e608f002
All checks were successful
Update README with ADR Index / update-readme (push) Successful in 6s
updates to finish nfs-fast implementation.
2026-02-16 18:08:38 -05:00

22 KiB
Raw Blame History

Tiered Storage Strategy: Longhorn + NFS

  • Status: accepted
  • Date: 2026-02-04
  • Deciders: Billy
  • Technical Story: Provide tiered storage for Kubernetes workloads balancing performance and capacity

Context and Problem Statement

Kubernetes requires a storage solution for stateful applications like databases, message queues, and AI model caches. Different workloads have vastly different requirements:

  • Databases need fast, reliable storage with replication
  • Media libraries need large capacity but can tolerate slower access
  • AI/ML workloads need both - fast storage for models, large capacity for datasets

The homelab has heterogeneous nodes including x86_64 servers and ARM64 Raspberry Pis, plus an external NAS for bulk storage.

How do we provide tiered storage that balances performance, reliability, and capacity for diverse homelab workloads?

Decision Drivers

  • Performance - fast IOPS for databases and critical workloads
  • Capacity - large storage for media, datasets, and archives
  • Reliability - data must survive node failures
  • Heterogeneous support - work on both x86_64 and ARM64 (with limitations)
  • Backup capability - support for off-cluster backups
  • GitOps deployment - Helm charts with Flux management

Considered Options

  1. Longhorn + NFS dual-tier storage
  2. Rook-Ceph for everything
  3. OpenEBS with Mayastor
  4. NFS only
  5. Longhorn only

Decision Outcome

Chosen option: Option 1 - Longhorn + NFS dual-tier storage

Three storage tiers optimized for different use cases:

  • longhorn (default): Fast distributed block storage on NVMe/SSDs for databases and critical workloads
  • nfs-fast: High-performance NFS + S3 storage on gravenhollow (all-SSD TrueNAS Scale, dual 10GbE, 12.2 TB) for AI model cache, hot data, and S3-compatible object storage via RustFS
  • nfs-slow: High-capacity NFS storage on candlekeep (QNAP HDD NAS) for media, datasets, and bulk storage

Positive Consequences

  • Right-sized storage for each workload type
  • Longhorn provides HA with automatic replication
  • NFS provides massive capacity without consuming cluster disk space
  • ReadWriteMany (RWX) easy on NFS tier
  • Cost-effective - use existing NAS investment

Negative Consequences

  • Two storage systems to manage
  • NFS is slower (hence nfs-slow naming)
  • NFS single point of failure (no replication)
  • Network dependency for both tiers

Architecture

┌────────────────────────────────────────────────────────────────────────────┐
│                              TIER 1: LONGHORN                              │
│                        (Fast Distributed Block Storage)                     │
│  ┌─────────────┐  ┌─────────────┐  ┌─────────────┐                         │
│  │   khelben   │  │   mystra    │  │   selune    │                         │
│  │  (NVIDIA)   │  │   (AMD)     │  │   (AMD)     │                         │
│  │             │  │             │  │             │                         │
│  │ /var/mnt/   │  │ /var/mnt/   │  │ /var/mnt/   │                         │
│  │  longhorn   │  │  longhorn   │  │  longhorn   │                         │
│  │  (NVMe)     │  │  (SSD)      │  │  (SSD)      │                         │
│  └──────┬──────┘  └──────┬──────┘  └──────┬──────┘                         │
│         │                │                │                                 │
│         └────────────────┼────────────────┘                                 │
│                          ▼                                                  │
│              ┌───────────────────────┐                                      │
│              │   Longhorn Manager    │                                      │
│              │  (Schedules replicas) │                                      │
│              └───────────┬───────────┘                                      │
│                          ▼                                                  │
│     ┌──────────┐  ┌──────────┐  ┌──────────┐  ┌──────────┐                 │
│     │ Postgres │  │  Vault   │  │Prometheus│  │ClickHouse│                 │
│     │   PVC    │  │   PVC    │  │   PVC    │  │   PVC    │                 │
│     └──────────┘  └──────────┘  └──────────┘  └──────────┘                 │
└────────────────────────────────────────────────────────────────────────────┘

┌────────────────────────────────────────────────────────────────────────────┐
│                              TIER 2: NFS-SLOW                              │
│                        (High-Capacity Bulk Storage)                         │
│                                                                            │
│  ┌────────────────────────────────────────────────────────────────┐        │
│  │                  candlekeep.lab.daviestechlabs.io              │        │
│  │                         (QNAP NAS)                              │        │
│  │                                                                 │        │
│  │   /kubernetes                                                   │        │
│  │   ├── jellyfin-media/     (1TB+ media library)                 │        │
│  │   ├── nextcloud/          (user files)                         │        │
│  │   ├── immich/             (photo backups)                      │        │
│  │   ├── kavita/             (ebooks, comics, manga)              │        │
│  │   ├── mlflow-artifacts/   (model artifacts)                    │        │
│  │   ├── ray-models/         (AI model weights)                   │        │
│  │   └── gitea-runner/       (build caches)                       │        │
│  └────────────────────────────────────────────────────────────────┘        │
│                          │                                                  │
│                          ▼                                                  │
│              ┌───────────────────────┐                                      │
│              │   NFS CSI Driver      │                                      │
│              │  (csi-driver-nfs)     │                                      │
│              └───────────┬───────────┘                                      │
│                          ▼                                                  │
│     ┌──────────┐  ┌──────────┐  ┌──────────┐  ┌──────────┐                 │
│     │ Jellyfin │  │Nextcloud │  │  Immich  │  │  Kavita  │                 │
│     │   PVC    │  │   PVC    │  │   PVC    │  │   PVC    │                 │
│     └──────────┘  └──────────┘  └──────────┘  └──────────┘                 │
└────────────────────────────────────────────────────────────────────────────┘

┌────────────────────────────────────────────────────────────────────────────┐
│                              TIER 3: NFS-FAST                              │
│                     (High-Performance SSD NFS + S3 Storage)                │
│                                                                            │
│  ┌────────────────────────────────────────────────────────────────┐        │
│  │                gravenhollow.lab.daviestechlabs.io              │        │
│  │          (TrueNAS Scale · All-SSD · Dual 10GbE · 12.2 TB)     │        │
│  │                                                                │        │
│  │   NFS: /mnt/gravenhollow/kubernetes                            │        │
│  │   ├── ray-model-cache/    (AI model weights - hot)             │        │
│  │   ├── mlflow-artifacts/   (ML experiment tracking)             │        │
│  │   └── training-data/      (datasets for fine-tuning)           │        │
│  │                                                                │        │
│  │   S3 (RustFS): http://gravenhollow.lab.daviestechlabs.io:30292  │        │
│  │   ├── kubeflow-pipelines   (pipeline artifacts)                │        │
│  │   ├── training-data        (large dataset staging)             │        │
│  │   └── longhorn-backups     (off-cluster backup target)         │        │
│  └────────────────────────────────────────────────────────────────┘        │
│                          │                                                  │
│                          ▼                                                  │
│              ┌───────────────────────┐                                      │
│              │   NFS CSI Driver      │                                      │
│              │  (csi-driver-nfs)     │                                      │
│              └───────────┬───────────┘                                      │
│                          ▼                                                  │
│     ┌──────────┐  ┌──────────┐  ┌──────────┐                               │
│     │Ray Model │  │  MLflow  │  │ Training │                               │
│     │  Cache   │  │ Artifact │  │   Data   │                               │
│     │   PVC    │  │   PVC    │  │   PVC    │                               │
│     └──────────┘  └──────────┘  └──────────┘                               │
└────────────────────────────────────────────────────────────────────────────┘

Tier 1: Longhorn Configuration

Helm Values

persistence:
  defaultClass: true
  defaultClassReplicaCount: 2
  defaultDataPath: /var/mnt/longhorn

defaultSettings:
  defaultDataPath: /var/mnt/longhorn
  # Allow on vllm-tainted nodes
  taintToleration: "dedicated=vllm:NoSchedule"
  # Exclude Raspberry Pi nodes (ARM64)
  systemManagedComponentsNodeSelector: "kubernetes.io/arch:amd64"
  # Snapshot retention
  defaultRecurringJobs:
    - name: nightly-snapshots
      task: snapshot
      cron: "0 2 * * *"
      retain: 7
    - name: weekly-backups
      task: backup
      cron: "0 3 * * 0"
      retain: 4

Longhorn Storage Classes

StorageClass Replicas Use Case
longhorn (default) 2 General workloads, databases
longhorn-single 1 Development/ephemeral
longhorn-strict 3 Critical databases

Tier 2: NFS Configuration

Helm Values (csi-driver-nfs)

storageClass:
  create: true
  name: nfs-slow
  parameters:
    server: candlekeep.lab.daviestechlabs.io
    share: /kubernetes
  mountOptions:
    - nfsvers=4.1
    - nconnect=16    # Multiple TCP connections for throughput
    - hard           # Retry indefinitely on failure
    - noatime        # Don't update access times (performance)
  reclaimPolicy: Delete
  volumeBindingMode: Immediate

Why "nfs-slow"?

The naming is intentional - it sets correct expectations:

  • Latency: NAS is over network, higher latency than local NVMe
  • IOPS: Spinning disks in NAS can't match SSD performance
  • Throughput: Adequate for streaming media, not for databases
  • Benefit: Massive capacity without consuming cluster disk space

Tier 3: NFS-Fast Configuration

Helm Values (second csi-driver-nfs installation)

A second HelmRelease (csi-driver-nfs-fast) references the same OCI chart but only creates the StorageClass — the CSI driver pods are already running from the nfs-slow installation.

controller:
  replicas: 0
node:
  enabled: false
storageClass:
  create: true
  name: nfs-fast
  parameters:
    server: gravenhollow.lab.daviestechlabs.io
    share: /mnt/gravenhollow/kubernetes
  mountOptions:
    - nfsvers=4.2        # Server-side copy, fallocate, seekhole
    - nconnect=16        # 16 TCP connections across bonded 10GbE
    - rsize=1048576      # 1 MB read block size
    - wsize=1048576      # 1 MB write block size
    - hard               # Retry indefinitely on timeout
    - noatime            # Skip access-time updates
    - nodiratime         # Skip directory access-time updates
    - nocto              # Disable close-to-open consistency (read-heavy workloads)
    - actimeo=600        # Cache attributes for 10 min
    - max_connect=16     # Allow up to 16 connections to the same server
  reclaimPolicy: Delete
  volumeBindingMode: Immediate

Performance Tuning Rationale

Option Why
nfsvers=4.2 Enables server-side copy, hole punch, and fallocate — TrueNAS Scale supports NFSv4.2 natively
nconnect=16 Opens 16 parallel TCP connections per mount, spreading I/O across both 10GbE bond members
rsize/wsize=1048576 1 MB block sizes maximise throughput per operation — jumbo frames (MTU 9000) carry each 1 MB payload in fewer packets, reducing per-packet overhead
nocto Skips close-to-open consistency checks — safe because model weights and artifacts are write-once/read-many
actimeo=600 Caches file and directory attributes for 10 minutes, reducing metadata round-trips for static content
nodiratime Avoids unnecessary directory timestamp writes alongside noatime

Why "nfs-fast"?

Gravenhollow addresses the performance gap between Longhorn (local) and candlekeep (HDD NAS):

  • All-SSD: No spinning disk latency — suitable for random I/O workloads like model loading
  • Dual 10GbE: 2× 10 Gbps network links via link aggregation
  • 12.2 TB capacity: Enough for model cache, artifacts, and training data
  • RustFS S3: S3-compatible object storage endpoint for pipeline artifacts and backups
  • Use case: AI/ML model cache, MLflow artifacts, training data — workloads that need better than HDD but don't require local NVMe

S3 Endpoint (RustFS)

Gravenhollow also provides S3-compatible object storage via RustFS:

  • Endpoint: http://gravenhollow.lab.daviestechlabs.io:30292
  • Use cases: Kubeflow pipeline artifacts, Longhorn off-cluster backups, training dataset staging
  • Credentials: Managed via Vault ExternalSecret (/kv/data/gravenhollowaccess_key, secret_key)

Storage Tier Selection Guide

Workload Type Storage Class Rationale
PostgreSQL (CNPG) longhorn HA with replication, low latency
Prometheus/ClickHouse longhorn High write IOPS required
Vault longhorn Security-critical, needs HA
AI/ML models (Ray) nfs-fast Large model weights, SSD speed
MLflow artifacts nfs-fast Experiment tracking, frequent reads
Training data nfs-fast Dataset staging for fine-tuning
Media (Jellyfin, Kavita) nfs-slow Large files, sequential reads
Photos (Immich) nfs-slow Bulk storage for photos
User files (Nextcloud) nfs-slow Capacity over speed
Build caches (Gitea runner) nfs-slow Ephemeral, large

Volume Usage by Tier

Longhorn Volumes (Performance Tier)

Workload Size Replicas Access Mode
Prometheus 50Gi 2 RWO
Vault 2Gi 2 RWO
ClickHouse 100Gi 2 RWO
Alertmanager 1Gi 2 RWO

NFS Volumes (Capacity Tier)

Workload Size Access Mode Notes
Jellyfin 2Ti RWX Media library
Immich 500Gi RWX Photo storage
Nextcloud 1Ti RWX User files
Kavita 200Gi RWX Ebooks, comics
MLflow 100Gi RWX Model artifacts
Ray models 200Gi RWX AI model weights
Gitea runner 50Gi RWO Build caches
Gitea DB (CNPG) 10Gi RWO Capacity-optimized

Backup Strategy

Longhorn Tier

Local Snapshots

  • Frequency: Nightly at 2 AM
  • Retention: 7 days
  • Purpose: Quick recovery from accidental deletion

Off-Cluster Backups

  • Frequency: Weekly on Sundays at 3 AM
  • Destination: S3-compatible storage (MinIO/Backblaze)
  • Retention: 4 weeks
  • Purpose: Disaster recovery

NFS Tier

NAS-Level Backups

  • Handled by NAS backup solution (snapshots, replication)
  • Not managed by Kubernetes
  • Relies on NAS raid configuration for redundancy

Backup Target Configuration (Longhorn)

# ExternalSecret for backup credentials
apiVersion: external-secrets.io/v1
kind: ExternalSecret
metadata:
  name: longhorn-backup-secret
spec:
  secretStoreRef:
    kind: ClusterSecretStore
    name: vault
  target:
    name: longhorn-backup-secret
  data:
    - secretKey: AWS_ACCESS_KEY_ID
      remoteRef:
        key: kv/data/longhorn
        property: backup_access_key
    - secretKey: AWS_SECRET_ACCESS_KEY
      remoteRef:
        key: kv/data/longhorn
        property: backup_secret_key

Node Exclusions (Longhorn Only)

Raspberry Pi nodes excluded because:

  • Limited disk I/O performance
  • SD card wear concerns
  • Memory constraints for Longhorn components

GPU nodes included with tolerations:

  • khelben (NVIDIA) participates in Longhorn storage
  • Taint toleration allows Longhorn to schedule there

Performance Considerations

Longhorn Performance

  • khelben has NVMe - fastest storage node
  • mystra/selune have SATA SSDs - adequate for most workloads
  • 2 replicas across different nodes ensures single node failure survival
  • Trade-off: 2x storage consumption

NFS Performance

  • Optimized with nconnect=16 for parallel connections
  • noatime reduces unnecessary write operations
  • Sequential read workloads perform well (media streaming)
  • Random I/O workloads should use Longhorn instead

When to Choose Each Tier

Requirement Longhorn NFS-Fast NFS-Slow
Low latency
High IOPS
Large capacity (12.2 TB)
ReadWriteMany (RWX) Limited
S3 compatible (RustFS) (Quobjects)
Node failure survival (NAS) (NAS)
Kubernetes-native

Monitoring

Grafana Dashboard: Longhorn dashboard for:

  • Volume health and replica status
  • IOPS and throughput per volume
  • Disk space utilization per node
  • Backup job status

Alerts:

  • Volume degraded (replica count < desired)
  • Disk space low (< 20% free)
  • Backup job failed

Future Enhancements

  1. NAS high availability - Second NAS with replication Done — gravenhollow adds a second NAS
  2. Dedicated storage network - Separate VLAN for storage traffic (gravenhollow's dual 10GbE makes this more impactful)
  3. NVMe-oF - Network NVMe for lower latency
  4. Tiered Longhorn - Hot (NVMe) and warm (SSD) within Longhorn
  5. S3 tier - MinIO for object storage workloads Done — gravenhollow RustFS provides S3
  6. Migrate AI/ML PVCs to nfs-fast - Move ray-model-cache and mlflow-artifacts from nfs-slow to nfs-fast
  7. Longhorn backups to gravenhollow S3 - Use RustFS as off-cluster backup target

References