daviestechlabs/homelab-design

Fork 0

Files

Billy D. f94945fb46

Update README with ADR Index / update-readme (push) Successful in 6s

Details

https.

2026-02-16 18:22:16 -05:00

22 KiB

Raw Blame History

Tiered Storage Strategy: Longhorn + NFS

Status: accepted
Date: 2026-02-04
Deciders: Billy
Technical Story: Provide tiered storage for Kubernetes workloads balancing performance and capacity

Context and Problem Statement

Kubernetes requires a storage solution for stateful applications like databases, message queues, and AI model caches. Different workloads have vastly different requirements:

Databases need fast, reliable storage with replication
Media libraries need large capacity but can tolerate slower access
AI/ML workloads need both - fast storage for models, large capacity for datasets

The homelab has heterogeneous nodes including x86_64 servers and ARM64 Raspberry Pis, plus an external NAS for bulk storage.

How do we provide tiered storage that balances performance, reliability, and capacity for diverse homelab workloads?

Decision Drivers

Performance - fast IOPS for databases and critical workloads
Capacity - large storage for media, datasets, and archives
Reliability - data must survive node failures
Heterogeneous support - work on both x86_64 and ARM64 (with limitations)
Backup capability - support for off-cluster backups
GitOps deployment - Helm charts with Flux management

Considered Options

Longhorn + NFS dual-tier storage
Rook-Ceph for everything
OpenEBS with Mayastor
NFS only
Longhorn only

Decision Outcome

Chosen option: Option 1 - Longhorn + NFS dual-tier storage

Three storage tiers optimized for different use cases:

longhorn (default): Fast distributed block storage on NVMe/SSDs for databases and critical workloads
nfs-fast: High-performance NFS + S3 storage on gravenhollow (all-SSD TrueNAS Scale, dual 10GbE, 12.2 TB) for AI model cache, hot data, and S3-compatible object storage via RustFS
nfs-slow: High-capacity NFS storage on candlekeep (QNAP HDD NAS) for media, datasets, and bulk storage

Positive Consequences

Right-sized storage for each workload type
Longhorn provides HA with automatic replication
NFS provides massive capacity without consuming cluster disk space
ReadWriteMany (RWX) easy on NFS tier
Cost-effective - use existing NAS investment

Negative Consequences

Two storage systems to manage
NFS is slower (hence nfs-slow naming)
NFS single point of failure (no replication)
Network dependency for both tiers

Architecture

┌────────────────────────────────────────────────────────────────────────────┐
│                              TIER 1: LONGHORN                              │
│                        (Fast Distributed Block Storage)                     │
│  ┌─────────────┐  ┌─────────────┐  ┌─────────────┐                         │
│  │   khelben   │  │   mystra    │  │   selune    │                         │
│  │  (NVIDIA)   │  │   (AMD)     │  │   (AMD)     │                         │
│  │             │  │             │  │             │                         │
│  │ /var/mnt/   │  │ /var/mnt/   │  │ /var/mnt/   │                         │
│  │  longhorn   │  │  longhorn   │  │  longhorn   │                         │
│  │  (NVMe)     │  │  (SSD)      │  │  (SSD)      │                         │
│  └──────┬──────┘  └──────┬──────┘  └──────┬──────┘                         │
│         │                │                │                                 │
│         └────────────────┼────────────────┘                                 │
│                          ▼                                                  │
│              ┌───────────────────────┐                                      │
│              │   Longhorn Manager    │                                      │
│              │  (Schedules replicas) │                                      │
│              └───────────┬───────────┘                                      │
│                          ▼                                                  │
│     ┌──────────┐  ┌──────────┐  ┌──────────┐  ┌──────────┐                 │
│     │ Postgres │  │  Vault   │  │Prometheus│  │ClickHouse│                 │
│     │   PVC    │  │   PVC    │  │   PVC    │  │   PVC    │                 │
│     └──────────┘  └──────────┘  └──────────┘  └──────────┘                 │
└────────────────────────────────────────────────────────────────────────────┘

┌────────────────────────────────────────────────────────────────────────────┐
│                              TIER 2: NFS-SLOW                              │
│                        (High-Capacity Bulk Storage)                         │
│                                                                            │
│  ┌────────────────────────────────────────────────────────────────┐        │
│  │                  candlekeep.lab.daviestechlabs.io              │        │
│  │                         (QNAP NAS)                              │        │
│  │                                                                 │        │
│  │   /kubernetes                                                   │        │
│  │   ├── jellyfin-media/     (1TB+ media library)                 │        │
│  │   ├── nextcloud/          (user files)                         │        │
│  │   ├── immich/             (photo backups)                      │        │
│  │   ├── kavita/             (ebooks, comics, manga)              │        │
│  │   ├── mlflow-artifacts/   (model artifacts)                    │        │
│  │   ├── ray-models/         (AI model weights)                   │        │
│  │   └── gitea-runner/       (build caches)                       │        │
│  └────────────────────────────────────────────────────────────────┘        │
│                          │                                                  │
│                          ▼                                                  │
│              ┌───────────────────────┐                                      │
│              │   NFS CSI Driver      │                                      │
│              │  (csi-driver-nfs)     │                                      │
│              └───────────┬───────────┘                                      │
│                          ▼                                                  │
│     ┌──────────┐  ┌──────────┐  ┌──────────┐  ┌──────────┐                 │
│     │ Jellyfin │  │Nextcloud │  │  Immich  │  │  Kavita  │                 │
│     │   PVC    │  │   PVC    │  │   PVC    │  │   PVC    │                 │
│     └──────────┘  └──────────┘  └──────────┘  └──────────┘                 │
└────────────────────────────────────────────────────────────────────────────┘

┌────────────────────────────────────────────────────────────────────────────┐
│                              TIER 3: NFS-FAST                              │
│                     (High-Performance SSD NFS + S3 Storage)                │
│                                                                            │
│  ┌────────────────────────────────────────────────────────────────┐        │
│  │                gravenhollow.lab.daviestechlabs.io              │        │
│  │          (TrueNAS Scale · All-SSD · Dual 10GbE · 12.2 TB)     │        │
│  │                                                                │        │
│  │   NFS: /mnt/gravenhollow/kubernetes                            │        │
│  │   ├── ray-model-cache/    (AI model weights - hot)             │        │
│  │   ├── mlflow-artifacts/   (ML experiment tracking)             │        │
│  │   └── training-data/      (datasets for fine-tuning)           │        │
│  │                                                                │        │
│  │   S3 (RustFS): https://gravenhollow.lab.daviestechlabs.io:30292 │        │
│  │   ├── kubeflow-pipelines   (pipeline artifacts)                │        │
│  │   ├── training-data        (large dataset staging)             │        │
│  │   └── longhorn-backups     (off-cluster backup target)         │        │
│  └────────────────────────────────────────────────────────────────┘        │
│                          │                                                  │
│                          ▼                                                  │
│              ┌───────────────────────┐                                      │
│              │   NFS CSI Driver      │                                      │
│              │  (csi-driver-nfs)     │                                      │
│              └───────────┬───────────┘                                      │
│                          ▼                                                  │
│     ┌──────────┐  ┌──────────┐  ┌──────────┐                               │
│     │Ray Model │  │  MLflow  │  │ Training │                               │
│     │  Cache   │  │ Artifact │  │   Data   │                               │
│     │   PVC    │  │   PVC    │  │   PVC    │                               │
│     └──────────┘  └──────────┘  └──────────┘                               │
└────────────────────────────────────────────────────────────────────────────┘

Tier 1: Longhorn Configuration

Helm Values

persistence:
  defaultClass: true
  defaultClassReplicaCount: 2
  defaultDataPath: /var/mnt/longhorn

defaultSettings:
  defaultDataPath: /var/mnt/longhorn
  # Allow on vllm-tainted nodes
  taintToleration: "dedicated=vllm:NoSchedule"
  # Exclude Raspberry Pi nodes (ARM64)
  systemManagedComponentsNodeSelector: "kubernetes.io/arch:amd64"
  # Snapshot retention
  defaultRecurringJobs:
    - name: nightly-snapshots
      task: snapshot
      cron: "0 2 * * *"
      retain: 7
    - name: weekly-backups
      task: backup
      cron: "0 3 * * 0"
      retain: 4

Longhorn Storage Classes

StorageClass	Replicas	Use Case
`longhorn` (default)	2	General workloads, databases
`longhorn-single`	1	Development/ephemeral
`longhorn-strict`	3	Critical databases

Tier 2: NFS Configuration

Helm Values (csi-driver-nfs)

storageClass:
  create: true
  name: nfs-slow
  parameters:
    server: candlekeep.lab.daviestechlabs.io
    share: /kubernetes
  mountOptions:
    - nfsvers=4.1
    - nconnect=16    # Multiple TCP connections for throughput
    - hard           # Retry indefinitely on failure
    - noatime        # Don't update access times (performance)
  reclaimPolicy: Delete
  volumeBindingMode: Immediate

Why "nfs-slow"?

The naming is intentional - it sets correct expectations:

Latency: NAS is over network, higher latency than local NVMe
IOPS: Spinning disks in NAS can't match SSD performance
Throughput: Adequate for streaming media, not for databases
Benefit: Massive capacity without consuming cluster disk space

Tier 3: NFS-Fast Configuration

Helm Values (second csi-driver-nfs installation)

A second HelmRelease (csi-driver-nfs-fast) references the same OCI chart but only creates the StorageClass — the CSI driver pods are already running from the nfs-slow installation.

controller:
  replicas: 0
node:
  enabled: false
storageClass:
  create: true
  name: nfs-fast
  parameters:
    server: gravenhollow.lab.daviestechlabs.io
    share: /mnt/gravenhollow/kubernetes
  mountOptions:
    - nfsvers=4.2        # Server-side copy, fallocate, seekhole
    - nconnect=16        # 16 TCP connections across bonded 10GbE
    - rsize=1048576      # 1 MB read block size
    - wsize=1048576      # 1 MB write block size
    - hard               # Retry indefinitely on timeout
    - noatime            # Skip access-time updates
    - nodiratime         # Skip directory access-time updates
    - nocto              # Disable close-to-open consistency (read-heavy workloads)
    - actimeo=600        # Cache attributes for 10 min
    - max_connect=16     # Allow up to 16 connections to the same server
  reclaimPolicy: Delete
  volumeBindingMode: Immediate

Performance Tuning Rationale

Option	Why
`nfsvers=4.2`	Enables server-side copy, hole punch, and fallocate — TrueNAS Scale supports NFSv4.2 natively
`nconnect=16`	Opens 16 parallel TCP connections per mount, spreading I/O across both 10GbE bond members
`rsize/wsize=1048576`	1 MB block sizes maximise throughput per operation — jumbo frames (MTU 9000) carry each 1 MB payload in fewer packets, reducing per-packet overhead
`nocto`	Skips close-to-open consistency checks — safe because model weights and artifacts are write-once/read-many
`actimeo=600`	Caches file and directory attributes for 10 minutes, reducing metadata round-trips for static content
`nodiratime`	Avoids unnecessary directory timestamp writes alongside `noatime`

Why "nfs-fast"?

Gravenhollow addresses the performance gap between Longhorn (local) and candlekeep (HDD NAS):

All-SSD: No spinning disk latency — suitable for random I/O workloads like model loading
Dual 10GbE: 2× 10 Gbps network links via link aggregation
12.2 TB capacity: Enough for model cache, artifacts, and training data
RustFS S3: S3-compatible object storage endpoint for pipeline artifacts and backups
Use case: AI/ML model cache, MLflow artifacts, training data — workloads that need better than HDD but don't require local NVMe

S3 Endpoint (RustFS)

Gravenhollow also provides S3-compatible object storage via RustFS:

Endpoint: https://gravenhollow.lab.daviestechlabs.io:30292
Use cases: Kubeflow pipeline artifacts, Longhorn off-cluster backups, training dataset staging
Credentials: Managed via Vault ExternalSecret (/kv/data/gravenhollow → access_key, secret_key)

Storage Tier Selection Guide

Workload Type	Storage Class	Rationale
PostgreSQL (CNPG)	`longhorn`	HA with replication, low latency
Prometheus/ClickHouse	`longhorn`	High write IOPS required
Vault	`longhorn`	Security-critical, needs HA
AI/ML models (Ray)	`nfs-fast`	Large model weights, SSD speed
MLflow artifacts	`nfs-fast`	Experiment tracking, frequent reads
Training data	`nfs-fast`	Dataset staging for fine-tuning
Media (Jellyfin, Kavita)	`nfs-slow`	Large files, sequential reads
Photos (Immich)	`nfs-slow`	Bulk storage for photos
User files (Nextcloud)	`nfs-slow`	Capacity over speed
Build caches (Gitea runner)	`nfs-slow`	Ephemeral, large

Volume Usage by Tier

Longhorn Volumes (Performance Tier)

Workload	Size	Replicas	Access Mode
Prometheus	50Gi	2	RWO
Vault	2Gi	2	RWO
ClickHouse	100Gi	2	RWO
Alertmanager	1Gi	2	RWO

NFS Volumes (Capacity Tier)

Workload	Size	Access Mode	Notes
Jellyfin	2Ti	RWX	Media library
Immich	500Gi	RWX	Photo storage
Nextcloud	1Ti	RWX	User files
Kavita	200Gi	RWX	Ebooks, comics
MLflow	100Gi	RWX	Model artifacts
Ray models	200Gi	RWX	AI model weights
Gitea runner	50Gi	RWO	Build caches
Gitea DB (CNPG)	10Gi	RWO	Capacity-optimized

Backup Strategy

Longhorn Tier

Local Snapshots

Frequency: Nightly at 2 AM
Retention: 7 days
Purpose: Quick recovery from accidental deletion

Off-Cluster Backups

Frequency: Weekly on Sundays at 3 AM
Destination: S3-compatible storage (MinIO/Backblaze)
Retention: 4 weeks
Purpose: Disaster recovery

NFS Tier

NAS-Level Backups

Handled by NAS backup solution (snapshots, replication)
Not managed by Kubernetes
Relies on NAS raid configuration for redundancy

Backup Target Configuration (Longhorn)

# ExternalSecret for backup credentials
apiVersion: external-secrets.io/v1
kind: ExternalSecret
metadata:
  name: longhorn-backup-secret
spec:
  secretStoreRef:
    kind: ClusterSecretStore
    name: vault
  target:
    name: longhorn-backup-secret
  data:
    - secretKey: AWS_ACCESS_KEY_ID
      remoteRef:
        key: kv/data/longhorn
        property: backup_access_key
    - secretKey: AWS_SECRET_ACCESS_KEY
      remoteRef:
        key: kv/data/longhorn
        property: backup_secret_key

Node Exclusions (Longhorn Only)

Raspberry Pi nodes excluded because:

Limited disk I/O performance
SD card wear concerns
Memory constraints for Longhorn components

GPU nodes included with tolerations:

khelben (NVIDIA) participates in Longhorn storage
Taint toleration allows Longhorn to schedule there

Performance Considerations

Longhorn Performance

khelben has NVMe - fastest storage node
mystra/selune have SATA SSDs - adequate for most workloads
2 replicas across different nodes ensures single node failure survival
Trade-off: 2x storage consumption

NFS Performance

Optimized with nconnect=16 for parallel connections
noatime reduces unnecessary write operations
Sequential read workloads perform well (media streaming)
Random I/O workloads should use Longhorn instead

When to Choose Each Tier

Requirement	Longhorn	NFS-Fast	NFS-Slow
Low latency	✅	⚡	❌
High IOPS	✅	⚡	❌
Large capacity	❌	✅ (12.2 TB)	✅✅
ReadWriteMany (RWX)	Limited	✅	✅
S3 compatible	❌	✅ (RustFS)	✅ (Quobjects)
Node failure survival	✅	✅ (NAS)	✅ (NAS)
Kubernetes-native	✅	✅	✅

Monitoring

Grafana Dashboard: Longhorn dashboard for:

Volume health and replica status
IOPS and throughput per volume
Disk space utilization per node
Backup job status

Alerts:

Volume degraded (replica count < desired)
Disk space low (< 20% free)
Backup job failed

Future Enhancements

NAS high availability - Second NAS with replication ✅ Done — gravenhollow adds a second NAS
Dedicated storage network - Separate VLAN for storage traffic (gravenhollow's dual 10GbE makes this more impactful)
NVMe-oF - Network NVMe for lower latency
Tiered Longhorn - Hot (NVMe) and warm (SSD) within Longhorn
S3 tier - MinIO for object storage workloads ✅ Done — gravenhollow RustFS provides S3
Migrate AI/ML PVCs to nfs-fast - Move ray-model-cache and mlflow-artifacts from nfs-slow to nfs-fast
Longhorn backups to gravenhollow S3 - Use RustFS as off-cluster backup target

22 KiB Raw Blame History Unescape Escape