- 0025: Observability stack (Prometheus + ClickStack + OTEL) - 0026: Tiered storage strategy (Longhorn + NFS) - 0027: Database strategy (CloudNativePG for PostgreSQL) - 0028: Authentik SSO strategy (OIDC/SAML identity provider)
15 KiB
15 KiB
Tiered Storage Strategy: Longhorn + NFS
- Status: accepted
- Date: 2026-02-04
- Deciders: Billy
- Technical Story: Provide tiered storage for Kubernetes workloads balancing performance and capacity
Context and Problem Statement
Kubernetes requires a storage solution for stateful applications like databases, message queues, and AI model caches. Different workloads have vastly different requirements:
- Databases need fast, reliable storage with replication
- Media libraries need large capacity but can tolerate slower access
- AI/ML workloads need both - fast storage for models, large capacity for datasets
The homelab has heterogeneous nodes including x86_64 servers and ARM64 Raspberry Pis, plus an external NAS for bulk storage.
How do we provide tiered storage that balances performance, reliability, and capacity for diverse homelab workloads?
Decision Drivers
- Performance - fast IOPS for databases and critical workloads
- Capacity - large storage for media, datasets, and archives
- Reliability - data must survive node failures
- Heterogeneous support - work on both x86_64 and ARM64 (with limitations)
- Backup capability - support for off-cluster backups
- GitOps deployment - Helm charts with Flux management
Considered Options
- Longhorn + NFS dual-tier storage
- Rook-Ceph for everything
- OpenEBS with Mayastor
- NFS only
- Longhorn only
Decision Outcome
Chosen option: Option 1 - Longhorn + NFS dual-tier storage
Two storage tiers optimized for different use cases:
longhorn(default): Fast distributed block storage on NVMe/SSDs for databases and critical workloadsnfs-slow: High-capacity NFS storage on external NAS for media, datasets, and bulk storage
Positive Consequences
- Right-sized storage for each workload type
- Longhorn provides HA with automatic replication
- NFS provides massive capacity without consuming cluster disk space
- ReadWriteMany (RWX) easy on NFS tier
- Cost-effective - use existing NAS investment
Negative Consequences
- Two storage systems to manage
- NFS is slower (hence
nfs-slownaming) - NFS single point of failure (no replication)
- Network dependency for both tiers
Architecture
┌────────────────────────────────────────────────────────────────────────────┐
│ TIER 1: LONGHORN │
│ (Fast Distributed Block Storage) │
│ ┌─────────────┐ ┌─────────────┐ ┌─────────────┐ │
│ │ khelben │ │ mystra │ │ selune │ │
│ │ (NVIDIA) │ │ (AMD) │ │ (AMD) │ │
│ │ │ │ │ │ │ │
│ │ /var/mnt/ │ │ /var/mnt/ │ │ /var/mnt/ │ │
│ │ longhorn │ │ longhorn │ │ longhorn │ │
│ │ (NVMe) │ │ (SSD) │ │ (SSD) │ │
│ └──────┬──────┘ └──────┬──────┘ └──────┬──────┘ │
│ │ │ │ │
│ └────────────────┼────────────────┘ │
│ ▼ │
│ ┌───────────────────────┐ │
│ │ Longhorn Manager │ │
│ │ (Schedules replicas) │ │
│ └───────────┬───────────┘ │
│ ▼ │
│ ┌──────────┐ ┌──────────┐ ┌──────────┐ ┌──────────┐ │
│ │ Postgres │ │ Vault │ │Prometheus│ │ClickHouse│ │
│ │ PVC │ │ PVC │ │ PVC │ │ PVC │ │
│ └──────────┘ └──────────┘ └──────────┘ └──────────┘ │
└────────────────────────────────────────────────────────────────────────────┘
┌────────────────────────────────────────────────────────────────────────────┐
│ TIER 2: NFS-SLOW │
│ (High-Capacity Bulk Storage) │
│ │
│ ┌────────────────────────────────────────────────────────────────┐ │
│ │ candlekeep.lab.daviestechlabs.io │ │
│ │ (External NAS) │ │
│ │ │ │
│ │ /kubernetes │ │
│ │ ├── jellyfin-media/ (1TB+ media library) │ │
│ │ ├── nextcloud/ (user files) │ │
│ │ ├── immich/ (photo backups) │ │
│ │ ├── kavita/ (ebooks, comics, manga) │ │
│ │ ├── mlflow-artifacts/ (model artifacts) │ │
│ │ ├── ray-models/ (AI model weights) │ │
│ │ └── gitea-runner/ (build caches) │ │
│ └────────────────────────────────────────────────────────────────┘ │
│ │ │
│ ▼ │
│ ┌───────────────────────┐ │
│ │ NFS CSI Driver │ │
│ │ (csi-driver-nfs) │ │
│ └───────────┬───────────┘ │
│ ▼ │
│ ┌──────────┐ ┌──────────┐ ┌──────────┐ ┌──────────┐ │
│ │ Jellyfin │ │Nextcloud │ │ Immich │ │ Kavita │ │
│ │ PVC │ │ PVC │ │ PVC │ │ PVC │ │
│ └──────────┘ └──────────┘ └──────────┘ └──────────┘ │
└────────────────────────────────────────────────────────────────────────────┘
Tier 1: Longhorn Configuration
Helm Values
persistence:
defaultClass: true
defaultClassReplicaCount: 2
defaultDataPath: /var/mnt/longhorn
defaultSettings:
defaultDataPath: /var/mnt/longhorn
# Allow on vllm-tainted nodes
taintToleration: "dedicated=vllm:NoSchedule"
# Exclude Raspberry Pi nodes (ARM64)
systemManagedComponentsNodeSelector: "kubernetes.io/arch:amd64"
# Snapshot retention
defaultRecurringJobs:
- name: nightly-snapshots
task: snapshot
cron: "0 2 * * *"
retain: 7
- name: weekly-backups
task: backup
cron: "0 3 * * 0"
retain: 4
Longhorn Storage Classes
| StorageClass | Replicas | Use Case |
|---|---|---|
longhorn (default) |
2 | General workloads, databases |
longhorn-single |
1 | Development/ephemeral |
longhorn-strict |
3 | Critical databases |
Tier 2: NFS Configuration
Helm Values (csi-driver-nfs)
storageClass:
create: true
name: nfs-slow
parameters:
server: candlekeep.lab.daviestechlabs.io
share: /kubernetes
mountOptions:
- nfsvers=4.1
- nconnect=16 # Multiple TCP connections for throughput
- hard # Retry indefinitely on failure
- noatime # Don't update access times (performance)
reclaimPolicy: Delete
volumeBindingMode: Immediate
Why "nfs-slow"?
The naming is intentional - it sets correct expectations:
- Latency: NAS is over network, higher latency than local NVMe
- IOPS: Spinning disks in NAS can't match SSD performance
- Throughput: Adequate for streaming media, not for databases
- Benefit: Massive capacity without consuming cluster disk space
Storage Tier Selection Guide
| Workload Type | Storage Class | Rationale |
|---|---|---|
| PostgreSQL (CNPG) | longhorn or nfs-slow |
Depends on criticality |
| Prometheus/ClickHouse | longhorn |
High write IOPS required |
| Vault | longhorn |
Security-critical, needs HA |
| Media (Jellyfin, Kavita) | nfs-slow |
Large files, sequential reads |
| Photos (Immich) | nfs-slow |
Bulk storage for photos |
| User files (Nextcloud) | nfs-slow |
Capacity over speed |
| AI/ML models (Ray) | nfs-slow |
Large model weights |
| Build caches (Gitea runner) | nfs-slow |
Ephemeral, large |
| MLflow artifacts | nfs-slow |
Model artifacts storage |
Volume Usage by Tier
Longhorn Volumes (Performance Tier)
| Workload | Size | Replicas | Access Mode |
|---|---|---|---|
| Prometheus | 50Gi | 2 | RWO |
| Vault | 2Gi | 2 | RWO |
| ClickHouse | 100Gi | 2 | RWO |
| Alertmanager | 1Gi | 2 | RWO |
NFS Volumes (Capacity Tier)
| Workload | Size | Access Mode | Notes |
|---|---|---|---|
| Jellyfin | 2Ti | RWX | Media library |
| Immich | 500Gi | RWX | Photo storage |
| Nextcloud | 1Ti | RWX | User files |
| Kavita | 200Gi | RWX | Ebooks, comics |
| MLflow | 100Gi | RWX | Model artifacts |
| Ray models | 200Gi | RWX | AI model weights |
| Gitea runner | 50Gi | RWO | Build caches |
| Gitea DB (CNPG) | 10Gi | RWO | Capacity-optimized |
Backup Strategy
Longhorn Tier
Local Snapshots
- Frequency: Nightly at 2 AM
- Retention: 7 days
- Purpose: Quick recovery from accidental deletion
Off-Cluster Backups
- Frequency: Weekly on Sundays at 3 AM
- Destination: S3-compatible storage (MinIO/Backblaze)
- Retention: 4 weeks
- Purpose: Disaster recovery
NFS Tier
NAS-Level Backups
- Handled by NAS backup solution (snapshots, replication)
- Not managed by Kubernetes
- Relies on NAS raid configuration for redundancy
Backup Target Configuration (Longhorn)
# ExternalSecret for backup credentials
apiVersion: external-secrets.io/v1
kind: ExternalSecret
metadata:
name: longhorn-backup-secret
spec:
secretStoreRef:
kind: ClusterSecretStore
name: vault
target:
name: longhorn-backup-secret
data:
- secretKey: AWS_ACCESS_KEY_ID
remoteRef:
key: kv/data/longhorn
property: backup_access_key
- secretKey: AWS_SECRET_ACCESS_KEY
remoteRef:
key: kv/data/longhorn
property: backup_secret_key
Node Exclusions (Longhorn Only)
Raspberry Pi nodes excluded because:
- Limited disk I/O performance
- SD card wear concerns
- Memory constraints for Longhorn components
GPU nodes included with tolerations:
khelben(NVIDIA) participates in Longhorn storage- Taint toleration allows Longhorn to schedule there
Performance Considerations
Longhorn Performance
khelbenhas NVMe - fastest storage nodemystra/selunehave SATA SSDs - adequate for most workloads- 2 replicas across different nodes ensures single node failure survival
- Trade-off: 2x storage consumption
NFS Performance
- Optimized with
nconnect=16for parallel connections noatimereduces unnecessary write operations- Sequential read workloads perform well (media streaming)
- Random I/O workloads should use Longhorn instead
When to Choose Each Tier
| Requirement | Longhorn | NFS-Slow |
|---|---|---|
| Low latency | ✅ | ❌ |
| High IOPS | ✅ | ❌ |
| Large capacity | ❌ | ✅ |
| ReadWriteMany (RWX) | Limited | ✅ |
| Node failure survival | ✅ | ✅ (NAS HA) |
| Kubernetes-native | ✅ | ✅ |
Monitoring
Grafana Dashboard: Longhorn dashboard for:
- Volume health and replica status
- IOPS and throughput per volume
- Disk space utilization per node
- Backup job status
Alerts:
- Volume degraded (replica count < desired)
- Disk space low (< 20% free)
- Backup job failed
Future Enhancements
- NAS high availability - Second NAS with replication
- Dedicated storage network - Separate VLAN for storage traffic
- NVMe-oF - Network NVMe for lower latency
- Tiered Longhorn - Hot (NVMe) and warm (SSD) within Longhorn
- S3 tier - MinIO for object storage workloads