homelab-design/decisions/0026-storage-strategy.md

# Tiered Storage Strategy: Longhorn + NFS

* Status: accepted
* Date: 2026-02-04
* Deciders: Billy
* Technical Story: Provide tiered storage for Kubernetes workloads balancing performance and capacity

## Context and Problem Statement

Kubernetes requires a storage solution for stateful applications like databases, message queues, and AI model caches. Different workloads have vastly different requirements:
- Databases need fast, reliable storage with replication
- Media libraries need large capacity but can tolerate slower access
- AI/ML workloads need both - fast storage for models, large capacity for datasets

The homelab has heterogeneous nodes including x86_64 servers and ARM64 Raspberry Pis, plus an external NAS for bulk storage.

How do we provide tiered storage that balances performance, reliability, and capacity for diverse homelab workloads?

## Decision Drivers

* Performance - fast IOPS for databases and critical workloads
* Capacity - large storage for media, datasets, and archives
* Reliability - data must survive node failures
* Heterogeneous support - work on both x86_64 and ARM64 (with limitations)
* Backup capability - support for off-cluster backups
* GitOps deployment - Helm charts with Flux management

## Considered Options

1. **Longhorn + NFS dual-tier storage**
2. **Rook-Ceph for everything**
3. **OpenEBS with Mayastor**
4. **NFS only**
5. **Longhorn only**

## Decision Outcome

Chosen option: **Option 1 - Longhorn + NFS dual-tier storage**

Two storage tiers optimized for different use cases:
- **`longhorn`** (default): Fast distributed block storage on NVMe/SSDs for databases and critical workloads
- **`nfs-slow`**: High-capacity NFS storage on external NAS for media, datasets, and bulk storage

### Positive Consequences

* Right-sized storage for each workload type
* Longhorn provides HA with automatic replication
* NFS provides massive capacity without consuming cluster disk space
* ReadWriteMany (RWX) easy on NFS tier
* Cost-effective - use existing NAS investment

### Negative Consequences

* Two storage systems to manage
* NFS is slower (hence `nfs-slow` naming)
* NFS single point of failure (no replication)
* Network dependency for both tiers

## Architecture

```
┌────────────────────────────────────────────────────────────────────────────┐
│                              TIER 1: LONGHORN                              │
│                        (Fast Distributed Block Storage)                     │
│  ┌─────────────┐  ┌─────────────┐  ┌─────────────┐                         │
│  │   khelben   │  │   mystra    │  │   selune    │                         │
│  │  (NVIDIA)   │  │   (AMD)     │  │   (AMD)     │                         │
│  │             │  │             │  │             │                         │
│  │ /var/mnt/   │  │ /var/mnt/   │  │ /var/mnt/   │                         │
│  │  longhorn   │  │  longhorn   │  │  longhorn   │                         │
│  │  (NVMe)     │  │  (SSD)      │  │  (SSD)      │                         │
│  └──────┬──────┘  └──────┬──────┘  └──────┬──────┘                         │
│         │                │                │                                 │
│         └────────────────┼────────────────┘                                 │
│                          ▼                                                  │
│              ┌───────────────────────┐                                      │
│              │   Longhorn Manager    │                                      │
│              │  (Schedules replicas) │                                      │
│              └───────────┬───────────┘                                      │
│                          ▼                                                  │
│     ┌──────────┐  ┌──────────┐  ┌──────────┐  ┌──────────┐                 │
│     │ Postgres │  │  Vault   │  │Prometheus│  │ClickHouse│                 │
│     │   PVC    │  │   PVC    │  │   PVC    │  │   PVC    │                 │
│     └──────────┘  └──────────┘  └──────────┘  └──────────┘                 │
└────────────────────────────────────────────────────────────────────────────┘

┌────────────────────────────────────────────────────────────────────────────┐
│                              TIER 2: NFS-SLOW                              │
│                        (High-Capacity Bulk Storage)                         │
│                                                                            │
│  ┌────────────────────────────────────────────────────────────────┐        │
│  │                  candlekeep.lab.daviestechlabs.io              │        │
│  │                        (External NAS)                           │        │
│  │                                                                 │        │
│  │   /kubernetes                                                   │        │
│  │   ├── jellyfin-media/     (1TB+ media library)                 │        │
│  │   ├── nextcloud/          (user files)                         │        │
│  │   ├── immich/             (photo backups)                      │        │
│  │   ├── kavita/             (ebooks, comics, manga)              │        │
│  │   ├── mlflow-artifacts/   (model artifacts)                    │        │
│  │   ├── ray-models/         (AI model weights)                   │        │
│  │   └── gitea-runner/       (build caches)                       │        │
│  └────────────────────────────────────────────────────────────────┘        │
│                          │                                                  │
│                          ▼                                                  │
│              ┌───────────────────────┐                                      │
│              │   NFS CSI Driver      │                                      │
│              │  (csi-driver-nfs)     │                                      │
│              └───────────┬───────────┘                                      │
│                          ▼                                                  │
│     ┌──────────┐  ┌──────────┐  ┌──────────┐  ┌──────────┐                 │
│     │ Jellyfin │  │Nextcloud │  │  Immich  │  │  Kavita  │                 │
│     │   PVC    │  │   PVC    │  │   PVC    │  │   PVC    │                 │
│     └──────────┘  └──────────┘  └──────────┘  └──────────┘                 │
└────────────────────────────────────────────────────────────────────────────┘
```

## Tier 1: Longhorn Configuration

### Helm Values

```yaml
persistence:
  defaultClass: true
  defaultClassReplicaCount: 2
  defaultDataPath: /var/mnt/longhorn

defaultSettings:
  defaultDataPath: /var/mnt/longhorn
  # Allow on vllm-tainted nodes
  taintToleration: "dedicated=vllm:NoSchedule"
  # Exclude Raspberry Pi nodes (ARM64)
  systemManagedComponentsNodeSelector: "kubernetes.io/arch:amd64"
  # Snapshot retention
  defaultRecurringJobs:
    - name: nightly-snapshots
      task: snapshot
      cron: "0 2 * * *"
      retain: 7
    - name: weekly-backups
      task: backup
      cron: "0 3 * * 0"
      retain: 4
```

### Longhorn Storage Classes

| StorageClass | Replicas | Use Case |
|--------------|----------|----------|
| `longhorn` (default) | 2 | General workloads, databases |
| `longhorn-single` | 1 | Development/ephemeral |
| `longhorn-strict` | 3 | Critical databases |

## Tier 2: NFS Configuration

### Helm Values (csi-driver-nfs)

```yaml
storageClass:
  create: true
  name: nfs-slow
  parameters:
    server: candlekeep.lab.daviestechlabs.io
    share: /kubernetes
  mountOptions:
    - nfsvers=4.1
    - nconnect=16    # Multiple TCP connections for throughput
    - hard           # Retry indefinitely on failure
    - noatime        # Don't update access times (performance)
  reclaimPolicy: Delete
  volumeBindingMode: Immediate
```

### Why "nfs-slow"?

The naming is intentional - it sets correct expectations:
- **Latency:** NAS is over network, higher latency than local NVMe
- **IOPS:** Spinning disks in NAS can't match SSD performance
- **Throughput:** Adequate for streaming media, not for databases
- **Benefit:** Massive capacity without consuming cluster disk space

## Storage Tier Selection Guide

| Workload Type | Storage Class | Rationale |
|---------------|---------------|-----------|
| PostgreSQL (CNPG) | `longhorn` or `nfs-slow` | Depends on criticality |
| Prometheus/ClickHouse | `longhorn` | High write IOPS required |
| Vault | `longhorn` | Security-critical, needs HA |
| Media (Jellyfin, Kavita) | `nfs-slow` | Large files, sequential reads |
| Photos (Immich) | `nfs-slow` | Bulk storage for photos |
| User files (Nextcloud) | `nfs-slow` | Capacity over speed |
| AI/ML models (Ray) | `nfs-slow` | Large model weights |
| Build caches (Gitea runner) | `nfs-slow` | Ephemeral, large |
| MLflow artifacts | `nfs-slow` | Model artifacts storage |

## Volume Usage by Tier

### Longhorn Volumes (Performance Tier)

| Workload | Size | Replicas | Access Mode |
|----------|------|----------|-------------|
| Prometheus | 50Gi | 2 | RWO |
| Vault | 2Gi | 2 | RWO |
| ClickHouse | 100Gi | 2 | RWO |
| Alertmanager | 1Gi | 2 | RWO |

### NFS Volumes (Capacity Tier)

| Workload | Size | Access Mode | Notes |
|----------|------|-------------|-------|
| Jellyfin | 2Ti | RWX | Media library |
| Immich | 500Gi | RWX | Photo storage |
| Nextcloud | 1Ti | RWX | User files |
| Kavita | 200Gi | RWX | Ebooks, comics |
| MLflow | 100Gi | RWX | Model artifacts |
| Ray models | 200Gi | RWX | AI model weights |
| Gitea runner | 50Gi | RWO | Build caches |
| Gitea DB (CNPG) | 10Gi | RWO | Capacity-optimized |

## Backup Strategy

### Longhorn Tier

#### Local Snapshots

- **Frequency:** Nightly at 2 AM
- **Retention:** 7 days
- **Purpose:** Quick recovery from accidental deletion

#### Off-Cluster Backups

- **Frequency:** Weekly on Sundays at 3 AM
- **Destination:** S3-compatible storage (MinIO/Backblaze)
- **Retention:** 4 weeks
- **Purpose:** Disaster recovery

### NFS Tier

#### NAS-Level Backups

- Handled by NAS backup solution (snapshots, replication)
- Not managed by Kubernetes
- Relies on NAS raid configuration for redundancy

### Backup Target Configuration (Longhorn)

```yaml
# ExternalSecret for backup credentials
apiVersion: external-secrets.io/v1
kind: ExternalSecret
metadata:
  name: longhorn-backup-secret
spec:
  secretStoreRef:
    kind: ClusterSecretStore
    name: vault
  target:
    name: longhorn-backup-secret
  data:
    - secretKey: AWS_ACCESS_KEY_ID
      remoteRef:
        key: kv/data/longhorn
        property: backup_access_key
    - secretKey: AWS_SECRET_ACCESS_KEY
      remoteRef:
        key: kv/data/longhorn
        property: backup_secret_key
```

## Node Exclusions (Longhorn Only)

**Raspberry Pi nodes excluded because:**
- Limited disk I/O performance
- SD card wear concerns
- Memory constraints for Longhorn components

**GPU nodes included with tolerations:**
- `khelben` (NVIDIA) participates in Longhorn storage
- Taint toleration allows Longhorn to schedule there

## Performance Considerations

### Longhorn Performance

- `khelben` has NVMe - fastest storage node
- `mystra`/`selune` have SATA SSDs - adequate for most workloads
- 2 replicas across different nodes ensures single node failure survival
- Trade-off: 2x storage consumption

### NFS Performance

- Optimized with `nconnect=16` for parallel connections
- `noatime` reduces unnecessary write operations
- Sequential read workloads perform well (media streaming)
- Random I/O workloads should use Longhorn instead

### When to Choose Each Tier

| Requirement | Longhorn | NFS-Slow |
|-------------|----------|----------|
| Low latency | ✅ | ❌ |
| High IOPS | ✅ | ❌ |
| Large capacity | ❌ | ✅ |
| ReadWriteMany (RWX) | Limited | ✅ |
| Node failure survival | ✅ | ✅ (NAS HA) |
| Kubernetes-native | ✅ | ✅ |

## Monitoring

**Grafana Dashboard:** Longhorn dashboard for:
- Volume health and replica status
- IOPS and throughput per volume
- Disk space utilization per node
- Backup job status

**Alerts:**
- Volume degraded (replica count < desired)
- Disk space low (< 20% free)
- Backup job failed

## Future Enhancements

1. **NAS high availability** - Second NAS with replication
2. **Dedicated storage network** - Separate VLAN for storage traffic
3. **NVMe-oF** - Network NVMe for lower latency
4. **Tiered Longhorn** - Hot (NVMe) and warm (SSD) within Longhorn
5. **S3 tier** - MinIO for object storage workloads

## References

* [Longhorn Documentation](https://longhorn.io/docs/)
* [Longhorn Best Practices](https://longhorn.io/docs/latest/best-practices/)
* [NFS CSI Driver](https://github.com/kubernetes-csi/csi-driver-nfs)
* [Talos Longhorn Integration](https://www.talos.dev/v1.6/kubernetes-guides/configuration/storage/)