homelab-design/decisions/0026-storage-strategy.md

# Tiered Storage Strategy: Longhorn + NFS

* Status: accepted
* Date: 2026-02-04
* Deciders: Billy
* Technical Story: Provide tiered storage for Kubernetes workloads balancing performance and capacity

## Context and Problem Statement

Kubernetes requires a storage solution for stateful applications like databases, message queues, and AI model caches. Different workloads have vastly different requirements:
- Databases need fast, reliable storage with replication
- Media libraries need large capacity but can tolerate slower access
- AI/ML workloads need both - fast storage for models, large capacity for datasets

The homelab has heterogeneous nodes including x86_64 servers and ARM64 Raspberry Pis, plus an external NAS for bulk storage.

How do we provide tiered storage that balances performance, reliability, and capacity for diverse homelab workloads?

## Decision Drivers

* Performance - fast IOPS for databases and critical workloads
* Capacity - large storage for media, datasets, and archives
* Reliability - data must survive node failures
* Heterogeneous support - work on both x86_64 and ARM64 (with limitations)
* Backup capability - support for off-cluster backups
* GitOps deployment - Helm charts with Flux management

## Considered Options

1. **Longhorn + NFS dual-tier storage**
2. **Rook-Ceph for everything**
3. **OpenEBS with Mayastor**
4. **NFS only**
5. **Longhorn only**

## Decision Outcome

Chosen option: **Option 1 - Longhorn + NFS dual-tier storage**

Three storage tiers optimized for different use cases:
- **`longhorn`** (default): Fast distributed block storage on NVMe/SSDs for databases and critical workloads
- **`nfs-fast`**: High-performance NFS + S3 storage on gravenhollow (all-SSD TrueNAS Scale, dual 10GbE, 12.2 TB) for AI model cache, hot data, and S3-compatible object storage via RustFS
- **`nfs-slow`**: High-capacity NFS storage on candlekeep (QNAP HDD NAS) for media, datasets, and bulk storage

### Positive Consequences

* Right-sized storage for each workload type
* Longhorn provides HA with automatic replication
* NFS provides massive capacity without consuming cluster disk space
* ReadWriteMany (RWX) easy on NFS tier
* Cost-effective - use existing NAS investment

### Negative Consequences

* Two storage systems to manage
* NFS is slower (hence `nfs-slow` naming)
* NFS single point of failure (no replication)
* Network dependency for both tiers

## Architecture

```
┌────────────────────────────────────────────────────────────────────────────┐
│                              TIER 1: LONGHORN                              │
│                        (Fast Distributed Block Storage)                     │
│  ┌─────────────┐  ┌─────────────┐  ┌─────────────┐                         │
│  │   khelben   │  │   mystra    │  │   selune    │                         │
│  │  (NVIDIA)   │  │   (AMD)     │  │   (AMD)     │                         │
│  │             │  │             │  │             │                         │
│  │ /var/mnt/   │  │ /var/mnt/   │  │ /var/mnt/   │                         │
│  │  longhorn   │  │  longhorn   │  │  longhorn   │                         │
│  │  (NVMe)     │  │  (SSD)      │  │  (SSD)      │                         │
│  └──────┬──────┘  └──────┬──────┘  └──────┬──────┘                         │
│         │                │                │                                 │
│         └────────────────┼────────────────┘                                 │
│                          ▼                                                  │
│              ┌───────────────────────┐                                      │
│              │   Longhorn Manager    │                                      │
│              │  (Schedules replicas) │                                      │
│              └───────────┬───────────┘                                      │
│                          ▼                                                  │
│     ┌──────────┐  ┌──────────┐  ┌──────────┐  ┌──────────┐                 │
│     │ Postgres │  │  Vault   │  │Prometheus│  │ClickHouse│                 │
│     │   PVC    │  │   PVC    │  │   PVC    │  │   PVC    │                 │
│     └──────────┘  └──────────┘  └──────────┘  └──────────┘                 │
└────────────────────────────────────────────────────────────────────────────┘

┌────────────────────────────────────────────────────────────────────────────┐
│                              TIER 2: NFS-SLOW                              │
│                        (High-Capacity Bulk Storage)                         │
│                                                                            │
│  ┌────────────────────────────────────────────────────────────────┐        │
│  │                  candlekeep.lab.daviestechlabs.io              │        │
│  │                         (QNAP NAS)                              │        │
│  │                                                                 │        │
│  │   /kubernetes                                                   │        │
│  │   ├── jellyfin-media/     (1TB+ media library)                 │        │
│  │   ├── nextcloud/          (user files)                         │        │
│  │   ├── immich/             (photo backups)                      │        │
│  │   ├── kavita/             (ebooks, comics, manga)              │        │
│  │   ├── mlflow-artifacts/   (model artifacts)                    │        │
│  │   ├── ray-models/         (AI model weights)                   │        │
│  │   └── gitea-runner/       (build caches)                       │        │
│  └────────────────────────────────────────────────────────────────┘        │
│                          │                                                  │
│                          ▼                                                  │
│              ┌───────────────────────┐                                      │
│              │   NFS CSI Driver      │                                      │
│              │  (csi-driver-nfs)     │                                      │
│              └───────────┬───────────┘                                      │
│                          ▼                                                  │
│     ┌──────────┐  ┌──────────┐  ┌──────────┐  ┌──────────┐                 │
│     │ Jellyfin │  │Nextcloud │  │  Immich  │  │  Kavita  │                 │
│     │   PVC    │  │   PVC    │  │   PVC    │  │   PVC    │                 │
│     └──────────┘  └──────────┘  └──────────┘  └──────────┘                 │
└────────────────────────────────────────────────────────────────────────────┘

┌────────────────────────────────────────────────────────────────────────────┐
│                              TIER 3: NFS-FAST                              │
│                     (High-Performance SSD NFS + S3 Storage)                │
│                                                                            │
│  ┌────────────────────────────────────────────────────────────────┐        │
│  │                gravenhollow.lab.daviestechlabs.io              │        │
│  │          (TrueNAS Scale · All-SSD · Dual 10GbE · 12.2 TB)     │        │
│  │                                                                │        │
│  │   NFS: /mnt/gravenhollow/kubernetes                            │        │
│  │   ├── ray-model-cache/    (AI model weights - hot)             │        │
│  │   ├── mlflow-artifacts/   (ML experiment tracking)             │        │
│  │   └── training-data/      (datasets for fine-tuning)           │        │
│  │                                                                │        │
│  │   S3 (RustFS): https://gravenhollow.lab.daviestechlabs.io:30292 │        │
│  │   ├── kubeflow-pipelines   (pipeline artifacts)                │        │
│  │   ├── training-data        (large dataset staging)             │        │
│  │   └── longhorn-backups     (off-cluster backup target)         │        │
│  └────────────────────────────────────────────────────────────────┘        │
│                          │                                                  │
│                          ▼                                                  │
│              ┌───────────────────────┐                                      │
│              │   NFS CSI Driver      │                                      │
│              │  (csi-driver-nfs)     │                                      │
│              └───────────┬───────────┘                                      │
│                          ▼                                                  │
│     ┌──────────┐  ┌──────────┐  ┌──────────┐                               │
│     │Ray Model │  │  MLflow  │  │ Training │                               │
│     │  Cache   │  │ Artifact │  │   Data   │                               │
│     │   PVC    │  │   PVC    │  │   PVC    │                               │
│     └──────────┘  └──────────┘  └──────────┘                               │
└────────────────────────────────────────────────────────────────────────────┘
```

## Tier 1: Longhorn Configuration

### Helm Values

```yaml
persistence:
  defaultClass: true
  defaultClassReplicaCount: 2
  defaultDataPath: /var/mnt/longhorn

defaultSettings:
  defaultDataPath: /var/mnt/longhorn
  # Allow on vllm-tainted nodes
  taintToleration: "dedicated=vllm:NoSchedule"
  # Exclude Raspberry Pi nodes (ARM64)
  systemManagedComponentsNodeSelector: "kubernetes.io/arch:amd64"
  # Snapshot retention
  defaultRecurringJobs:
    - name: nightly-snapshots
      task: snapshot
      cron: "0 2 * * *"
      retain: 7
    - name: weekly-backups
      task: backup
      cron: "0 3 * * 0"
      retain: 4
```

### Longhorn Storage Classes

| StorageClass | Replicas | Use Case |
|--------------|----------|----------|
| `longhorn` (default) | 2 | General workloads, databases |
| `longhorn-single` | 1 | Development/ephemeral |
| `longhorn-strict` | 3 | Critical databases |

## Tier 2: NFS Configuration

### Helm Values (csi-driver-nfs)

```yaml
storageClass:
  create: true
  name: nfs-slow
  parameters:
    server: candlekeep.lab.daviestechlabs.io
    share: /kubernetes
  mountOptions:
    - nfsvers=4.1
    - nconnect=16    # Multiple TCP connections for throughput
    - hard           # Retry indefinitely on failure
    - noatime        # Don't update access times (performance)
  reclaimPolicy: Delete
  volumeBindingMode: Immediate
```

### Why "nfs-slow"?

The naming is intentional - it sets correct expectations:
- **Latency:** NAS is over network, higher latency than local NVMe
- **IOPS:** Spinning disks in NAS can't match SSD performance
- **Throughput:** Adequate for streaming media, not for databases
- **Benefit:** Massive capacity without consuming cluster disk space

## Tier 3: NFS-Fast Configuration

### Helm Values (second csi-driver-nfs installation)

A second HelmRelease (`csi-driver-nfs-fast`) references the same OCI chart but only creates the StorageClass — the CSI driver pods are already running from the nfs-slow installation.

```yaml
controller:
  replicas: 0
node:
  enabled: false
storageClass:
  create: true
  name: nfs-fast
  parameters:
    server: gravenhollow.lab.daviestechlabs.io
    share: /mnt/gravenhollow/kubernetes
  mountOptions:
    - nfsvers=4.2        # Server-side copy, fallocate, seekhole
    - nconnect=16        # 16 TCP connections across bonded 10GbE
    - rsize=1048576      # 1 MB read block size
    - wsize=1048576      # 1 MB write block size
    - hard               # Retry indefinitely on timeout
    - noatime            # Skip access-time updates
    - nodiratime         # Skip directory access-time updates
    - nocto              # Disable close-to-open consistency (read-heavy workloads)
    - actimeo=600        # Cache attributes for 10 min
    - max_connect=16     # Allow up to 16 connections to the same server
  reclaimPolicy: Delete
  volumeBindingMode: Immediate
```

### Performance Tuning Rationale

| Option | Why |
|--------|-----|
| `nfsvers=4.2` | Enables server-side copy, hole punch, and fallocate — TrueNAS Scale supports NFSv4.2 natively |
| `nconnect=16` | Opens 16 parallel TCP connections per mount, spreading I/O across both 10GbE bond members |
| `rsize/wsize=1048576` | 1 MB block sizes maximise throughput per operation — jumbo frames (MTU 9000) carry each 1 MB payload in fewer packets, reducing per-packet overhead |
| `nocto` | Skips close-to-open consistency checks — safe because model weights and artifacts are write-once/read-many |
| `actimeo=600` | Caches file and directory attributes for 10 minutes, reducing metadata round-trips for static content |
| `nodiratime` | Avoids unnecessary directory timestamp writes alongside `noatime` |

### Why "nfs-fast"?

Gravenhollow addresses the performance gap between Longhorn (local) and candlekeep (HDD NAS):
- **All-SSD:** No spinning disk latency — suitable for random I/O workloads like model loading
- **Dual 10GbE:** 2× 10 Gbps network links via link aggregation
- **12.2 TB capacity:** Enough for model cache, artifacts, and training data
- **RustFS S3:** S3-compatible object storage endpoint for pipeline artifacts and backups
- **Use case:** AI/ML model cache, MLflow artifacts, training data — workloads that need better than HDD but don't require local NVMe

### S3 Endpoint (RustFS)

Gravenhollow also provides S3-compatible object storage via RustFS:
- **Endpoint:** `https://gravenhollow.lab.daviestechlabs.io:30292`
- **Use cases:** Kubeflow pipeline artifacts, Longhorn off-cluster backups, training dataset staging
- **Credentials:** Managed via Vault ExternalSecret (`/kv/data/gravenhollow` → `access_key`, `secret_key`)

## Storage Tier Selection Guide

| Workload Type | Storage Class | Rationale |
|---------------|---------------|-----------|
| PostgreSQL (CNPG) | `longhorn` | HA with replication, low latency |
| Prometheus/ClickHouse | `longhorn` | High write IOPS required |
| Vault | `longhorn` | Security-critical, needs HA |
| AI/ML models (Ray) | `nfs-fast` | Large model weights, SSD speed |
| MLflow artifacts | `nfs-fast` | Experiment tracking, frequent reads |
| Training data | `nfs-fast` | Dataset staging for fine-tuning |
| Media (Jellyfin, Kavita) | `nfs-slow` | Large files, sequential reads |
| Photos (Immich) | `nfs-slow` | Bulk storage for photos |
| User files (Nextcloud) | `nfs-slow` | Capacity over speed |
| Build caches (Gitea runner) | `nfs-slow` | Ephemeral, large |

## Volume Usage by Tier

### Longhorn Volumes (Performance Tier)

| Workload | Size | Replicas | Access Mode |
|----------|------|----------|-------------|
| Prometheus | 50Gi | 2 | RWO |
| Vault | 2Gi | 2 | RWO |
| ClickHouse | 100Gi | 2 | RWO |
| Alertmanager | 1Gi | 2 | RWO |

### NFS Volumes (Capacity Tier)

| Workload | Size | Access Mode | Notes |
|----------|------|-------------|-------|
| Jellyfin | 2Ti | RWX | Media library |
| Immich | 500Gi | RWX | Photo storage |
| Nextcloud | 1Ti | RWX | User files |
| Kavita | 200Gi | RWX | Ebooks, comics |
| MLflow | 100Gi | RWX | Model artifacts |
| Ray models | 200Gi | RWX | AI model weights |
| Gitea runner | 50Gi | RWO | Build caches |
| Gitea DB (CNPG) | 10Gi | RWO | Capacity-optimized |

## Backup Strategy

### Longhorn Tier

#### Local Snapshots

- **Frequency:** Nightly at 2 AM
- **Retention:** 7 days
- **Purpose:** Quick recovery from accidental deletion

#### Off-Cluster Backups

- **Frequency:** Weekly on Sundays at 3 AM
- **Destination:** S3-compatible storage (MinIO/Backblaze)
- **Retention:** 4 weeks
- **Purpose:** Disaster recovery

### NFS Tier

#### NAS-Level Backups

- Handled by NAS backup solution (snapshots, replication)
- Not managed by Kubernetes
- Relies on NAS raid configuration for redundancy

### Backup Target Configuration (Longhorn)

```yaml
# ExternalSecret for backup credentials
apiVersion: external-secrets.io/v1
kind: ExternalSecret
metadata:
  name: longhorn-backup-secret
spec:
  secretStoreRef:
    kind: ClusterSecretStore
    name: vault
  target:
    name: longhorn-backup-secret
  data:
    - secretKey: AWS_ACCESS_KEY_ID
      remoteRef:
        key: kv/data/longhorn
        property: backup_access_key
    - secretKey: AWS_SECRET_ACCESS_KEY
      remoteRef:
        key: kv/data/longhorn
        property: backup_secret_key
```

## Node Exclusions (Longhorn Only)

**Raspberry Pi nodes excluded because:**
- Limited disk I/O performance
- SD card wear concerns
- Memory constraints for Longhorn components

**GPU nodes included with tolerations:**
- `khelben` (NVIDIA) participates in Longhorn storage
- Taint toleration allows Longhorn to schedule there

## Performance Considerations

### Longhorn Performance

- `khelben` has NVMe - fastest storage node
- `mystra`/`selune` have SATA SSDs - adequate for most workloads
- 2 replicas across different nodes ensures single node failure survival
- Trade-off: 2x storage consumption

### NFS Performance

- Optimized with `nconnect=16` for parallel connections
- `noatime` reduces unnecessary write operations
- Sequential read workloads perform well (media streaming)
- Random I/O workloads should use Longhorn instead

### When to Choose Each Tier

| Requirement | Longhorn | NFS-Fast | NFS-Slow |
|-------------|----------|----------|----------|
| Low latency | ✅ | ⚡ | ❌ |
| High IOPS | ✅ | ⚡ | ❌ |
| Large capacity | ❌ | ✅ (12.2 TB) | ✅✅ |
| ReadWriteMany (RWX) | Limited | ✅ | ✅ |
| S3 compatible | ❌ | ✅ (RustFS) | ✅ (Quobjects) |
| Node failure survival | ✅ | ✅ (NAS) | ✅ (NAS) |
| Kubernetes-native | ✅ | ✅ | ✅ |

## Monitoring

**Grafana Dashboard:** Longhorn dashboard for:
- Volume health and replica status
- IOPS and throughput per volume
- Disk space utilization per node
- Backup job status

**Alerts:**
- Volume degraded (replica count < desired)
- Disk space low (< 20% free)
- Backup job failed

## Future Enhancements

1. ~~**NAS high availability** - Second NAS with replication~~ ✅ Done — gravenhollow adds a second NAS
2. **Dedicated storage network** - Separate VLAN for storage traffic (gravenhollow's dual 10GbE makes this more impactful)
3. **NVMe-oF** - Network NVMe for lower latency
4. **Tiered Longhorn** - Hot (NVMe) and warm (SSD) within Longhorn
5. ~~**S3 tier** - MinIO for object storage workloads~~ ✅ Done — gravenhollow RustFS provides S3
6. **Migrate AI/ML PVCs to nfs-fast** - Move ray-model-cache and mlflow-artifacts from nfs-slow to nfs-fast
7. **Longhorn backups to gravenhollow S3** - Use RustFS as off-cluster backup target

## References

* [Longhorn Documentation](https://longhorn.io/docs/)
* [Longhorn Best Practices](https://longhorn.io/docs/latest/best-practices/)
* [NFS CSI Driver](https://github.com/kubernetes-csi/csi-driver-nfs)
* [Talos Longhorn Integration](https://www.talos.dev/v1.6/kubernetes-guides/configuration/storage/)