Files
homelab-design/decisions/0026-storage-strategy.md
Billy D. f94945fb46
All checks were successful
Update README with ADR Index / update-readme (push) Successful in 6s
https.
2026-02-16 18:22:16 -05:00

431 lines
22 KiB
Markdown
Raw Blame History

This file contains ambiguous Unicode characters
This file contains Unicode characters that might be confused with other characters. If you think that this is intentional, you can safely ignore this warning. Use the Escape button to reveal them.
# Tiered Storage Strategy: Longhorn + NFS
* Status: accepted
* Date: 2026-02-04
* Deciders: Billy
* Technical Story: Provide tiered storage for Kubernetes workloads balancing performance and capacity
## Context and Problem Statement
Kubernetes requires a storage solution for stateful applications like databases, message queues, and AI model caches. Different workloads have vastly different requirements:
- Databases need fast, reliable storage with replication
- Media libraries need large capacity but can tolerate slower access
- AI/ML workloads need both - fast storage for models, large capacity for datasets
The homelab has heterogeneous nodes including x86_64 servers and ARM64 Raspberry Pis, plus an external NAS for bulk storage.
How do we provide tiered storage that balances performance, reliability, and capacity for diverse homelab workloads?
## Decision Drivers
* Performance - fast IOPS for databases and critical workloads
* Capacity - large storage for media, datasets, and archives
* Reliability - data must survive node failures
* Heterogeneous support - work on both x86_64 and ARM64 (with limitations)
* Backup capability - support for off-cluster backups
* GitOps deployment - Helm charts with Flux management
## Considered Options
1. **Longhorn + NFS dual-tier storage**
2. **Rook-Ceph for everything**
3. **OpenEBS with Mayastor**
4. **NFS only**
5. **Longhorn only**
## Decision Outcome
Chosen option: **Option 1 - Longhorn + NFS dual-tier storage**
Three storage tiers optimized for different use cases:
- **`longhorn`** (default): Fast distributed block storage on NVMe/SSDs for databases and critical workloads
- **`nfs-fast`**: High-performance NFS + S3 storage on gravenhollow (all-SSD TrueNAS Scale, dual 10GbE, 12.2 TB) for AI model cache, hot data, and S3-compatible object storage via RustFS
- **`nfs-slow`**: High-capacity NFS storage on candlekeep (QNAP HDD NAS) for media, datasets, and bulk storage
### Positive Consequences
* Right-sized storage for each workload type
* Longhorn provides HA with automatic replication
* NFS provides massive capacity without consuming cluster disk space
* ReadWriteMany (RWX) easy on NFS tier
* Cost-effective - use existing NAS investment
### Negative Consequences
* Two storage systems to manage
* NFS is slower (hence `nfs-slow` naming)
* NFS single point of failure (no replication)
* Network dependency for both tiers
## Architecture
```
┌────────────────────────────────────────────────────────────────────────────┐
│ TIER 1: LONGHORN │
│ (Fast Distributed Block Storage) │
│ ┌─────────────┐ ┌─────────────┐ ┌─────────────┐ │
│ │ khelben │ │ mystra │ │ selune │ │
│ │ (NVIDIA) │ │ (AMD) │ │ (AMD) │ │
│ │ │ │ │ │ │ │
│ │ /var/mnt/ │ │ /var/mnt/ │ │ /var/mnt/ │ │
│ │ longhorn │ │ longhorn │ │ longhorn │ │
│ │ (NVMe) │ │ (SSD) │ │ (SSD) │ │
│ └──────┬──────┘ └──────┬──────┘ └──────┬──────┘ │
│ │ │ │ │
│ └────────────────┼────────────────┘ │
│ ▼ │
│ ┌───────────────────────┐ │
│ │ Longhorn Manager │ │
│ │ (Schedules replicas) │ │
│ └───────────┬───────────┘ │
│ ▼ │
│ ┌──────────┐ ┌──────────┐ ┌──────────┐ ┌──────────┐ │
│ │ Postgres │ │ Vault │ │Prometheus│ │ClickHouse│ │
│ │ PVC │ │ PVC │ │ PVC │ │ PVC │ │
│ └──────────┘ └──────────┘ └──────────┘ └──────────┘ │
└────────────────────────────────────────────────────────────────────────────┘
┌────────────────────────────────────────────────────────────────────────────┐
│ TIER 2: NFS-SLOW │
│ (High-Capacity Bulk Storage) │
│ │
│ ┌────────────────────────────────────────────────────────────────┐ │
│ │ candlekeep.lab.daviestechlabs.io │ │
│ │ (QNAP NAS) │ │
│ │ │ │
│ │ /kubernetes │ │
│ │ ├── jellyfin-media/ (1TB+ media library) │ │
│ │ ├── nextcloud/ (user files) │ │
│ │ ├── immich/ (photo backups) │ │
│ │ ├── kavita/ (ebooks, comics, manga) │ │
│ │ ├── mlflow-artifacts/ (model artifacts) │ │
│ │ ├── ray-models/ (AI model weights) │ │
│ │ └── gitea-runner/ (build caches) │ │
│ └────────────────────────────────────────────────────────────────┘ │
│ │ │
│ ▼ │
│ ┌───────────────────────┐ │
│ │ NFS CSI Driver │ │
│ │ (csi-driver-nfs) │ │
│ └───────────┬───────────┘ │
│ ▼ │
│ ┌──────────┐ ┌──────────┐ ┌──────────┐ ┌──────────┐ │
│ │ Jellyfin │ │Nextcloud │ │ Immich │ │ Kavita │ │
│ │ PVC │ │ PVC │ │ PVC │ │ PVC │ │
│ └──────────┘ └──────────┘ └──────────┘ └──────────┘ │
└────────────────────────────────────────────────────────────────────────────┘
┌────────────────────────────────────────────────────────────────────────────┐
│ TIER 3: NFS-FAST │
│ (High-Performance SSD NFS + S3 Storage) │
│ │
│ ┌────────────────────────────────────────────────────────────────┐ │
│ │ gravenhollow.lab.daviestechlabs.io │ │
│ │ (TrueNAS Scale · All-SSD · Dual 10GbE · 12.2 TB) │ │
│ │ │ │
│ │ NFS: /mnt/gravenhollow/kubernetes │ │
│ │ ├── ray-model-cache/ (AI model weights - hot) │ │
│ │ ├── mlflow-artifacts/ (ML experiment tracking) │ │
│ │ └── training-data/ (datasets for fine-tuning) │ │
│ │ │ │
│ │ S3 (RustFS): https://gravenhollow.lab.daviestechlabs.io:30292 │ │
│ │ ├── kubeflow-pipelines (pipeline artifacts) │ │
│ │ ├── training-data (large dataset staging) │ │
│ │ └── longhorn-backups (off-cluster backup target) │ │
│ └────────────────────────────────────────────────────────────────┘ │
│ │ │
│ ▼ │
│ ┌───────────────────────┐ │
│ │ NFS CSI Driver │ │
│ │ (csi-driver-nfs) │ │
│ └───────────┬───────────┘ │
│ ▼ │
│ ┌──────────┐ ┌──────────┐ ┌──────────┐ │
│ │Ray Model │ │ MLflow │ │ Training │ │
│ │ Cache │ │ Artifact │ │ Data │ │
│ │ PVC │ │ PVC │ │ PVC │ │
│ └──────────┘ └──────────┘ └──────────┘ │
└────────────────────────────────────────────────────────────────────────────┘
```
## Tier 1: Longhorn Configuration
### Helm Values
```yaml
persistence:
defaultClass: true
defaultClassReplicaCount: 2
defaultDataPath: /var/mnt/longhorn
defaultSettings:
defaultDataPath: /var/mnt/longhorn
# Allow on vllm-tainted nodes
taintToleration: "dedicated=vllm:NoSchedule"
# Exclude Raspberry Pi nodes (ARM64)
systemManagedComponentsNodeSelector: "kubernetes.io/arch:amd64"
# Snapshot retention
defaultRecurringJobs:
- name: nightly-snapshots
task: snapshot
cron: "0 2 * * *"
retain: 7
- name: weekly-backups
task: backup
cron: "0 3 * * 0"
retain: 4
```
### Longhorn Storage Classes
| StorageClass | Replicas | Use Case |
|--------------|----------|----------|
| `longhorn` (default) | 2 | General workloads, databases |
| `longhorn-single` | 1 | Development/ephemeral |
| `longhorn-strict` | 3 | Critical databases |
## Tier 2: NFS Configuration
### Helm Values (csi-driver-nfs)
```yaml
storageClass:
create: true
name: nfs-slow
parameters:
server: candlekeep.lab.daviestechlabs.io
share: /kubernetes
mountOptions:
- nfsvers=4.1
- nconnect=16 # Multiple TCP connections for throughput
- hard # Retry indefinitely on failure
- noatime # Don't update access times (performance)
reclaimPolicy: Delete
volumeBindingMode: Immediate
```
### Why "nfs-slow"?
The naming is intentional - it sets correct expectations:
- **Latency:** NAS is over network, higher latency than local NVMe
- **IOPS:** Spinning disks in NAS can't match SSD performance
- **Throughput:** Adequate for streaming media, not for databases
- **Benefit:** Massive capacity without consuming cluster disk space
## Tier 3: NFS-Fast Configuration
### Helm Values (second csi-driver-nfs installation)
A second HelmRelease (`csi-driver-nfs-fast`) references the same OCI chart but only creates the StorageClass — the CSI driver pods are already running from the nfs-slow installation.
```yaml
controller:
replicas: 0
node:
enabled: false
storageClass:
create: true
name: nfs-fast
parameters:
server: gravenhollow.lab.daviestechlabs.io
share: /mnt/gravenhollow/kubernetes
mountOptions:
- nfsvers=4.2 # Server-side copy, fallocate, seekhole
- nconnect=16 # 16 TCP connections across bonded 10GbE
- rsize=1048576 # 1 MB read block size
- wsize=1048576 # 1 MB write block size
- hard # Retry indefinitely on timeout
- noatime # Skip access-time updates
- nodiratime # Skip directory access-time updates
- nocto # Disable close-to-open consistency (read-heavy workloads)
- actimeo=600 # Cache attributes for 10 min
- max_connect=16 # Allow up to 16 connections to the same server
reclaimPolicy: Delete
volumeBindingMode: Immediate
```
### Performance Tuning Rationale
| Option | Why |
|--------|-----|
| `nfsvers=4.2` | Enables server-side copy, hole punch, and fallocate — TrueNAS Scale supports NFSv4.2 natively |
| `nconnect=16` | Opens 16 parallel TCP connections per mount, spreading I/O across both 10GbE bond members |
| `rsize/wsize=1048576` | 1 MB block sizes maximise throughput per operation — jumbo frames (MTU 9000) carry each 1 MB payload in fewer packets, reducing per-packet overhead |
| `nocto` | Skips close-to-open consistency checks — safe because model weights and artifacts are write-once/read-many |
| `actimeo=600` | Caches file and directory attributes for 10 minutes, reducing metadata round-trips for static content |
| `nodiratime` | Avoids unnecessary directory timestamp writes alongside `noatime` |
### Why "nfs-fast"?
Gravenhollow addresses the performance gap between Longhorn (local) and candlekeep (HDD NAS):
- **All-SSD:** No spinning disk latency — suitable for random I/O workloads like model loading
- **Dual 10GbE:** 2× 10 Gbps network links via link aggregation
- **12.2 TB capacity:** Enough for model cache, artifacts, and training data
- **RustFS S3:** S3-compatible object storage endpoint for pipeline artifacts and backups
- **Use case:** AI/ML model cache, MLflow artifacts, training data — workloads that need better than HDD but don't require local NVMe
### S3 Endpoint (RustFS)
Gravenhollow also provides S3-compatible object storage via RustFS:
- **Endpoint:** `https://gravenhollow.lab.daviestechlabs.io:30292`
- **Use cases:** Kubeflow pipeline artifacts, Longhorn off-cluster backups, training dataset staging
- **Credentials:** Managed via Vault ExternalSecret (`/kv/data/gravenhollow``access_key`, `secret_key`)
## Storage Tier Selection Guide
| Workload Type | Storage Class | Rationale |
|---------------|---------------|-----------|
| PostgreSQL (CNPG) | `longhorn` | HA with replication, low latency |
| Prometheus/ClickHouse | `longhorn` | High write IOPS required |
| Vault | `longhorn` | Security-critical, needs HA |
| AI/ML models (Ray) | `nfs-fast` | Large model weights, SSD speed |
| MLflow artifacts | `nfs-fast` | Experiment tracking, frequent reads |
| Training data | `nfs-fast` | Dataset staging for fine-tuning |
| Media (Jellyfin, Kavita) | `nfs-slow` | Large files, sequential reads |
| Photos (Immich) | `nfs-slow` | Bulk storage for photos |
| User files (Nextcloud) | `nfs-slow` | Capacity over speed |
| Build caches (Gitea runner) | `nfs-slow` | Ephemeral, large |
## Volume Usage by Tier
### Longhorn Volumes (Performance Tier)
| Workload | Size | Replicas | Access Mode |
|----------|------|----------|-------------|
| Prometheus | 50Gi | 2 | RWO |
| Vault | 2Gi | 2 | RWO |
| ClickHouse | 100Gi | 2 | RWO |
| Alertmanager | 1Gi | 2 | RWO |
### NFS Volumes (Capacity Tier)
| Workload | Size | Access Mode | Notes |
|----------|------|-------------|-------|
| Jellyfin | 2Ti | RWX | Media library |
| Immich | 500Gi | RWX | Photo storage |
| Nextcloud | 1Ti | RWX | User files |
| Kavita | 200Gi | RWX | Ebooks, comics |
| MLflow | 100Gi | RWX | Model artifacts |
| Ray models | 200Gi | RWX | AI model weights |
| Gitea runner | 50Gi | RWO | Build caches |
| Gitea DB (CNPG) | 10Gi | RWO | Capacity-optimized |
## Backup Strategy
### Longhorn Tier
#### Local Snapshots
- **Frequency:** Nightly at 2 AM
- **Retention:** 7 days
- **Purpose:** Quick recovery from accidental deletion
#### Off-Cluster Backups
- **Frequency:** Weekly on Sundays at 3 AM
- **Destination:** S3-compatible storage (MinIO/Backblaze)
- **Retention:** 4 weeks
- **Purpose:** Disaster recovery
### NFS Tier
#### NAS-Level Backups
- Handled by NAS backup solution (snapshots, replication)
- Not managed by Kubernetes
- Relies on NAS raid configuration for redundancy
### Backup Target Configuration (Longhorn)
```yaml
# ExternalSecret for backup credentials
apiVersion: external-secrets.io/v1
kind: ExternalSecret
metadata:
name: longhorn-backup-secret
spec:
secretStoreRef:
kind: ClusterSecretStore
name: vault
target:
name: longhorn-backup-secret
data:
- secretKey: AWS_ACCESS_KEY_ID
remoteRef:
key: kv/data/longhorn
property: backup_access_key
- secretKey: AWS_SECRET_ACCESS_KEY
remoteRef:
key: kv/data/longhorn
property: backup_secret_key
```
## Node Exclusions (Longhorn Only)
**Raspberry Pi nodes excluded because:**
- Limited disk I/O performance
- SD card wear concerns
- Memory constraints for Longhorn components
**GPU nodes included with tolerations:**
- `khelben` (NVIDIA) participates in Longhorn storage
- Taint toleration allows Longhorn to schedule there
## Performance Considerations
### Longhorn Performance
- `khelben` has NVMe - fastest storage node
- `mystra`/`selune` have SATA SSDs - adequate for most workloads
- 2 replicas across different nodes ensures single node failure survival
- Trade-off: 2x storage consumption
### NFS Performance
- Optimized with `nconnect=16` for parallel connections
- `noatime` reduces unnecessary write operations
- Sequential read workloads perform well (media streaming)
- Random I/O workloads should use Longhorn instead
### When to Choose Each Tier
| Requirement | Longhorn | NFS-Fast | NFS-Slow |
|-------------|----------|----------|----------|
| Low latency | ✅ | ⚡ | ❌ |
| High IOPS | ✅ | ⚡ | ❌ |
| Large capacity | ❌ | ✅ (12.2 TB) | ✅✅ |
| ReadWriteMany (RWX) | Limited | ✅ | ✅ |
| S3 compatible | ❌ | ✅ (RustFS) | ✅ (Quobjects) |
| Node failure survival | ✅ | ✅ (NAS) | ✅ (NAS) |
| Kubernetes-native | ✅ | ✅ | ✅ |
## Monitoring
**Grafana Dashboard:** Longhorn dashboard for:
- Volume health and replica status
- IOPS and throughput per volume
- Disk space utilization per node
- Backup job status
**Alerts:**
- Volume degraded (replica count < desired)
- Disk space low (< 20% free)
- Backup job failed
## Future Enhancements
1. ~~**NAS high availability** - Second NAS with replication~~ ✅ Done — gravenhollow adds a second NAS
2. **Dedicated storage network** - Separate VLAN for storage traffic (gravenhollow's dual 10GbE makes this more impactful)
3. **NVMe-oF** - Network NVMe for lower latency
4. **Tiered Longhorn** - Hot (NVMe) and warm (SSD) within Longhorn
5. ~~**S3 tier** - MinIO for object storage workloads~~ ✅ Done — gravenhollow RustFS provides S3
6. **Migrate AI/ML PVCs to nfs-fast** - Move ray-model-cache and mlflow-artifacts from nfs-slow to nfs-fast
7. **Longhorn backups to gravenhollow S3** - Use RustFS as off-cluster backup target
## References
* [Longhorn Documentation](https://longhorn.io/docs/)
* [Longhorn Best Practices](https://longhorn.io/docs/latest/best-practices/)
* [NFS CSI Driver](https://github.com/kubernetes-csi/csi-driver-nfs)
* [Talos Longhorn Integration](https://www.talos.dev/v1.6/kubernetes-guides/configuration/storage/)