All checks were successful
Update README with ADR Index / update-readme (push) Successful in 6s
431 lines
22 KiB
Markdown
431 lines
22 KiB
Markdown
# Tiered Storage Strategy: Longhorn + NFS
|
||
|
||
* Status: accepted
|
||
* Date: 2026-02-04
|
||
* Deciders: Billy
|
||
* Technical Story: Provide tiered storage for Kubernetes workloads balancing performance and capacity
|
||
|
||
## Context and Problem Statement
|
||
|
||
Kubernetes requires a storage solution for stateful applications like databases, message queues, and AI model caches. Different workloads have vastly different requirements:
|
||
- Databases need fast, reliable storage with replication
|
||
- Media libraries need large capacity but can tolerate slower access
|
||
- AI/ML workloads need both - fast storage for models, large capacity for datasets
|
||
|
||
The homelab has heterogeneous nodes including x86_64 servers and ARM64 Raspberry Pis, plus an external NAS for bulk storage.
|
||
|
||
How do we provide tiered storage that balances performance, reliability, and capacity for diverse homelab workloads?
|
||
|
||
## Decision Drivers
|
||
|
||
* Performance - fast IOPS for databases and critical workloads
|
||
* Capacity - large storage for media, datasets, and archives
|
||
* Reliability - data must survive node failures
|
||
* Heterogeneous support - work on both x86_64 and ARM64 (with limitations)
|
||
* Backup capability - support for off-cluster backups
|
||
* GitOps deployment - Helm charts with Flux management
|
||
|
||
## Considered Options
|
||
|
||
1. **Longhorn + NFS dual-tier storage**
|
||
2. **Rook-Ceph for everything**
|
||
3. **OpenEBS with Mayastor**
|
||
4. **NFS only**
|
||
5. **Longhorn only**
|
||
|
||
## Decision Outcome
|
||
|
||
Chosen option: **Option 1 - Longhorn + NFS dual-tier storage**
|
||
|
||
Three storage tiers optimized for different use cases:
|
||
- **`longhorn`** (default): Fast distributed block storage on NVMe/SSDs for databases and critical workloads
|
||
- **`nfs-fast`**: High-performance NFS + S3 storage on gravenhollow (all-SSD TrueNAS Scale, dual 10GbE, 12.2 TB) for AI model cache, hot data, and S3-compatible object storage via RustFS
|
||
- **`nfs-slow`**: High-capacity NFS storage on candlekeep (QNAP HDD NAS) for media, datasets, and bulk storage
|
||
|
||
### Positive Consequences
|
||
|
||
* Right-sized storage for each workload type
|
||
* Longhorn provides HA with automatic replication
|
||
* NFS provides massive capacity without consuming cluster disk space
|
||
* ReadWriteMany (RWX) easy on NFS tier
|
||
* Cost-effective - use existing NAS investment
|
||
|
||
### Negative Consequences
|
||
|
||
* Two storage systems to manage
|
||
* NFS is slower (hence `nfs-slow` naming)
|
||
* NFS single point of failure (no replication)
|
||
* Network dependency for both tiers
|
||
|
||
## Architecture
|
||
|
||
```
|
||
┌────────────────────────────────────────────────────────────────────────────┐
|
||
│ TIER 1: LONGHORN │
|
||
│ (Fast Distributed Block Storage) │
|
||
│ ┌─────────────┐ ┌─────────────┐ ┌─────────────┐ │
|
||
│ │ khelben │ │ mystra │ │ selune │ │
|
||
│ │ (NVIDIA) │ │ (AMD) │ │ (AMD) │ │
|
||
│ │ │ │ │ │ │ │
|
||
│ │ /var/mnt/ │ │ /var/mnt/ │ │ /var/mnt/ │ │
|
||
│ │ longhorn │ │ longhorn │ │ longhorn │ │
|
||
│ │ (NVMe) │ │ (SSD) │ │ (SSD) │ │
|
||
│ └──────┬──────┘ └──────┬──────┘ └──────┬──────┘ │
|
||
│ │ │ │ │
|
||
│ └────────────────┼────────────────┘ │
|
||
│ ▼ │
|
||
│ ┌───────────────────────┐ │
|
||
│ │ Longhorn Manager │ │
|
||
│ │ (Schedules replicas) │ │
|
||
│ └───────────┬───────────┘ │
|
||
│ ▼ │
|
||
│ ┌──────────┐ ┌──────────┐ ┌──────────┐ ┌──────────┐ │
|
||
│ │ Postgres │ │ Vault │ │Prometheus│ │ClickHouse│ │
|
||
│ │ PVC │ │ PVC │ │ PVC │ │ PVC │ │
|
||
│ └──────────┘ └──────────┘ └──────────┘ └──────────┘ │
|
||
└────────────────────────────────────────────────────────────────────────────┘
|
||
|
||
┌────────────────────────────────────────────────────────────────────────────┐
|
||
│ TIER 2: NFS-SLOW │
|
||
│ (High-Capacity Bulk Storage) │
|
||
│ │
|
||
│ ┌────────────────────────────────────────────────────────────────┐ │
|
||
│ │ candlekeep.lab.daviestechlabs.io │ │
|
||
│ │ (QNAP NAS) │ │
|
||
│ │ │ │
|
||
│ │ /kubernetes │ │
|
||
│ │ ├── jellyfin-media/ (1TB+ media library) │ │
|
||
│ │ ├── nextcloud/ (user files) │ │
|
||
│ │ ├── immich/ (photo backups) │ │
|
||
│ │ ├── kavita/ (ebooks, comics, manga) │ │
|
||
│ │ ├── mlflow-artifacts/ (model artifacts) │ │
|
||
│ │ ├── ray-models/ (AI model weights) │ │
|
||
│ │ └── gitea-runner/ (build caches) │ │
|
||
│ └────────────────────────────────────────────────────────────────┘ │
|
||
│ │ │
|
||
│ ▼ │
|
||
│ ┌───────────────────────┐ │
|
||
│ │ NFS CSI Driver │ │
|
||
│ │ (csi-driver-nfs) │ │
|
||
│ └───────────┬───────────┘ │
|
||
│ ▼ │
|
||
│ ┌──────────┐ ┌──────────┐ ┌──────────┐ ┌──────────┐ │
|
||
│ │ Jellyfin │ │Nextcloud │ │ Immich │ │ Kavita │ │
|
||
│ │ PVC │ │ PVC │ │ PVC │ │ PVC │ │
|
||
│ └──────────┘ └──────────┘ └──────────┘ └──────────┘ │
|
||
└────────────────────────────────────────────────────────────────────────────┘
|
||
|
||
┌────────────────────────────────────────────────────────────────────────────┐
|
||
│ TIER 3: NFS-FAST │
|
||
│ (High-Performance SSD NFS + S3 Storage) │
|
||
│ │
|
||
│ ┌────────────────────────────────────────────────────────────────┐ │
|
||
│ │ gravenhollow.lab.daviestechlabs.io │ │
|
||
│ │ (TrueNAS Scale · All-SSD · Dual 10GbE · 12.2 TB) │ │
|
||
│ │ │ │
|
||
│ │ NFS: /mnt/gravenhollow/kubernetes │ │
|
||
│ │ ├── ray-model-cache/ (AI model weights - hot) │ │
|
||
│ │ ├── mlflow-artifacts/ (ML experiment tracking) │ │
|
||
│ │ └── training-data/ (datasets for fine-tuning) │ │
|
||
│ │ │ │
|
||
│ │ S3 (RustFS): https://gravenhollow.lab.daviestechlabs.io:30292 │ │
|
||
│ │ ├── kubeflow-pipelines (pipeline artifacts) │ │
|
||
│ │ ├── training-data (large dataset staging) │ │
|
||
│ │ └── longhorn-backups (off-cluster backup target) │ │
|
||
│ └────────────────────────────────────────────────────────────────┘ │
|
||
│ │ │
|
||
│ ▼ │
|
||
│ ┌───────────────────────┐ │
|
||
│ │ NFS CSI Driver │ │
|
||
│ │ (csi-driver-nfs) │ │
|
||
│ └───────────┬───────────┘ │
|
||
│ ▼ │
|
||
│ ┌──────────┐ ┌──────────┐ ┌──────────┐ │
|
||
│ │Ray Model │ │ MLflow │ │ Training │ │
|
||
│ │ Cache │ │ Artifact │ │ Data │ │
|
||
│ │ PVC │ │ PVC │ │ PVC │ │
|
||
│ └──────────┘ └──────────┘ └──────────┘ │
|
||
└────────────────────────────────────────────────────────────────────────────┘
|
||
```
|
||
|
||
## Tier 1: Longhorn Configuration
|
||
|
||
### Helm Values
|
||
|
||
```yaml
|
||
persistence:
|
||
defaultClass: true
|
||
defaultClassReplicaCount: 2
|
||
defaultDataPath: /var/mnt/longhorn
|
||
|
||
defaultSettings:
|
||
defaultDataPath: /var/mnt/longhorn
|
||
# Allow on vllm-tainted nodes
|
||
taintToleration: "dedicated=vllm:NoSchedule"
|
||
# Exclude Raspberry Pi nodes (ARM64)
|
||
systemManagedComponentsNodeSelector: "kubernetes.io/arch:amd64"
|
||
# Snapshot retention
|
||
defaultRecurringJobs:
|
||
- name: nightly-snapshots
|
||
task: snapshot
|
||
cron: "0 2 * * *"
|
||
retain: 7
|
||
- name: weekly-backups
|
||
task: backup
|
||
cron: "0 3 * * 0"
|
||
retain: 4
|
||
```
|
||
|
||
### Longhorn Storage Classes
|
||
|
||
| StorageClass | Replicas | Use Case |
|
||
|--------------|----------|----------|
|
||
| `longhorn` (default) | 2 | General workloads, databases |
|
||
| `longhorn-single` | 1 | Development/ephemeral |
|
||
| `longhorn-strict` | 3 | Critical databases |
|
||
|
||
## Tier 2: NFS Configuration
|
||
|
||
### Helm Values (csi-driver-nfs)
|
||
|
||
```yaml
|
||
storageClass:
|
||
create: true
|
||
name: nfs-slow
|
||
parameters:
|
||
server: candlekeep.lab.daviestechlabs.io
|
||
share: /kubernetes
|
||
mountOptions:
|
||
- nfsvers=4.1
|
||
- nconnect=16 # Multiple TCP connections for throughput
|
||
- hard # Retry indefinitely on failure
|
||
- noatime # Don't update access times (performance)
|
||
reclaimPolicy: Delete
|
||
volumeBindingMode: Immediate
|
||
```
|
||
|
||
### Why "nfs-slow"?
|
||
|
||
The naming is intentional - it sets correct expectations:
|
||
- **Latency:** NAS is over network, higher latency than local NVMe
|
||
- **IOPS:** Spinning disks in NAS can't match SSD performance
|
||
- **Throughput:** Adequate for streaming media, not for databases
|
||
- **Benefit:** Massive capacity without consuming cluster disk space
|
||
|
||
## Tier 3: NFS-Fast Configuration
|
||
|
||
### Helm Values (second csi-driver-nfs installation)
|
||
|
||
A second HelmRelease (`csi-driver-nfs-fast`) references the same OCI chart but only creates the StorageClass — the CSI driver pods are already running from the nfs-slow installation.
|
||
|
||
```yaml
|
||
controller:
|
||
replicas: 0
|
||
node:
|
||
enabled: false
|
||
storageClass:
|
||
create: true
|
||
name: nfs-fast
|
||
parameters:
|
||
server: gravenhollow.lab.daviestechlabs.io
|
||
share: /mnt/gravenhollow/kubernetes
|
||
mountOptions:
|
||
- nfsvers=4.2 # Server-side copy, fallocate, seekhole
|
||
- nconnect=16 # 16 TCP connections across bonded 10GbE
|
||
- rsize=1048576 # 1 MB read block size
|
||
- wsize=1048576 # 1 MB write block size
|
||
- hard # Retry indefinitely on timeout
|
||
- noatime # Skip access-time updates
|
||
- nodiratime # Skip directory access-time updates
|
||
- nocto # Disable close-to-open consistency (read-heavy workloads)
|
||
- actimeo=600 # Cache attributes for 10 min
|
||
- max_connect=16 # Allow up to 16 connections to the same server
|
||
reclaimPolicy: Delete
|
||
volumeBindingMode: Immediate
|
||
```
|
||
|
||
### Performance Tuning Rationale
|
||
|
||
| Option | Why |
|
||
|--------|-----|
|
||
| `nfsvers=4.2` | Enables server-side copy, hole punch, and fallocate — TrueNAS Scale supports NFSv4.2 natively |
|
||
| `nconnect=16` | Opens 16 parallel TCP connections per mount, spreading I/O across both 10GbE bond members |
|
||
| `rsize/wsize=1048576` | 1 MB block sizes maximise throughput per operation — jumbo frames (MTU 9000) carry each 1 MB payload in fewer packets, reducing per-packet overhead |
|
||
| `nocto` | Skips close-to-open consistency checks — safe because model weights and artifacts are write-once/read-many |
|
||
| `actimeo=600` | Caches file and directory attributes for 10 minutes, reducing metadata round-trips for static content |
|
||
| `nodiratime` | Avoids unnecessary directory timestamp writes alongside `noatime` |
|
||
|
||
### Why "nfs-fast"?
|
||
|
||
Gravenhollow addresses the performance gap between Longhorn (local) and candlekeep (HDD NAS):
|
||
- **All-SSD:** No spinning disk latency — suitable for random I/O workloads like model loading
|
||
- **Dual 10GbE:** 2× 10 Gbps network links via link aggregation
|
||
- **12.2 TB capacity:** Enough for model cache, artifacts, and training data
|
||
- **RustFS S3:** S3-compatible object storage endpoint for pipeline artifacts and backups
|
||
- **Use case:** AI/ML model cache, MLflow artifacts, training data — workloads that need better than HDD but don't require local NVMe
|
||
|
||
### S3 Endpoint (RustFS)
|
||
|
||
Gravenhollow also provides S3-compatible object storage via RustFS:
|
||
- **Endpoint:** `https://gravenhollow.lab.daviestechlabs.io:30292`
|
||
- **Use cases:** Kubeflow pipeline artifacts, Longhorn off-cluster backups, training dataset staging
|
||
- **Credentials:** Managed via Vault ExternalSecret (`/kv/data/gravenhollow` → `access_key`, `secret_key`)
|
||
|
||
## Storage Tier Selection Guide
|
||
|
||
| Workload Type | Storage Class | Rationale |
|
||
|---------------|---------------|-----------|
|
||
| PostgreSQL (CNPG) | `longhorn` | HA with replication, low latency |
|
||
| Prometheus/ClickHouse | `longhorn` | High write IOPS required |
|
||
| Vault | `longhorn` | Security-critical, needs HA |
|
||
| AI/ML models (Ray) | `nfs-fast` | Large model weights, SSD speed |
|
||
| MLflow artifacts | `nfs-fast` | Experiment tracking, frequent reads |
|
||
| Training data | `nfs-fast` | Dataset staging for fine-tuning |
|
||
| Media (Jellyfin, Kavita) | `nfs-slow` | Large files, sequential reads |
|
||
| Photos (Immich) | `nfs-slow` | Bulk storage for photos |
|
||
| User files (Nextcloud) | `nfs-slow` | Capacity over speed |
|
||
| Build caches (Gitea runner) | `nfs-slow` | Ephemeral, large |
|
||
|
||
## Volume Usage by Tier
|
||
|
||
### Longhorn Volumes (Performance Tier)
|
||
|
||
| Workload | Size | Replicas | Access Mode |
|
||
|----------|------|----------|-------------|
|
||
| Prometheus | 50Gi | 2 | RWO |
|
||
| Vault | 2Gi | 2 | RWO |
|
||
| ClickHouse | 100Gi | 2 | RWO |
|
||
| Alertmanager | 1Gi | 2 | RWO |
|
||
|
||
### NFS Volumes (Capacity Tier)
|
||
|
||
| Workload | Size | Access Mode | Notes |
|
||
|----------|------|-------------|-------|
|
||
| Jellyfin | 2Ti | RWX | Media library |
|
||
| Immich | 500Gi | RWX | Photo storage |
|
||
| Nextcloud | 1Ti | RWX | User files |
|
||
| Kavita | 200Gi | RWX | Ebooks, comics |
|
||
| MLflow | 100Gi | RWX | Model artifacts |
|
||
| Ray models | 200Gi | RWX | AI model weights |
|
||
| Gitea runner | 50Gi | RWO | Build caches |
|
||
| Gitea DB (CNPG) | 10Gi | RWO | Capacity-optimized |
|
||
|
||
## Backup Strategy
|
||
|
||
### Longhorn Tier
|
||
|
||
#### Local Snapshots
|
||
|
||
- **Frequency:** Nightly at 2 AM
|
||
- **Retention:** 7 days
|
||
- **Purpose:** Quick recovery from accidental deletion
|
||
|
||
#### Off-Cluster Backups
|
||
|
||
- **Frequency:** Weekly on Sundays at 3 AM
|
||
- **Destination:** S3-compatible storage (MinIO/Backblaze)
|
||
- **Retention:** 4 weeks
|
||
- **Purpose:** Disaster recovery
|
||
|
||
### NFS Tier
|
||
|
||
#### NAS-Level Backups
|
||
|
||
- Handled by NAS backup solution (snapshots, replication)
|
||
- Not managed by Kubernetes
|
||
- Relies on NAS raid configuration for redundancy
|
||
|
||
### Backup Target Configuration (Longhorn)
|
||
|
||
```yaml
|
||
# ExternalSecret for backup credentials
|
||
apiVersion: external-secrets.io/v1
|
||
kind: ExternalSecret
|
||
metadata:
|
||
name: longhorn-backup-secret
|
||
spec:
|
||
secretStoreRef:
|
||
kind: ClusterSecretStore
|
||
name: vault
|
||
target:
|
||
name: longhorn-backup-secret
|
||
data:
|
||
- secretKey: AWS_ACCESS_KEY_ID
|
||
remoteRef:
|
||
key: kv/data/longhorn
|
||
property: backup_access_key
|
||
- secretKey: AWS_SECRET_ACCESS_KEY
|
||
remoteRef:
|
||
key: kv/data/longhorn
|
||
property: backup_secret_key
|
||
```
|
||
|
||
## Node Exclusions (Longhorn Only)
|
||
|
||
**Raspberry Pi nodes excluded because:**
|
||
- Limited disk I/O performance
|
||
- SD card wear concerns
|
||
- Memory constraints for Longhorn components
|
||
|
||
**GPU nodes included with tolerations:**
|
||
- `khelben` (NVIDIA) participates in Longhorn storage
|
||
- Taint toleration allows Longhorn to schedule there
|
||
|
||
## Performance Considerations
|
||
|
||
### Longhorn Performance
|
||
|
||
- `khelben` has NVMe - fastest storage node
|
||
- `mystra`/`selune` have SATA SSDs - adequate for most workloads
|
||
- 2 replicas across different nodes ensures single node failure survival
|
||
- Trade-off: 2x storage consumption
|
||
|
||
### NFS Performance
|
||
|
||
- Optimized with `nconnect=16` for parallel connections
|
||
- `noatime` reduces unnecessary write operations
|
||
- Sequential read workloads perform well (media streaming)
|
||
- Random I/O workloads should use Longhorn instead
|
||
|
||
### When to Choose Each Tier
|
||
|
||
| Requirement | Longhorn | NFS-Fast | NFS-Slow |
|
||
|-------------|----------|----------|----------|
|
||
| Low latency | ✅ | ⚡ | ❌ |
|
||
| High IOPS | ✅ | ⚡ | ❌ |
|
||
| Large capacity | ❌ | ✅ (12.2 TB) | ✅✅ |
|
||
| ReadWriteMany (RWX) | Limited | ✅ | ✅ |
|
||
| S3 compatible | ❌ | ✅ (RustFS) | ✅ (Quobjects) |
|
||
| Node failure survival | ✅ | ✅ (NAS) | ✅ (NAS) |
|
||
| Kubernetes-native | ✅ | ✅ | ✅ |
|
||
|
||
## Monitoring
|
||
|
||
**Grafana Dashboard:** Longhorn dashboard for:
|
||
- Volume health and replica status
|
||
- IOPS and throughput per volume
|
||
- Disk space utilization per node
|
||
- Backup job status
|
||
|
||
**Alerts:**
|
||
- Volume degraded (replica count < desired)
|
||
- Disk space low (< 20% free)
|
||
- Backup job failed
|
||
|
||
## Future Enhancements
|
||
|
||
1. ~~**NAS high availability** - Second NAS with replication~~ ✅ Done — gravenhollow adds a second NAS
|
||
2. **Dedicated storage network** - Separate VLAN for storage traffic (gravenhollow's dual 10GbE makes this more impactful)
|
||
3. **NVMe-oF** - Network NVMe for lower latency
|
||
4. **Tiered Longhorn** - Hot (NVMe) and warm (SSD) within Longhorn
|
||
5. ~~**S3 tier** - MinIO for object storage workloads~~ ✅ Done — gravenhollow RustFS provides S3
|
||
6. **Migrate AI/ML PVCs to nfs-fast** - Move ray-model-cache and mlflow-artifacts from nfs-slow to nfs-fast
|
||
7. **Longhorn backups to gravenhollow S3** - Use RustFS as off-cluster backup target
|
||
|
||
## References
|
||
|
||
* [Longhorn Documentation](https://longhorn.io/docs/)
|
||
* [Longhorn Best Practices](https://longhorn.io/docs/latest/best-practices/)
|
||
* [NFS CSI Driver](https://github.com/kubernetes-csi/csi-driver-nfs)
|
||
* [Talos Longhorn Integration](https://www.talos.dev/v1.6/kubernetes-guides/configuration/storage/)
|