docs: add ADRs 0025-0028 for infrastructure patterns

- 0025: Observability stack (Prometheus + ClickStack + OTEL)
- 0026: Tiered storage strategy (Longhorn + NFS)
- 0027: Database strategy (CloudNativePG for PostgreSQL)
- 0028: Authentik SSO strategy (OIDC/SAML identity provider)
This commit is contained in:
2026-02-04 08:55:15 -05:00
parent a128c265e4
commit b43c80153c
4 changed files with 1282 additions and 0 deletions

View File

@@ -0,0 +1,334 @@
# Tiered Storage Strategy: Longhorn + NFS
* Status: accepted
* Date: 2026-02-04
* Deciders: Billy
* Technical Story: Provide tiered storage for Kubernetes workloads balancing performance and capacity
## Context and Problem Statement
Kubernetes requires a storage solution for stateful applications like databases, message queues, and AI model caches. Different workloads have vastly different requirements:
- Databases need fast, reliable storage with replication
- Media libraries need large capacity but can tolerate slower access
- AI/ML workloads need both - fast storage for models, large capacity for datasets
The homelab has heterogeneous nodes including x86_64 servers and ARM64 Raspberry Pis, plus an external NAS for bulk storage.
How do we provide tiered storage that balances performance, reliability, and capacity for diverse homelab workloads?
## Decision Drivers
* Performance - fast IOPS for databases and critical workloads
* Capacity - large storage for media, datasets, and archives
* Reliability - data must survive node failures
* Heterogeneous support - work on both x86_64 and ARM64 (with limitations)
* Backup capability - support for off-cluster backups
* GitOps deployment - Helm charts with Flux management
## Considered Options
1. **Longhorn + NFS dual-tier storage**
2. **Rook-Ceph for everything**
3. **OpenEBS with Mayastor**
4. **NFS only**
5. **Longhorn only**
## Decision Outcome
Chosen option: **Option 1 - Longhorn + NFS dual-tier storage**
Two storage tiers optimized for different use cases:
- **`longhorn`** (default): Fast distributed block storage on NVMe/SSDs for databases and critical workloads
- **`nfs-slow`**: High-capacity NFS storage on external NAS for media, datasets, and bulk storage
### Positive Consequences
* Right-sized storage for each workload type
* Longhorn provides HA with automatic replication
* NFS provides massive capacity without consuming cluster disk space
* ReadWriteMany (RWX) easy on NFS tier
* Cost-effective - use existing NAS investment
### Negative Consequences
* Two storage systems to manage
* NFS is slower (hence `nfs-slow` naming)
* NFS single point of failure (no replication)
* Network dependency for both tiers
## Architecture
```
┌────────────────────────────────────────────────────────────────────────────┐
│ TIER 1: LONGHORN │
│ (Fast Distributed Block Storage) │
│ ┌─────────────┐ ┌─────────────┐ ┌─────────────┐ │
│ │ khelben │ │ mystra │ │ selune │ │
│ │ (NVIDIA) │ │ (AMD) │ │ (AMD) │ │
│ │ │ │ │ │ │ │
│ │ /var/mnt/ │ │ /var/mnt/ │ │ /var/mnt/ │ │
│ │ longhorn │ │ longhorn │ │ longhorn │ │
│ │ (NVMe) │ │ (SSD) │ │ (SSD) │ │
│ └──────┬──────┘ └──────┬──────┘ └──────┬──────┘ │
│ │ │ │ │
│ └────────────────┼────────────────┘ │
│ ▼ │
│ ┌───────────────────────┐ │
│ │ Longhorn Manager │ │
│ │ (Schedules replicas) │ │
│ └───────────┬───────────┘ │
│ ▼ │
│ ┌──────────┐ ┌──────────┐ ┌──────────┐ ┌──────────┐ │
│ │ Postgres │ │ Vault │ │Prometheus│ │ClickHouse│ │
│ │ PVC │ │ PVC │ │ PVC │ │ PVC │ │
│ └──────────┘ └──────────┘ └──────────┘ └──────────┘ │
└────────────────────────────────────────────────────────────────────────────┘
┌────────────────────────────────────────────────────────────────────────────┐
│ TIER 2: NFS-SLOW │
│ (High-Capacity Bulk Storage) │
│ │
│ ┌────────────────────────────────────────────────────────────────┐ │
│ │ candlekeep.lab.daviestechlabs.io │ │
│ │ (External NAS) │ │
│ │ │ │
│ │ /kubernetes │ │
│ │ ├── jellyfin-media/ (1TB+ media library) │ │
│ │ ├── nextcloud/ (user files) │ │
│ │ ├── immich/ (photo backups) │ │
│ │ ├── kavita/ (ebooks, comics, manga) │ │
│ │ ├── mlflow-artifacts/ (model artifacts) │ │
│ │ ├── ray-models/ (AI model weights) │ │
│ │ └── gitea-runner/ (build caches) │ │
│ └────────────────────────────────────────────────────────────────┘ │
│ │ │
│ ▼ │
│ ┌───────────────────────┐ │
│ │ NFS CSI Driver │ │
│ │ (csi-driver-nfs) │ │
│ └───────────┬───────────┘ │
│ ▼ │
│ ┌──────────┐ ┌──────────┐ ┌──────────┐ ┌──────────┐ │
│ │ Jellyfin │ │Nextcloud │ │ Immich │ │ Kavita │ │
│ │ PVC │ │ PVC │ │ PVC │ │ PVC │ │
│ └──────────┘ └──────────┘ └──────────┘ └──────────┘ │
└────────────────────────────────────────────────────────────────────────────┘
```
## Tier 1: Longhorn Configuration
### Helm Values
```yaml
persistence:
defaultClass: true
defaultClassReplicaCount: 2
defaultDataPath: /var/mnt/longhorn
defaultSettings:
defaultDataPath: /var/mnt/longhorn
# Allow on vllm-tainted nodes
taintToleration: "dedicated=vllm:NoSchedule"
# Exclude Raspberry Pi nodes (ARM64)
systemManagedComponentsNodeSelector: "kubernetes.io/arch:amd64"
# Snapshot retention
defaultRecurringJobs:
- name: nightly-snapshots
task: snapshot
cron: "0 2 * * *"
retain: 7
- name: weekly-backups
task: backup
cron: "0 3 * * 0"
retain: 4
```
### Longhorn Storage Classes
| StorageClass | Replicas | Use Case |
|--------------|----------|----------|
| `longhorn` (default) | 2 | General workloads, databases |
| `longhorn-single` | 1 | Development/ephemeral |
| `longhorn-strict` | 3 | Critical databases |
## Tier 2: NFS Configuration
### Helm Values (csi-driver-nfs)
```yaml
storageClass:
create: true
name: nfs-slow
parameters:
server: candlekeep.lab.daviestechlabs.io
share: /kubernetes
mountOptions:
- nfsvers=4.1
- nconnect=16 # Multiple TCP connections for throughput
- hard # Retry indefinitely on failure
- noatime # Don't update access times (performance)
reclaimPolicy: Delete
volumeBindingMode: Immediate
```
### Why "nfs-slow"?
The naming is intentional - it sets correct expectations:
- **Latency:** NAS is over network, higher latency than local NVMe
- **IOPS:** Spinning disks in NAS can't match SSD performance
- **Throughput:** Adequate for streaming media, not for databases
- **Benefit:** Massive capacity without consuming cluster disk space
## Storage Tier Selection Guide
| Workload Type | Storage Class | Rationale |
|---------------|---------------|-----------|
| PostgreSQL (CNPG) | `longhorn` or `nfs-slow` | Depends on criticality |
| Prometheus/ClickHouse | `longhorn` | High write IOPS required |
| Vault | `longhorn` | Security-critical, needs HA |
| Media (Jellyfin, Kavita) | `nfs-slow` | Large files, sequential reads |
| Photos (Immich) | `nfs-slow` | Bulk storage for photos |
| User files (Nextcloud) | `nfs-slow` | Capacity over speed |
| AI/ML models (Ray) | `nfs-slow` | Large model weights |
| Build caches (Gitea runner) | `nfs-slow` | Ephemeral, large |
| MLflow artifacts | `nfs-slow` | Model artifacts storage |
## Volume Usage by Tier
### Longhorn Volumes (Performance Tier)
| Workload | Size | Replicas | Access Mode |
|----------|------|----------|-------------|
| Prometheus | 50Gi | 2 | RWO |
| Vault | 2Gi | 2 | RWO |
| ClickHouse | 100Gi | 2 | RWO |
| Alertmanager | 1Gi | 2 | RWO |
### NFS Volumes (Capacity Tier)
| Workload | Size | Access Mode | Notes |
|----------|------|-------------|-------|
| Jellyfin | 2Ti | RWX | Media library |
| Immich | 500Gi | RWX | Photo storage |
| Nextcloud | 1Ti | RWX | User files |
| Kavita | 200Gi | RWX | Ebooks, comics |
| MLflow | 100Gi | RWX | Model artifacts |
| Ray models | 200Gi | RWX | AI model weights |
| Gitea runner | 50Gi | RWO | Build caches |
| Gitea DB (CNPG) | 10Gi | RWO | Capacity-optimized |
## Backup Strategy
### Longhorn Tier
#### Local Snapshots
- **Frequency:** Nightly at 2 AM
- **Retention:** 7 days
- **Purpose:** Quick recovery from accidental deletion
#### Off-Cluster Backups
- **Frequency:** Weekly on Sundays at 3 AM
- **Destination:** S3-compatible storage (MinIO/Backblaze)
- **Retention:** 4 weeks
- **Purpose:** Disaster recovery
### NFS Tier
#### NAS-Level Backups
- Handled by NAS backup solution (snapshots, replication)
- Not managed by Kubernetes
- Relies on NAS raid configuration for redundancy
### Backup Target Configuration (Longhorn)
```yaml
# ExternalSecret for backup credentials
apiVersion: external-secrets.io/v1
kind: ExternalSecret
metadata:
name: longhorn-backup-secret
spec:
secretStoreRef:
kind: ClusterSecretStore
name: vault
target:
name: longhorn-backup-secret
data:
- secretKey: AWS_ACCESS_KEY_ID
remoteRef:
key: kv/data/longhorn
property: backup_access_key
- secretKey: AWS_SECRET_ACCESS_KEY
remoteRef:
key: kv/data/longhorn
property: backup_secret_key
```
## Node Exclusions (Longhorn Only)
**Raspberry Pi nodes excluded because:**
- Limited disk I/O performance
- SD card wear concerns
- Memory constraints for Longhorn components
**GPU nodes included with tolerations:**
- `khelben` (NVIDIA) participates in Longhorn storage
- Taint toleration allows Longhorn to schedule there
## Performance Considerations
### Longhorn Performance
- `khelben` has NVMe - fastest storage node
- `mystra`/`selune` have SATA SSDs - adequate for most workloads
- 2 replicas across different nodes ensures single node failure survival
- Trade-off: 2x storage consumption
### NFS Performance
- Optimized with `nconnect=16` for parallel connections
- `noatime` reduces unnecessary write operations
- Sequential read workloads perform well (media streaming)
- Random I/O workloads should use Longhorn instead
### When to Choose Each Tier
| Requirement | Longhorn | NFS-Slow |
|-------------|----------|----------|
| Low latency | ✅ | ❌ |
| High IOPS | ✅ | ❌ |
| Large capacity | ❌ | ✅ |
| ReadWriteMany (RWX) | Limited | ✅ |
| Node failure survival | ✅ | ✅ (NAS HA) |
| Kubernetes-native | ✅ | ✅ |
## Monitoring
**Grafana Dashboard:** Longhorn dashboard for:
- Volume health and replica status
- IOPS and throughput per volume
- Disk space utilization per node
- Backup job status
**Alerts:**
- Volume degraded (replica count < desired)
- Disk space low (< 20% free)
- Backup job failed
## Future Enhancements
1. **NAS high availability** - Second NAS with replication
2. **Dedicated storage network** - Separate VLAN for storage traffic
3. **NVMe-oF** - Network NVMe for lower latency
4. **Tiered Longhorn** - Hot (NVMe) and warm (SSD) within Longhorn
5. **S3 tier** - MinIO for object storage workloads
## References
* [Longhorn Documentation](https://longhorn.io/docs/)
* [Longhorn Best Practices](https://longhorn.io/docs/latest/best-practices/)
* [NFS CSI Driver](https://github.com/kubernetes-csi/csi-driver-nfs)
* [Talos Longhorn Integration](https://www.talos.dev/v1.6/kubernetes-guides/configuration/storage/)