docs: add ADRs 0025-0028 for infrastructure patterns

- 0025: Observability stack (Prometheus + ClickStack + OTEL) - 0026: Tiered storage strategy (Longhorn + NFS) - 0027: Database strategy (CloudNativePG for PostgreSQL) - 0028: Authentik SSO strategy (OIDC/SAML identity provider)
2026-02-04 08:55:15 -05:00
parent a128c265e4
commit b43c80153c
4 changed files with 1282 additions and 0 deletions
--- a/decisions/0026-storage-strategy.md
+++ b/decisions/0026-storage-strategy.md
@@ -0,0 +1,334 @@
+# Tiered Storage Strategy: Longhorn + NFS
+
+* Status: accepted
+* Date: 2026-02-04
+* Deciders: Billy
+* Technical Story: Provide tiered storage for Kubernetes workloads balancing performance and capacity
+
+## Context and Problem Statement
+
+Kubernetes requires a storage solution for stateful applications like databases, message queues, and AI model caches. Different workloads have vastly different requirements:
+- Databases need fast, reliable storage with replication
+- Media libraries need large capacity but can tolerate slower access
+- AI/ML workloads need both - fast storage for models, large capacity for datasets
+
+The homelab has heterogeneous nodes including x86_64 servers and ARM64 Raspberry Pis, plus an external NAS for bulk storage.
+
+How do we provide tiered storage that balances performance, reliability, and capacity for diverse homelab workloads?
+
+## Decision Drivers
+
+* Performance - fast IOPS for databases and critical workloads
+* Capacity - large storage for media, datasets, and archives
+* Reliability - data must survive node failures
+* Heterogeneous support - work on both x86_64 and ARM64 (with limitations)
+* Backup capability - support for off-cluster backups
+* GitOps deployment - Helm charts with Flux management
+
+## Considered Options
+
+1. **Longhorn + NFS dual-tier storage**
+2. **Rook-Ceph for everything**
+3. **OpenEBS with Mayastor**
+4. **NFS only**
+5. **Longhorn only**
+
+## Decision Outcome
+
+Chosen option: **Option 1 - Longhorn + NFS dual-tier storage**
+
+Two storage tiers optimized for different use cases:
+- **`longhorn`** (default): Fast distributed block storage on NVMe/SSDs for databases and critical workloads
+- **`nfs-slow`**: High-capacity NFS storage on external NAS for media, datasets, and bulk storage
+
+### Positive Consequences
+
+* Right-sized storage for each workload type
+* Longhorn provides HA with automatic replication
+* NFS provides massive capacity without consuming cluster disk space
+* ReadWriteMany (RWX) easy on NFS tier
+* Cost-effective - use existing NAS investment
+
+### Negative Consequences
+
+* Two storage systems to manage
+* NFS is slower (hence `nfs-slow` naming)
+* NFS single point of failure (no replication)
+* Network dependency for both tiers
+
+## Architecture
+
+```
+┌────────────────────────────────────────────────────────────────────────────┐
+│                              TIER 1: LONGHORN                              │
+│                        (Fast Distributed Block Storage)                     │
+│  ┌─────────────┐  ┌─────────────┐  ┌─────────────┐                         │
+│  │   khelben   │  │   mystra    │  │   selune    │                         │
+│  │  (NVIDIA)   │  │   (AMD)     │  │   (AMD)     │                         │
+│  │             │  │             │  │             │                         │
+│  │ /var/mnt/   │  │ /var/mnt/   │  │ /var/mnt/   │                         │
+│  │  longhorn   │  │  longhorn   │  │  longhorn   │                         │
+│  │  (NVMe)     │  │  (SSD)      │  │  (SSD)      │                         │
+│  └──────┬──────┘  └──────┬──────┘  └──────┬──────┘                         │
+│         │                │                │                                 │
+│         └────────────────┼────────────────┘                                 │
+│                          ▼                                                  │
+│              ┌───────────────────────┐                                      │
+│              │   Longhorn Manager    │                                      │
+│              │  (Schedules replicas) │                                      │
+│              └───────────┬───────────┘                                      │
+│                          ▼                                                  │
+│     ┌──────────┐  ┌──────────┐  ┌──────────┐  ┌──────────┐                 │
+│     │ Postgres │  │  Vault   │  │Prometheus│  │ClickHouse│                 │
+│     │   PVC    │  │   PVC    │  │   PVC    │  │   PVC    │                 │
+│     └──────────┘  └──────────┘  └──────────┘  └──────────┘                 │
+└────────────────────────────────────────────────────────────────────────────┘
+
+┌────────────────────────────────────────────────────────────────────────────┐
+│                              TIER 2: NFS-SLOW                              │
+│                        (High-Capacity Bulk Storage)                         │
+│                                                                            │
+│  ┌────────────────────────────────────────────────────────────────┐        │
+│  │                  candlekeep.lab.daviestechlabs.io              │        │
+│  │                        (External NAS)                           │        │
+│  │                                                                 │        │
+│  │   /kubernetes                                                   │        │
+│  │   ├── jellyfin-media/     (1TB+ media library)                 │        │
+│  │   ├── nextcloud/          (user files)                         │        │
+│  │   ├── immich/             (photo backups)                      │        │
+│  │   ├── kavita/             (ebooks, comics, manga)              │        │
+│  │   ├── mlflow-artifacts/   (model artifacts)                    │        │
+│  │   ├── ray-models/         (AI model weights)                   │        │
+│  │   └── gitea-runner/       (build caches)                       │        │
+│  └────────────────────────────────────────────────────────────────┘        │
+│                          │                                                  │
+│                          ▼                                                  │
+│              ┌───────────────────────┐                                      │
+│              │   NFS CSI Driver      │                                      │
+│              │  (csi-driver-nfs)     │                                      │
+│              └───────────┬───────────┘                                      │
+│                          ▼                                                  │
+│     ┌──────────┐  ┌──────────┐  ┌──────────┐  ┌──────────┐                 │
+│     │ Jellyfin │  │Nextcloud │  │  Immich  │  │  Kavita  │                 │
+│     │   PVC    │  │   PVC    │  │   PVC    │  │   PVC    │                 │
+│     └──────────┘  └──────────┘  └──────────┘  └──────────┘                 │
+└────────────────────────────────────────────────────────────────────────────┘
+```
+
+## Tier 1: Longhorn Configuration
+
+### Helm Values
+
+```yaml
+persistence:
+  defaultClass: true
+  defaultClassReplicaCount: 2
+  defaultDataPath: /var/mnt/longhorn
+
+defaultSettings:
+  defaultDataPath: /var/mnt/longhorn
+  # Allow on vllm-tainted nodes
+  taintToleration: "dedicated=vllm:NoSchedule"
+  # Exclude Raspberry Pi nodes (ARM64)
+  systemManagedComponentsNodeSelector: "kubernetes.io/arch:amd64"
+  # Snapshot retention
+  defaultRecurringJobs:
+    - name: nightly-snapshots
+      task: snapshot
+      cron: "0 2 * * *"
+      retain: 7
+    - name: weekly-backups
+      task: backup
+      cron: "0 3 * * 0"
+      retain: 4
+```
+
+### Longhorn Storage Classes
+
+| StorageClass | Replicas | Use Case |
+|--------------|----------|----------|
+| `longhorn` (default) | 2 | General workloads, databases |
+| `longhorn-single` | 1 | Development/ephemeral |
+| `longhorn-strict` | 3 | Critical databases |
+
+## Tier 2: NFS Configuration
+
+### Helm Values (csi-driver-nfs)
+
+```yaml
+storageClass:
+  create: true
+  name: nfs-slow
+  parameters:
+    server: candlekeep.lab.daviestechlabs.io
+    share: /kubernetes
+  mountOptions:
+    - nfsvers=4.1
+    - nconnect=16    # Multiple TCP connections for throughput
+    - hard           # Retry indefinitely on failure
+    - noatime        # Don't update access times (performance)
+  reclaimPolicy: Delete
+  volumeBindingMode: Immediate
+```
+
+### Why "nfs-slow"?
+
+The naming is intentional - it sets correct expectations:
+- **Latency:** NAS is over network, higher latency than local NVMe
+- **IOPS:** Spinning disks in NAS can't match SSD performance
+- **Throughput:** Adequate for streaming media, not for databases
+- **Benefit:** Massive capacity without consuming cluster disk space
+
+## Storage Tier Selection Guide
+
+| Workload Type | Storage Class | Rationale |
+|---------------|---------------|-----------|
+| PostgreSQL (CNPG) | `longhorn` or `nfs-slow` | Depends on criticality |
+| Prometheus/ClickHouse | `longhorn` | High write IOPS required |
+| Vault | `longhorn` | Security-critical, needs HA |
+| Media (Jellyfin, Kavita) | `nfs-slow` | Large files, sequential reads |
+| Photos (Immich) | `nfs-slow` | Bulk storage for photos |
+| User files (Nextcloud) | `nfs-slow` | Capacity over speed |
+| AI/ML models (Ray) | `nfs-slow` | Large model weights |
+| Build caches (Gitea runner) | `nfs-slow` | Ephemeral, large |
+| MLflow artifacts | `nfs-slow` | Model artifacts storage |
+
+## Volume Usage by Tier
+
+### Longhorn Volumes (Performance Tier)
+
+| Workload | Size | Replicas | Access Mode |
+|----------|------|----------|-------------|
+| Prometheus | 50Gi | 2 | RWO |
+| Vault | 2Gi | 2 | RWO |
+| ClickHouse | 100Gi | 2 | RWO |
+| Alertmanager | 1Gi | 2 | RWO |
+
+### NFS Volumes (Capacity Tier)
+
+| Workload | Size | Access Mode | Notes |
+|----------|------|-------------|-------|
+| Jellyfin | 2Ti | RWX | Media library |
+| Immich | 500Gi | RWX | Photo storage |
+| Nextcloud | 1Ti | RWX | User files |
+| Kavita | 200Gi | RWX | Ebooks, comics |
+| MLflow | 100Gi | RWX | Model artifacts |
+| Ray models | 200Gi | RWX | AI model weights |
+| Gitea runner | 50Gi | RWO | Build caches |
+| Gitea DB (CNPG) | 10Gi | RWO | Capacity-optimized |
+
+## Backup Strategy
+
+### Longhorn Tier
+
+#### Local Snapshots
+
+- **Frequency:** Nightly at 2 AM
+- **Retention:** 7 days
+- **Purpose:** Quick recovery from accidental deletion
+
+#### Off-Cluster Backups
+
+- **Frequency:** Weekly on Sundays at 3 AM
+- **Destination:** S3-compatible storage (MinIO/Backblaze)
+- **Retention:** 4 weeks
+- **Purpose:** Disaster recovery
+
+### NFS Tier
+
+#### NAS-Level Backups
+
+- Handled by NAS backup solution (snapshots, replication)
+- Not managed by Kubernetes
+- Relies on NAS raid configuration for redundancy
+
+### Backup Target Configuration (Longhorn)
+
+```yaml
+# ExternalSecret for backup credentials
+apiVersion: external-secrets.io/v1
+kind: ExternalSecret
+metadata:
+  name: longhorn-backup-secret
+spec:
+  secretStoreRef:
+    kind: ClusterSecretStore
+    name: vault
+  target:
+    name: longhorn-backup-secret
+  data:
+    - secretKey: AWS_ACCESS_KEY_ID
+      remoteRef:
+        key: kv/data/longhorn
+        property: backup_access_key
+    - secretKey: AWS_SECRET_ACCESS_KEY
+      remoteRef:
+        key: kv/data/longhorn
+        property: backup_secret_key
+```
+
+## Node Exclusions (Longhorn Only)
+
+**Raspberry Pi nodes excluded because:**
+- Limited disk I/O performance
+- SD card wear concerns
+- Memory constraints for Longhorn components
+
+**GPU nodes included with tolerations:**
+- `khelben` (NVIDIA) participates in Longhorn storage
+- Taint toleration allows Longhorn to schedule there
+
+## Performance Considerations
+
+### Longhorn Performance
+
+- `khelben` has NVMe - fastest storage node
+- `mystra`/`selune` have SATA SSDs - adequate for most workloads
+- 2 replicas across different nodes ensures single node failure survival
+- Trade-off: 2x storage consumption
+
+### NFS Performance
+
+- Optimized with `nconnect=16` for parallel connections
+- `noatime` reduces unnecessary write operations
+- Sequential read workloads perform well (media streaming)
+- Random I/O workloads should use Longhorn instead
+
+### When to Choose Each Tier
+
+| Requirement | Longhorn | NFS-Slow |
+|-------------|----------|----------|
+| Low latency | ✅ | ❌ |
+| High IOPS | ✅ | ❌ |
+| Large capacity | ❌ | ✅ |
+| ReadWriteMany (RWX) | Limited | ✅ |
+| Node failure survival | ✅ | ✅ (NAS HA) |
+| Kubernetes-native | ✅ | ✅ |
+
+## Monitoring
+
+**Grafana Dashboard:** Longhorn dashboard for:
+- Volume health and replica status
+- IOPS and throughput per volume
+- Disk space utilization per node
+- Backup job status
+
+**Alerts:**
+- Volume degraded (replica count < desired)
+- Disk space low (< 20% free)
+- Backup job failed
+
+## Future Enhancements
+
+1. **NAS high availability** - Second NAS with replication
+2. **Dedicated storage network** - Separate VLAN for storage traffic
+3. **NVMe-oF** - Network NVMe for lower latency
+4. **Tiered Longhorn** - Hot (NVMe) and warm (SSD) within Longhorn
+5. **S3 tier** - MinIO for object storage workloads
+
+## References
+
+* [Longhorn Documentation](https://longhorn.io/docs/)
+* [Longhorn Best Practices](https://longhorn.io/docs/latest/best-practices/)
+* [NFS CSI Driver](https://github.com/kubernetes-csi/csi-driver-nfs)
+* [Talos Longhorn Integration](https://www.talos.dev/v1.6/kubernetes-guides/configuration/storage/)