docs: add ADRs 0025-0028 for infrastructure patterns

- 0025: Observability stack (Prometheus + ClickStack + OTEL) - 0026: Tiered storage strategy (Longhorn + NFS) - 0027: Database strategy (CloudNativePG for PostgreSQL) - 0028: Authentik SSO strategy (OIDC/SAML identity provider)
2026-02-04 08:55:15 -05:00
parent a128c265e4
commit b43c80153c
4 changed files with 1282 additions and 0 deletions
--- a/decisions/0025-observability-stack.md
+++ b/decisions/0025-observability-stack.md
@@ -0,0 +1,239 @@
+# Observability Stack Architecture
+
+* Status: accepted
+* Date: 2026-02-04
+* Deciders: Billy
+* Technical Story: Establish comprehensive observability for metrics, logs, and traces across the homelab
+
+## Context and Problem Statement
+
+A complex homelab with AI/ML workloads, multiple databases, and numerous services requires comprehensive observability to understand system behavior, debug issues, and optimize performance.
+
+How do we build an observability stack that provides metrics, logs, and traces while remaining manageable for a single operator?
+
+## Decision Drivers
+
+* Three pillars coverage - metrics, logs, and traces all addressed
+* Unified visualization - single pane of glass for all telemetry
+* Resource efficiency - don't overwhelm the cluster with observability overhead
+* OpenTelemetry compatibility - future-proof instrumentation standard
+* GitOps deployment - all configuration version-controlled
+
+## Considered Options
+
+1. **Prometheus + ClickStack + OpenTelemetry Collector**
+2. **Prometheus + Loki + Tempo (PLT Stack)**
+3. **Datadog/New Relic (SaaS)**
+4. **ELK Stack (Elasticsearch, Logstash, Kibana)**
+
+## Decision Outcome
+
+Chosen option: **Option 1 - Prometheus + ClickStack + OpenTelemetry Collector**
+
+Prometheus handles metrics with its mature ecosystem, ClickStack (ClickHouse-based) provides unified logs and traces storage with excellent performance, and OpenTelemetry Collector routes all telemetry data.
+
+### Positive Consequences
+
+* Prometheus ecosystem is mature with extensive service monitor support
+* ClickHouse provides fast querying for logs and traces at scale
+* OpenTelemetry is vendor-neutral and industry standard
+* Grafana provides unified dashboards for all data sources
+* Cost-effective (no SaaS fees)
+
+### Negative Consequences
+
+* More complex than pure SaaS solutions
+* ClickHouse requires storage management
+* Multiple components to maintain
+
+## Architecture
+
+```
+┌─────────────────────────────────────────────────────────────────────┐
+│                        Applications                                  │
+│  ┌──────────┐  ┌──────────┐  ┌──────────┐  ┌──────────┐            │
+│  │ Go Apps  │  │ Python   │  │ Node.js  │  │ Java     │            │
+│  │ (OTEL)   │  │ (OTEL)   │  │ (OTEL)   │  │ (OTEL)   │            │
+│  └────┬─────┘  └────┬─────┘  └────┬─────┘  └────┬─────┘            │
+└───────┼─────────────┼─────────────┼─────────────┼──────────────────┘
+        │             │             │             │
+        └──────────────────┬────────────────────────┘
+                           │ OTLP (gRPC/HTTP)
+                           ▼
+              ┌────────────────────────┐
+              │  OpenTelemetry         │
+              │  Collector             │
+              │  (traces, metrics,     │
+              │   logs)                │
+              └───────────┬────────────┘
+                          │
+          ┌───────────────┼───────────────┐
+          │               │               │
+          ▼               ▼               ▼
+┌─────────────────┐ ┌───────────┐ ┌───────────────┐
+│   ClickStack    │ │Prometheus │ │   Grafana     │
+│   (ClickHouse)  │ │           │ │               │
+│  ┌───────────┐  │ │ Metrics   │ │  Dashboards   │
+│  │  Traces   │  │ │ Storage   │ │  Alerting     │
+│  ├───────────┤  │ │           │ │  Exploration  │
+│  │   Logs    │  │ └───────────┘ │               │
+│  └───────────┘  │               └───────────────┘
+└─────────────────┘                      │
+                                         │
+                    ┌────────────────────┤
+                    │                    │
+              ┌─────▼─────┐        ┌─────▼─────┐
+              │Alertmanager│        │   ntfy    │
+              │           │        │ (push)    │
+              └───────────┘        └───────────┘
+```
+
+## Component Details
+
+### Metrics: Prometheus + kube-prometheus-stack
+
+**Deployment:** HelmRelease via Flux
+
+```yaml
+prometheus:
+  prometheusSpec:
+    retention: 14d
+    retentionSize: 50GB
+    storage:
+      volumeClaimTemplate:
+        spec:
+          storageClassName: longhorn
+          storage: 50Gi
+```
+
+**Key Features:**
+- ServiceMonitor auto-discovery for all workloads
+- 14-day retention with 50GB limit
+- PromPP image for enhanced performance
+- AlertManager for routing alerts
+
+### Logs & Traces: ClickStack
+
+**Why ClickStack over Loki/Tempo:**
+- Single storage backend (ClickHouse) for both logs and traces
+- Excellent query performance on large datasets
+- Built-in correlation between logs and traces
+- Lower resource overhead than separate Loki + Tempo
+
+**Configuration:**
+- OTEL Collector receives all telemetry
+- Forwards to ClickStack's OTEL collector
+- Grafana datasources for querying
+
+### Telemetry Collection: OpenTelemetry
+
+**OpenTelemetry Operator:** Manages auto-instrumentation
+
+```yaml
+apiVersion: opentelemetry.io/v1alpha1
+kind: Instrumentation
+metadata:
+  name: auto-instrumentation
+spec:
+  python:
+    image: ghcr.io/open-telemetry/opentelemetry-operator/autoinstrumentation-python
+  nodejs:
+    image: ghcr.io/open-telemetry/opentelemetry-operator/autoinstrumentation-nodejs
+```
+
+**OpenTelemetry Collector:** Central routing
+
+```yaml
+receivers:
+  otlp:
+    protocols:
+      grpc:
+        endpoint: 0.0.0.0:4317
+      http:
+        endpoint: 0.0.0.0:4318
+
+exporters:
+  otlphttp:
+    endpoint: http://clickstack-otel-collector:4318
+
+service:
+  pipelines:
+    traces:
+      receivers: [otlp]
+      exporters: [otlphttp]
+    metrics:
+      receivers: [otlp]
+      exporters: [otlphttp]
+    logs:
+      receivers: [otlp]
+      exporters: [otlphttp]
+```
+
+### Visualization: Grafana
+
+**Grafana Operator:** Manages dashboards and datasources as CRDs
+
+```yaml
+apiVersion: grafana.integreatly.org/v1beta1
+kind: GrafanaDashboard
+metadata:
+  name: kubernetes-nodes
+spec:
+  instanceSelector:
+    matchLabels:
+      grafana.internal/instance: grafana
+  url: https://grafana.com/api/dashboards/15758/revisions/44/download
+```
+
+**Datasources:**
+| Type | Source | Purpose |
+|------|--------|---------|
+| Prometheus | prometheus-operated:9090 | Metrics |
+| ClickHouse | clickstack:8123 | Logs & Traces |
+| Alertmanager | alertmanager-operated:9093 | Alert status |
+
+### Alerting Pipeline
+
+```
+Prometheus Rules → Alertmanager → ntfy → Discord/Mobile
+                                      └─→ Email (future)
+```
+
+**Alert Categories:**
+- Infrastructure: Node down, disk full, OOM
+- Application: Error rate, latency SLO breach
+- Security: Gatekeeper violations, vulnerability findings
+
+## Dashboards
+
+| Dashboard | Source | Purpose |
+|-----------|--------|---------|
+| Kubernetes Global | Grafana #15757 | Cluster overview |
+| Node Exporter | Grafana #1860 | Node metrics |
+| CNPG PostgreSQL | CNPG | Database health |
+| Flux | Flux Operator | GitOps status |
+| Cilium | Cilium | Network metrics |
+| Envoy Gateway | Envoy | Ingress metrics |
+
+## Resource Allocation
+
+| Component | CPU Request | Memory Limit |
+|-----------|-------------|--------------|
+| Prometheus | 100m | 2Gi |
+| OTEL Collector | 100m | 512Mi |
+| ClickStack | 500m | 2Gi |
+| Grafana | 100m | 256Mi |
+
+## Future Enhancements
+
+1. **Continuous Profiling** - Pyroscope for Go/Python profiling
+2. **SLO Tracking** - Sloth for SLI/SLO automation
+3. **Synthetic Monitoring** - Gatus for endpoint probing
+4. **Cost Attribution** - OpenCost for resource cost tracking
+
+## References
+
+* [OpenTelemetry Documentation](https://opentelemetry.io/docs/)
+* [ClickHouse for Observability](https://clickhouse.com/docs/en/use-cases/observability)
+* [kube-prometheus-stack](https://github.com/prometheus-community/helm-charts/tree/main/charts/kube-prometheus-stack)
+* [Grafana Operator](https://grafana.github.io/grafana-operator/)
--- a/decisions/0026-storage-strategy.md
+++ b/decisions/0026-storage-strategy.md
@@ -0,0 +1,334 @@
+# Tiered Storage Strategy: Longhorn + NFS
+
+* Status: accepted
+* Date: 2026-02-04
+* Deciders: Billy
+* Technical Story: Provide tiered storage for Kubernetes workloads balancing performance and capacity
+
+## Context and Problem Statement
+
+Kubernetes requires a storage solution for stateful applications like databases, message queues, and AI model caches. Different workloads have vastly different requirements:
+- Databases need fast, reliable storage with replication
+- Media libraries need large capacity but can tolerate slower access
+- AI/ML workloads need both - fast storage for models, large capacity for datasets
+
+The homelab has heterogeneous nodes including x86_64 servers and ARM64 Raspberry Pis, plus an external NAS for bulk storage.
+
+How do we provide tiered storage that balances performance, reliability, and capacity for diverse homelab workloads?
+
+## Decision Drivers
+
+* Performance - fast IOPS for databases and critical workloads
+* Capacity - large storage for media, datasets, and archives
+* Reliability - data must survive node failures
+* Heterogeneous support - work on both x86_64 and ARM64 (with limitations)
+* Backup capability - support for off-cluster backups
+* GitOps deployment - Helm charts with Flux management
+
+## Considered Options
+
+1. **Longhorn + NFS dual-tier storage**
+2. **Rook-Ceph for everything**
+3. **OpenEBS with Mayastor**
+4. **NFS only**
+5. **Longhorn only**
+
+## Decision Outcome
+
+Chosen option: **Option 1 - Longhorn + NFS dual-tier storage**
+
+Two storage tiers optimized for different use cases:
+- **`longhorn`** (default): Fast distributed block storage on NVMe/SSDs for databases and critical workloads
+- **`nfs-slow`**: High-capacity NFS storage on external NAS for media, datasets, and bulk storage
+
+### Positive Consequences
+
+* Right-sized storage for each workload type
+* Longhorn provides HA with automatic replication
+* NFS provides massive capacity without consuming cluster disk space
+* ReadWriteMany (RWX) easy on NFS tier
+* Cost-effective - use existing NAS investment
+
+### Negative Consequences
+
+* Two storage systems to manage
+* NFS is slower (hence `nfs-slow` naming)
+* NFS single point of failure (no replication)
+* Network dependency for both tiers
+
+## Architecture
+
+```
+┌────────────────────────────────────────────────────────────────────────────┐
+│                              TIER 1: LONGHORN                              │
+│                        (Fast Distributed Block Storage)                     │
+│  ┌─────────────┐  ┌─────────────┐  ┌─────────────┐                         │
+│  │   khelben   │  │   mystra    │  │   selune    │                         │
+│  │  (NVIDIA)   │  │   (AMD)     │  │   (AMD)     │                         │
+│  │             │  │             │  │             │                         │
+│  │ /var/mnt/   │  │ /var/mnt/   │  │ /var/mnt/   │                         │
+│  │  longhorn   │  │  longhorn   │  │  longhorn   │                         │
+│  │  (NVMe)     │  │  (SSD)      │  │  (SSD)      │                         │
+│  └──────┬──────┘  └──────┬──────┘  └──────┬──────┘                         │
+│         │                │                │                                 │
+│         └────────────────┼────────────────┘                                 │
+│                          ▼                                                  │
+│              ┌───────────────────────┐                                      │
+│              │   Longhorn Manager    │                                      │
+│              │  (Schedules replicas) │                                      │
+│              └───────────┬───────────┘                                      │
+│                          ▼                                                  │
+│     ┌──────────┐  ┌──────────┐  ┌──────────┐  ┌──────────┐                 │
+│     │ Postgres │  │  Vault   │  │Prometheus│  │ClickHouse│                 │
+│     │   PVC    │  │   PVC    │  │   PVC    │  │   PVC    │                 │
+│     └──────────┘  └──────────┘  └──────────┘  └──────────┘                 │
+└────────────────────────────────────────────────────────────────────────────┘
+
+┌────────────────────────────────────────────────────────────────────────────┐
+│                              TIER 2: NFS-SLOW                              │
+│                        (High-Capacity Bulk Storage)                         │
+│                                                                            │
+│  ┌────────────────────────────────────────────────────────────────┐        │
+│  │                  candlekeep.lab.daviestechlabs.io              │        │
+│  │                        (External NAS)                           │        │
+│  │                                                                 │        │
+│  │   /kubernetes                                                   │        │
+│  │   ├── jellyfin-media/     (1TB+ media library)                 │        │
+│  │   ├── nextcloud/          (user files)                         │        │
+│  │   ├── immich/             (photo backups)                      │        │
+│  │   ├── kavita/             (ebooks, comics, manga)              │        │
+│  │   ├── mlflow-artifacts/   (model artifacts)                    │        │
+│  │   ├── ray-models/         (AI model weights)                   │        │
+│  │   └── gitea-runner/       (build caches)                       │        │
+│  └────────────────────────────────────────────────────────────────┘        │
+│                          │                                                  │
+│                          ▼                                                  │
+│              ┌───────────────────────┐                                      │
+│              │   NFS CSI Driver      │                                      │
+│              │  (csi-driver-nfs)     │                                      │
+│              └───────────┬───────────┘                                      │
+│                          ▼                                                  │
+│     ┌──────────┐  ┌──────────┐  ┌──────────┐  ┌──────────┐                 │
+│     │ Jellyfin │  │Nextcloud │  │  Immich  │  │  Kavita  │                 │
+│     │   PVC    │  │   PVC    │  │   PVC    │  │   PVC    │                 │
+│     └──────────┘  └──────────┘  └──────────┘  └──────────┘                 │
+└────────────────────────────────────────────────────────────────────────────┘
+```
+
+## Tier 1: Longhorn Configuration
+
+### Helm Values
+
+```yaml
+persistence:
+  defaultClass: true
+  defaultClassReplicaCount: 2
+  defaultDataPath: /var/mnt/longhorn
+
+defaultSettings:
+  defaultDataPath: /var/mnt/longhorn
+  # Allow on vllm-tainted nodes
+  taintToleration: "dedicated=vllm:NoSchedule"
+  # Exclude Raspberry Pi nodes (ARM64)
+  systemManagedComponentsNodeSelector: "kubernetes.io/arch:amd64"
+  # Snapshot retention
+  defaultRecurringJobs:
+    - name: nightly-snapshots
+      task: snapshot
+      cron: "0 2 * * *"
+      retain: 7
+    - name: weekly-backups
+      task: backup
+      cron: "0 3 * * 0"
+      retain: 4
+```
+
+### Longhorn Storage Classes
+
+| StorageClass | Replicas | Use Case |
+|--------------|----------|----------|
+| `longhorn` (default) | 2 | General workloads, databases |
+| `longhorn-single` | 1 | Development/ephemeral |
+| `longhorn-strict` | 3 | Critical databases |
+
+## Tier 2: NFS Configuration
+
+### Helm Values (csi-driver-nfs)
+
+```yaml
+storageClass:
+  create: true
+  name: nfs-slow
+  parameters:
+    server: candlekeep.lab.daviestechlabs.io
+    share: /kubernetes
+  mountOptions:
+    - nfsvers=4.1
+    - nconnect=16    # Multiple TCP connections for throughput
+    - hard           # Retry indefinitely on failure
+    - noatime        # Don't update access times (performance)
+  reclaimPolicy: Delete
+  volumeBindingMode: Immediate
+```
+
+### Why "nfs-slow"?
+
+The naming is intentional - it sets correct expectations:
+- **Latency:** NAS is over network, higher latency than local NVMe
+- **IOPS:** Spinning disks in NAS can't match SSD performance
+- **Throughput:** Adequate for streaming media, not for databases
+- **Benefit:** Massive capacity without consuming cluster disk space
+
+## Storage Tier Selection Guide
+
+| Workload Type | Storage Class | Rationale |
+|---------------|---------------|-----------|
+| PostgreSQL (CNPG) | `longhorn` or `nfs-slow` | Depends on criticality |
+| Prometheus/ClickHouse | `longhorn` | High write IOPS required |
+| Vault | `longhorn` | Security-critical, needs HA |
+| Media (Jellyfin, Kavita) | `nfs-slow` | Large files, sequential reads |
+| Photos (Immich) | `nfs-slow` | Bulk storage for photos |
+| User files (Nextcloud) | `nfs-slow` | Capacity over speed |
+| AI/ML models (Ray) | `nfs-slow` | Large model weights |
+| Build caches (Gitea runner) | `nfs-slow` | Ephemeral, large |
+| MLflow artifacts | `nfs-slow` | Model artifacts storage |
+
+## Volume Usage by Tier
+
+### Longhorn Volumes (Performance Tier)
+
+| Workload | Size | Replicas | Access Mode |
+|----------|------|----------|-------------|
+| Prometheus | 50Gi | 2 | RWO |
+| Vault | 2Gi | 2 | RWO |
+| ClickHouse | 100Gi | 2 | RWO |
+| Alertmanager | 1Gi | 2 | RWO |
+
+### NFS Volumes (Capacity Tier)
+
+| Workload | Size | Access Mode | Notes |
+|----------|------|-------------|-------|
+| Jellyfin | 2Ti | RWX | Media library |
+| Immich | 500Gi | RWX | Photo storage |
+| Nextcloud | 1Ti | RWX | User files |
+| Kavita | 200Gi | RWX | Ebooks, comics |
+| MLflow | 100Gi | RWX | Model artifacts |
+| Ray models | 200Gi | RWX | AI model weights |
+| Gitea runner | 50Gi | RWO | Build caches |
+| Gitea DB (CNPG) | 10Gi | RWO | Capacity-optimized |
+
+## Backup Strategy
+
+### Longhorn Tier
+
+#### Local Snapshots
+
+- **Frequency:** Nightly at 2 AM
+- **Retention:** 7 days
+- **Purpose:** Quick recovery from accidental deletion
+
+#### Off-Cluster Backups
+
+- **Frequency:** Weekly on Sundays at 3 AM
+- **Destination:** S3-compatible storage (MinIO/Backblaze)
+- **Retention:** 4 weeks
+- **Purpose:** Disaster recovery
+
+### NFS Tier
+
+#### NAS-Level Backups
+
+- Handled by NAS backup solution (snapshots, replication)
+- Not managed by Kubernetes
+- Relies on NAS raid configuration for redundancy
+
+### Backup Target Configuration (Longhorn)
+
+```yaml
+# ExternalSecret for backup credentials
+apiVersion: external-secrets.io/v1
+kind: ExternalSecret
+metadata:
+  name: longhorn-backup-secret
+spec:
+  secretStoreRef:
+    kind: ClusterSecretStore
+    name: vault
+  target:
+    name: longhorn-backup-secret
+  data:
+    - secretKey: AWS_ACCESS_KEY_ID
+      remoteRef:
+        key: kv/data/longhorn
+        property: backup_access_key
+    - secretKey: AWS_SECRET_ACCESS_KEY
+      remoteRef:
+        key: kv/data/longhorn
+        property: backup_secret_key
+```
+
+## Node Exclusions (Longhorn Only)
+
+**Raspberry Pi nodes excluded because:**
+- Limited disk I/O performance
+- SD card wear concerns
+- Memory constraints for Longhorn components
+
+**GPU nodes included with tolerations:**
+- `khelben` (NVIDIA) participates in Longhorn storage
+- Taint toleration allows Longhorn to schedule there
+
+## Performance Considerations
+
+### Longhorn Performance
+
+- `khelben` has NVMe - fastest storage node
+- `mystra`/`selune` have SATA SSDs - adequate for most workloads
+- 2 replicas across different nodes ensures single node failure survival
+- Trade-off: 2x storage consumption
+
+### NFS Performance
+
+- Optimized with `nconnect=16` for parallel connections
+- `noatime` reduces unnecessary write operations
+- Sequential read workloads perform well (media streaming)
+- Random I/O workloads should use Longhorn instead
+
+### When to Choose Each Tier
+
+| Requirement | Longhorn | NFS-Slow |
+|-------------|----------|----------|
+| Low latency | ✅ | ❌ |
+| High IOPS | ✅ | ❌ |
+| Large capacity | ❌ | ✅ |
+| ReadWriteMany (RWX) | Limited | ✅ |
+| Node failure survival | ✅ | ✅ (NAS HA) |
+| Kubernetes-native | ✅ | ✅ |
+
+## Monitoring
+
+**Grafana Dashboard:** Longhorn dashboard for:
+- Volume health and replica status
+- IOPS and throughput per volume
+- Disk space utilization per node
+- Backup job status
+
+**Alerts:**
+- Volume degraded (replica count < desired)
+- Disk space low (< 20% free)
+- Backup job failed
+
+## Future Enhancements
+
+1. **NAS high availability** - Second NAS with replication
+2. **Dedicated storage network** - Separate VLAN for storage traffic
+3. **NVMe-oF** - Network NVMe for lower latency
+4. **Tiered Longhorn** - Hot (NVMe) and warm (SSD) within Longhorn
+5. **S3 tier** - MinIO for object storage workloads
+
+## References
+
+* [Longhorn Documentation](https://longhorn.io/docs/)
+* [Longhorn Best Practices](https://longhorn.io/docs/latest/best-practices/)
+* [NFS CSI Driver](https://github.com/kubernetes-csi/csi-driver-nfs)
+* [Talos Longhorn Integration](https://www.talos.dev/v1.6/kubernetes-guides/configuration/storage/)
--- a/decisions/0027-database-strategy.md
+++ b/decisions/0027-database-strategy.md
@@ -0,0 +1,294 @@
+# Database Strategy with CloudNativePG
+
+* Status: accepted
+* Date: 2026-02-04
+* Deciders: Billy
+* Technical Story: Standardize PostgreSQL deployment for stateful applications
+
+## Context and Problem Statement
+
+Multiple applications in the homelab require relational databases: Gitea, Authentik, Companions, MLflow, and potentially more. Each could use different database solutions, creating operational complexity.
+
+How do we standardize database deployment while providing production-grade reliability and minimal operational overhead?
+
+## Decision Drivers
+
+* Operational simplicity - single operator to learn and manage
+* High availability - automatic failover for critical databases
+* Backup integration - consistent backup strategy across all databases
+* GitOps compatibility - declarative database provisioning
+* Resource efficiency - don't over-provision for homelab scale
+
+## Considered Options
+
+1. **CloudNativePG for PostgreSQL**
+2. **Helm charts per application (Bitnami PostgreSQL)**
+3. **External managed database (RDS-style)**
+4. **SQLite where possible + single shared PostgreSQL**
+
+## Decision Outcome
+
+Chosen option: **Option 1 - CloudNativePG for PostgreSQL**
+
+CloudNativePG (CNPG) provides a Kubernetes-native PostgreSQL operator with HA, automatic failover, connection pooling (PgBouncer), and integrated backups.
+
+### Positive Consequences
+
+* Single operator manages all PostgreSQL instances
+* Declarative Cluster CRD for GitOps deployment
+* Automatic failover with minimal data loss
+* Built-in PgBouncer for connection pooling
+* Prometheus metrics and Grafana dashboards included
+* CNPG is CNCF-listed and actively maintained
+
+### Negative Consequences
+
+* PostgreSQL only (no MySQL/MariaDB support)
+* Operator adds resource overhead
+* Learning curve for CNPG-specific features
+
+## Architecture
+
+```
+┌─────────────────────────────────────────────────────────────────┐
+│                        CNPG Operator                             │
+│                     (cnpg-system namespace)                      │
+└────────────────────────────┬────────────────────────────────────┘
+                             │ Manages
+                             ▼
+┌──────────────────┬─────────────────┬─────────────────────────────┐
+│                  │                 │                             │
+▼                  ▼                 ▼                             ▼
+┌──────────────┐  ┌──────────────┐  ┌──────────────┐  ┌──────────────┐
+│  gitea-pg    │  │ authentik-db │  │companions-db │  │  mlflow-db   │
+│  (3 replicas)│  │  (3 replicas)│  │ (3 replicas) │  │ (1 replica)  │
+│              │  │              │  │              │  │              │
+│ ┌──────────┐ │  │ ┌──────────┐ │  │ ┌──────────┐ │  │ ┌──────────┐ │
+│ │ Primary  │ │  │ │ Primary  │ │  │ │ Primary  │ │  │ │ Primary  │ │
+│ ├──────────┤ │  │ ├──────────┤ │  │ ├──────────┤ │  │ └──────────┘ │
+│ │ Replica  │ │  │ │ Replica  │ │  │ │ Replica  │ │  │              │
+│ ├──────────┤ │  │ ├──────────┤ │  │ ├──────────┤ │  │              │
+│ │ Replica  │ │  │ │ Replica  │ │  │ │ Replica  │ │  │              │
+│ └──────────┘ │  │ └──────────┘ │  │ └──────────┘ │  │              │
+│ ┌──────────┐ │  │ ┌──────────┐ │  │ ┌──────────┐ │  │              │
+│ │ PgBouncer│ │  │ │ PgBouncer│ │  │ │ PgBouncer│ │  │              │
+│ └──────────┘ │  │ └──────────┘ │  │ └──────────┘ │  │              │
+└──────────────┘  └──────────────┘  └──────────────┘  └──────────────┘
+      │                  │                 │                │
+      └──────────────────┼─────────────────┼────────────────┘
+                         │                 │
+                   ┌─────▼─────┐     ┌─────▼─────┐
+                   │  Longhorn │     │ Longhorn  │
+                   │   PVCs    │     │  Backups  │
+                   └───────────┘     └───────────┘
+```
+
+## Cluster Configuration Template
+
+```yaml
+apiVersion: postgresql.cnpg.io/v1
+kind: Cluster
+metadata:
+  name: app-db
+spec:
+  description: "Application PostgreSQL Cluster"
+  imageName: ghcr.io/cloudnative-pg/postgresql:17.2
+  instances: 3
+
+  primaryUpdateStrategy: unsupervised
+
+  postgresql:
+    parameters:
+      shared_buffers: "256MB"
+      effective_cache_size: "768MB"
+      work_mem: "16MB"
+      max_connections: "200"
+      
+  # Enable PgBouncer for connection pooling
+  enablePgBouncer: true
+  pgbouncer:
+    poolMode: transaction
+    defaultPoolSize: "25"
+
+  # Storage on Longhorn
+  storage:
+    size: 10Gi
+    storageClass: longhorn
+
+  # Monitoring
+  monitoring:
+    enabled: true
+    customQueriesConfigMap:
+      - name: cnpg-default-monitoring
+        key: queries
+
+  # Backup configuration
+  backup:
+    barmanObjectStore:
+      destinationPath: "s3://backups/postgres/"
+      s3Credentials:
+        accessKeyId:
+          name: postgres-backup-creds
+          key: ACCESS_KEY_ID
+        secretAccessKey:
+          name: postgres-backup-creds
+          key: SECRET_ACCESS_KEY
+    retentionPolicy: "7d"
+```
+
+## Database Instances
+
+| Cluster | Instances | Storage | PgBouncer | Purpose |
+|---------|-----------|---------|-----------|---------|
+| `gitea-pg` | 3 | 10Gi | Yes | Git repository metadata |
+| `authentik-db` | 3 | 5Gi | Yes | Identity/SSO data |
+| `companions-db` | 3 | 10Gi | Yes | Chat app data |
+| `mlflow-db` | 1 | 5Gi | No | Experiment tracking |
+| `kubeflow-db` | 1 | 10Gi | No | Pipeline metadata |
+
+## Connection Patterns
+
+### Service Discovery
+
+CNPG creates services for each cluster:
+
+| Service | Purpose |
+|---------|---------|
+| `<cluster>-rw` | Read-write (primary only) |
+| `<cluster>-ro` | Read-only (any replica) |
+| `<cluster>-r` | Read (any instance) |
+| `<cluster>-pooler-rw` | PgBouncer read-write |
+| `<cluster>-pooler-ro` | PgBouncer read-only |
+
+### Application Configuration
+
+```yaml
+# Application config using CNPG service
+DATABASE_URL: "postgresql://user:password@gitea-pg-pooler-rw.gitea.svc:5432/giteadb"
+```
+
+### Credentials via External Secrets
+
+```yaml
+apiVersion: external-secrets.io/v1
+kind: ExternalSecret
+metadata:
+  name: app-db-credentials
+spec:
+  secretStoreRef:
+    kind: ClusterSecretStore
+    name: vault
+  target:
+    name: app-db-credentials
+  data:
+    - secretKey: username
+      remoteRef:
+        key: kv/data/app-db
+        property: username
+    - secretKey: password
+      remoteRef:
+        key: kv/data/app-db
+        property: password
+```
+
+## High Availability
+
+### Automatic Failover
+
+- CNPG monitors primary health continuously
+- If primary fails, automatic promotion of replica
+- Application reconnection via service abstraction
+- Typical failover time: 10-30 seconds
+
+### Replica Synchronization
+
+- Streaming replication from primary to replicas
+- Synchronous replication available for zero data loss (trade-off: latency)
+- Default: asynchronous with acceptable RPO
+
+## Backup Strategy
+
+### Continuous WAL Archiving
+
+- Write-Ahead Log streamed to S3
+- Point-in-time recovery capability
+- RPO: seconds (last WAL segment)
+
+### Base Backups
+
+- **Frequency:** Daily
+- **Retention:** 7 days
+- **Destination:** S3-compatible (MinIO/Backblaze)
+
+### Recovery Testing
+
+- Periodic restore to test cluster
+- Validate backup integrity
+- Document recovery procedure
+
+## Monitoring
+
+### Prometheus Metrics
+
+- Connection count and pool utilization
+- Transaction rate and latency
+- Replication lag
+- Disk usage and WAL generation
+
+### Grafana Dashboard
+
+CNPG provides official dashboard:
+- Cluster health overview
+- Per-instance metrics
+- Replication status
+- Backup job history
+
+### Alerts
+
+```yaml
+- alert: PostgreSQLDown
+  expr: cnpg_collector_up == 0
+  for: 5m
+  labels:
+    severity: critical
+
+- alert: PostgreSQLReplicationLag
+  expr: cnpg_pg_replication_lag_seconds > 30
+  for: 5m
+  labels:
+    severity: warning
+
+- alert: PostgreSQLConnectionsHigh
+  expr: cnpg_pg_stat_activity_count / cnpg_pg_settings_max_connections > 0.8
+  for: 5m
+  labels:
+    severity: warning
+```
+
+## When NOT to Use CloudNativePG
+
+| Scenario | Alternative |
+|----------|-------------|
+| Simple app, no HA needed | Embedded SQLite |
+| MySQL/MariaDB required | Application-specific chart |
+| Massive scale | External managed database |
+| Non-relational data | Redis/Valkey, MongoDB |
+
+## PostgreSQL Version Policy
+
+- Use latest stable major version (currently 17)
+- Minor version updates: automatic (`primaryUpdateStrategy: unsupervised`)
+- Major version upgrades: manual with testing
+
+## Future Enhancements
+
+1. **Cross-cluster replication** - DR site replica
+2. **Logical replication** - Selective table sync between clusters
+3. **TimescaleDB extension** - Time-series optimization for metrics
+4. **PgVector extension** - Vector storage alternative to Milvus
+
+## References
+
+* [CloudNativePG Documentation](https://cloudnative-pg.io/documentation/)
+* [CNPG GitHub](https://github.com/cloudnative-pg/cloudnative-pg)
+* [PostgreSQL High Availability](https://www.postgresql.org/docs/current/high-availability.html)
--- a/decisions/0028-authentik-sso-strategy.md
+++ b/decisions/0028-authentik-sso-strategy.md
@@ -0,0 +1,415 @@
+# Authentik Single Sign-On Strategy
+
+* Status: accepted
+* Date: 2026-02-04
+* Deciders: Billy
+* Technical Story: Centralize authentication across all homelab applications
+
+## Context and Problem Statement
+
+A growing homelab with many self-hosted applications creates authentication sprawl - each app has its own user database, passwords, and session management. This creates poor user experience and security risks.
+
+How do we centralize authentication while maintaining flexibility for different application requirements?
+
+## Decision Drivers
+
+* Single sign-on (SSO) for all applications
+* Centralized user management and lifecycle
+* MFA enforcement across all applications
+* Open-source and self-hosted
+* Low resource requirements for homelab scale
+
+## Considered Options
+
+1. **Authentik as OIDC/SAML provider**
+2. **Keycloak**
+3. **Authelia + LDAP**
+4. **Per-application local auth**
+
+## Decision Outcome
+
+Chosen option: **Option 1 - Authentik as OIDC/SAML provider**
+
+Authentik provides modern identity management with OIDC, SAML 2.0, LDAP, and SCIM support. Its flow-based authentication engine allows customizable login experiences.
+
+### Positive Consequences
+
+* Clean, modern UI for users and admins
+* Flexible flow-based authentication
+* Built-in MFA (TOTP, WebAuthn, SMS, Duo)
+* Proxy provider for legacy apps
+* SCIM for user provisioning
+* Active development and community
+
+### Negative Consequences
+
+* Python-based (higher memory than Go alternatives)
+* PostgreSQL dependency
+* Some enterprise features require outpost pods
+
+## Architecture
+
+```
+                                    ┌─────────────────────┐
+                                    │      User           │
+                                    └──────────┬──────────┘
+                                               │
+                                               ▼
+                                    ┌─────────────────────┐
+                                    │   Ingress/Traefik   │
+                                    └──────────┬──────────┘
+                                               │
+                    ┌──────────────────────────┼──────────────────────────┐
+                    │                          │                          │
+                    ▼                          ▼                          ▼
+         ┌─────────────────┐       ┌─────────────────┐       ┌─────────────────┐
+         │  auth.lab.io    │       │  app.lab.io     │       │  app2.lab.io    │
+         │   (Authentik)   │       │  (OIDC-enabled) │       │ (Proxy-auth)    │
+         └─────────────────┘       └────────┬────────┘       └────────┬────────┘
+                    │                       │                         │
+                    │    ┌──────────────────┘                         │
+                    │    │ OIDC/OAuth2                                │
+                    │    │                                            │
+                    ▼    ▼                                            ▼
+         ┌─────────────────────────────────────────────────────────────────┐
+         │                          Authentik                              │
+         │  ┌─────────────┐  ┌─────────────┐  ┌─────────────┐            │
+         │  │   Server    │  │   Worker    │  │   Outpost   │◄───────────┤
+         │  │   (API)     │  │  (Tasks)    │  │  (Proxy)    │ Forward    │
+         │  └──────┬──────┘  └──────┬──────┘  └─────────────┘ Auth       │
+         │         │                │                                     │
+         │         └────────┬───────┘                                     │
+         │                  │                                             │
+         │           ┌──────▼──────┐                                      │
+         │           │   Redis     │                                      │
+         │           │  (Cache)    │                                      │
+         │           └─────────────┘                                      │
+         │                                                                │
+         └─────────────────────────────┬──────────────────────────────────┘
+                                       │
+                                ┌──────▼──────┐
+                                │ PostgreSQL  │
+                                │ (CNPG)      │
+                                └─────────────┘
+```
+
+## Provider Configuration
+
+### OIDC Applications
+
+| Application | Provider Type | Claims Override | Notes |
+|-------------|---------------|-----------------|-------|
+| Gitea | OIDC | None | Admin mapping via group |
+| Affine | OIDC | `email_verified: true` | See ADR-0016 |
+| Companions | OIDC | None | Custom provider |
+| Grafana | OIDC | `role` claim | Admin role mapping |
+| ArgoCD | OIDC | `groups` claim | RBAC integration |
+| MLflow | Proxy | N/A | Forward auth |
+| Open WebUI | OIDC | None | LLM interface |
+
+### Provider Template
+
+```yaml
+# Example OAuth2/OIDC Provider
+apiVersion: authentik.io/v1
+kind: OAuth2Provider
+metadata:
+  name: gitea
+spec:
+  name: Gitea
+  authorizationFlow: default-authorization-flow
+  clientId: ${GITEA_CLIENT_ID}
+  clientSecret: ${GITEA_CLIENT_SECRET}
+  redirectUris:
+    - https://git.lab.daviestechlabs.io/user/oauth2/authentik/callback
+  signingKey: authentik-self-signed
+  propertyMappings:
+    - authentik-default-openid
+    - authentik-default-email
+    - authentik-default-profile
+```
+
+## Authentication Flows
+
+### Default Login Flow
+
+```
+┌─────────────┐     ┌─────────────┐     ┌─────────────┐     ┌─────────────┐
+│   Login     │────▶│  Username   │────▶│   Password  │────▶│    MFA      │
+│   Stage     │     │   Stage     │     │   Stage     │     │   Stage     │
+└─────────────┘     └─────────────┘     └─────────────┘     └──────┬──────┘
+                                                                   │
+                                                                   ▼
+                                                           ┌─────────────┐
+                                                           │   Session   │
+                                                           │   Created   │
+                                                           └─────────────┘
+```
+
+### Flow Customization
+
+- **Admin users:** Require hardware key (WebAuthn)
+- **API access:** Service account tokens
+- **External users:** Email verification + MFA enrollment
+
+## Group-Based Authorization
+
+### Group Structure
+
+```
+authentik-admins        → Authentik admin access
+├── cluster-admins      → Full cluster access
+├── gitea-admins        → Git admin
+├── monitoring-admins   → Grafana admin
+└── ai-platform-admins  → AI/ML admin
+
+authentik-users         → Standard user access  
+├── developers          → Git write, monitoring read
+├── ml-engineers        → AI/ML services access
+└── guests              → Read-only access
+```
+
+### Application Group Mapping
+
+```yaml
+# Grafana OIDC config
+grafana:
+  auth.generic_oauth:
+    role_attribute_path: |
+      contains(groups[*], 'monitoring-admins') && 'Admin' || 
+      contains(groups[*], 'developers') && 'Editor' || 
+      'Viewer'
+```
+
+## Outpost Deployment
+
+### Embedded Outpost (Default)
+
+- Runs within Authentik server
+- Handles LDAP and Radius
+- Suitable for small deployments
+
+### Standalone Outpost (Proxy)
+
+```yaml
+apiVersion: apps/v1
+kind: Deployment
+metadata:
+  name: authentik-outpost-proxy
+spec:
+  replicas: 2
+  template:
+    spec:
+      containers:
+        - name: outpost
+          image: ghcr.io/goauthentik/proxy
+          ports:
+            - containerPort: 9000
+              name: http
+            - containerPort: 9443
+              name: https
+          env:
+            - name: AUTHENTIK_HOST
+              value: "https://auth.lab.daviestechlabs.io/"
+            - name: AUTHENTIK_TOKEN
+              valueFrom:
+                secretKeyRef:
+                  name: authentik-outpost-token
+                  key: token
+```
+
+### Forward Auth Integration
+
+For applications without OIDC support:
+
+```yaml
+# Traefik ForwardAuth middleware
+apiVersion: traefik.io/v1alpha1
+kind: Middleware
+metadata:
+  name: authentik-forward-auth
+spec:
+  forwardAuth:
+    address: http://authentik-outpost-proxy.authentik.svc:9000/outpost.goauthentik.io/auth/traefik
+    trustForwardHeader: true
+    authResponseHeaders:
+      - X-authentik-username
+      - X-authentik-groups
+      - X-authentik-email
+```
+
+## MFA Enforcement
+
+### Policies
+
+| User Group | MFA Requirement |
+|------------|-----------------|
+| Admins | WebAuthn (hardware key) required |
+| Developers | TOTP or WebAuthn required |
+| Guests | MFA optional |
+
+### Device Registration
+
+- Self-service MFA enrollment
+- Recovery codes generated at setup
+- Admin can reset user MFA
+
+## SCIM User Provisioning
+
+### When to Use
+
+- Automatic user creation in downstream apps
+- Group membership sync
+- User deprovisioning on termination
+
+### Supported Apps
+
+Currently using SCIM provisioning for:
+- None (manual user creation in apps)
+
+Future consideration for:
+- Gitea organization sync
+- Grafana team sync
+
+## Secrets Management Integration
+
+### Vault Integration
+
+```yaml
+# External Secret for Authentik DB credentials
+apiVersion: external-secrets.io/v1
+kind: ExternalSecret
+metadata:
+  name: authentik-db-credentials
+  namespace: authentik
+spec:
+  secretStoreRef:
+    kind: ClusterSecretStore
+    name: vault
+  target:
+    name: authentik-db-credentials
+  data:
+    - secretKey: password
+      remoteRef:
+        key: kv/data/authentik
+        property: db_password
+    - secretKey: secret_key
+      remoteRef:
+        key: kv/data/authentik
+        property: secret_key
+```
+
+## Monitoring
+
+### Prometheus Metrics
+
+Authentik exposes metrics at `/metrics`:
+
+- `authentik_login_duration_seconds`
+- `authentik_login_attempt_total`
+- `authentik_outpost_connected`
+- `authentik_provider_authorization_total`
+
+### Grafana Dashboard
+
+- Login success/failure rates
+- Active sessions
+- Provider usage
+- MFA adoption rates
+
+### Alerts
+
+```yaml
+- alert: AuthentikHighLoginFailures
+  expr: rate(authentik_login_attempt_total{result="failure"}[5m]) > 10
+  for: 5m
+  labels:
+    severity: warning
+  annotations:
+    summary: High login failure rate detected
+
+- alert: AuthentikOutpostDisconnected
+  expr: authentik_outpost_connected == 0
+  for: 5m
+  labels:
+    severity: critical
+```
+
+## Backup and Recovery
+
+### What to Backup
+
+1. PostgreSQL database (via CNPG)
+2. Media files (if custom branding)
+3. Blueprint exports (configuration as code)
+
+### Blueprints
+
+Export configuration as YAML for GitOps:
+
+```yaml
+# authentik-blueprints/providers/gitea.yaml
+version: 1
+metadata:
+  name: Gitea OIDC Provider
+entries:
+  - model: authentik_providers_oauth2.oauth2provider
+    identifiers:
+      name: gitea
+    attrs:
+      authorization_flow: !Find [authentik_flows.flow, [slug, default-authorization-flow]]
+      # ... provider config
+```
+
+## Integration Patterns
+
+### Pattern 1: Native OIDC
+
+Best for: Modern applications with OIDC support
+
+```
+App ──OIDC──▶ Authentik ──▶ App (with user info)
+```
+
+### Pattern 2: Proxy Forward Auth
+
+Best for: Legacy apps without SSO support
+
+```
+Request ──▶ Traefik ──ForwardAuth──▶ Authentik Outpost
+                │                            │
+                │◀──────Header injection─────┘
+                │
+                ▼
+              App (reads X-authentik-* headers)
+```
+
+### Pattern 3: LDAP Compatibility
+
+Best for: Apps requiring LDAP
+
+```
+App ──LDAP──▶ Authentik Outpost (LDAP) ──▶ Authentik
+```
+
+## Resource Requirements
+
+| Component | CPU Request | Memory Request |
+|-----------|-------------|----------------|
+| Server | 100m | 500Mi |
+| Worker | 100m | 500Mi |
+| Redis | 50m | 128Mi |
+| Outpost (each) | 50m | 128Mi |
+
+## Future Enhancements
+
+1. **Passkey/FIDO2** - Passwordless authentication
+2. **External IdP federation** - Google/GitHub as upstream IdP
+3. **Conditional access** - Device trust, network location policies
+4. **Session revocation** - Force logout from all apps
+
+## References
+
+* [Authentik Documentation](https://goauthentik.io/docs/)
+* [Authentik GitHub](https://github.com/goauthentik/authentik)
+* [OIDC Specification](https://openid.net/specs/openid-connect-core-1_0.html)