docs: add ADRs 0025-0028 for infrastructure patterns

- 0025: Observability stack (Prometheus + ClickStack + OTEL) - 0026: Tiered storage strategy (Longhorn + NFS) - 0027: Database strategy (CloudNativePG for PostgreSQL) - 0028: Authentik SSO strategy (OIDC/SAML identity provider)
2026-02-04 08:55:15 -05:00
parent a128c265e4
commit b43c80153c
4 changed files with 1282 additions and 0 deletions
--- a/decisions/0025-observability-stack.md
+++ b/decisions/0025-observability-stack.md
@@ -0,0 +1,239 @@
 # Observability Stack Architecture
 * Status: accepted
 * Date: 2026-02-04
 * Deciders: Billy
 * Technical Story: Establish comprehensive observability for metrics, logs, and traces across the homelab
 ## Context and Problem Statement
 A complex homelab with AI/ML workloads, multiple databases, and numerous services requires comprehensive observability to understand system behavior, debug issues, and optimize performance.
 How do we build an observability stack that provides metrics, logs, and traces while remaining manageable for a single operator?
 ## Decision Drivers
 * Three pillars coverage - metrics, logs, and traces all addressed
 * Unified visualization - single pane of glass for all telemetry
 * Resource efficiency - don't overwhelm the cluster with observability overhead
 * OpenTelemetry compatibility - future-proof instrumentation standard
 * GitOps deployment - all configuration version-controlled
 ## Considered Options
 1. **Prometheus + ClickStack + OpenTelemetry Collector**
 2. **Prometheus + Loki + Tempo (PLT Stack)**
 3. **Datadog/New Relic (SaaS)**
 4. **ELK Stack (Elasticsearch, Logstash, Kibana)**
 ## Decision Outcome
 Chosen option: **Option 1 - Prometheus + ClickStack + OpenTelemetry Collector**
 Prometheus handles metrics with its mature ecosystem, ClickStack (ClickHouse-based) provides unified logs and traces storage with excellent performance, and OpenTelemetry Collector routes all telemetry data.
 ### Positive Consequences
 * Prometheus ecosystem is mature with extensive service monitor support
 * ClickHouse provides fast querying for logs and traces at scale
 * OpenTelemetry is vendor-neutral and industry standard
 * Grafana provides unified dashboards for all data sources
 * Cost-effective (no SaaS fees)
 ### Negative Consequences
 * More complex than pure SaaS solutions
 * ClickHouse requires storage management
 * Multiple components to maintain
 ## Architecture
 ```
 ┌─────────────────────────────────────────────────────────────────────┐
 │                        Applications                                  │
 │  ┌──────────┐  ┌──────────┐  ┌──────────┐  ┌──────────┐            │
 │  │ Go Apps  │  │ Python   │  │ Node.js  │  │ Java     │            │
 │  │ (OTEL)   │  │ (OTEL)   │  │ (OTEL)   │  │ (OTEL)   │            │
 │  └────┬─────┘  └────┬─────┘  └────┬─────┘  └────┬─────┘            │
 └───────┼─────────────┼─────────────┼─────────────┼──────────────────┘
        │             │             │             │
        └──────────────────┬────────────────────────┘
                           │ OTLP (gRPC/HTTP)
                           ▼
              ┌────────────────────────┐
              │  OpenTelemetry         │
              │  Collector             │
              │  (traces, metrics,     │
              │   logs)                │
              └───────────┬────────────┘
                          │
          ┌───────────────┼───────────────┐
          │               │               │
          ▼               ▼               ▼
 ┌─────────────────┐ ┌───────────┐ ┌───────────────┐
 │   ClickStack    │ │Prometheus │ │   Grafana     │
 │   (ClickHouse)  │ │           │ │               │
 │  ┌───────────┐  │ │ Metrics   │ │  Dashboards   │
 │  │  Traces   │  │ │ Storage   │ │  Alerting     │
 │  ├───────────┤  │ │           │ │  Exploration  │
 │  │   Logs    │  │ └───────────┘ │               │
 │  └───────────┘  │               └───────────────┘
 └─────────────────┘                      │
                                         │
                    ┌────────────────────┤
                    │                    │
              ┌─────▼─────┐        ┌─────▼─────┐
              │Alertmanager│        │   ntfy    │
              │           │        │ (push)    │
              └───────────┘        └───────────┘
 ```
 ## Component Details
 ### Metrics: Prometheus + kube-prometheus-stack
 **Deployment:** HelmRelease via Flux
 ```yaml
 prometheus:
  prometheusSpec:
    retention: 14d
    retentionSize: 50GB
    storage:
      volumeClaimTemplate:
        spec:
          storageClassName: longhorn
          storage: 50Gi
 ```
 **Key Features:**
 - ServiceMonitor auto-discovery for all workloads
 - 14-day retention with 50GB limit
 - PromPP image for enhanced performance
 - AlertManager for routing alerts
 ### Logs & Traces: ClickStack
 **Why ClickStack over Loki/Tempo:**
 - Single storage backend (ClickHouse) for both logs and traces
 - Excellent query performance on large datasets
 - Built-in correlation between logs and traces
 - Lower resource overhead than separate Loki + Tempo
 **Configuration:**
 - OTEL Collector receives all telemetry
 - Forwards to ClickStack's OTEL collector
 - Grafana datasources for querying
 ### Telemetry Collection: OpenTelemetry
 **OpenTelemetry Operator:** Manages auto-instrumentation
 ```yaml
 apiVersion: opentelemetry.io/v1alpha1
 kind: Instrumentation
 metadata:
  name: auto-instrumentation
 spec:
  python:
    image: ghcr.io/open-telemetry/opentelemetry-operator/autoinstrumentation-python
  nodejs:
    image: ghcr.io/open-telemetry/opentelemetry-operator/autoinstrumentation-nodejs
 ```
 **OpenTelemetry Collector:** Central routing
 ```yaml
 receivers:
  otlp:
    protocols:
      grpc:
        endpoint: 0.0.0.0:4317
      http:
        endpoint: 0.0.0.0:4318
 exporters:
  otlphttp:
    endpoint: http://clickstack-otel-collector:4318
 service:
  pipelines:
    traces:
      receivers: [otlp]
      exporters: [otlphttp]
    metrics:
      receivers: [otlp]
      exporters: [otlphttp]
    logs:
      receivers: [otlp]
      exporters: [otlphttp]
 ```
 ### Visualization: Grafana
 **Grafana Operator:** Manages dashboards and datasources as CRDs
 ```yaml
 apiVersion: grafana.integreatly.org/v1beta1
 kind: GrafanaDashboard
 metadata:
  name: kubernetes-nodes
 spec:
  instanceSelector:
    matchLabels:
      grafana.internal/instance: grafana
  url: https://grafana.com/api/dashboards/15758/revisions/44/download
 ```
 **Datasources:**
 | Type | Source | Purpose |
 |------|--------|---------|
 | Prometheus | prometheus-operated:9090 | Metrics |
 | ClickHouse | clickstack:8123 | Logs & Traces |
 | Alertmanager | alertmanager-operated:9093 | Alert status |
 ### Alerting Pipeline
 ```
 Prometheus Rules → Alertmanager → ntfy → Discord/Mobile
                                      └─→ Email (future)
 ```
 **Alert Categories:**
 - Infrastructure: Node down, disk full, OOM
 - Application: Error rate, latency SLO breach
 - Security: Gatekeeper violations, vulnerability findings
 ## Dashboards
 | Dashboard | Source | Purpose |
 |-----------|--------|---------|
 | Kubernetes Global | Grafana #15757 | Cluster overview |
 | Node Exporter | Grafana #1860 | Node metrics |
 | CNPG PostgreSQL | CNPG | Database health |
 | Flux | Flux Operator | GitOps status |
 | Cilium | Cilium | Network metrics |
 | Envoy Gateway | Envoy | Ingress metrics |
 ## Resource Allocation
 | Component | CPU Request | Memory Limit |
 |-----------|-------------|--------------|
 | Prometheus | 100m | 2Gi |
 | OTEL Collector | 100m | 512Mi |
 | ClickStack | 500m | 2Gi |
 | Grafana | 100m | 256Mi |
 ## Future Enhancements
 1. **Continuous Profiling** - Pyroscope for Go/Python profiling
 2. **SLO Tracking** - Sloth for SLI/SLO automation
 3. **Synthetic Monitoring** - Gatus for endpoint probing
 4. **Cost Attribution** - OpenCost for resource cost tracking
 ## References
 * [OpenTelemetry Documentation](https://opentelemetry.io/docs/)
 * [ClickHouse for Observability](https://clickhouse.com/docs/en/use-cases/observability)
 * [kube-prometheus-stack](https://github.com/prometheus-community/helm-charts/tree/main/charts/kube-prometheus-stack)
 * [Grafana Operator](https://grafana.github.io/grafana-operator/)
--- a/decisions/0026-storage-strategy.md
+++ b/decisions/0026-storage-strategy.md
@@ -0,0 +1,334 @@
 # Tiered Storage Strategy: Longhorn + NFS
 * Status: accepted
 * Date: 2026-02-04
 * Deciders: Billy
 * Technical Story: Provide tiered storage for Kubernetes workloads balancing performance and capacity
 ## Context and Problem Statement
 Kubernetes requires a storage solution for stateful applications like databases, message queues, and AI model caches. Different workloads have vastly different requirements:
 - Databases need fast, reliable storage with replication
 - Media libraries need large capacity but can tolerate slower access
 - AI/ML workloads need both - fast storage for models, large capacity for datasets
 The homelab has heterogeneous nodes including x86_64 servers and ARM64 Raspberry Pis, plus an external NAS for bulk storage.
 How do we provide tiered storage that balances performance, reliability, and capacity for diverse homelab workloads?
 ## Decision Drivers
 * Performance - fast IOPS for databases and critical workloads
 * Capacity - large storage for media, datasets, and archives
 * Reliability - data must survive node failures
 * Heterogeneous support - work on both x86_64 and ARM64 (with limitations)
 * Backup capability - support for off-cluster backups
 * GitOps deployment - Helm charts with Flux management
 ## Considered Options
 1. **Longhorn + NFS dual-tier storage**
 2. **Rook-Ceph for everything**
 3. **OpenEBS with Mayastor**
 4. **NFS only**
 5. **Longhorn only**
 ## Decision Outcome
 Chosen option: **Option 1 - Longhorn + NFS dual-tier storage**
 Two storage tiers optimized for different use cases:
 - **`longhorn`** (default): Fast distributed block storage on NVMe/SSDs for databases and critical workloads
 - **`nfs-slow`**: High-capacity NFS storage on external NAS for media, datasets, and bulk storage
 ### Positive Consequences
 * Right-sized storage for each workload type
 * Longhorn provides HA with automatic replication
 * NFS provides massive capacity without consuming cluster disk space
 * ReadWriteMany (RWX) easy on NFS tier
 * Cost-effective - use existing NAS investment
 ### Negative Consequences
 * Two storage systems to manage
 * NFS is slower (hence `nfs-slow` naming)
 * NFS single point of failure (no replication)
 * Network dependency for both tiers
 ## Architecture
 ```
 ┌────────────────────────────────────────────────────────────────────────────┐
 │                              TIER 1: LONGHORN                              │
 │                        (Fast Distributed Block Storage)                     │
 │  ┌─────────────┐  ┌─────────────┐  ┌─────────────┐                         │
 │  │   khelben   │  │   mystra    │  │   selune    │                         │
 │  │  (NVIDIA)   │  │   (AMD)     │  │   (AMD)     │                         │
 │  │             │  │             │  │             │                         │
 │  │ /var/mnt/   │  │ /var/mnt/   │  │ /var/mnt/   │                         │
 │  │  longhorn   │  │  longhorn   │  │  longhorn   │                         │
 │  │  (NVMe)     │  │  (SSD)      │  │  (SSD)      │                         │
 │  └──────┬──────┘  └──────┬──────┘  └──────┬──────┘                         │
 │         │                │                │                                 │
 │         └────────────────┼────────────────┘                                 │
 │                          ▼                                                  │
 │              ┌───────────────────────┐                                      │
 │              │   Longhorn Manager    │                                      │
 │              │  (Schedules replicas) │                                      │
 │              └───────────┬───────────┘                                      │
 │                          ▼                                                  │
 │     ┌──────────┐  ┌──────────┐  ┌──────────┐  ┌──────────┐                 │
 │     │ Postgres │  │  Vault   │  │Prometheus│  │ClickHouse│                 │
 │     │   PVC    │  │   PVC    │  │   PVC    │  │   PVC    │                 │
 │     └──────────┘  └──────────┘  └──────────┘  └──────────┘                 │
 └────────────────────────────────────────────────────────────────────────────┘
 ┌────────────────────────────────────────────────────────────────────────────┐
 │                              TIER 2: NFS-SLOW                              │
 │                        (High-Capacity Bulk Storage)                         │
 │                                                                            │
 │  ┌────────────────────────────────────────────────────────────────┐        │
 │  │                  candlekeep.lab.daviestechlabs.io              │        │
 │  │                        (External NAS)                           │        │
 │  │                                                                 │        │
 │  │   /kubernetes                                                   │        │
 │  │   ├── jellyfin-media/     (1TB+ media library)                 │        │
 │  │   ├── nextcloud/          (user files)                         │        │
 │  │   ├── immich/             (photo backups)                      │        │
 │  │   ├── kavita/             (ebooks, comics, manga)              │        │
 │  │   ├── mlflow-artifacts/   (model artifacts)                    │        │
 │  │   ├── ray-models/         (AI model weights)                   │        │
 │  │   └── gitea-runner/       (build caches)                       │        │
 │  └────────────────────────────────────────────────────────────────┘        │
 │                          │                                                  │
 │                          ▼                                                  │
 │              ┌───────────────────────┐                                      │
 │              │   NFS CSI Driver      │                                      │
 │              │  (csi-driver-nfs)     │                                      │
 │              └───────────┬───────────┘                                      │
 │                          ▼                                                  │
 │     ┌──────────┐  ┌──────────┐  ┌──────────┐  ┌──────────┐                 │
 │     │ Jellyfin │  │Nextcloud │  │  Immich  │  │  Kavita  │                 │
 │     │   PVC    │  │   PVC    │  │   PVC    │  │   PVC    │                 │
 │     └──────────┘  └──────────┘  └──────────┘  └──────────┘                 │
 └────────────────────────────────────────────────────────────────────────────┘
 ```
 ## Tier 1: Longhorn Configuration
 ### Helm Values
 ```yaml
 persistence:
  defaultClass: true
  defaultClassReplicaCount: 2
  defaultDataPath: /var/mnt/longhorn
 defaultSettings:
  defaultDataPath: /var/mnt/longhorn
  # Allow on vllm-tainted nodes
  taintToleration: "dedicated=vllm:NoSchedule"
  # Exclude Raspberry Pi nodes (ARM64)
  systemManagedComponentsNodeSelector: "kubernetes.io/arch:amd64"
  # Snapshot retention
  defaultRecurringJobs:
    - name: nightly-snapshots
      task: snapshot
      cron: "0 2 * * *"
      retain: 7
    - name: weekly-backups
      task: backup
      cron: "0 3 * * 0"
      retain: 4
 ```
 ### Longhorn Storage Classes
 | StorageClass | Replicas | Use Case |
 |--------------|----------|----------|
 | `longhorn` (default) | 2 | General workloads, databases |
 | `longhorn-single` | 1 | Development/ephemeral |
 | `longhorn-strict` | 3 | Critical databases |
 ## Tier 2: NFS Configuration
 ### Helm Values (csi-driver-nfs)
 ```yaml
 storageClass:
  create: true
  name: nfs-slow
  parameters:
    server: candlekeep.lab.daviestechlabs.io
    share: /kubernetes
  mountOptions:
    - nfsvers=4.1
    - nconnect=16    # Multiple TCP connections for throughput
    - hard           # Retry indefinitely on failure
    - noatime        # Don't update access times (performance)
  reclaimPolicy: Delete
  volumeBindingMode: Immediate
 ```
 ### Why "nfs-slow"?
 The naming is intentional - it sets correct expectations:
 - **Latency:** NAS is over network, higher latency than local NVMe
 - **IOPS:** Spinning disks in NAS can't match SSD performance
 - **Throughput:** Adequate for streaming media, not for databases
 - **Benefit:** Massive capacity without consuming cluster disk space
 ## Storage Tier Selection Guide
 | Workload Type | Storage Class | Rationale |
 |---------------|---------------|-----------|
 | PostgreSQL (CNPG) | `longhorn` or `nfs-slow` | Depends on criticality |
 | Prometheus/ClickHouse | `longhorn` | High write IOPS required |
 | Vault | `longhorn` | Security-critical, needs HA |
 | Media (Jellyfin, Kavita) | `nfs-slow` | Large files, sequential reads |
 | Photos (Immich) | `nfs-slow` | Bulk storage for photos |
 | User files (Nextcloud) | `nfs-slow` | Capacity over speed |
 | AI/ML models (Ray) | `nfs-slow` | Large model weights |
 | Build caches (Gitea runner) | `nfs-slow` | Ephemeral, large |
 | MLflow artifacts | `nfs-slow` | Model artifacts storage |
 ## Volume Usage by Tier
 ### Longhorn Volumes (Performance Tier)
 | Workload | Size | Replicas | Access Mode |
 |----------|------|----------|-------------|
 | Prometheus | 50Gi | 2 | RWO |
 | Vault | 2Gi | 2 | RWO |
 | ClickHouse | 100Gi | 2 | RWO |
 | Alertmanager | 1Gi | 2 | RWO |
 ### NFS Volumes (Capacity Tier)
 | Workload | Size | Access Mode | Notes |
 |----------|------|-------------|-------|
 | Jellyfin | 2Ti | RWX | Media library |
 | Immich | 500Gi | RWX | Photo storage |
 | Nextcloud | 1Ti | RWX | User files |
 | Kavita | 200Gi | RWX | Ebooks, comics |
 | MLflow | 100Gi | RWX | Model artifacts |
 | Ray models | 200Gi | RWX | AI model weights |
 | Gitea runner | 50Gi | RWO | Build caches |
 | Gitea DB (CNPG) | 10Gi | RWO | Capacity-optimized |
 ## Backup Strategy
 ### Longhorn Tier
 #### Local Snapshots
 - **Frequency:** Nightly at 2 AM
 - **Retention:** 7 days
 - **Purpose:** Quick recovery from accidental deletion
 #### Off-Cluster Backups
 - **Frequency:** Weekly on Sundays at 3 AM
 - **Destination:** S3-compatible storage (MinIO/Backblaze)
 - **Retention:** 4 weeks
 - **Purpose:** Disaster recovery
 ### NFS Tier
 #### NAS-Level Backups
 - Handled by NAS backup solution (snapshots, replication)
 - Not managed by Kubernetes
 - Relies on NAS raid configuration for redundancy
 ### Backup Target Configuration (Longhorn)
 ```yaml
 # ExternalSecret for backup credentials
 apiVersion: external-secrets.io/v1
 kind: ExternalSecret
 metadata:
  name: longhorn-backup-secret
 spec:
  secretStoreRef:
    kind: ClusterSecretStore
    name: vault
  target:
    name: longhorn-backup-secret
  data:
    - secretKey: AWS_ACCESS_KEY_ID
      remoteRef:
        key: kv/data/longhorn
        property: backup_access_key
    - secretKey: AWS_SECRET_ACCESS_KEY
      remoteRef:
        key: kv/data/longhorn
        property: backup_secret_key
 ```
 ## Node Exclusions (Longhorn Only)
 **Raspberry Pi nodes excluded because:**
 - Limited disk I/O performance
 - SD card wear concerns
 - Memory constraints for Longhorn components
 **GPU nodes included with tolerations:**
 - `khelben` (NVIDIA) participates in Longhorn storage
 - Taint toleration allows Longhorn to schedule there
 ## Performance Considerations
 ### Longhorn Performance
 - `khelben` has NVMe - fastest storage node
 - `mystra`/`selune` have SATA SSDs - adequate for most workloads
 - 2 replicas across different nodes ensures single node failure survival
 - Trade-off: 2x storage consumption
 ### NFS Performance
 - Optimized with `nconnect=16` for parallel connections
 - `noatime` reduces unnecessary write operations
 - Sequential read workloads perform well (media streaming)
 - Random I/O workloads should use Longhorn instead
 ### When to Choose Each Tier
 | Requirement | Longhorn | NFS-Slow |
 |-------------|----------|----------|
 | Low latency | ✅ | ❌ |
 | High IOPS | ✅ | ❌ |
 | Large capacity | ❌ | ✅ |
 | ReadWriteMany (RWX) | Limited | ✅ |
 | Node failure survival | ✅ | ✅ (NAS HA) |
 | Kubernetes-native | ✅ | ✅ |
 ## Monitoring
 **Grafana Dashboard:** Longhorn dashboard for:
 - Volume health and replica status
 - IOPS and throughput per volume
 - Disk space utilization per node
 - Backup job status
 **Alerts:**
 - Volume degraded (replica count < desired)
 - Disk space low (< 20% free)
 - Backup job failed
 ## Future Enhancements
 1. **NAS high availability** - Second NAS with replication
 2. **Dedicated storage network** - Separate VLAN for storage traffic
 3. **NVMe-oF** - Network NVMe for lower latency
 4. **Tiered Longhorn** - Hot (NVMe) and warm (SSD) within Longhorn
 5. **S3 tier** - MinIO for object storage workloads
 ## References
 * [Longhorn Documentation](https://longhorn.io/docs/)
 * [Longhorn Best Practices](https://longhorn.io/docs/latest/best-practices/)
 * [NFS CSI Driver](https://github.com/kubernetes-csi/csi-driver-nfs)
 * [Talos Longhorn Integration](https://www.talos.dev/v1.6/kubernetes-guides/configuration/storage/)
--- a/decisions/0027-database-strategy.md
+++ b/decisions/0027-database-strategy.md
@@ -0,0 +1,294 @@
 # Database Strategy with CloudNativePG
 * Status: accepted
 * Date: 2026-02-04
 * Deciders: Billy
 * Technical Story: Standardize PostgreSQL deployment for stateful applications
 ## Context and Problem Statement
 Multiple applications in the homelab require relational databases: Gitea, Authentik, Companions, MLflow, and potentially more. Each could use different database solutions, creating operational complexity.
 How do we standardize database deployment while providing production-grade reliability and minimal operational overhead?
 ## Decision Drivers
 * Operational simplicity - single operator to learn and manage
 * High availability - automatic failover for critical databases
 * Backup integration - consistent backup strategy across all databases
 * GitOps compatibility - declarative database provisioning
 * Resource efficiency - don't over-provision for homelab scale
 ## Considered Options
 1. **CloudNativePG for PostgreSQL**
 2. **Helm charts per application (Bitnami PostgreSQL)**
 3. **External managed database (RDS-style)**
 4. **SQLite where possible + single shared PostgreSQL**
 ## Decision Outcome
 Chosen option: **Option 1 - CloudNativePG for PostgreSQL**
 CloudNativePG (CNPG) provides a Kubernetes-native PostgreSQL operator with HA, automatic failover, connection pooling (PgBouncer), and integrated backups.
 ### Positive Consequences
 * Single operator manages all PostgreSQL instances
 * Declarative Cluster CRD for GitOps deployment
 * Automatic failover with minimal data loss
 * Built-in PgBouncer for connection pooling
 * Prometheus metrics and Grafana dashboards included
 * CNPG is CNCF-listed and actively maintained
 ### Negative Consequences
 * PostgreSQL only (no MySQL/MariaDB support)
 * Operator adds resource overhead
 * Learning curve for CNPG-specific features
 ## Architecture
 ```
 ┌─────────────────────────────────────────────────────────────────┐
 │                        CNPG Operator                             │
 │                     (cnpg-system namespace)                      │
 └────────────────────────────┬────────────────────────────────────┘
                             │ Manages
                             ▼
 ┌──────────────────┬─────────────────┬─────────────────────────────┐
 │                  │                 │                             │
 ▼                  ▼                 ▼                             ▼
 ┌──────────────┐  ┌──────────────┐  ┌──────────────┐  ┌──────────────┐
 │  gitea-pg    │  │ authentik-db │  │companions-db │  │  mlflow-db   │
 │  (3 replicas)│  │  (3 replicas)│  │ (3 replicas) │  │ (1 replica)  │
 │              │  │              │  │              │  │              │
 │ ┌──────────┐ │  │ ┌──────────┐ │  │ ┌──────────┐ │  │ ┌──────────┐ │
 │ │ Primary  │ │  │ │ Primary  │ │  │ │ Primary  │ │  │ │ Primary  │ │
 │ ├──────────┤ │  │ ├──────────┤ │  │ ├──────────┤ │  │ └──────────┘ │
 │ │ Replica  │ │  │ │ Replica  │ │  │ │ Replica  │ │  │              │
 │ ├──────────┤ │  │ ├──────────┤ │  │ ├──────────┤ │  │              │
 │ │ Replica  │ │  │ │ Replica  │ │  │ │ Replica  │ │  │              │
 │ └──────────┘ │  │ └──────────┘ │  │ └──────────┘ │  │              │
 │ ┌──────────┐ │  │ ┌──────────┐ │  │ ┌──────────┐ │  │              │
 │ │ PgBouncer│ │  │ │ PgBouncer│ │  │ │ PgBouncer│ │  │              │
 │ └──────────┘ │  │ └──────────┘ │  │ └──────────┘ │  │              │
 └──────────────┘  └──────────────┘  └──────────────┘  └──────────────┘
      │                  │                 │                │
      └──────────────────┼─────────────────┼────────────────┘
                         │                 │
                   ┌─────▼─────┐     ┌─────▼─────┐
                   │  Longhorn │     │ Longhorn  │
                   │   PVCs    │     │  Backups  │
                   └───────────┘     └───────────┘
 ```
 ## Cluster Configuration Template
 ```yaml
 apiVersion: postgresql.cnpg.io/v1
 kind: Cluster
 metadata:
  name: app-db
 spec:
  description: "Application PostgreSQL Cluster"
  imageName: ghcr.io/cloudnative-pg/postgresql:17.2
  instances: 3
  primaryUpdateStrategy: unsupervised
  postgresql:
    parameters:
      shared_buffers: "256MB"
      effective_cache_size: "768MB"
      work_mem: "16MB"
      max_connections: "200"
  # Enable PgBouncer for connection pooling
  enablePgBouncer: true
  pgbouncer:
    poolMode: transaction
    defaultPoolSize: "25"
  # Storage on Longhorn
  storage:
    size: 10Gi
    storageClass: longhorn
  # Monitoring
  monitoring:
    enabled: true
    customQueriesConfigMap:
      - name: cnpg-default-monitoring
        key: queries
  # Backup configuration
  backup:
    barmanObjectStore:
      destinationPath: "s3://backups/postgres/"
      s3Credentials:
        accessKeyId:
          name: postgres-backup-creds
          key: ACCESS_KEY_ID
        secretAccessKey:
          name: postgres-backup-creds
          key: SECRET_ACCESS_KEY
    retentionPolicy: "7d"
 ```
 ## Database Instances
 | Cluster | Instances | Storage | PgBouncer | Purpose |
 |---------|-----------|---------|-----------|---------|
 | `gitea-pg` | 3 | 10Gi | Yes | Git repository metadata |
 | `authentik-db` | 3 | 5Gi | Yes | Identity/SSO data |
 | `companions-db` | 3 | 10Gi | Yes | Chat app data |
 | `mlflow-db` | 1 | 5Gi | No | Experiment tracking |
 | `kubeflow-db` | 1 | 10Gi | No | Pipeline metadata |
 ## Connection Patterns
 ### Service Discovery
 CNPG creates services for each cluster:
 | Service | Purpose |
 |---------|---------|
 | `<cluster>-rw` | Read-write (primary only) |
 | `<cluster>-ro` | Read-only (any replica) |
 | `<cluster>-r` | Read (any instance) |
 | `<cluster>-pooler-rw` | PgBouncer read-write |
 | `<cluster>-pooler-ro` | PgBouncer read-only |
 ### Application Configuration
 ```yaml
 # Application config using CNPG service
 DATABASE_URL: "postgresql://user:password@gitea-pg-pooler-rw.gitea.svc:5432/giteadb"
 ```
 ### Credentials via External Secrets
 ```yaml
 apiVersion: external-secrets.io/v1
 kind: ExternalSecret
 metadata:
  name: app-db-credentials
 spec:
  secretStoreRef:
    kind: ClusterSecretStore
    name: vault
  target:
    name: app-db-credentials
  data:
    - secretKey: username
      remoteRef:
        key: kv/data/app-db
        property: username
    - secretKey: password
      remoteRef:
        key: kv/data/app-db
        property: password
 ```
 ## High Availability
 ### Automatic Failover
 - CNPG monitors primary health continuously
 - If primary fails, automatic promotion of replica
 - Application reconnection via service abstraction
 - Typical failover time: 10-30 seconds
 ### Replica Synchronization
 - Streaming replication from primary to replicas
 - Synchronous replication available for zero data loss (trade-off: latency)
 - Default: asynchronous with acceptable RPO
 ## Backup Strategy
 ### Continuous WAL Archiving
 - Write-Ahead Log streamed to S3
 - Point-in-time recovery capability
 - RPO: seconds (last WAL segment)
 ### Base Backups
 - **Frequency:** Daily
 - **Retention:** 7 days
 - **Destination:** S3-compatible (MinIO/Backblaze)
 ### Recovery Testing
 - Periodic restore to test cluster
 - Validate backup integrity
 - Document recovery procedure
 ## Monitoring
 ### Prometheus Metrics
 - Connection count and pool utilization
 - Transaction rate and latency
 - Replication lag
 - Disk usage and WAL generation
 ### Grafana Dashboard
 CNPG provides official dashboard:
 - Cluster health overview
 - Per-instance metrics
 - Replication status
 - Backup job history
 ### Alerts
 ```yaml
 - alert: PostgreSQLDown
  expr: cnpg_collector_up == 0
  for: 5m
  labels:
    severity: critical
 - alert: PostgreSQLReplicationLag
  expr: cnpg_pg_replication_lag_seconds > 30
  for: 5m
  labels:
    severity: warning
 - alert: PostgreSQLConnectionsHigh
  expr: cnpg_pg_stat_activity_count / cnpg_pg_settings_max_connections > 0.8
  for: 5m
  labels:
    severity: warning
 ```
 ## When NOT to Use CloudNativePG
 | Scenario | Alternative |
 |----------|-------------|
 | Simple app, no HA needed | Embedded SQLite |
 | MySQL/MariaDB required | Application-specific chart |
 | Massive scale | External managed database |
 | Non-relational data | Redis/Valkey, MongoDB |
 ## PostgreSQL Version Policy
 - Use latest stable major version (currently 17)
 - Minor version updates: automatic (`primaryUpdateStrategy: unsupervised`)
 - Major version upgrades: manual with testing
 ## Future Enhancements
 1. **Cross-cluster replication** - DR site replica
 2. **Logical replication** - Selective table sync between clusters
 3. **TimescaleDB extension** - Time-series optimization for metrics
 4. **PgVector extension** - Vector storage alternative to Milvus
 ## References
 * [CloudNativePG Documentation](https://cloudnative-pg.io/documentation/)
 * [CNPG GitHub](https://github.com/cloudnative-pg/cloudnative-pg)
 * [PostgreSQL High Availability](https://www.postgresql.org/docs/current/high-availability.html)
--- a/decisions/0028-authentik-sso-strategy.md
+++ b/decisions/0028-authentik-sso-strategy.md
@@ -0,0 +1,415 @@
 # Authentik Single Sign-On Strategy
 * Status: accepted
 * Date: 2026-02-04
 * Deciders: Billy
 * Technical Story: Centralize authentication across all homelab applications
 ## Context and Problem Statement
 A growing homelab with many self-hosted applications creates authentication sprawl - each app has its own user database, passwords, and session management. This creates poor user experience and security risks.
 How do we centralize authentication while maintaining flexibility for different application requirements?
 ## Decision Drivers
 * Single sign-on (SSO) for all applications
 * Centralized user management and lifecycle
 * MFA enforcement across all applications
 * Open-source and self-hosted
 * Low resource requirements for homelab scale
 ## Considered Options
 1. **Authentik as OIDC/SAML provider**
 2. **Keycloak**
 3. **Authelia + LDAP**
 4. **Per-application local auth**
 ## Decision Outcome
 Chosen option: **Option 1 - Authentik as OIDC/SAML provider**
 Authentik provides modern identity management with OIDC, SAML 2.0, LDAP, and SCIM support. Its flow-based authentication engine allows customizable login experiences.
 ### Positive Consequences
 * Clean, modern UI for users and admins
 * Flexible flow-based authentication
 * Built-in MFA (TOTP, WebAuthn, SMS, Duo)
 * Proxy provider for legacy apps
 * SCIM for user provisioning
 * Active development and community
 ### Negative Consequences
 * Python-based (higher memory than Go alternatives)
 * PostgreSQL dependency
 * Some enterprise features require outpost pods
 ## Architecture
 ```
                                    ┌─────────────────────┐
                                    │      User           │
                                    └──────────┬──────────┘
                                               │
                                               ▼
                                    ┌─────────────────────┐
                                    │   Ingress/Traefik   │
                                    └──────────┬──────────┘
                                               │
                    ┌──────────────────────────┼──────────────────────────┐
                    │                          │                          │
                    ▼                          ▼                          ▼
         ┌─────────────────┐       ┌─────────────────┐       ┌─────────────────┐
         │  auth.lab.io    │       │  app.lab.io     │       │  app2.lab.io    │
         │   (Authentik)   │       │  (OIDC-enabled) │       │ (Proxy-auth)    │
         └─────────────────┘       └────────┬────────┘       └────────┬────────┘
                    │                       │                         │
                    │    ┌──────────────────┘                         │
                    │    │ OIDC/OAuth2                                │
                    │    │                                            │
                    ▼    ▼                                            ▼
         ┌─────────────────────────────────────────────────────────────────┐
         │                          Authentik                              │
         │  ┌─────────────┐  ┌─────────────┐  ┌─────────────┐            │
         │  │   Server    │  │   Worker    │  │   Outpost   │◄───────────┤
         │  │   (API)     │  │  (Tasks)    │  │  (Proxy)    │ Forward    │
         │  └──────┬──────┘  └──────┬──────┘  └─────────────┘ Auth       │
         │         │                │                                     │
         │         └────────┬───────┘                                     │
         │                  │                                             │
         │           ┌──────▼──────┐                                      │
         │           │   Redis     │                                      │
         │           │  (Cache)    │                                      │
         │           └─────────────┘                                      │
         │                                                                │
         └─────────────────────────────┬──────────────────────────────────┘
                                       │
                                ┌──────▼──────┐
                                │ PostgreSQL  │
                                │ (CNPG)      │
                                └─────────────┘
 ```
 ## Provider Configuration
 ### OIDC Applications
 | Application | Provider Type | Claims Override | Notes |
 |-------------|---------------|-----------------|-------|
 | Gitea | OIDC | None | Admin mapping via group |
 | Affine | OIDC | `email_verified: true` | See ADR-0016 |
 | Companions | OIDC | None | Custom provider |
 | Grafana | OIDC | `role` claim | Admin role mapping |
 | ArgoCD | OIDC | `groups` claim | RBAC integration |
 | MLflow | Proxy | N/A | Forward auth |
 | Open WebUI | OIDC | None | LLM interface |
 ### Provider Template
 ```yaml
 # Example OAuth2/OIDC Provider
 apiVersion: authentik.io/v1
 kind: OAuth2Provider
 metadata:
  name: gitea
 spec:
  name: Gitea
  authorizationFlow: default-authorization-flow
  clientId: ${GITEA_CLIENT_ID}
  clientSecret: ${GITEA_CLIENT_SECRET}
  redirectUris:
    - https://git.lab.daviestechlabs.io/user/oauth2/authentik/callback
  signingKey: authentik-self-signed
  propertyMappings:
    - authentik-default-openid
    - authentik-default-email
    - authentik-default-profile
 ```
 ## Authentication Flows
 ### Default Login Flow
 ```
 ┌─────────────┐     ┌─────────────┐     ┌─────────────┐     ┌─────────────┐
 │   Login     │────▶│  Username   │────▶│   Password  │────▶│    MFA      │
 │   Stage     │     │   Stage     │     │   Stage     │     │   Stage     │
 └─────────────┘     └─────────────┘     └─────────────┘     └──────┬──────┘
                                                                   │
                                                                   ▼
                                                           ┌─────────────┐
                                                           │   Session   │
                                                           │   Created   │
                                                           └─────────────┘
 ```
 ### Flow Customization
 - **Admin users:** Require hardware key (WebAuthn)
 - **API access:** Service account tokens
 - **External users:** Email verification + MFA enrollment
 ## Group-Based Authorization
 ### Group Structure
 ```
 authentik-admins        → Authentik admin access
 ├── cluster-admins      → Full cluster access
 ├── gitea-admins        → Git admin
 ├── monitoring-admins   → Grafana admin
 └── ai-platform-admins  → AI/ML admin
 authentik-users         → Standard user access  
 ├── developers          → Git write, monitoring read
 ├── ml-engineers        → AI/ML services access
 └── guests              → Read-only access
 ```
 ### Application Group Mapping
 ```yaml
 # Grafana OIDC config
 grafana:
  auth.generic_oauth:
    role_attribute_path: |
      contains(groups[*], 'monitoring-admins') && 'Admin' || 
      contains(groups[*], 'developers') && 'Editor' || 
      'Viewer'
 ```
 ## Outpost Deployment
 ### Embedded Outpost (Default)
 - Runs within Authentik server
 - Handles LDAP and Radius
 - Suitable for small deployments
 ### Standalone Outpost (Proxy)
 ```yaml
 apiVersion: apps/v1
 kind: Deployment
 metadata:
  name: authentik-outpost-proxy
 spec:
  replicas: 2
  template:
    spec:
      containers:
        - name: outpost
          image: ghcr.io/goauthentik/proxy
          ports:
            - containerPort: 9000
              name: http
            - containerPort: 9443
              name: https
          env:
            - name: AUTHENTIK_HOST
              value: "https://auth.lab.daviestechlabs.io/"
            - name: AUTHENTIK_TOKEN
              valueFrom:
                secretKeyRef:
                  name: authentik-outpost-token
                  key: token
 ```
 ### Forward Auth Integration
 For applications without OIDC support:
 ```yaml
 # Traefik ForwardAuth middleware
 apiVersion: traefik.io/v1alpha1
 kind: Middleware
 metadata:
  name: authentik-forward-auth
 spec:
  forwardAuth:
    address: http://authentik-outpost-proxy.authentik.svc:9000/outpost.goauthentik.io/auth/traefik
    trustForwardHeader: true
    authResponseHeaders:
      - X-authentik-username
      - X-authentik-groups
      - X-authentik-email
 ```
 ## MFA Enforcement
 ### Policies
 | User Group | MFA Requirement |
 |------------|-----------------|
 | Admins | WebAuthn (hardware key) required |
 | Developers | TOTP or WebAuthn required |
 | Guests | MFA optional |
 ### Device Registration
 - Self-service MFA enrollment
 - Recovery codes generated at setup
 - Admin can reset user MFA
 ## SCIM User Provisioning
 ### When to Use
 - Automatic user creation in downstream apps
 - Group membership sync
 - User deprovisioning on termination
 ### Supported Apps
 Currently using SCIM provisioning for:
 - None (manual user creation in apps)
 Future consideration for:
 - Gitea organization sync
 - Grafana team sync
 ## Secrets Management Integration
 ### Vault Integration
 ```yaml
 # External Secret for Authentik DB credentials
 apiVersion: external-secrets.io/v1
 kind: ExternalSecret
 metadata:
  name: authentik-db-credentials
  namespace: authentik
 spec:
  secretStoreRef:
    kind: ClusterSecretStore
    name: vault
  target:
    name: authentik-db-credentials
  data:
    - secretKey: password
      remoteRef:
        key: kv/data/authentik
        property: db_password
    - secretKey: secret_key
      remoteRef:
        key: kv/data/authentik
        property: secret_key
 ```
 ## Monitoring
 ### Prometheus Metrics
 Authentik exposes metrics at `/metrics`:
 - `authentik_login_duration_seconds`
 - `authentik_login_attempt_total`
 - `authentik_outpost_connected`
 - `authentik_provider_authorization_total`
 ### Grafana Dashboard
 - Login success/failure rates
 - Active sessions
 - Provider usage
 - MFA adoption rates
 ### Alerts
 ```yaml
 - alert: AuthentikHighLoginFailures
  expr: rate(authentik_login_attempt_total{result="failure"}[5m]) > 10
  for: 5m
  labels:
    severity: warning
  annotations:
    summary: High login failure rate detected
 - alert: AuthentikOutpostDisconnected
  expr: authentik_outpost_connected == 0
  for: 5m
  labels:
    severity: critical
 ```
 ## Backup and Recovery
 ### What to Backup
 1. PostgreSQL database (via CNPG)
 2. Media files (if custom branding)
 3. Blueprint exports (configuration as code)
 ### Blueprints
 Export configuration as YAML for GitOps:
 ```yaml
 # authentik-blueprints/providers/gitea.yaml
 version: 1
 metadata:
  name: Gitea OIDC Provider
 entries:
  - model: authentik_providers_oauth2.oauth2provider
    identifiers:
      name: gitea
    attrs:
      authorization_flow: !Find [authentik_flows.flow, [slug, default-authorization-flow]]
      # ... provider config
 ```
 ## Integration Patterns
 ### Pattern 1: Native OIDC
 Best for: Modern applications with OIDC support
 ```
 App ──OIDC──▶ Authentik ──▶ App (with user info)
 ```
 ### Pattern 2: Proxy Forward Auth
 Best for: Legacy apps without SSO support
 ```
 Request ──▶ Traefik ──ForwardAuth──▶ Authentik Outpost
                │                            │
                │◀──────Header injection─────┘
                │
                ▼
              App (reads X-authentik-* headers)
 ```
 ### Pattern 3: LDAP Compatibility
 Best for: Apps requiring LDAP
 ```
 App ──LDAP──▶ Authentik Outpost (LDAP) ──▶ Authentik
 ```
 ## Resource Requirements
 | Component | CPU Request | Memory Request |
 |-----------|-------------|----------------|
 | Server | 100m | 500Mi |
 | Worker | 100m | 500Mi |
 | Redis | 50m | 128Mi |
 | Outpost (each) | 50m | 128Mi |
 ## Future Enhancements
 1. **Passkey/FIDO2** - Passwordless authentication
 2. **External IdP federation** - Google/GitHub as upstream IdP
 3. **Conditional access** - Device trust, network location policies
 4. **Session revocation** - Force logout from all apps
 ## References
 * [Authentik Documentation](https://goauthentik.io/docs/)
 * [Authentik GitHub](https://github.com/goauthentik/authentik)
 * [OIDC Specification](https://openid.net/specs/openid-connect-core-1_0.html)