diff --git a/decisions/0025-observability-stack.md b/decisions/0025-observability-stack.md new file mode 100644 index 0000000..4b927c0 --- /dev/null +++ b/decisions/0025-observability-stack.md @@ -0,0 +1,239 @@ +# Observability Stack Architecture + +* Status: accepted +* Date: 2026-02-04 +* Deciders: Billy +* Technical Story: Establish comprehensive observability for metrics, logs, and traces across the homelab + +## Context and Problem Statement + +A complex homelab with AI/ML workloads, multiple databases, and numerous services requires comprehensive observability to understand system behavior, debug issues, and optimize performance. + +How do we build an observability stack that provides metrics, logs, and traces while remaining manageable for a single operator? + +## Decision Drivers + +* Three pillars coverage - metrics, logs, and traces all addressed +* Unified visualization - single pane of glass for all telemetry +* Resource efficiency - don't overwhelm the cluster with observability overhead +* OpenTelemetry compatibility - future-proof instrumentation standard +* GitOps deployment - all configuration version-controlled + +## Considered Options + +1. **Prometheus + ClickStack + OpenTelemetry Collector** +2. **Prometheus + Loki + Tempo (PLT Stack)** +3. **Datadog/New Relic (SaaS)** +4. **ELK Stack (Elasticsearch, Logstash, Kibana)** + +## Decision Outcome + +Chosen option: **Option 1 - Prometheus + ClickStack + OpenTelemetry Collector** + +Prometheus handles metrics with its mature ecosystem, ClickStack (ClickHouse-based) provides unified logs and traces storage with excellent performance, and OpenTelemetry Collector routes all telemetry data. + +### Positive Consequences + +* Prometheus ecosystem is mature with extensive service monitor support +* ClickHouse provides fast querying for logs and traces at scale +* OpenTelemetry is vendor-neutral and industry standard +* Grafana provides unified dashboards for all data sources +* Cost-effective (no SaaS fees) + +### Negative Consequences + +* More complex than pure SaaS solutions +* ClickHouse requires storage management +* Multiple components to maintain + +## Architecture + +``` +┌─────────────────────────────────────────────────────────────────────┐ +│ Applications │ +│ ┌──────────┐ ┌──────────┐ ┌──────────┐ ┌──────────┐ │ +│ │ Go Apps │ │ Python │ │ Node.js │ │ Java │ │ +│ │ (OTEL) │ │ (OTEL) │ │ (OTEL) │ │ (OTEL) │ │ +│ └────┬─────┘ └────┬─────┘ └────┬─────┘ └────┬─────┘ │ +└───────┼─────────────┼─────────────┼─────────────┼──────────────────┘ + │ │ │ │ + └──────────────────┬────────────────────────┘ + │ OTLP (gRPC/HTTP) + ▼ + ┌────────────────────────┐ + │ OpenTelemetry │ + │ Collector │ + │ (traces, metrics, │ + │ logs) │ + └───────────┬────────────┘ + │ + ┌───────────────┼───────────────┐ + │ │ │ + ▼ ▼ ▼ +┌─────────────────┐ ┌───────────┐ ┌───────────────┐ +│ ClickStack │ │Prometheus │ │ Grafana │ +│ (ClickHouse) │ │ │ │ │ +│ ┌───────────┐ │ │ Metrics │ │ Dashboards │ +│ │ Traces │ │ │ Storage │ │ Alerting │ +│ ├───────────┤ │ │ │ │ Exploration │ +│ │ Logs │ │ └───────────┘ │ │ +│ └───────────┘ │ └───────────────┘ +└─────────────────┘ │ + │ + ┌────────────────────┤ + │ │ + ┌─────▼─────┐ ┌─────▼─────┐ + │Alertmanager│ │ ntfy │ + │ │ │ (push) │ + └───────────┘ └───────────┘ +``` + +## Component Details + +### Metrics: Prometheus + kube-prometheus-stack + +**Deployment:** HelmRelease via Flux + +```yaml +prometheus: + prometheusSpec: + retention: 14d + retentionSize: 50GB + storage: + volumeClaimTemplate: + spec: + storageClassName: longhorn + storage: 50Gi +``` + +**Key Features:** +- ServiceMonitor auto-discovery for all workloads +- 14-day retention with 50GB limit +- PromPP image for enhanced performance +- AlertManager for routing alerts + +### Logs & Traces: ClickStack + +**Why ClickStack over Loki/Tempo:** +- Single storage backend (ClickHouse) for both logs and traces +- Excellent query performance on large datasets +- Built-in correlation between logs and traces +- Lower resource overhead than separate Loki + Tempo + +**Configuration:** +- OTEL Collector receives all telemetry +- Forwards to ClickStack's OTEL collector +- Grafana datasources for querying + +### Telemetry Collection: OpenTelemetry + +**OpenTelemetry Operator:** Manages auto-instrumentation + +```yaml +apiVersion: opentelemetry.io/v1alpha1 +kind: Instrumentation +metadata: + name: auto-instrumentation +spec: + python: + image: ghcr.io/open-telemetry/opentelemetry-operator/autoinstrumentation-python + nodejs: + image: ghcr.io/open-telemetry/opentelemetry-operator/autoinstrumentation-nodejs +``` + +**OpenTelemetry Collector:** Central routing + +```yaml +receivers: + otlp: + protocols: + grpc: + endpoint: 0.0.0.0:4317 + http: + endpoint: 0.0.0.0:4318 + +exporters: + otlphttp: + endpoint: http://clickstack-otel-collector:4318 + +service: + pipelines: + traces: + receivers: [otlp] + exporters: [otlphttp] + metrics: + receivers: [otlp] + exporters: [otlphttp] + logs: + receivers: [otlp] + exporters: [otlphttp] +``` + +### Visualization: Grafana + +**Grafana Operator:** Manages dashboards and datasources as CRDs + +```yaml +apiVersion: grafana.integreatly.org/v1beta1 +kind: GrafanaDashboard +metadata: + name: kubernetes-nodes +spec: + instanceSelector: + matchLabels: + grafana.internal/instance: grafana + url: https://grafana.com/api/dashboards/15758/revisions/44/download +``` + +**Datasources:** +| Type | Source | Purpose | +|------|--------|---------| +| Prometheus | prometheus-operated:9090 | Metrics | +| ClickHouse | clickstack:8123 | Logs & Traces | +| Alertmanager | alertmanager-operated:9093 | Alert status | + +### Alerting Pipeline + +``` +Prometheus Rules → Alertmanager → ntfy → Discord/Mobile + └─→ Email (future) +``` + +**Alert Categories:** +- Infrastructure: Node down, disk full, OOM +- Application: Error rate, latency SLO breach +- Security: Gatekeeper violations, vulnerability findings + +## Dashboards + +| Dashboard | Source | Purpose | +|-----------|--------|---------| +| Kubernetes Global | Grafana #15757 | Cluster overview | +| Node Exporter | Grafana #1860 | Node metrics | +| CNPG PostgreSQL | CNPG | Database health | +| Flux | Flux Operator | GitOps status | +| Cilium | Cilium | Network metrics | +| Envoy Gateway | Envoy | Ingress metrics | + +## Resource Allocation + +| Component | CPU Request | Memory Limit | +|-----------|-------------|--------------| +| Prometheus | 100m | 2Gi | +| OTEL Collector | 100m | 512Mi | +| ClickStack | 500m | 2Gi | +| Grafana | 100m | 256Mi | + +## Future Enhancements + +1. **Continuous Profiling** - Pyroscope for Go/Python profiling +2. **SLO Tracking** - Sloth for SLI/SLO automation +3. **Synthetic Monitoring** - Gatus for endpoint probing +4. **Cost Attribution** - OpenCost for resource cost tracking + +## References + +* [OpenTelemetry Documentation](https://opentelemetry.io/docs/) +* [ClickHouse for Observability](https://clickhouse.com/docs/en/use-cases/observability) +* [kube-prometheus-stack](https://github.com/prometheus-community/helm-charts/tree/main/charts/kube-prometheus-stack) +* [Grafana Operator](https://grafana.github.io/grafana-operator/) diff --git a/decisions/0026-storage-strategy.md b/decisions/0026-storage-strategy.md new file mode 100644 index 0000000..3c82195 --- /dev/null +++ b/decisions/0026-storage-strategy.md @@ -0,0 +1,334 @@ +# Tiered Storage Strategy: Longhorn + NFS + +* Status: accepted +* Date: 2026-02-04 +* Deciders: Billy +* Technical Story: Provide tiered storage for Kubernetes workloads balancing performance and capacity + +## Context and Problem Statement + +Kubernetes requires a storage solution for stateful applications like databases, message queues, and AI model caches. Different workloads have vastly different requirements: +- Databases need fast, reliable storage with replication +- Media libraries need large capacity but can tolerate slower access +- AI/ML workloads need both - fast storage for models, large capacity for datasets + +The homelab has heterogeneous nodes including x86_64 servers and ARM64 Raspberry Pis, plus an external NAS for bulk storage. + +How do we provide tiered storage that balances performance, reliability, and capacity for diverse homelab workloads? + +## Decision Drivers + +* Performance - fast IOPS for databases and critical workloads +* Capacity - large storage for media, datasets, and archives +* Reliability - data must survive node failures +* Heterogeneous support - work on both x86_64 and ARM64 (with limitations) +* Backup capability - support for off-cluster backups +* GitOps deployment - Helm charts with Flux management + +## Considered Options + +1. **Longhorn + NFS dual-tier storage** +2. **Rook-Ceph for everything** +3. **OpenEBS with Mayastor** +4. **NFS only** +5. **Longhorn only** + +## Decision Outcome + +Chosen option: **Option 1 - Longhorn + NFS dual-tier storage** + +Two storage tiers optimized for different use cases: +- **`longhorn`** (default): Fast distributed block storage on NVMe/SSDs for databases and critical workloads +- **`nfs-slow`**: High-capacity NFS storage on external NAS for media, datasets, and bulk storage + +### Positive Consequences + +* Right-sized storage for each workload type +* Longhorn provides HA with automatic replication +* NFS provides massive capacity without consuming cluster disk space +* ReadWriteMany (RWX) easy on NFS tier +* Cost-effective - use existing NAS investment + +### Negative Consequences + +* Two storage systems to manage +* NFS is slower (hence `nfs-slow` naming) +* NFS single point of failure (no replication) +* Network dependency for both tiers + +## Architecture + +``` +┌────────────────────────────────────────────────────────────────────────────┐ +│ TIER 1: LONGHORN │ +│ (Fast Distributed Block Storage) │ +│ ┌─────────────┐ ┌─────────────┐ ┌─────────────┐ │ +│ │ khelben │ │ mystra │ │ selune │ │ +│ │ (NVIDIA) │ │ (AMD) │ │ (AMD) │ │ +│ │ │ │ │ │ │ │ +│ │ /var/mnt/ │ │ /var/mnt/ │ │ /var/mnt/ │ │ +│ │ longhorn │ │ longhorn │ │ longhorn │ │ +│ │ (NVMe) │ │ (SSD) │ │ (SSD) │ │ +│ └──────┬──────┘ └──────┬──────┘ └──────┬──────┘ │ +│ │ │ │ │ +│ └────────────────┼────────────────┘ │ +│ ▼ │ +│ ┌───────────────────────┐ │ +│ │ Longhorn Manager │ │ +│ │ (Schedules replicas) │ │ +│ └───────────┬───────────┘ │ +│ ▼ │ +│ ┌──────────┐ ┌──────────┐ ┌──────────┐ ┌──────────┐ │ +│ │ Postgres │ │ Vault │ │Prometheus│ │ClickHouse│ │ +│ │ PVC │ │ PVC │ │ PVC │ │ PVC │ │ +│ └──────────┘ └──────────┘ └──────────┘ └──────────┘ │ +└────────────────────────────────────────────────────────────────────────────┘ + +┌────────────────────────────────────────────────────────────────────────────┐ +│ TIER 2: NFS-SLOW │ +│ (High-Capacity Bulk Storage) │ +│ │ +│ ┌────────────────────────────────────────────────────────────────┐ │ +│ │ candlekeep.lab.daviestechlabs.io │ │ +│ │ (External NAS) │ │ +│ │ │ │ +│ │ /kubernetes │ │ +│ │ ├── jellyfin-media/ (1TB+ media library) │ │ +│ │ ├── nextcloud/ (user files) │ │ +│ │ ├── immich/ (photo backups) │ │ +│ │ ├── kavita/ (ebooks, comics, manga) │ │ +│ │ ├── mlflow-artifacts/ (model artifacts) │ │ +│ │ ├── ray-models/ (AI model weights) │ │ +│ │ └── gitea-runner/ (build caches) │ │ +│ └────────────────────────────────────────────────────────────────┘ │ +│ │ │ +│ ▼ │ +│ ┌───────────────────────┐ │ +│ │ NFS CSI Driver │ │ +│ │ (csi-driver-nfs) │ │ +│ └───────────┬───────────┘ │ +│ ▼ │ +│ ┌──────────┐ ┌──────────┐ ┌──────────┐ ┌──────────┐ │ +│ │ Jellyfin │ │Nextcloud │ │ Immich │ │ Kavita │ │ +│ │ PVC │ │ PVC │ │ PVC │ │ PVC │ │ +│ └──────────┘ └──────────┘ └──────────┘ └──────────┘ │ +└────────────────────────────────────────────────────────────────────────────┘ +``` + +## Tier 1: Longhorn Configuration + +### Helm Values + +```yaml +persistence: + defaultClass: true + defaultClassReplicaCount: 2 + defaultDataPath: /var/mnt/longhorn + +defaultSettings: + defaultDataPath: /var/mnt/longhorn + # Allow on vllm-tainted nodes + taintToleration: "dedicated=vllm:NoSchedule" + # Exclude Raspberry Pi nodes (ARM64) + systemManagedComponentsNodeSelector: "kubernetes.io/arch:amd64" + # Snapshot retention + defaultRecurringJobs: + - name: nightly-snapshots + task: snapshot + cron: "0 2 * * *" + retain: 7 + - name: weekly-backups + task: backup + cron: "0 3 * * 0" + retain: 4 +``` + +### Longhorn Storage Classes + +| StorageClass | Replicas | Use Case | +|--------------|----------|----------| +| `longhorn` (default) | 2 | General workloads, databases | +| `longhorn-single` | 1 | Development/ephemeral | +| `longhorn-strict` | 3 | Critical databases | + +## Tier 2: NFS Configuration + +### Helm Values (csi-driver-nfs) + +```yaml +storageClass: + create: true + name: nfs-slow + parameters: + server: candlekeep.lab.daviestechlabs.io + share: /kubernetes + mountOptions: + - nfsvers=4.1 + - nconnect=16 # Multiple TCP connections for throughput + - hard # Retry indefinitely on failure + - noatime # Don't update access times (performance) + reclaimPolicy: Delete + volumeBindingMode: Immediate +``` + +### Why "nfs-slow"? + +The naming is intentional - it sets correct expectations: +- **Latency:** NAS is over network, higher latency than local NVMe +- **IOPS:** Spinning disks in NAS can't match SSD performance +- **Throughput:** Adequate for streaming media, not for databases +- **Benefit:** Massive capacity without consuming cluster disk space + +## Storage Tier Selection Guide + +| Workload Type | Storage Class | Rationale | +|---------------|---------------|-----------| +| PostgreSQL (CNPG) | `longhorn` or `nfs-slow` | Depends on criticality | +| Prometheus/ClickHouse | `longhorn` | High write IOPS required | +| Vault | `longhorn` | Security-critical, needs HA | +| Media (Jellyfin, Kavita) | `nfs-slow` | Large files, sequential reads | +| Photos (Immich) | `nfs-slow` | Bulk storage for photos | +| User files (Nextcloud) | `nfs-slow` | Capacity over speed | +| AI/ML models (Ray) | `nfs-slow` | Large model weights | +| Build caches (Gitea runner) | `nfs-slow` | Ephemeral, large | +| MLflow artifacts | `nfs-slow` | Model artifacts storage | + +## Volume Usage by Tier + +### Longhorn Volumes (Performance Tier) + +| Workload | Size | Replicas | Access Mode | +|----------|------|----------|-------------| +| Prometheus | 50Gi | 2 | RWO | +| Vault | 2Gi | 2 | RWO | +| ClickHouse | 100Gi | 2 | RWO | +| Alertmanager | 1Gi | 2 | RWO | + +### NFS Volumes (Capacity Tier) + +| Workload | Size | Access Mode | Notes | +|----------|------|-------------|-------| +| Jellyfin | 2Ti | RWX | Media library | +| Immich | 500Gi | RWX | Photo storage | +| Nextcloud | 1Ti | RWX | User files | +| Kavita | 200Gi | RWX | Ebooks, comics | +| MLflow | 100Gi | RWX | Model artifacts | +| Ray models | 200Gi | RWX | AI model weights | +| Gitea runner | 50Gi | RWO | Build caches | +| Gitea DB (CNPG) | 10Gi | RWO | Capacity-optimized | + +## Backup Strategy + +### Longhorn Tier + +#### Local Snapshots + +- **Frequency:** Nightly at 2 AM +- **Retention:** 7 days +- **Purpose:** Quick recovery from accidental deletion + +#### Off-Cluster Backups + +- **Frequency:** Weekly on Sundays at 3 AM +- **Destination:** S3-compatible storage (MinIO/Backblaze) +- **Retention:** 4 weeks +- **Purpose:** Disaster recovery + +### NFS Tier + +#### NAS-Level Backups + +- Handled by NAS backup solution (snapshots, replication) +- Not managed by Kubernetes +- Relies on NAS raid configuration for redundancy + +### Backup Target Configuration (Longhorn) + +```yaml +# ExternalSecret for backup credentials +apiVersion: external-secrets.io/v1 +kind: ExternalSecret +metadata: + name: longhorn-backup-secret +spec: + secretStoreRef: + kind: ClusterSecretStore + name: vault + target: + name: longhorn-backup-secret + data: + - secretKey: AWS_ACCESS_KEY_ID + remoteRef: + key: kv/data/longhorn + property: backup_access_key + - secretKey: AWS_SECRET_ACCESS_KEY + remoteRef: + key: kv/data/longhorn + property: backup_secret_key +``` + +## Node Exclusions (Longhorn Only) + +**Raspberry Pi nodes excluded because:** +- Limited disk I/O performance +- SD card wear concerns +- Memory constraints for Longhorn components + +**GPU nodes included with tolerations:** +- `khelben` (NVIDIA) participates in Longhorn storage +- Taint toleration allows Longhorn to schedule there + +## Performance Considerations + +### Longhorn Performance + +- `khelben` has NVMe - fastest storage node +- `mystra`/`selune` have SATA SSDs - adequate for most workloads +- 2 replicas across different nodes ensures single node failure survival +- Trade-off: 2x storage consumption + +### NFS Performance + +- Optimized with `nconnect=16` for parallel connections +- `noatime` reduces unnecessary write operations +- Sequential read workloads perform well (media streaming) +- Random I/O workloads should use Longhorn instead + +### When to Choose Each Tier + +| Requirement | Longhorn | NFS-Slow | +|-------------|----------|----------| +| Low latency | ✅ | ❌ | +| High IOPS | ✅ | ❌ | +| Large capacity | ❌ | ✅ | +| ReadWriteMany (RWX) | Limited | ✅ | +| Node failure survival | ✅ | ✅ (NAS HA) | +| Kubernetes-native | ✅ | ✅ | + +## Monitoring + +**Grafana Dashboard:** Longhorn dashboard for: +- Volume health and replica status +- IOPS and throughput per volume +- Disk space utilization per node +- Backup job status + +**Alerts:** +- Volume degraded (replica count < desired) +- Disk space low (< 20% free) +- Backup job failed + +## Future Enhancements + +1. **NAS high availability** - Second NAS with replication +2. **Dedicated storage network** - Separate VLAN for storage traffic +3. **NVMe-oF** - Network NVMe for lower latency +4. **Tiered Longhorn** - Hot (NVMe) and warm (SSD) within Longhorn +5. **S3 tier** - MinIO for object storage workloads + +## References + +* [Longhorn Documentation](https://longhorn.io/docs/) +* [Longhorn Best Practices](https://longhorn.io/docs/latest/best-practices/) +* [NFS CSI Driver](https://github.com/kubernetes-csi/csi-driver-nfs) +* [Talos Longhorn Integration](https://www.talos.dev/v1.6/kubernetes-guides/configuration/storage/) diff --git a/decisions/0027-database-strategy.md b/decisions/0027-database-strategy.md new file mode 100644 index 0000000..0e1c70b --- /dev/null +++ b/decisions/0027-database-strategy.md @@ -0,0 +1,294 @@ +# Database Strategy with CloudNativePG + +* Status: accepted +* Date: 2026-02-04 +* Deciders: Billy +* Technical Story: Standardize PostgreSQL deployment for stateful applications + +## Context and Problem Statement + +Multiple applications in the homelab require relational databases: Gitea, Authentik, Companions, MLflow, and potentially more. Each could use different database solutions, creating operational complexity. + +How do we standardize database deployment while providing production-grade reliability and minimal operational overhead? + +## Decision Drivers + +* Operational simplicity - single operator to learn and manage +* High availability - automatic failover for critical databases +* Backup integration - consistent backup strategy across all databases +* GitOps compatibility - declarative database provisioning +* Resource efficiency - don't over-provision for homelab scale + +## Considered Options + +1. **CloudNativePG for PostgreSQL** +2. **Helm charts per application (Bitnami PostgreSQL)** +3. **External managed database (RDS-style)** +4. **SQLite where possible + single shared PostgreSQL** + +## Decision Outcome + +Chosen option: **Option 1 - CloudNativePG for PostgreSQL** + +CloudNativePG (CNPG) provides a Kubernetes-native PostgreSQL operator with HA, automatic failover, connection pooling (PgBouncer), and integrated backups. + +### Positive Consequences + +* Single operator manages all PostgreSQL instances +* Declarative Cluster CRD for GitOps deployment +* Automatic failover with minimal data loss +* Built-in PgBouncer for connection pooling +* Prometheus metrics and Grafana dashboards included +* CNPG is CNCF-listed and actively maintained + +### Negative Consequences + +* PostgreSQL only (no MySQL/MariaDB support) +* Operator adds resource overhead +* Learning curve for CNPG-specific features + +## Architecture + +``` +┌─────────────────────────────────────────────────────────────────┐ +│ CNPG Operator │ +│ (cnpg-system namespace) │ +└────────────────────────────┬────────────────────────────────────┘ + │ Manages + ▼ +┌──────────────────┬─────────────────┬─────────────────────────────┐ +│ │ │ │ +▼ ▼ ▼ ▼ +┌──────────────┐ ┌──────────────┐ ┌──────────────┐ ┌──────────────┐ +│ gitea-pg │ │ authentik-db │ │companions-db │ │ mlflow-db │ +│ (3 replicas)│ │ (3 replicas)│ │ (3 replicas) │ │ (1 replica) │ +│ │ │ │ │ │ │ │ +│ ┌──────────┐ │ │ ┌──────────┐ │ │ ┌──────────┐ │ │ ┌──────────┐ │ +│ │ Primary │ │ │ │ Primary │ │ │ │ Primary │ │ │ │ Primary │ │ +│ ├──────────┤ │ │ ├──────────┤ │ │ ├──────────┤ │ │ └──────────┘ │ +│ │ Replica │ │ │ │ Replica │ │ │ │ Replica │ │ │ │ +│ ├──────────┤ │ │ ├──────────┤ │ │ ├──────────┤ │ │ │ +│ │ Replica │ │ │ │ Replica │ │ │ │ Replica │ │ │ │ +│ └──────────┘ │ │ └──────────┘ │ │ └──────────┘ │ │ │ +│ ┌──────────┐ │ │ ┌──────────┐ │ │ ┌──────────┐ │ │ │ +│ │ PgBouncer│ │ │ │ PgBouncer│ │ │ │ PgBouncer│ │ │ │ +│ └──────────┘ │ │ └──────────┘ │ │ └──────────┘ │ │ │ +└──────────────┘ └──────────────┘ └──────────────┘ └──────────────┘ + │ │ │ │ + └──────────────────┼─────────────────┼────────────────┘ + │ │ + ┌─────▼─────┐ ┌─────▼─────┐ + │ Longhorn │ │ Longhorn │ + │ PVCs │ │ Backups │ + └───────────┘ └───────────┘ +``` + +## Cluster Configuration Template + +```yaml +apiVersion: postgresql.cnpg.io/v1 +kind: Cluster +metadata: + name: app-db +spec: + description: "Application PostgreSQL Cluster" + imageName: ghcr.io/cloudnative-pg/postgresql:17.2 + instances: 3 + + primaryUpdateStrategy: unsupervised + + postgresql: + parameters: + shared_buffers: "256MB" + effective_cache_size: "768MB" + work_mem: "16MB" + max_connections: "200" + + # Enable PgBouncer for connection pooling + enablePgBouncer: true + pgbouncer: + poolMode: transaction + defaultPoolSize: "25" + + # Storage on Longhorn + storage: + size: 10Gi + storageClass: longhorn + + # Monitoring + monitoring: + enabled: true + customQueriesConfigMap: + - name: cnpg-default-monitoring + key: queries + + # Backup configuration + backup: + barmanObjectStore: + destinationPath: "s3://backups/postgres/" + s3Credentials: + accessKeyId: + name: postgres-backup-creds + key: ACCESS_KEY_ID + secretAccessKey: + name: postgres-backup-creds + key: SECRET_ACCESS_KEY + retentionPolicy: "7d" +``` + +## Database Instances + +| Cluster | Instances | Storage | PgBouncer | Purpose | +|---------|-----------|---------|-----------|---------| +| `gitea-pg` | 3 | 10Gi | Yes | Git repository metadata | +| `authentik-db` | 3 | 5Gi | Yes | Identity/SSO data | +| `companions-db` | 3 | 10Gi | Yes | Chat app data | +| `mlflow-db` | 1 | 5Gi | No | Experiment tracking | +| `kubeflow-db` | 1 | 10Gi | No | Pipeline metadata | + +## Connection Patterns + +### Service Discovery + +CNPG creates services for each cluster: + +| Service | Purpose | +|---------|---------| +| `-rw` | Read-write (primary only) | +| `-ro` | Read-only (any replica) | +| `-r` | Read (any instance) | +| `-pooler-rw` | PgBouncer read-write | +| `-pooler-ro` | PgBouncer read-only | + +### Application Configuration + +```yaml +# Application config using CNPG service +DATABASE_URL: "postgresql://user:password@gitea-pg-pooler-rw.gitea.svc:5432/giteadb" +``` + +### Credentials via External Secrets + +```yaml +apiVersion: external-secrets.io/v1 +kind: ExternalSecret +metadata: + name: app-db-credentials +spec: + secretStoreRef: + kind: ClusterSecretStore + name: vault + target: + name: app-db-credentials + data: + - secretKey: username + remoteRef: + key: kv/data/app-db + property: username + - secretKey: password + remoteRef: + key: kv/data/app-db + property: password +``` + +## High Availability + +### Automatic Failover + +- CNPG monitors primary health continuously +- If primary fails, automatic promotion of replica +- Application reconnection via service abstraction +- Typical failover time: 10-30 seconds + +### Replica Synchronization + +- Streaming replication from primary to replicas +- Synchronous replication available for zero data loss (trade-off: latency) +- Default: asynchronous with acceptable RPO + +## Backup Strategy + +### Continuous WAL Archiving + +- Write-Ahead Log streamed to S3 +- Point-in-time recovery capability +- RPO: seconds (last WAL segment) + +### Base Backups + +- **Frequency:** Daily +- **Retention:** 7 days +- **Destination:** S3-compatible (MinIO/Backblaze) + +### Recovery Testing + +- Periodic restore to test cluster +- Validate backup integrity +- Document recovery procedure + +## Monitoring + +### Prometheus Metrics + +- Connection count and pool utilization +- Transaction rate and latency +- Replication lag +- Disk usage and WAL generation + +### Grafana Dashboard + +CNPG provides official dashboard: +- Cluster health overview +- Per-instance metrics +- Replication status +- Backup job history + +### Alerts + +```yaml +- alert: PostgreSQLDown + expr: cnpg_collector_up == 0 + for: 5m + labels: + severity: critical + +- alert: PostgreSQLReplicationLag + expr: cnpg_pg_replication_lag_seconds > 30 + for: 5m + labels: + severity: warning + +- alert: PostgreSQLConnectionsHigh + expr: cnpg_pg_stat_activity_count / cnpg_pg_settings_max_connections > 0.8 + for: 5m + labels: + severity: warning +``` + +## When NOT to Use CloudNativePG + +| Scenario | Alternative | +|----------|-------------| +| Simple app, no HA needed | Embedded SQLite | +| MySQL/MariaDB required | Application-specific chart | +| Massive scale | External managed database | +| Non-relational data | Redis/Valkey, MongoDB | + +## PostgreSQL Version Policy + +- Use latest stable major version (currently 17) +- Minor version updates: automatic (`primaryUpdateStrategy: unsupervised`) +- Major version upgrades: manual with testing + +## Future Enhancements + +1. **Cross-cluster replication** - DR site replica +2. **Logical replication** - Selective table sync between clusters +3. **TimescaleDB extension** - Time-series optimization for metrics +4. **PgVector extension** - Vector storage alternative to Milvus + +## References + +* [CloudNativePG Documentation](https://cloudnative-pg.io/documentation/) +* [CNPG GitHub](https://github.com/cloudnative-pg/cloudnative-pg) +* [PostgreSQL High Availability](https://www.postgresql.org/docs/current/high-availability.html) diff --git a/decisions/0028-authentik-sso-strategy.md b/decisions/0028-authentik-sso-strategy.md new file mode 100644 index 0000000..6b43b3c --- /dev/null +++ b/decisions/0028-authentik-sso-strategy.md @@ -0,0 +1,415 @@ +# Authentik Single Sign-On Strategy + +* Status: accepted +* Date: 2026-02-04 +* Deciders: Billy +* Technical Story: Centralize authentication across all homelab applications + +## Context and Problem Statement + +A growing homelab with many self-hosted applications creates authentication sprawl - each app has its own user database, passwords, and session management. This creates poor user experience and security risks. + +How do we centralize authentication while maintaining flexibility for different application requirements? + +## Decision Drivers + +* Single sign-on (SSO) for all applications +* Centralized user management and lifecycle +* MFA enforcement across all applications +* Open-source and self-hosted +* Low resource requirements for homelab scale + +## Considered Options + +1. **Authentik as OIDC/SAML provider** +2. **Keycloak** +3. **Authelia + LDAP** +4. **Per-application local auth** + +## Decision Outcome + +Chosen option: **Option 1 - Authentik as OIDC/SAML provider** + +Authentik provides modern identity management with OIDC, SAML 2.0, LDAP, and SCIM support. Its flow-based authentication engine allows customizable login experiences. + +### Positive Consequences + +* Clean, modern UI for users and admins +* Flexible flow-based authentication +* Built-in MFA (TOTP, WebAuthn, SMS, Duo) +* Proxy provider for legacy apps +* SCIM for user provisioning +* Active development and community + +### Negative Consequences + +* Python-based (higher memory than Go alternatives) +* PostgreSQL dependency +* Some enterprise features require outpost pods + +## Architecture + +``` + ┌─────────────────────┐ + │ User │ + └──────────┬──────────┘ + │ + ▼ + ┌─────────────────────┐ + │ Ingress/Traefik │ + └──────────┬──────────┘ + │ + ┌──────────────────────────┼──────────────────────────┐ + │ │ │ + ▼ ▼ ▼ + ┌─────────────────┐ ┌─────────────────┐ ┌─────────────────┐ + │ auth.lab.io │ │ app.lab.io │ │ app2.lab.io │ + │ (Authentik) │ │ (OIDC-enabled) │ │ (Proxy-auth) │ + └─────────────────┘ └────────┬────────┘ └────────┬────────┘ + │ │ │ + │ ┌──────────────────┘ │ + │ │ OIDC/OAuth2 │ + │ │ │ + ▼ ▼ ▼ + ┌─────────────────────────────────────────────────────────────────┐ + │ Authentik │ + │ ┌─────────────┐ ┌─────────────┐ ┌─────────────┐ │ + │ │ Server │ │ Worker │ │ Outpost │◄───────────┤ + │ │ (API) │ │ (Tasks) │ │ (Proxy) │ Forward │ + │ └──────┬──────┘ └──────┬──────┘ └─────────────┘ Auth │ + │ │ │ │ + │ └────────┬───────┘ │ + │ │ │ + │ ┌──────▼──────┐ │ + │ │ Redis │ │ + │ │ (Cache) │ │ + │ └─────────────┘ │ + │ │ + └─────────────────────────────┬──────────────────────────────────┘ + │ + ┌──────▼──────┐ + │ PostgreSQL │ + │ (CNPG) │ + └─────────────┘ +``` + +## Provider Configuration + +### OIDC Applications + +| Application | Provider Type | Claims Override | Notes | +|-------------|---------------|-----------------|-------| +| Gitea | OIDC | None | Admin mapping via group | +| Affine | OIDC | `email_verified: true` | See ADR-0016 | +| Companions | OIDC | None | Custom provider | +| Grafana | OIDC | `role` claim | Admin role mapping | +| ArgoCD | OIDC | `groups` claim | RBAC integration | +| MLflow | Proxy | N/A | Forward auth | +| Open WebUI | OIDC | None | LLM interface | + +### Provider Template + +```yaml +# Example OAuth2/OIDC Provider +apiVersion: authentik.io/v1 +kind: OAuth2Provider +metadata: + name: gitea +spec: + name: Gitea + authorizationFlow: default-authorization-flow + clientId: ${GITEA_CLIENT_ID} + clientSecret: ${GITEA_CLIENT_SECRET} + redirectUris: + - https://git.lab.daviestechlabs.io/user/oauth2/authentik/callback + signingKey: authentik-self-signed + propertyMappings: + - authentik-default-openid + - authentik-default-email + - authentik-default-profile +``` + +## Authentication Flows + +### Default Login Flow + +``` +┌─────────────┐ ┌─────────────┐ ┌─────────────┐ ┌─────────────┐ +│ Login │────▶│ Username │────▶│ Password │────▶│ MFA │ +│ Stage │ │ Stage │ │ Stage │ │ Stage │ +└─────────────┘ └─────────────┘ └─────────────┘ └──────┬──────┘ + │ + ▼ + ┌─────────────┐ + │ Session │ + │ Created │ + └─────────────┘ +``` + +### Flow Customization + +- **Admin users:** Require hardware key (WebAuthn) +- **API access:** Service account tokens +- **External users:** Email verification + MFA enrollment + +## Group-Based Authorization + +### Group Structure + +``` +authentik-admins → Authentik admin access +├── cluster-admins → Full cluster access +├── gitea-admins → Git admin +├── monitoring-admins → Grafana admin +└── ai-platform-admins → AI/ML admin + +authentik-users → Standard user access +├── developers → Git write, monitoring read +├── ml-engineers → AI/ML services access +└── guests → Read-only access +``` + +### Application Group Mapping + +```yaml +# Grafana OIDC config +grafana: + auth.generic_oauth: + role_attribute_path: | + contains(groups[*], 'monitoring-admins') && 'Admin' || + contains(groups[*], 'developers') && 'Editor' || + 'Viewer' +``` + +## Outpost Deployment + +### Embedded Outpost (Default) + +- Runs within Authentik server +- Handles LDAP and Radius +- Suitable for small deployments + +### Standalone Outpost (Proxy) + +```yaml +apiVersion: apps/v1 +kind: Deployment +metadata: + name: authentik-outpost-proxy +spec: + replicas: 2 + template: + spec: + containers: + - name: outpost + image: ghcr.io/goauthentik/proxy + ports: + - containerPort: 9000 + name: http + - containerPort: 9443 + name: https + env: + - name: AUTHENTIK_HOST + value: "https://auth.lab.daviestechlabs.io/" + - name: AUTHENTIK_TOKEN + valueFrom: + secretKeyRef: + name: authentik-outpost-token + key: token +``` + +### Forward Auth Integration + +For applications without OIDC support: + +```yaml +# Traefik ForwardAuth middleware +apiVersion: traefik.io/v1alpha1 +kind: Middleware +metadata: + name: authentik-forward-auth +spec: + forwardAuth: + address: http://authentik-outpost-proxy.authentik.svc:9000/outpost.goauthentik.io/auth/traefik + trustForwardHeader: true + authResponseHeaders: + - X-authentik-username + - X-authentik-groups + - X-authentik-email +``` + +## MFA Enforcement + +### Policies + +| User Group | MFA Requirement | +|------------|-----------------| +| Admins | WebAuthn (hardware key) required | +| Developers | TOTP or WebAuthn required | +| Guests | MFA optional | + +### Device Registration + +- Self-service MFA enrollment +- Recovery codes generated at setup +- Admin can reset user MFA + +## SCIM User Provisioning + +### When to Use + +- Automatic user creation in downstream apps +- Group membership sync +- User deprovisioning on termination + +### Supported Apps + +Currently using SCIM provisioning for: +- None (manual user creation in apps) + +Future consideration for: +- Gitea organization sync +- Grafana team sync + +## Secrets Management Integration + +### Vault Integration + +```yaml +# External Secret for Authentik DB credentials +apiVersion: external-secrets.io/v1 +kind: ExternalSecret +metadata: + name: authentik-db-credentials + namespace: authentik +spec: + secretStoreRef: + kind: ClusterSecretStore + name: vault + target: + name: authentik-db-credentials + data: + - secretKey: password + remoteRef: + key: kv/data/authentik + property: db_password + - secretKey: secret_key + remoteRef: + key: kv/data/authentik + property: secret_key +``` + +## Monitoring + +### Prometheus Metrics + +Authentik exposes metrics at `/metrics`: + +- `authentik_login_duration_seconds` +- `authentik_login_attempt_total` +- `authentik_outpost_connected` +- `authentik_provider_authorization_total` + +### Grafana Dashboard + +- Login success/failure rates +- Active sessions +- Provider usage +- MFA adoption rates + +### Alerts + +```yaml +- alert: AuthentikHighLoginFailures + expr: rate(authentik_login_attempt_total{result="failure"}[5m]) > 10 + for: 5m + labels: + severity: warning + annotations: + summary: High login failure rate detected + +- alert: AuthentikOutpostDisconnected + expr: authentik_outpost_connected == 0 + for: 5m + labels: + severity: critical +``` + +## Backup and Recovery + +### What to Backup + +1. PostgreSQL database (via CNPG) +2. Media files (if custom branding) +3. Blueprint exports (configuration as code) + +### Blueprints + +Export configuration as YAML for GitOps: + +```yaml +# authentik-blueprints/providers/gitea.yaml +version: 1 +metadata: + name: Gitea OIDC Provider +entries: + - model: authentik_providers_oauth2.oauth2provider + identifiers: + name: gitea + attrs: + authorization_flow: !Find [authentik_flows.flow, [slug, default-authorization-flow]] + # ... provider config +``` + +## Integration Patterns + +### Pattern 1: Native OIDC + +Best for: Modern applications with OIDC support + +``` +App ──OIDC──▶ Authentik ──▶ App (with user info) +``` + +### Pattern 2: Proxy Forward Auth + +Best for: Legacy apps without SSO support + +``` +Request ──▶ Traefik ──ForwardAuth──▶ Authentik Outpost + │ │ + │◀──────Header injection─────┘ + │ + ▼ + App (reads X-authentik-* headers) +``` + +### Pattern 3: LDAP Compatibility + +Best for: Apps requiring LDAP + +``` +App ──LDAP──▶ Authentik Outpost (LDAP) ──▶ Authentik +``` + +## Resource Requirements + +| Component | CPU Request | Memory Request | +|-----------|-------------|----------------| +| Server | 100m | 500Mi | +| Worker | 100m | 500Mi | +| Redis | 50m | 128Mi | +| Outpost (each) | 50m | 128Mi | + +## Future Enhancements + +1. **Passkey/FIDO2** - Passwordless authentication +2. **External IdP federation** - Google/GitHub as upstream IdP +3. **Conditional access** - Device trust, network location policies +4. **Session revocation** - Force logout from all apps + +## References + +* [Authentik Documentation](https://goauthentik.io/docs/) +* [Authentik GitHub](https://github.com/goauthentik/authentik) +* [OIDC Specification](https://openid.net/specs/openid-connect-core-1_0.html)