docs: add ADRs 0025-0028 for infrastructure patterns

- 0025: Observability stack (Prometheus + ClickStack + OTEL)
- 0026: Tiered storage strategy (Longhorn + NFS)
- 0027: Database strategy (CloudNativePG for PostgreSQL)
- 0028: Authentik SSO strategy (OIDC/SAML identity provider)
This commit is contained in:
2026-02-04 08:55:15 -05:00
parent a128c265e4
commit b43c80153c
4 changed files with 1282 additions and 0 deletions

View File

@@ -0,0 +1,239 @@
# Observability Stack Architecture
* Status: accepted
* Date: 2026-02-04
* Deciders: Billy
* Technical Story: Establish comprehensive observability for metrics, logs, and traces across the homelab
## Context and Problem Statement
A complex homelab with AI/ML workloads, multiple databases, and numerous services requires comprehensive observability to understand system behavior, debug issues, and optimize performance.
How do we build an observability stack that provides metrics, logs, and traces while remaining manageable for a single operator?
## Decision Drivers
* Three pillars coverage - metrics, logs, and traces all addressed
* Unified visualization - single pane of glass for all telemetry
* Resource efficiency - don't overwhelm the cluster with observability overhead
* OpenTelemetry compatibility - future-proof instrumentation standard
* GitOps deployment - all configuration version-controlled
## Considered Options
1. **Prometheus + ClickStack + OpenTelemetry Collector**
2. **Prometheus + Loki + Tempo (PLT Stack)**
3. **Datadog/New Relic (SaaS)**
4. **ELK Stack (Elasticsearch, Logstash, Kibana)**
## Decision Outcome
Chosen option: **Option 1 - Prometheus + ClickStack + OpenTelemetry Collector**
Prometheus handles metrics with its mature ecosystem, ClickStack (ClickHouse-based) provides unified logs and traces storage with excellent performance, and OpenTelemetry Collector routes all telemetry data.
### Positive Consequences
* Prometheus ecosystem is mature with extensive service monitor support
* ClickHouse provides fast querying for logs and traces at scale
* OpenTelemetry is vendor-neutral and industry standard
* Grafana provides unified dashboards for all data sources
* Cost-effective (no SaaS fees)
### Negative Consequences
* More complex than pure SaaS solutions
* ClickHouse requires storage management
* Multiple components to maintain
## Architecture
```
┌─────────────────────────────────────────────────────────────────────┐
│ Applications │
│ ┌──────────┐ ┌──────────┐ ┌──────────┐ ┌──────────┐ │
│ │ Go Apps │ │ Python │ │ Node.js │ │ Java │ │
│ │ (OTEL) │ │ (OTEL) │ │ (OTEL) │ │ (OTEL) │ │
│ └────┬─────┘ └────┬─────┘ └────┬─────┘ └────┬─────┘ │
└───────┼─────────────┼─────────────┼─────────────┼──────────────────┘
│ │ │ │
└──────────────────┬────────────────────────┘
│ OTLP (gRPC/HTTP)
┌────────────────────────┐
│ OpenTelemetry │
│ Collector │
│ (traces, metrics, │
│ logs) │
└───────────┬────────────┘
┌───────────────┼───────────────┐
│ │ │
▼ ▼ ▼
┌─────────────────┐ ┌───────────┐ ┌───────────────┐
│ ClickStack │ │Prometheus │ │ Grafana │
│ (ClickHouse) │ │ │ │ │
│ ┌───────────┐ │ │ Metrics │ │ Dashboards │
│ │ Traces │ │ │ Storage │ │ Alerting │
│ ├───────────┤ │ │ │ │ Exploration │
│ │ Logs │ │ └───────────┘ │ │
│ └───────────┘ │ └───────────────┘
└─────────────────┘ │
┌────────────────────┤
│ │
┌─────▼─────┐ ┌─────▼─────┐
│Alertmanager│ │ ntfy │
│ │ │ (push) │
└───────────┘ └───────────┘
```
## Component Details
### Metrics: Prometheus + kube-prometheus-stack
**Deployment:** HelmRelease via Flux
```yaml
prometheus:
prometheusSpec:
retention: 14d
retentionSize: 50GB
storage:
volumeClaimTemplate:
spec:
storageClassName: longhorn
storage: 50Gi
```
**Key Features:**
- ServiceMonitor auto-discovery for all workloads
- 14-day retention with 50GB limit
- PromPP image for enhanced performance
- AlertManager for routing alerts
### Logs & Traces: ClickStack
**Why ClickStack over Loki/Tempo:**
- Single storage backend (ClickHouse) for both logs and traces
- Excellent query performance on large datasets
- Built-in correlation between logs and traces
- Lower resource overhead than separate Loki + Tempo
**Configuration:**
- OTEL Collector receives all telemetry
- Forwards to ClickStack's OTEL collector
- Grafana datasources for querying
### Telemetry Collection: OpenTelemetry
**OpenTelemetry Operator:** Manages auto-instrumentation
```yaml
apiVersion: opentelemetry.io/v1alpha1
kind: Instrumentation
metadata:
name: auto-instrumentation
spec:
python:
image: ghcr.io/open-telemetry/opentelemetry-operator/autoinstrumentation-python
nodejs:
image: ghcr.io/open-telemetry/opentelemetry-operator/autoinstrumentation-nodejs
```
**OpenTelemetry Collector:** Central routing
```yaml
receivers:
otlp:
protocols:
grpc:
endpoint: 0.0.0.0:4317
http:
endpoint: 0.0.0.0:4318
exporters:
otlphttp:
endpoint: http://clickstack-otel-collector:4318
service:
pipelines:
traces:
receivers: [otlp]
exporters: [otlphttp]
metrics:
receivers: [otlp]
exporters: [otlphttp]
logs:
receivers: [otlp]
exporters: [otlphttp]
```
### Visualization: Grafana
**Grafana Operator:** Manages dashboards and datasources as CRDs
```yaml
apiVersion: grafana.integreatly.org/v1beta1
kind: GrafanaDashboard
metadata:
name: kubernetes-nodes
spec:
instanceSelector:
matchLabels:
grafana.internal/instance: grafana
url: https://grafana.com/api/dashboards/15758/revisions/44/download
```
**Datasources:**
| Type | Source | Purpose |
|------|--------|---------|
| Prometheus | prometheus-operated:9090 | Metrics |
| ClickHouse | clickstack:8123 | Logs & Traces |
| Alertmanager | alertmanager-operated:9093 | Alert status |
### Alerting Pipeline
```
Prometheus Rules → Alertmanager → ntfy → Discord/Mobile
└─→ Email (future)
```
**Alert Categories:**
- Infrastructure: Node down, disk full, OOM
- Application: Error rate, latency SLO breach
- Security: Gatekeeper violations, vulnerability findings
## Dashboards
| Dashboard | Source | Purpose |
|-----------|--------|---------|
| Kubernetes Global | Grafana #15757 | Cluster overview |
| Node Exporter | Grafana #1860 | Node metrics |
| CNPG PostgreSQL | CNPG | Database health |
| Flux | Flux Operator | GitOps status |
| Cilium | Cilium | Network metrics |
| Envoy Gateway | Envoy | Ingress metrics |
## Resource Allocation
| Component | CPU Request | Memory Limit |
|-----------|-------------|--------------|
| Prometheus | 100m | 2Gi |
| OTEL Collector | 100m | 512Mi |
| ClickStack | 500m | 2Gi |
| Grafana | 100m | 256Mi |
## Future Enhancements
1. **Continuous Profiling** - Pyroscope for Go/Python profiling
2. **SLO Tracking** - Sloth for SLI/SLO automation
3. **Synthetic Monitoring** - Gatus for endpoint probing
4. **Cost Attribution** - OpenCost for resource cost tracking
## References
* [OpenTelemetry Documentation](https://opentelemetry.io/docs/)
* [ClickHouse for Observability](https://clickhouse.com/docs/en/use-cases/observability)
* [kube-prometheus-stack](https://github.com/prometheus-community/helm-charts/tree/main/charts/kube-prometheus-stack)
* [Grafana Operator](https://grafana.github.io/grafana-operator/)

View File

@@ -0,0 +1,334 @@
# Tiered Storage Strategy: Longhorn + NFS
* Status: accepted
* Date: 2026-02-04
* Deciders: Billy
* Technical Story: Provide tiered storage for Kubernetes workloads balancing performance and capacity
## Context and Problem Statement
Kubernetes requires a storage solution for stateful applications like databases, message queues, and AI model caches. Different workloads have vastly different requirements:
- Databases need fast, reliable storage with replication
- Media libraries need large capacity but can tolerate slower access
- AI/ML workloads need both - fast storage for models, large capacity for datasets
The homelab has heterogeneous nodes including x86_64 servers and ARM64 Raspberry Pis, plus an external NAS for bulk storage.
How do we provide tiered storage that balances performance, reliability, and capacity for diverse homelab workloads?
## Decision Drivers
* Performance - fast IOPS for databases and critical workloads
* Capacity - large storage for media, datasets, and archives
* Reliability - data must survive node failures
* Heterogeneous support - work on both x86_64 and ARM64 (with limitations)
* Backup capability - support for off-cluster backups
* GitOps deployment - Helm charts with Flux management
## Considered Options
1. **Longhorn + NFS dual-tier storage**
2. **Rook-Ceph for everything**
3. **OpenEBS with Mayastor**
4. **NFS only**
5. **Longhorn only**
## Decision Outcome
Chosen option: **Option 1 - Longhorn + NFS dual-tier storage**
Two storage tiers optimized for different use cases:
- **`longhorn`** (default): Fast distributed block storage on NVMe/SSDs for databases and critical workloads
- **`nfs-slow`**: High-capacity NFS storage on external NAS for media, datasets, and bulk storage
### Positive Consequences
* Right-sized storage for each workload type
* Longhorn provides HA with automatic replication
* NFS provides massive capacity without consuming cluster disk space
* ReadWriteMany (RWX) easy on NFS tier
* Cost-effective - use existing NAS investment
### Negative Consequences
* Two storage systems to manage
* NFS is slower (hence `nfs-slow` naming)
* NFS single point of failure (no replication)
* Network dependency for both tiers
## Architecture
```
┌────────────────────────────────────────────────────────────────────────────┐
│ TIER 1: LONGHORN │
│ (Fast Distributed Block Storage) │
│ ┌─────────────┐ ┌─────────────┐ ┌─────────────┐ │
│ │ khelben │ │ mystra │ │ selune │ │
│ │ (NVIDIA) │ │ (AMD) │ │ (AMD) │ │
│ │ │ │ │ │ │ │
│ │ /var/mnt/ │ │ /var/mnt/ │ │ /var/mnt/ │ │
│ │ longhorn │ │ longhorn │ │ longhorn │ │
│ │ (NVMe) │ │ (SSD) │ │ (SSD) │ │
│ └──────┬──────┘ └──────┬──────┘ └──────┬──────┘ │
│ │ │ │ │
│ └────────────────┼────────────────┘ │
│ ▼ │
│ ┌───────────────────────┐ │
│ │ Longhorn Manager │ │
│ │ (Schedules replicas) │ │
│ └───────────┬───────────┘ │
│ ▼ │
│ ┌──────────┐ ┌──────────┐ ┌──────────┐ ┌──────────┐ │
│ │ Postgres │ │ Vault │ │Prometheus│ │ClickHouse│ │
│ │ PVC │ │ PVC │ │ PVC │ │ PVC │ │
│ └──────────┘ └──────────┘ └──────────┘ └──────────┘ │
└────────────────────────────────────────────────────────────────────────────┘
┌────────────────────────────────────────────────────────────────────────────┐
│ TIER 2: NFS-SLOW │
│ (High-Capacity Bulk Storage) │
│ │
│ ┌────────────────────────────────────────────────────────────────┐ │
│ │ candlekeep.lab.daviestechlabs.io │ │
│ │ (External NAS) │ │
│ │ │ │
│ │ /kubernetes │ │
│ │ ├── jellyfin-media/ (1TB+ media library) │ │
│ │ ├── nextcloud/ (user files) │ │
│ │ ├── immich/ (photo backups) │ │
│ │ ├── kavita/ (ebooks, comics, manga) │ │
│ │ ├── mlflow-artifacts/ (model artifacts) │ │
│ │ ├── ray-models/ (AI model weights) │ │
│ │ └── gitea-runner/ (build caches) │ │
│ └────────────────────────────────────────────────────────────────┘ │
│ │ │
│ ▼ │
│ ┌───────────────────────┐ │
│ │ NFS CSI Driver │ │
│ │ (csi-driver-nfs) │ │
│ └───────────┬───────────┘ │
│ ▼ │
│ ┌──────────┐ ┌──────────┐ ┌──────────┐ ┌──────────┐ │
│ │ Jellyfin │ │Nextcloud │ │ Immich │ │ Kavita │ │
│ │ PVC │ │ PVC │ │ PVC │ │ PVC │ │
│ └──────────┘ └──────────┘ └──────────┘ └──────────┘ │
└────────────────────────────────────────────────────────────────────────────┘
```
## Tier 1: Longhorn Configuration
### Helm Values
```yaml
persistence:
defaultClass: true
defaultClassReplicaCount: 2
defaultDataPath: /var/mnt/longhorn
defaultSettings:
defaultDataPath: /var/mnt/longhorn
# Allow on vllm-tainted nodes
taintToleration: "dedicated=vllm:NoSchedule"
# Exclude Raspberry Pi nodes (ARM64)
systemManagedComponentsNodeSelector: "kubernetes.io/arch:amd64"
# Snapshot retention
defaultRecurringJobs:
- name: nightly-snapshots
task: snapshot
cron: "0 2 * * *"
retain: 7
- name: weekly-backups
task: backup
cron: "0 3 * * 0"
retain: 4
```
### Longhorn Storage Classes
| StorageClass | Replicas | Use Case |
|--------------|----------|----------|
| `longhorn` (default) | 2 | General workloads, databases |
| `longhorn-single` | 1 | Development/ephemeral |
| `longhorn-strict` | 3 | Critical databases |
## Tier 2: NFS Configuration
### Helm Values (csi-driver-nfs)
```yaml
storageClass:
create: true
name: nfs-slow
parameters:
server: candlekeep.lab.daviestechlabs.io
share: /kubernetes
mountOptions:
- nfsvers=4.1
- nconnect=16 # Multiple TCP connections for throughput
- hard # Retry indefinitely on failure
- noatime # Don't update access times (performance)
reclaimPolicy: Delete
volumeBindingMode: Immediate
```
### Why "nfs-slow"?
The naming is intentional - it sets correct expectations:
- **Latency:** NAS is over network, higher latency than local NVMe
- **IOPS:** Spinning disks in NAS can't match SSD performance
- **Throughput:** Adequate for streaming media, not for databases
- **Benefit:** Massive capacity without consuming cluster disk space
## Storage Tier Selection Guide
| Workload Type | Storage Class | Rationale |
|---------------|---------------|-----------|
| PostgreSQL (CNPG) | `longhorn` or `nfs-slow` | Depends on criticality |
| Prometheus/ClickHouse | `longhorn` | High write IOPS required |
| Vault | `longhorn` | Security-critical, needs HA |
| Media (Jellyfin, Kavita) | `nfs-slow` | Large files, sequential reads |
| Photos (Immich) | `nfs-slow` | Bulk storage for photos |
| User files (Nextcloud) | `nfs-slow` | Capacity over speed |
| AI/ML models (Ray) | `nfs-slow` | Large model weights |
| Build caches (Gitea runner) | `nfs-slow` | Ephemeral, large |
| MLflow artifacts | `nfs-slow` | Model artifacts storage |
## Volume Usage by Tier
### Longhorn Volumes (Performance Tier)
| Workload | Size | Replicas | Access Mode |
|----------|------|----------|-------------|
| Prometheus | 50Gi | 2 | RWO |
| Vault | 2Gi | 2 | RWO |
| ClickHouse | 100Gi | 2 | RWO |
| Alertmanager | 1Gi | 2 | RWO |
### NFS Volumes (Capacity Tier)
| Workload | Size | Access Mode | Notes |
|----------|------|-------------|-------|
| Jellyfin | 2Ti | RWX | Media library |
| Immich | 500Gi | RWX | Photo storage |
| Nextcloud | 1Ti | RWX | User files |
| Kavita | 200Gi | RWX | Ebooks, comics |
| MLflow | 100Gi | RWX | Model artifacts |
| Ray models | 200Gi | RWX | AI model weights |
| Gitea runner | 50Gi | RWO | Build caches |
| Gitea DB (CNPG) | 10Gi | RWO | Capacity-optimized |
## Backup Strategy
### Longhorn Tier
#### Local Snapshots
- **Frequency:** Nightly at 2 AM
- **Retention:** 7 days
- **Purpose:** Quick recovery from accidental deletion
#### Off-Cluster Backups
- **Frequency:** Weekly on Sundays at 3 AM
- **Destination:** S3-compatible storage (MinIO/Backblaze)
- **Retention:** 4 weeks
- **Purpose:** Disaster recovery
### NFS Tier
#### NAS-Level Backups
- Handled by NAS backup solution (snapshots, replication)
- Not managed by Kubernetes
- Relies on NAS raid configuration for redundancy
### Backup Target Configuration (Longhorn)
```yaml
# ExternalSecret for backup credentials
apiVersion: external-secrets.io/v1
kind: ExternalSecret
metadata:
name: longhorn-backup-secret
spec:
secretStoreRef:
kind: ClusterSecretStore
name: vault
target:
name: longhorn-backup-secret
data:
- secretKey: AWS_ACCESS_KEY_ID
remoteRef:
key: kv/data/longhorn
property: backup_access_key
- secretKey: AWS_SECRET_ACCESS_KEY
remoteRef:
key: kv/data/longhorn
property: backup_secret_key
```
## Node Exclusions (Longhorn Only)
**Raspberry Pi nodes excluded because:**
- Limited disk I/O performance
- SD card wear concerns
- Memory constraints for Longhorn components
**GPU nodes included with tolerations:**
- `khelben` (NVIDIA) participates in Longhorn storage
- Taint toleration allows Longhorn to schedule there
## Performance Considerations
### Longhorn Performance
- `khelben` has NVMe - fastest storage node
- `mystra`/`selune` have SATA SSDs - adequate for most workloads
- 2 replicas across different nodes ensures single node failure survival
- Trade-off: 2x storage consumption
### NFS Performance
- Optimized with `nconnect=16` for parallel connections
- `noatime` reduces unnecessary write operations
- Sequential read workloads perform well (media streaming)
- Random I/O workloads should use Longhorn instead
### When to Choose Each Tier
| Requirement | Longhorn | NFS-Slow |
|-------------|----------|----------|
| Low latency | ✅ | ❌ |
| High IOPS | ✅ | ❌ |
| Large capacity | ❌ | ✅ |
| ReadWriteMany (RWX) | Limited | ✅ |
| Node failure survival | ✅ | ✅ (NAS HA) |
| Kubernetes-native | ✅ | ✅ |
## Monitoring
**Grafana Dashboard:** Longhorn dashboard for:
- Volume health and replica status
- IOPS and throughput per volume
- Disk space utilization per node
- Backup job status
**Alerts:**
- Volume degraded (replica count < desired)
- Disk space low (< 20% free)
- Backup job failed
## Future Enhancements
1. **NAS high availability** - Second NAS with replication
2. **Dedicated storage network** - Separate VLAN for storage traffic
3. **NVMe-oF** - Network NVMe for lower latency
4. **Tiered Longhorn** - Hot (NVMe) and warm (SSD) within Longhorn
5. **S3 tier** - MinIO for object storage workloads
## References
* [Longhorn Documentation](https://longhorn.io/docs/)
* [Longhorn Best Practices](https://longhorn.io/docs/latest/best-practices/)
* [NFS CSI Driver](https://github.com/kubernetes-csi/csi-driver-nfs)
* [Talos Longhorn Integration](https://www.talos.dev/v1.6/kubernetes-guides/configuration/storage/)

View File

@@ -0,0 +1,294 @@
# Database Strategy with CloudNativePG
* Status: accepted
* Date: 2026-02-04
* Deciders: Billy
* Technical Story: Standardize PostgreSQL deployment for stateful applications
## Context and Problem Statement
Multiple applications in the homelab require relational databases: Gitea, Authentik, Companions, MLflow, and potentially more. Each could use different database solutions, creating operational complexity.
How do we standardize database deployment while providing production-grade reliability and minimal operational overhead?
## Decision Drivers
* Operational simplicity - single operator to learn and manage
* High availability - automatic failover for critical databases
* Backup integration - consistent backup strategy across all databases
* GitOps compatibility - declarative database provisioning
* Resource efficiency - don't over-provision for homelab scale
## Considered Options
1. **CloudNativePG for PostgreSQL**
2. **Helm charts per application (Bitnami PostgreSQL)**
3. **External managed database (RDS-style)**
4. **SQLite where possible + single shared PostgreSQL**
## Decision Outcome
Chosen option: **Option 1 - CloudNativePG for PostgreSQL**
CloudNativePG (CNPG) provides a Kubernetes-native PostgreSQL operator with HA, automatic failover, connection pooling (PgBouncer), and integrated backups.
### Positive Consequences
* Single operator manages all PostgreSQL instances
* Declarative Cluster CRD for GitOps deployment
* Automatic failover with minimal data loss
* Built-in PgBouncer for connection pooling
* Prometheus metrics and Grafana dashboards included
* CNPG is CNCF-listed and actively maintained
### Negative Consequences
* PostgreSQL only (no MySQL/MariaDB support)
* Operator adds resource overhead
* Learning curve for CNPG-specific features
## Architecture
```
┌─────────────────────────────────────────────────────────────────┐
│ CNPG Operator │
│ (cnpg-system namespace) │
└────────────────────────────┬────────────────────────────────────┘
│ Manages
┌──────────────────┬─────────────────┬─────────────────────────────┐
│ │ │ │
▼ ▼ ▼ ▼
┌──────────────┐ ┌──────────────┐ ┌──────────────┐ ┌──────────────┐
│ gitea-pg │ │ authentik-db │ │companions-db │ │ mlflow-db │
│ (3 replicas)│ │ (3 replicas)│ │ (3 replicas) │ │ (1 replica) │
│ │ │ │ │ │ │ │
│ ┌──────────┐ │ │ ┌──────────┐ │ │ ┌──────────┐ │ │ ┌──────────┐ │
│ │ Primary │ │ │ │ Primary │ │ │ │ Primary │ │ │ │ Primary │ │
│ ├──────────┤ │ │ ├──────────┤ │ │ ├──────────┤ │ │ └──────────┘ │
│ │ Replica │ │ │ │ Replica │ │ │ │ Replica │ │ │ │
│ ├──────────┤ │ │ ├──────────┤ │ │ ├──────────┤ │ │ │
│ │ Replica │ │ │ │ Replica │ │ │ │ Replica │ │ │ │
│ └──────────┘ │ │ └──────────┘ │ │ └──────────┘ │ │ │
│ ┌──────────┐ │ │ ┌──────────┐ │ │ ┌──────────┐ │ │ │
│ │ PgBouncer│ │ │ │ PgBouncer│ │ │ │ PgBouncer│ │ │ │
│ └──────────┘ │ │ └──────────┘ │ │ └──────────┘ │ │ │
└──────────────┘ └──────────────┘ └──────────────┘ └──────────────┘
│ │ │ │
└──────────────────┼─────────────────┼────────────────┘
│ │
┌─────▼─────┐ ┌─────▼─────┐
│ Longhorn │ │ Longhorn │
│ PVCs │ │ Backups │
└───────────┘ └───────────┘
```
## Cluster Configuration Template
```yaml
apiVersion: postgresql.cnpg.io/v1
kind: Cluster
metadata:
name: app-db
spec:
description: "Application PostgreSQL Cluster"
imageName: ghcr.io/cloudnative-pg/postgresql:17.2
instances: 3
primaryUpdateStrategy: unsupervised
postgresql:
parameters:
shared_buffers: "256MB"
effective_cache_size: "768MB"
work_mem: "16MB"
max_connections: "200"
# Enable PgBouncer for connection pooling
enablePgBouncer: true
pgbouncer:
poolMode: transaction
defaultPoolSize: "25"
# Storage on Longhorn
storage:
size: 10Gi
storageClass: longhorn
# Monitoring
monitoring:
enabled: true
customQueriesConfigMap:
- name: cnpg-default-monitoring
key: queries
# Backup configuration
backup:
barmanObjectStore:
destinationPath: "s3://backups/postgres/"
s3Credentials:
accessKeyId:
name: postgres-backup-creds
key: ACCESS_KEY_ID
secretAccessKey:
name: postgres-backup-creds
key: SECRET_ACCESS_KEY
retentionPolicy: "7d"
```
## Database Instances
| Cluster | Instances | Storage | PgBouncer | Purpose |
|---------|-----------|---------|-----------|---------|
| `gitea-pg` | 3 | 10Gi | Yes | Git repository metadata |
| `authentik-db` | 3 | 5Gi | Yes | Identity/SSO data |
| `companions-db` | 3 | 10Gi | Yes | Chat app data |
| `mlflow-db` | 1 | 5Gi | No | Experiment tracking |
| `kubeflow-db` | 1 | 10Gi | No | Pipeline metadata |
## Connection Patterns
### Service Discovery
CNPG creates services for each cluster:
| Service | Purpose |
|---------|---------|
| `<cluster>-rw` | Read-write (primary only) |
| `<cluster>-ro` | Read-only (any replica) |
| `<cluster>-r` | Read (any instance) |
| `<cluster>-pooler-rw` | PgBouncer read-write |
| `<cluster>-pooler-ro` | PgBouncer read-only |
### Application Configuration
```yaml
# Application config using CNPG service
DATABASE_URL: "postgresql://user:password@gitea-pg-pooler-rw.gitea.svc:5432/giteadb"
```
### Credentials via External Secrets
```yaml
apiVersion: external-secrets.io/v1
kind: ExternalSecret
metadata:
name: app-db-credentials
spec:
secretStoreRef:
kind: ClusterSecretStore
name: vault
target:
name: app-db-credentials
data:
- secretKey: username
remoteRef:
key: kv/data/app-db
property: username
- secretKey: password
remoteRef:
key: kv/data/app-db
property: password
```
## High Availability
### Automatic Failover
- CNPG monitors primary health continuously
- If primary fails, automatic promotion of replica
- Application reconnection via service abstraction
- Typical failover time: 10-30 seconds
### Replica Synchronization
- Streaming replication from primary to replicas
- Synchronous replication available for zero data loss (trade-off: latency)
- Default: asynchronous with acceptable RPO
## Backup Strategy
### Continuous WAL Archiving
- Write-Ahead Log streamed to S3
- Point-in-time recovery capability
- RPO: seconds (last WAL segment)
### Base Backups
- **Frequency:** Daily
- **Retention:** 7 days
- **Destination:** S3-compatible (MinIO/Backblaze)
### Recovery Testing
- Periodic restore to test cluster
- Validate backup integrity
- Document recovery procedure
## Monitoring
### Prometheus Metrics
- Connection count and pool utilization
- Transaction rate and latency
- Replication lag
- Disk usage and WAL generation
### Grafana Dashboard
CNPG provides official dashboard:
- Cluster health overview
- Per-instance metrics
- Replication status
- Backup job history
### Alerts
```yaml
- alert: PostgreSQLDown
expr: cnpg_collector_up == 0
for: 5m
labels:
severity: critical
- alert: PostgreSQLReplicationLag
expr: cnpg_pg_replication_lag_seconds > 30
for: 5m
labels:
severity: warning
- alert: PostgreSQLConnectionsHigh
expr: cnpg_pg_stat_activity_count / cnpg_pg_settings_max_connections > 0.8
for: 5m
labels:
severity: warning
```
## When NOT to Use CloudNativePG
| Scenario | Alternative |
|----------|-------------|
| Simple app, no HA needed | Embedded SQLite |
| MySQL/MariaDB required | Application-specific chart |
| Massive scale | External managed database |
| Non-relational data | Redis/Valkey, MongoDB |
## PostgreSQL Version Policy
- Use latest stable major version (currently 17)
- Minor version updates: automatic (`primaryUpdateStrategy: unsupervised`)
- Major version upgrades: manual with testing
## Future Enhancements
1. **Cross-cluster replication** - DR site replica
2. **Logical replication** - Selective table sync between clusters
3. **TimescaleDB extension** - Time-series optimization for metrics
4. **PgVector extension** - Vector storage alternative to Milvus
## References
* [CloudNativePG Documentation](https://cloudnative-pg.io/documentation/)
* [CNPG GitHub](https://github.com/cloudnative-pg/cloudnative-pg)
* [PostgreSQL High Availability](https://www.postgresql.org/docs/current/high-availability.html)

View File

@@ -0,0 +1,415 @@
# Authentik Single Sign-On Strategy
* Status: accepted
* Date: 2026-02-04
* Deciders: Billy
* Technical Story: Centralize authentication across all homelab applications
## Context and Problem Statement
A growing homelab with many self-hosted applications creates authentication sprawl - each app has its own user database, passwords, and session management. This creates poor user experience and security risks.
How do we centralize authentication while maintaining flexibility for different application requirements?
## Decision Drivers
* Single sign-on (SSO) for all applications
* Centralized user management and lifecycle
* MFA enforcement across all applications
* Open-source and self-hosted
* Low resource requirements for homelab scale
## Considered Options
1. **Authentik as OIDC/SAML provider**
2. **Keycloak**
3. **Authelia + LDAP**
4. **Per-application local auth**
## Decision Outcome
Chosen option: **Option 1 - Authentik as OIDC/SAML provider**
Authentik provides modern identity management with OIDC, SAML 2.0, LDAP, and SCIM support. Its flow-based authentication engine allows customizable login experiences.
### Positive Consequences
* Clean, modern UI for users and admins
* Flexible flow-based authentication
* Built-in MFA (TOTP, WebAuthn, SMS, Duo)
* Proxy provider for legacy apps
* SCIM for user provisioning
* Active development and community
### Negative Consequences
* Python-based (higher memory than Go alternatives)
* PostgreSQL dependency
* Some enterprise features require outpost pods
## Architecture
```
┌─────────────────────┐
│ User │
└──────────┬──────────┘
┌─────────────────────┐
│ Ingress/Traefik │
└──────────┬──────────┘
┌──────────────────────────┼──────────────────────────┐
│ │ │
▼ ▼ ▼
┌─────────────────┐ ┌─────────────────┐ ┌─────────────────┐
│ auth.lab.io │ │ app.lab.io │ │ app2.lab.io │
│ (Authentik) │ │ (OIDC-enabled) │ │ (Proxy-auth) │
└─────────────────┘ └────────┬────────┘ └────────┬────────┘
│ │ │
│ ┌──────────────────┘ │
│ │ OIDC/OAuth2 │
│ │ │
▼ ▼ ▼
┌─────────────────────────────────────────────────────────────────┐
│ Authentik │
│ ┌─────────────┐ ┌─────────────┐ ┌─────────────┐ │
│ │ Server │ │ Worker │ │ Outpost │◄───────────┤
│ │ (API) │ │ (Tasks) │ │ (Proxy) │ Forward │
│ └──────┬──────┘ └──────┬──────┘ └─────────────┘ Auth │
│ │ │ │
│ └────────┬───────┘ │
│ │ │
│ ┌──────▼──────┐ │
│ │ Redis │ │
│ │ (Cache) │ │
│ └─────────────┘ │
│ │
└─────────────────────────────┬──────────────────────────────────┘
┌──────▼──────┐
│ PostgreSQL │
│ (CNPG) │
└─────────────┘
```
## Provider Configuration
### OIDC Applications
| Application | Provider Type | Claims Override | Notes |
|-------------|---------------|-----------------|-------|
| Gitea | OIDC | None | Admin mapping via group |
| Affine | OIDC | `email_verified: true` | See ADR-0016 |
| Companions | OIDC | None | Custom provider |
| Grafana | OIDC | `role` claim | Admin role mapping |
| ArgoCD | OIDC | `groups` claim | RBAC integration |
| MLflow | Proxy | N/A | Forward auth |
| Open WebUI | OIDC | None | LLM interface |
### Provider Template
```yaml
# Example OAuth2/OIDC Provider
apiVersion: authentik.io/v1
kind: OAuth2Provider
metadata:
name: gitea
spec:
name: Gitea
authorizationFlow: default-authorization-flow
clientId: ${GITEA_CLIENT_ID}
clientSecret: ${GITEA_CLIENT_SECRET}
redirectUris:
- https://git.lab.daviestechlabs.io/user/oauth2/authentik/callback
signingKey: authentik-self-signed
propertyMappings:
- authentik-default-openid
- authentik-default-email
- authentik-default-profile
```
## Authentication Flows
### Default Login Flow
```
┌─────────────┐ ┌─────────────┐ ┌─────────────┐ ┌─────────────┐
│ Login │────▶│ Username │────▶│ Password │────▶│ MFA │
│ Stage │ │ Stage │ │ Stage │ │ Stage │
└─────────────┘ └─────────────┘ └─────────────┘ └──────┬──────┘
┌─────────────┐
│ Session │
│ Created │
└─────────────┘
```
### Flow Customization
- **Admin users:** Require hardware key (WebAuthn)
- **API access:** Service account tokens
- **External users:** Email verification + MFA enrollment
## Group-Based Authorization
### Group Structure
```
authentik-admins → Authentik admin access
├── cluster-admins → Full cluster access
├── gitea-admins → Git admin
├── monitoring-admins → Grafana admin
└── ai-platform-admins → AI/ML admin
authentik-users → Standard user access
├── developers → Git write, monitoring read
├── ml-engineers → AI/ML services access
└── guests → Read-only access
```
### Application Group Mapping
```yaml
# Grafana OIDC config
grafana:
auth.generic_oauth:
role_attribute_path: |
contains(groups[*], 'monitoring-admins') && 'Admin' ||
contains(groups[*], 'developers') && 'Editor' ||
'Viewer'
```
## Outpost Deployment
### Embedded Outpost (Default)
- Runs within Authentik server
- Handles LDAP and Radius
- Suitable for small deployments
### Standalone Outpost (Proxy)
```yaml
apiVersion: apps/v1
kind: Deployment
metadata:
name: authentik-outpost-proxy
spec:
replicas: 2
template:
spec:
containers:
- name: outpost
image: ghcr.io/goauthentik/proxy
ports:
- containerPort: 9000
name: http
- containerPort: 9443
name: https
env:
- name: AUTHENTIK_HOST
value: "https://auth.lab.daviestechlabs.io/"
- name: AUTHENTIK_TOKEN
valueFrom:
secretKeyRef:
name: authentik-outpost-token
key: token
```
### Forward Auth Integration
For applications without OIDC support:
```yaml
# Traefik ForwardAuth middleware
apiVersion: traefik.io/v1alpha1
kind: Middleware
metadata:
name: authentik-forward-auth
spec:
forwardAuth:
address: http://authentik-outpost-proxy.authentik.svc:9000/outpost.goauthentik.io/auth/traefik
trustForwardHeader: true
authResponseHeaders:
- X-authentik-username
- X-authentik-groups
- X-authentik-email
```
## MFA Enforcement
### Policies
| User Group | MFA Requirement |
|------------|-----------------|
| Admins | WebAuthn (hardware key) required |
| Developers | TOTP or WebAuthn required |
| Guests | MFA optional |
### Device Registration
- Self-service MFA enrollment
- Recovery codes generated at setup
- Admin can reset user MFA
## SCIM User Provisioning
### When to Use
- Automatic user creation in downstream apps
- Group membership sync
- User deprovisioning on termination
### Supported Apps
Currently using SCIM provisioning for:
- None (manual user creation in apps)
Future consideration for:
- Gitea organization sync
- Grafana team sync
## Secrets Management Integration
### Vault Integration
```yaml
# External Secret for Authentik DB credentials
apiVersion: external-secrets.io/v1
kind: ExternalSecret
metadata:
name: authentik-db-credentials
namespace: authentik
spec:
secretStoreRef:
kind: ClusterSecretStore
name: vault
target:
name: authentik-db-credentials
data:
- secretKey: password
remoteRef:
key: kv/data/authentik
property: db_password
- secretKey: secret_key
remoteRef:
key: kv/data/authentik
property: secret_key
```
## Monitoring
### Prometheus Metrics
Authentik exposes metrics at `/metrics`:
- `authentik_login_duration_seconds`
- `authentik_login_attempt_total`
- `authentik_outpost_connected`
- `authentik_provider_authorization_total`
### Grafana Dashboard
- Login success/failure rates
- Active sessions
- Provider usage
- MFA adoption rates
### Alerts
```yaml
- alert: AuthentikHighLoginFailures
expr: rate(authentik_login_attempt_total{result="failure"}[5m]) > 10
for: 5m
labels:
severity: warning
annotations:
summary: High login failure rate detected
- alert: AuthentikOutpostDisconnected
expr: authentik_outpost_connected == 0
for: 5m
labels:
severity: critical
```
## Backup and Recovery
### What to Backup
1. PostgreSQL database (via CNPG)
2. Media files (if custom branding)
3. Blueprint exports (configuration as code)
### Blueprints
Export configuration as YAML for GitOps:
```yaml
# authentik-blueprints/providers/gitea.yaml
version: 1
metadata:
name: Gitea OIDC Provider
entries:
- model: authentik_providers_oauth2.oauth2provider
identifiers:
name: gitea
attrs:
authorization_flow: !Find [authentik_flows.flow, [slug, default-authorization-flow]]
# ... provider config
```
## Integration Patterns
### Pattern 1: Native OIDC
Best for: Modern applications with OIDC support
```
App ──OIDC──▶ Authentik ──▶ App (with user info)
```
### Pattern 2: Proxy Forward Auth
Best for: Legacy apps without SSO support
```
Request ──▶ Traefik ──ForwardAuth──▶ Authentik Outpost
│ │
│◀──────Header injection─────┘
App (reads X-authentik-* headers)
```
### Pattern 3: LDAP Compatibility
Best for: Apps requiring LDAP
```
App ──LDAP──▶ Authentik Outpost (LDAP) ──▶ Authentik
```
## Resource Requirements
| Component | CPU Request | Memory Request |
|-----------|-------------|----------------|
| Server | 100m | 500Mi |
| Worker | 100m | 500Mi |
| Redis | 50m | 128Mi |
| Outpost (each) | 50m | 128Mi |
## Future Enhancements
1. **Passkey/FIDO2** - Passwordless authentication
2. **External IdP federation** - Google/GitHub as upstream IdP
3. **Conditional access** - Device trust, network location policies
4. **Session revocation** - Force logout from all apps
## References
* [Authentik Documentation](https://goauthentik.io/docs/)
* [Authentik GitHub](https://github.com/goauthentik/authentik)
* [OIDC Specification](https://openid.net/specs/openid-connect-core-1_0.html)