docs: add ADRs 0025-0028 for infrastructure patterns
- 0025: Observability stack (Prometheus + ClickStack + OTEL) - 0026: Tiered storage strategy (Longhorn + NFS) - 0027: Database strategy (CloudNativePG for PostgreSQL) - 0028: Authentik SSO strategy (OIDC/SAML identity provider)
This commit is contained in:
239
decisions/0025-observability-stack.md
Normal file
239
decisions/0025-observability-stack.md
Normal file
@@ -0,0 +1,239 @@
|
|||||||
|
# Observability Stack Architecture
|
||||||
|
|
||||||
|
* Status: accepted
|
||||||
|
* Date: 2026-02-04
|
||||||
|
* Deciders: Billy
|
||||||
|
* Technical Story: Establish comprehensive observability for metrics, logs, and traces across the homelab
|
||||||
|
|
||||||
|
## Context and Problem Statement
|
||||||
|
|
||||||
|
A complex homelab with AI/ML workloads, multiple databases, and numerous services requires comprehensive observability to understand system behavior, debug issues, and optimize performance.
|
||||||
|
|
||||||
|
How do we build an observability stack that provides metrics, logs, and traces while remaining manageable for a single operator?
|
||||||
|
|
||||||
|
## Decision Drivers
|
||||||
|
|
||||||
|
* Three pillars coverage - metrics, logs, and traces all addressed
|
||||||
|
* Unified visualization - single pane of glass for all telemetry
|
||||||
|
* Resource efficiency - don't overwhelm the cluster with observability overhead
|
||||||
|
* OpenTelemetry compatibility - future-proof instrumentation standard
|
||||||
|
* GitOps deployment - all configuration version-controlled
|
||||||
|
|
||||||
|
## Considered Options
|
||||||
|
|
||||||
|
1. **Prometheus + ClickStack + OpenTelemetry Collector**
|
||||||
|
2. **Prometheus + Loki + Tempo (PLT Stack)**
|
||||||
|
3. **Datadog/New Relic (SaaS)**
|
||||||
|
4. **ELK Stack (Elasticsearch, Logstash, Kibana)**
|
||||||
|
|
||||||
|
## Decision Outcome
|
||||||
|
|
||||||
|
Chosen option: **Option 1 - Prometheus + ClickStack + OpenTelemetry Collector**
|
||||||
|
|
||||||
|
Prometheus handles metrics with its mature ecosystem, ClickStack (ClickHouse-based) provides unified logs and traces storage with excellent performance, and OpenTelemetry Collector routes all telemetry data.
|
||||||
|
|
||||||
|
### Positive Consequences
|
||||||
|
|
||||||
|
* Prometheus ecosystem is mature with extensive service monitor support
|
||||||
|
* ClickHouse provides fast querying for logs and traces at scale
|
||||||
|
* OpenTelemetry is vendor-neutral and industry standard
|
||||||
|
* Grafana provides unified dashboards for all data sources
|
||||||
|
* Cost-effective (no SaaS fees)
|
||||||
|
|
||||||
|
### Negative Consequences
|
||||||
|
|
||||||
|
* More complex than pure SaaS solutions
|
||||||
|
* ClickHouse requires storage management
|
||||||
|
* Multiple components to maintain
|
||||||
|
|
||||||
|
## Architecture
|
||||||
|
|
||||||
|
```
|
||||||
|
┌─────────────────────────────────────────────────────────────────────┐
|
||||||
|
│ Applications │
|
||||||
|
│ ┌──────────┐ ┌──────────┐ ┌──────────┐ ┌──────────┐ │
|
||||||
|
│ │ Go Apps │ │ Python │ │ Node.js │ │ Java │ │
|
||||||
|
│ │ (OTEL) │ │ (OTEL) │ │ (OTEL) │ │ (OTEL) │ │
|
||||||
|
│ └────┬─────┘ └────┬─────┘ └────┬─────┘ └────┬─────┘ │
|
||||||
|
└───────┼─────────────┼─────────────┼─────────────┼──────────────────┘
|
||||||
|
│ │ │ │
|
||||||
|
└──────────────────┬────────────────────────┘
|
||||||
|
│ OTLP (gRPC/HTTP)
|
||||||
|
▼
|
||||||
|
┌────────────────────────┐
|
||||||
|
│ OpenTelemetry │
|
||||||
|
│ Collector │
|
||||||
|
│ (traces, metrics, │
|
||||||
|
│ logs) │
|
||||||
|
└───────────┬────────────┘
|
||||||
|
│
|
||||||
|
┌───────────────┼───────────────┐
|
||||||
|
│ │ │
|
||||||
|
▼ ▼ ▼
|
||||||
|
┌─────────────────┐ ┌───────────┐ ┌───────────────┐
|
||||||
|
│ ClickStack │ │Prometheus │ │ Grafana │
|
||||||
|
│ (ClickHouse) │ │ │ │ │
|
||||||
|
│ ┌───────────┐ │ │ Metrics │ │ Dashboards │
|
||||||
|
│ │ Traces │ │ │ Storage │ │ Alerting │
|
||||||
|
│ ├───────────┤ │ │ │ │ Exploration │
|
||||||
|
│ │ Logs │ │ └───────────┘ │ │
|
||||||
|
│ └───────────┘ │ └───────────────┘
|
||||||
|
└─────────────────┘ │
|
||||||
|
│
|
||||||
|
┌────────────────────┤
|
||||||
|
│ │
|
||||||
|
┌─────▼─────┐ ┌─────▼─────┐
|
||||||
|
│Alertmanager│ │ ntfy │
|
||||||
|
│ │ │ (push) │
|
||||||
|
└───────────┘ └───────────┘
|
||||||
|
```
|
||||||
|
|
||||||
|
## Component Details
|
||||||
|
|
||||||
|
### Metrics: Prometheus + kube-prometheus-stack
|
||||||
|
|
||||||
|
**Deployment:** HelmRelease via Flux
|
||||||
|
|
||||||
|
```yaml
|
||||||
|
prometheus:
|
||||||
|
prometheusSpec:
|
||||||
|
retention: 14d
|
||||||
|
retentionSize: 50GB
|
||||||
|
storage:
|
||||||
|
volumeClaimTemplate:
|
||||||
|
spec:
|
||||||
|
storageClassName: longhorn
|
||||||
|
storage: 50Gi
|
||||||
|
```
|
||||||
|
|
||||||
|
**Key Features:**
|
||||||
|
- ServiceMonitor auto-discovery for all workloads
|
||||||
|
- 14-day retention with 50GB limit
|
||||||
|
- PromPP image for enhanced performance
|
||||||
|
- AlertManager for routing alerts
|
||||||
|
|
||||||
|
### Logs & Traces: ClickStack
|
||||||
|
|
||||||
|
**Why ClickStack over Loki/Tempo:**
|
||||||
|
- Single storage backend (ClickHouse) for both logs and traces
|
||||||
|
- Excellent query performance on large datasets
|
||||||
|
- Built-in correlation between logs and traces
|
||||||
|
- Lower resource overhead than separate Loki + Tempo
|
||||||
|
|
||||||
|
**Configuration:**
|
||||||
|
- OTEL Collector receives all telemetry
|
||||||
|
- Forwards to ClickStack's OTEL collector
|
||||||
|
- Grafana datasources for querying
|
||||||
|
|
||||||
|
### Telemetry Collection: OpenTelemetry
|
||||||
|
|
||||||
|
**OpenTelemetry Operator:** Manages auto-instrumentation
|
||||||
|
|
||||||
|
```yaml
|
||||||
|
apiVersion: opentelemetry.io/v1alpha1
|
||||||
|
kind: Instrumentation
|
||||||
|
metadata:
|
||||||
|
name: auto-instrumentation
|
||||||
|
spec:
|
||||||
|
python:
|
||||||
|
image: ghcr.io/open-telemetry/opentelemetry-operator/autoinstrumentation-python
|
||||||
|
nodejs:
|
||||||
|
image: ghcr.io/open-telemetry/opentelemetry-operator/autoinstrumentation-nodejs
|
||||||
|
```
|
||||||
|
|
||||||
|
**OpenTelemetry Collector:** Central routing
|
||||||
|
|
||||||
|
```yaml
|
||||||
|
receivers:
|
||||||
|
otlp:
|
||||||
|
protocols:
|
||||||
|
grpc:
|
||||||
|
endpoint: 0.0.0.0:4317
|
||||||
|
http:
|
||||||
|
endpoint: 0.0.0.0:4318
|
||||||
|
|
||||||
|
exporters:
|
||||||
|
otlphttp:
|
||||||
|
endpoint: http://clickstack-otel-collector:4318
|
||||||
|
|
||||||
|
service:
|
||||||
|
pipelines:
|
||||||
|
traces:
|
||||||
|
receivers: [otlp]
|
||||||
|
exporters: [otlphttp]
|
||||||
|
metrics:
|
||||||
|
receivers: [otlp]
|
||||||
|
exporters: [otlphttp]
|
||||||
|
logs:
|
||||||
|
receivers: [otlp]
|
||||||
|
exporters: [otlphttp]
|
||||||
|
```
|
||||||
|
|
||||||
|
### Visualization: Grafana
|
||||||
|
|
||||||
|
**Grafana Operator:** Manages dashboards and datasources as CRDs
|
||||||
|
|
||||||
|
```yaml
|
||||||
|
apiVersion: grafana.integreatly.org/v1beta1
|
||||||
|
kind: GrafanaDashboard
|
||||||
|
metadata:
|
||||||
|
name: kubernetes-nodes
|
||||||
|
spec:
|
||||||
|
instanceSelector:
|
||||||
|
matchLabels:
|
||||||
|
grafana.internal/instance: grafana
|
||||||
|
url: https://grafana.com/api/dashboards/15758/revisions/44/download
|
||||||
|
```
|
||||||
|
|
||||||
|
**Datasources:**
|
||||||
|
| Type | Source | Purpose |
|
||||||
|
|------|--------|---------|
|
||||||
|
| Prometheus | prometheus-operated:9090 | Metrics |
|
||||||
|
| ClickHouse | clickstack:8123 | Logs & Traces |
|
||||||
|
| Alertmanager | alertmanager-operated:9093 | Alert status |
|
||||||
|
|
||||||
|
### Alerting Pipeline
|
||||||
|
|
||||||
|
```
|
||||||
|
Prometheus Rules → Alertmanager → ntfy → Discord/Mobile
|
||||||
|
└─→ Email (future)
|
||||||
|
```
|
||||||
|
|
||||||
|
**Alert Categories:**
|
||||||
|
- Infrastructure: Node down, disk full, OOM
|
||||||
|
- Application: Error rate, latency SLO breach
|
||||||
|
- Security: Gatekeeper violations, vulnerability findings
|
||||||
|
|
||||||
|
## Dashboards
|
||||||
|
|
||||||
|
| Dashboard | Source | Purpose |
|
||||||
|
|-----------|--------|---------|
|
||||||
|
| Kubernetes Global | Grafana #15757 | Cluster overview |
|
||||||
|
| Node Exporter | Grafana #1860 | Node metrics |
|
||||||
|
| CNPG PostgreSQL | CNPG | Database health |
|
||||||
|
| Flux | Flux Operator | GitOps status |
|
||||||
|
| Cilium | Cilium | Network metrics |
|
||||||
|
| Envoy Gateway | Envoy | Ingress metrics |
|
||||||
|
|
||||||
|
## Resource Allocation
|
||||||
|
|
||||||
|
| Component | CPU Request | Memory Limit |
|
||||||
|
|-----------|-------------|--------------|
|
||||||
|
| Prometheus | 100m | 2Gi |
|
||||||
|
| OTEL Collector | 100m | 512Mi |
|
||||||
|
| ClickStack | 500m | 2Gi |
|
||||||
|
| Grafana | 100m | 256Mi |
|
||||||
|
|
||||||
|
## Future Enhancements
|
||||||
|
|
||||||
|
1. **Continuous Profiling** - Pyroscope for Go/Python profiling
|
||||||
|
2. **SLO Tracking** - Sloth for SLI/SLO automation
|
||||||
|
3. **Synthetic Monitoring** - Gatus for endpoint probing
|
||||||
|
4. **Cost Attribution** - OpenCost for resource cost tracking
|
||||||
|
|
||||||
|
## References
|
||||||
|
|
||||||
|
* [OpenTelemetry Documentation](https://opentelemetry.io/docs/)
|
||||||
|
* [ClickHouse for Observability](https://clickhouse.com/docs/en/use-cases/observability)
|
||||||
|
* [kube-prometheus-stack](https://github.com/prometheus-community/helm-charts/tree/main/charts/kube-prometheus-stack)
|
||||||
|
* [Grafana Operator](https://grafana.github.io/grafana-operator/)
|
||||||
334
decisions/0026-storage-strategy.md
Normal file
334
decisions/0026-storage-strategy.md
Normal file
@@ -0,0 +1,334 @@
|
|||||||
|
# Tiered Storage Strategy: Longhorn + NFS
|
||||||
|
|
||||||
|
* Status: accepted
|
||||||
|
* Date: 2026-02-04
|
||||||
|
* Deciders: Billy
|
||||||
|
* Technical Story: Provide tiered storage for Kubernetes workloads balancing performance and capacity
|
||||||
|
|
||||||
|
## Context and Problem Statement
|
||||||
|
|
||||||
|
Kubernetes requires a storage solution for stateful applications like databases, message queues, and AI model caches. Different workloads have vastly different requirements:
|
||||||
|
- Databases need fast, reliable storage with replication
|
||||||
|
- Media libraries need large capacity but can tolerate slower access
|
||||||
|
- AI/ML workloads need both - fast storage for models, large capacity for datasets
|
||||||
|
|
||||||
|
The homelab has heterogeneous nodes including x86_64 servers and ARM64 Raspberry Pis, plus an external NAS for bulk storage.
|
||||||
|
|
||||||
|
How do we provide tiered storage that balances performance, reliability, and capacity for diverse homelab workloads?
|
||||||
|
|
||||||
|
## Decision Drivers
|
||||||
|
|
||||||
|
* Performance - fast IOPS for databases and critical workloads
|
||||||
|
* Capacity - large storage for media, datasets, and archives
|
||||||
|
* Reliability - data must survive node failures
|
||||||
|
* Heterogeneous support - work on both x86_64 and ARM64 (with limitations)
|
||||||
|
* Backup capability - support for off-cluster backups
|
||||||
|
* GitOps deployment - Helm charts with Flux management
|
||||||
|
|
||||||
|
## Considered Options
|
||||||
|
|
||||||
|
1. **Longhorn + NFS dual-tier storage**
|
||||||
|
2. **Rook-Ceph for everything**
|
||||||
|
3. **OpenEBS with Mayastor**
|
||||||
|
4. **NFS only**
|
||||||
|
5. **Longhorn only**
|
||||||
|
|
||||||
|
## Decision Outcome
|
||||||
|
|
||||||
|
Chosen option: **Option 1 - Longhorn + NFS dual-tier storage**
|
||||||
|
|
||||||
|
Two storage tiers optimized for different use cases:
|
||||||
|
- **`longhorn`** (default): Fast distributed block storage on NVMe/SSDs for databases and critical workloads
|
||||||
|
- **`nfs-slow`**: High-capacity NFS storage on external NAS for media, datasets, and bulk storage
|
||||||
|
|
||||||
|
### Positive Consequences
|
||||||
|
|
||||||
|
* Right-sized storage for each workload type
|
||||||
|
* Longhorn provides HA with automatic replication
|
||||||
|
* NFS provides massive capacity without consuming cluster disk space
|
||||||
|
* ReadWriteMany (RWX) easy on NFS tier
|
||||||
|
* Cost-effective - use existing NAS investment
|
||||||
|
|
||||||
|
### Negative Consequences
|
||||||
|
|
||||||
|
* Two storage systems to manage
|
||||||
|
* NFS is slower (hence `nfs-slow` naming)
|
||||||
|
* NFS single point of failure (no replication)
|
||||||
|
* Network dependency for both tiers
|
||||||
|
|
||||||
|
## Architecture
|
||||||
|
|
||||||
|
```
|
||||||
|
┌────────────────────────────────────────────────────────────────────────────┐
|
||||||
|
│ TIER 1: LONGHORN │
|
||||||
|
│ (Fast Distributed Block Storage) │
|
||||||
|
│ ┌─────────────┐ ┌─────────────┐ ┌─────────────┐ │
|
||||||
|
│ │ khelben │ │ mystra │ │ selune │ │
|
||||||
|
│ │ (NVIDIA) │ │ (AMD) │ │ (AMD) │ │
|
||||||
|
│ │ │ │ │ │ │ │
|
||||||
|
│ │ /var/mnt/ │ │ /var/mnt/ │ │ /var/mnt/ │ │
|
||||||
|
│ │ longhorn │ │ longhorn │ │ longhorn │ │
|
||||||
|
│ │ (NVMe) │ │ (SSD) │ │ (SSD) │ │
|
||||||
|
│ └──────┬──────┘ └──────┬──────┘ └──────┬──────┘ │
|
||||||
|
│ │ │ │ │
|
||||||
|
│ └────────────────┼────────────────┘ │
|
||||||
|
│ ▼ │
|
||||||
|
│ ┌───────────────────────┐ │
|
||||||
|
│ │ Longhorn Manager │ │
|
||||||
|
│ │ (Schedules replicas) │ │
|
||||||
|
│ └───────────┬───────────┘ │
|
||||||
|
│ ▼ │
|
||||||
|
│ ┌──────────┐ ┌──────────┐ ┌──────────┐ ┌──────────┐ │
|
||||||
|
│ │ Postgres │ │ Vault │ │Prometheus│ │ClickHouse│ │
|
||||||
|
│ │ PVC │ │ PVC │ │ PVC │ │ PVC │ │
|
||||||
|
│ └──────────┘ └──────────┘ └──────────┘ └──────────┘ │
|
||||||
|
└────────────────────────────────────────────────────────────────────────────┘
|
||||||
|
|
||||||
|
┌────────────────────────────────────────────────────────────────────────────┐
|
||||||
|
│ TIER 2: NFS-SLOW │
|
||||||
|
│ (High-Capacity Bulk Storage) │
|
||||||
|
│ │
|
||||||
|
│ ┌────────────────────────────────────────────────────────────────┐ │
|
||||||
|
│ │ candlekeep.lab.daviestechlabs.io │ │
|
||||||
|
│ │ (External NAS) │ │
|
||||||
|
│ │ │ │
|
||||||
|
│ │ /kubernetes │ │
|
||||||
|
│ │ ├── jellyfin-media/ (1TB+ media library) │ │
|
||||||
|
│ │ ├── nextcloud/ (user files) │ │
|
||||||
|
│ │ ├── immich/ (photo backups) │ │
|
||||||
|
│ │ ├── kavita/ (ebooks, comics, manga) │ │
|
||||||
|
│ │ ├── mlflow-artifacts/ (model artifacts) │ │
|
||||||
|
│ │ ├── ray-models/ (AI model weights) │ │
|
||||||
|
│ │ └── gitea-runner/ (build caches) │ │
|
||||||
|
│ └────────────────────────────────────────────────────────────────┘ │
|
||||||
|
│ │ │
|
||||||
|
│ ▼ │
|
||||||
|
│ ┌───────────────────────┐ │
|
||||||
|
│ │ NFS CSI Driver │ │
|
||||||
|
│ │ (csi-driver-nfs) │ │
|
||||||
|
│ └───────────┬───────────┘ │
|
||||||
|
│ ▼ │
|
||||||
|
│ ┌──────────┐ ┌──────────┐ ┌──────────┐ ┌──────────┐ │
|
||||||
|
│ │ Jellyfin │ │Nextcloud │ │ Immich │ │ Kavita │ │
|
||||||
|
│ │ PVC │ │ PVC │ │ PVC │ │ PVC │ │
|
||||||
|
│ └──────────┘ └──────────┘ └──────────┘ └──────────┘ │
|
||||||
|
└────────────────────────────────────────────────────────────────────────────┘
|
||||||
|
```
|
||||||
|
|
||||||
|
## Tier 1: Longhorn Configuration
|
||||||
|
|
||||||
|
### Helm Values
|
||||||
|
|
||||||
|
```yaml
|
||||||
|
persistence:
|
||||||
|
defaultClass: true
|
||||||
|
defaultClassReplicaCount: 2
|
||||||
|
defaultDataPath: /var/mnt/longhorn
|
||||||
|
|
||||||
|
defaultSettings:
|
||||||
|
defaultDataPath: /var/mnt/longhorn
|
||||||
|
# Allow on vllm-tainted nodes
|
||||||
|
taintToleration: "dedicated=vllm:NoSchedule"
|
||||||
|
# Exclude Raspberry Pi nodes (ARM64)
|
||||||
|
systemManagedComponentsNodeSelector: "kubernetes.io/arch:amd64"
|
||||||
|
# Snapshot retention
|
||||||
|
defaultRecurringJobs:
|
||||||
|
- name: nightly-snapshots
|
||||||
|
task: snapshot
|
||||||
|
cron: "0 2 * * *"
|
||||||
|
retain: 7
|
||||||
|
- name: weekly-backups
|
||||||
|
task: backup
|
||||||
|
cron: "0 3 * * 0"
|
||||||
|
retain: 4
|
||||||
|
```
|
||||||
|
|
||||||
|
### Longhorn Storage Classes
|
||||||
|
|
||||||
|
| StorageClass | Replicas | Use Case |
|
||||||
|
|--------------|----------|----------|
|
||||||
|
| `longhorn` (default) | 2 | General workloads, databases |
|
||||||
|
| `longhorn-single` | 1 | Development/ephemeral |
|
||||||
|
| `longhorn-strict` | 3 | Critical databases |
|
||||||
|
|
||||||
|
## Tier 2: NFS Configuration
|
||||||
|
|
||||||
|
### Helm Values (csi-driver-nfs)
|
||||||
|
|
||||||
|
```yaml
|
||||||
|
storageClass:
|
||||||
|
create: true
|
||||||
|
name: nfs-slow
|
||||||
|
parameters:
|
||||||
|
server: candlekeep.lab.daviestechlabs.io
|
||||||
|
share: /kubernetes
|
||||||
|
mountOptions:
|
||||||
|
- nfsvers=4.1
|
||||||
|
- nconnect=16 # Multiple TCP connections for throughput
|
||||||
|
- hard # Retry indefinitely on failure
|
||||||
|
- noatime # Don't update access times (performance)
|
||||||
|
reclaimPolicy: Delete
|
||||||
|
volumeBindingMode: Immediate
|
||||||
|
```
|
||||||
|
|
||||||
|
### Why "nfs-slow"?
|
||||||
|
|
||||||
|
The naming is intentional - it sets correct expectations:
|
||||||
|
- **Latency:** NAS is over network, higher latency than local NVMe
|
||||||
|
- **IOPS:** Spinning disks in NAS can't match SSD performance
|
||||||
|
- **Throughput:** Adequate for streaming media, not for databases
|
||||||
|
- **Benefit:** Massive capacity without consuming cluster disk space
|
||||||
|
|
||||||
|
## Storage Tier Selection Guide
|
||||||
|
|
||||||
|
| Workload Type | Storage Class | Rationale |
|
||||||
|
|---------------|---------------|-----------|
|
||||||
|
| PostgreSQL (CNPG) | `longhorn` or `nfs-slow` | Depends on criticality |
|
||||||
|
| Prometheus/ClickHouse | `longhorn` | High write IOPS required |
|
||||||
|
| Vault | `longhorn` | Security-critical, needs HA |
|
||||||
|
| Media (Jellyfin, Kavita) | `nfs-slow` | Large files, sequential reads |
|
||||||
|
| Photos (Immich) | `nfs-slow` | Bulk storage for photos |
|
||||||
|
| User files (Nextcloud) | `nfs-slow` | Capacity over speed |
|
||||||
|
| AI/ML models (Ray) | `nfs-slow` | Large model weights |
|
||||||
|
| Build caches (Gitea runner) | `nfs-slow` | Ephemeral, large |
|
||||||
|
| MLflow artifacts | `nfs-slow` | Model artifacts storage |
|
||||||
|
|
||||||
|
## Volume Usage by Tier
|
||||||
|
|
||||||
|
### Longhorn Volumes (Performance Tier)
|
||||||
|
|
||||||
|
| Workload | Size | Replicas | Access Mode |
|
||||||
|
|----------|------|----------|-------------|
|
||||||
|
| Prometheus | 50Gi | 2 | RWO |
|
||||||
|
| Vault | 2Gi | 2 | RWO |
|
||||||
|
| ClickHouse | 100Gi | 2 | RWO |
|
||||||
|
| Alertmanager | 1Gi | 2 | RWO |
|
||||||
|
|
||||||
|
### NFS Volumes (Capacity Tier)
|
||||||
|
|
||||||
|
| Workload | Size | Access Mode | Notes |
|
||||||
|
|----------|------|-------------|-------|
|
||||||
|
| Jellyfin | 2Ti | RWX | Media library |
|
||||||
|
| Immich | 500Gi | RWX | Photo storage |
|
||||||
|
| Nextcloud | 1Ti | RWX | User files |
|
||||||
|
| Kavita | 200Gi | RWX | Ebooks, comics |
|
||||||
|
| MLflow | 100Gi | RWX | Model artifacts |
|
||||||
|
| Ray models | 200Gi | RWX | AI model weights |
|
||||||
|
| Gitea runner | 50Gi | RWO | Build caches |
|
||||||
|
| Gitea DB (CNPG) | 10Gi | RWO | Capacity-optimized |
|
||||||
|
|
||||||
|
## Backup Strategy
|
||||||
|
|
||||||
|
### Longhorn Tier
|
||||||
|
|
||||||
|
#### Local Snapshots
|
||||||
|
|
||||||
|
- **Frequency:** Nightly at 2 AM
|
||||||
|
- **Retention:** 7 days
|
||||||
|
- **Purpose:** Quick recovery from accidental deletion
|
||||||
|
|
||||||
|
#### Off-Cluster Backups
|
||||||
|
|
||||||
|
- **Frequency:** Weekly on Sundays at 3 AM
|
||||||
|
- **Destination:** S3-compatible storage (MinIO/Backblaze)
|
||||||
|
- **Retention:** 4 weeks
|
||||||
|
- **Purpose:** Disaster recovery
|
||||||
|
|
||||||
|
### NFS Tier
|
||||||
|
|
||||||
|
#### NAS-Level Backups
|
||||||
|
|
||||||
|
- Handled by NAS backup solution (snapshots, replication)
|
||||||
|
- Not managed by Kubernetes
|
||||||
|
- Relies on NAS raid configuration for redundancy
|
||||||
|
|
||||||
|
### Backup Target Configuration (Longhorn)
|
||||||
|
|
||||||
|
```yaml
|
||||||
|
# ExternalSecret for backup credentials
|
||||||
|
apiVersion: external-secrets.io/v1
|
||||||
|
kind: ExternalSecret
|
||||||
|
metadata:
|
||||||
|
name: longhorn-backup-secret
|
||||||
|
spec:
|
||||||
|
secretStoreRef:
|
||||||
|
kind: ClusterSecretStore
|
||||||
|
name: vault
|
||||||
|
target:
|
||||||
|
name: longhorn-backup-secret
|
||||||
|
data:
|
||||||
|
- secretKey: AWS_ACCESS_KEY_ID
|
||||||
|
remoteRef:
|
||||||
|
key: kv/data/longhorn
|
||||||
|
property: backup_access_key
|
||||||
|
- secretKey: AWS_SECRET_ACCESS_KEY
|
||||||
|
remoteRef:
|
||||||
|
key: kv/data/longhorn
|
||||||
|
property: backup_secret_key
|
||||||
|
```
|
||||||
|
|
||||||
|
## Node Exclusions (Longhorn Only)
|
||||||
|
|
||||||
|
**Raspberry Pi nodes excluded because:**
|
||||||
|
- Limited disk I/O performance
|
||||||
|
- SD card wear concerns
|
||||||
|
- Memory constraints for Longhorn components
|
||||||
|
|
||||||
|
**GPU nodes included with tolerations:**
|
||||||
|
- `khelben` (NVIDIA) participates in Longhorn storage
|
||||||
|
- Taint toleration allows Longhorn to schedule there
|
||||||
|
|
||||||
|
## Performance Considerations
|
||||||
|
|
||||||
|
### Longhorn Performance
|
||||||
|
|
||||||
|
- `khelben` has NVMe - fastest storage node
|
||||||
|
- `mystra`/`selune` have SATA SSDs - adequate for most workloads
|
||||||
|
- 2 replicas across different nodes ensures single node failure survival
|
||||||
|
- Trade-off: 2x storage consumption
|
||||||
|
|
||||||
|
### NFS Performance
|
||||||
|
|
||||||
|
- Optimized with `nconnect=16` for parallel connections
|
||||||
|
- `noatime` reduces unnecessary write operations
|
||||||
|
- Sequential read workloads perform well (media streaming)
|
||||||
|
- Random I/O workloads should use Longhorn instead
|
||||||
|
|
||||||
|
### When to Choose Each Tier
|
||||||
|
|
||||||
|
| Requirement | Longhorn | NFS-Slow |
|
||||||
|
|-------------|----------|----------|
|
||||||
|
| Low latency | ✅ | ❌ |
|
||||||
|
| High IOPS | ✅ | ❌ |
|
||||||
|
| Large capacity | ❌ | ✅ |
|
||||||
|
| ReadWriteMany (RWX) | Limited | ✅ |
|
||||||
|
| Node failure survival | ✅ | ✅ (NAS HA) |
|
||||||
|
| Kubernetes-native | ✅ | ✅ |
|
||||||
|
|
||||||
|
## Monitoring
|
||||||
|
|
||||||
|
**Grafana Dashboard:** Longhorn dashboard for:
|
||||||
|
- Volume health and replica status
|
||||||
|
- IOPS and throughput per volume
|
||||||
|
- Disk space utilization per node
|
||||||
|
- Backup job status
|
||||||
|
|
||||||
|
**Alerts:**
|
||||||
|
- Volume degraded (replica count < desired)
|
||||||
|
- Disk space low (< 20% free)
|
||||||
|
- Backup job failed
|
||||||
|
|
||||||
|
## Future Enhancements
|
||||||
|
|
||||||
|
1. **NAS high availability** - Second NAS with replication
|
||||||
|
2. **Dedicated storage network** - Separate VLAN for storage traffic
|
||||||
|
3. **NVMe-oF** - Network NVMe for lower latency
|
||||||
|
4. **Tiered Longhorn** - Hot (NVMe) and warm (SSD) within Longhorn
|
||||||
|
5. **S3 tier** - MinIO for object storage workloads
|
||||||
|
|
||||||
|
## References
|
||||||
|
|
||||||
|
* [Longhorn Documentation](https://longhorn.io/docs/)
|
||||||
|
* [Longhorn Best Practices](https://longhorn.io/docs/latest/best-practices/)
|
||||||
|
* [NFS CSI Driver](https://github.com/kubernetes-csi/csi-driver-nfs)
|
||||||
|
* [Talos Longhorn Integration](https://www.talos.dev/v1.6/kubernetes-guides/configuration/storage/)
|
||||||
294
decisions/0027-database-strategy.md
Normal file
294
decisions/0027-database-strategy.md
Normal file
@@ -0,0 +1,294 @@
|
|||||||
|
# Database Strategy with CloudNativePG
|
||||||
|
|
||||||
|
* Status: accepted
|
||||||
|
* Date: 2026-02-04
|
||||||
|
* Deciders: Billy
|
||||||
|
* Technical Story: Standardize PostgreSQL deployment for stateful applications
|
||||||
|
|
||||||
|
## Context and Problem Statement
|
||||||
|
|
||||||
|
Multiple applications in the homelab require relational databases: Gitea, Authentik, Companions, MLflow, and potentially more. Each could use different database solutions, creating operational complexity.
|
||||||
|
|
||||||
|
How do we standardize database deployment while providing production-grade reliability and minimal operational overhead?
|
||||||
|
|
||||||
|
## Decision Drivers
|
||||||
|
|
||||||
|
* Operational simplicity - single operator to learn and manage
|
||||||
|
* High availability - automatic failover for critical databases
|
||||||
|
* Backup integration - consistent backup strategy across all databases
|
||||||
|
* GitOps compatibility - declarative database provisioning
|
||||||
|
* Resource efficiency - don't over-provision for homelab scale
|
||||||
|
|
||||||
|
## Considered Options
|
||||||
|
|
||||||
|
1. **CloudNativePG for PostgreSQL**
|
||||||
|
2. **Helm charts per application (Bitnami PostgreSQL)**
|
||||||
|
3. **External managed database (RDS-style)**
|
||||||
|
4. **SQLite where possible + single shared PostgreSQL**
|
||||||
|
|
||||||
|
## Decision Outcome
|
||||||
|
|
||||||
|
Chosen option: **Option 1 - CloudNativePG for PostgreSQL**
|
||||||
|
|
||||||
|
CloudNativePG (CNPG) provides a Kubernetes-native PostgreSQL operator with HA, automatic failover, connection pooling (PgBouncer), and integrated backups.
|
||||||
|
|
||||||
|
### Positive Consequences
|
||||||
|
|
||||||
|
* Single operator manages all PostgreSQL instances
|
||||||
|
* Declarative Cluster CRD for GitOps deployment
|
||||||
|
* Automatic failover with minimal data loss
|
||||||
|
* Built-in PgBouncer for connection pooling
|
||||||
|
* Prometheus metrics and Grafana dashboards included
|
||||||
|
* CNPG is CNCF-listed and actively maintained
|
||||||
|
|
||||||
|
### Negative Consequences
|
||||||
|
|
||||||
|
* PostgreSQL only (no MySQL/MariaDB support)
|
||||||
|
* Operator adds resource overhead
|
||||||
|
* Learning curve for CNPG-specific features
|
||||||
|
|
||||||
|
## Architecture
|
||||||
|
|
||||||
|
```
|
||||||
|
┌─────────────────────────────────────────────────────────────────┐
|
||||||
|
│ CNPG Operator │
|
||||||
|
│ (cnpg-system namespace) │
|
||||||
|
└────────────────────────────┬────────────────────────────────────┘
|
||||||
|
│ Manages
|
||||||
|
▼
|
||||||
|
┌──────────────────┬─────────────────┬─────────────────────────────┐
|
||||||
|
│ │ │ │
|
||||||
|
▼ ▼ ▼ ▼
|
||||||
|
┌──────────────┐ ┌──────────────┐ ┌──────────────┐ ┌──────────────┐
|
||||||
|
│ gitea-pg │ │ authentik-db │ │companions-db │ │ mlflow-db │
|
||||||
|
│ (3 replicas)│ │ (3 replicas)│ │ (3 replicas) │ │ (1 replica) │
|
||||||
|
│ │ │ │ │ │ │ │
|
||||||
|
│ ┌──────────┐ │ │ ┌──────────┐ │ │ ┌──────────┐ │ │ ┌──────────┐ │
|
||||||
|
│ │ Primary │ │ │ │ Primary │ │ │ │ Primary │ │ │ │ Primary │ │
|
||||||
|
│ ├──────────┤ │ │ ├──────────┤ │ │ ├──────────┤ │ │ └──────────┘ │
|
||||||
|
│ │ Replica │ │ │ │ Replica │ │ │ │ Replica │ │ │ │
|
||||||
|
│ ├──────────┤ │ │ ├──────────┤ │ │ ├──────────┤ │ │ │
|
||||||
|
│ │ Replica │ │ │ │ Replica │ │ │ │ Replica │ │ │ │
|
||||||
|
│ └──────────┘ │ │ └──────────┘ │ │ └──────────┘ │ │ │
|
||||||
|
│ ┌──────────┐ │ │ ┌──────────┐ │ │ ┌──────────┐ │ │ │
|
||||||
|
│ │ PgBouncer│ │ │ │ PgBouncer│ │ │ │ PgBouncer│ │ │ │
|
||||||
|
│ └──────────┘ │ │ └──────────┘ │ │ └──────────┘ │ │ │
|
||||||
|
└──────────────┘ └──────────────┘ └──────────────┘ └──────────────┘
|
||||||
|
│ │ │ │
|
||||||
|
└──────────────────┼─────────────────┼────────────────┘
|
||||||
|
│ │
|
||||||
|
┌─────▼─────┐ ┌─────▼─────┐
|
||||||
|
│ Longhorn │ │ Longhorn │
|
||||||
|
│ PVCs │ │ Backups │
|
||||||
|
└───────────┘ └───────────┘
|
||||||
|
```
|
||||||
|
|
||||||
|
## Cluster Configuration Template
|
||||||
|
|
||||||
|
```yaml
|
||||||
|
apiVersion: postgresql.cnpg.io/v1
|
||||||
|
kind: Cluster
|
||||||
|
metadata:
|
||||||
|
name: app-db
|
||||||
|
spec:
|
||||||
|
description: "Application PostgreSQL Cluster"
|
||||||
|
imageName: ghcr.io/cloudnative-pg/postgresql:17.2
|
||||||
|
instances: 3
|
||||||
|
|
||||||
|
primaryUpdateStrategy: unsupervised
|
||||||
|
|
||||||
|
postgresql:
|
||||||
|
parameters:
|
||||||
|
shared_buffers: "256MB"
|
||||||
|
effective_cache_size: "768MB"
|
||||||
|
work_mem: "16MB"
|
||||||
|
max_connections: "200"
|
||||||
|
|
||||||
|
# Enable PgBouncer for connection pooling
|
||||||
|
enablePgBouncer: true
|
||||||
|
pgbouncer:
|
||||||
|
poolMode: transaction
|
||||||
|
defaultPoolSize: "25"
|
||||||
|
|
||||||
|
# Storage on Longhorn
|
||||||
|
storage:
|
||||||
|
size: 10Gi
|
||||||
|
storageClass: longhorn
|
||||||
|
|
||||||
|
# Monitoring
|
||||||
|
monitoring:
|
||||||
|
enabled: true
|
||||||
|
customQueriesConfigMap:
|
||||||
|
- name: cnpg-default-monitoring
|
||||||
|
key: queries
|
||||||
|
|
||||||
|
# Backup configuration
|
||||||
|
backup:
|
||||||
|
barmanObjectStore:
|
||||||
|
destinationPath: "s3://backups/postgres/"
|
||||||
|
s3Credentials:
|
||||||
|
accessKeyId:
|
||||||
|
name: postgres-backup-creds
|
||||||
|
key: ACCESS_KEY_ID
|
||||||
|
secretAccessKey:
|
||||||
|
name: postgres-backup-creds
|
||||||
|
key: SECRET_ACCESS_KEY
|
||||||
|
retentionPolicy: "7d"
|
||||||
|
```
|
||||||
|
|
||||||
|
## Database Instances
|
||||||
|
|
||||||
|
| Cluster | Instances | Storage | PgBouncer | Purpose |
|
||||||
|
|---------|-----------|---------|-----------|---------|
|
||||||
|
| `gitea-pg` | 3 | 10Gi | Yes | Git repository metadata |
|
||||||
|
| `authentik-db` | 3 | 5Gi | Yes | Identity/SSO data |
|
||||||
|
| `companions-db` | 3 | 10Gi | Yes | Chat app data |
|
||||||
|
| `mlflow-db` | 1 | 5Gi | No | Experiment tracking |
|
||||||
|
| `kubeflow-db` | 1 | 10Gi | No | Pipeline metadata |
|
||||||
|
|
||||||
|
## Connection Patterns
|
||||||
|
|
||||||
|
### Service Discovery
|
||||||
|
|
||||||
|
CNPG creates services for each cluster:
|
||||||
|
|
||||||
|
| Service | Purpose |
|
||||||
|
|---------|---------|
|
||||||
|
| `<cluster>-rw` | Read-write (primary only) |
|
||||||
|
| `<cluster>-ro` | Read-only (any replica) |
|
||||||
|
| `<cluster>-r` | Read (any instance) |
|
||||||
|
| `<cluster>-pooler-rw` | PgBouncer read-write |
|
||||||
|
| `<cluster>-pooler-ro` | PgBouncer read-only |
|
||||||
|
|
||||||
|
### Application Configuration
|
||||||
|
|
||||||
|
```yaml
|
||||||
|
# Application config using CNPG service
|
||||||
|
DATABASE_URL: "postgresql://user:password@gitea-pg-pooler-rw.gitea.svc:5432/giteadb"
|
||||||
|
```
|
||||||
|
|
||||||
|
### Credentials via External Secrets
|
||||||
|
|
||||||
|
```yaml
|
||||||
|
apiVersion: external-secrets.io/v1
|
||||||
|
kind: ExternalSecret
|
||||||
|
metadata:
|
||||||
|
name: app-db-credentials
|
||||||
|
spec:
|
||||||
|
secretStoreRef:
|
||||||
|
kind: ClusterSecretStore
|
||||||
|
name: vault
|
||||||
|
target:
|
||||||
|
name: app-db-credentials
|
||||||
|
data:
|
||||||
|
- secretKey: username
|
||||||
|
remoteRef:
|
||||||
|
key: kv/data/app-db
|
||||||
|
property: username
|
||||||
|
- secretKey: password
|
||||||
|
remoteRef:
|
||||||
|
key: kv/data/app-db
|
||||||
|
property: password
|
||||||
|
```
|
||||||
|
|
||||||
|
## High Availability
|
||||||
|
|
||||||
|
### Automatic Failover
|
||||||
|
|
||||||
|
- CNPG monitors primary health continuously
|
||||||
|
- If primary fails, automatic promotion of replica
|
||||||
|
- Application reconnection via service abstraction
|
||||||
|
- Typical failover time: 10-30 seconds
|
||||||
|
|
||||||
|
### Replica Synchronization
|
||||||
|
|
||||||
|
- Streaming replication from primary to replicas
|
||||||
|
- Synchronous replication available for zero data loss (trade-off: latency)
|
||||||
|
- Default: asynchronous with acceptable RPO
|
||||||
|
|
||||||
|
## Backup Strategy
|
||||||
|
|
||||||
|
### Continuous WAL Archiving
|
||||||
|
|
||||||
|
- Write-Ahead Log streamed to S3
|
||||||
|
- Point-in-time recovery capability
|
||||||
|
- RPO: seconds (last WAL segment)
|
||||||
|
|
||||||
|
### Base Backups
|
||||||
|
|
||||||
|
- **Frequency:** Daily
|
||||||
|
- **Retention:** 7 days
|
||||||
|
- **Destination:** S3-compatible (MinIO/Backblaze)
|
||||||
|
|
||||||
|
### Recovery Testing
|
||||||
|
|
||||||
|
- Periodic restore to test cluster
|
||||||
|
- Validate backup integrity
|
||||||
|
- Document recovery procedure
|
||||||
|
|
||||||
|
## Monitoring
|
||||||
|
|
||||||
|
### Prometheus Metrics
|
||||||
|
|
||||||
|
- Connection count and pool utilization
|
||||||
|
- Transaction rate and latency
|
||||||
|
- Replication lag
|
||||||
|
- Disk usage and WAL generation
|
||||||
|
|
||||||
|
### Grafana Dashboard
|
||||||
|
|
||||||
|
CNPG provides official dashboard:
|
||||||
|
- Cluster health overview
|
||||||
|
- Per-instance metrics
|
||||||
|
- Replication status
|
||||||
|
- Backup job history
|
||||||
|
|
||||||
|
### Alerts
|
||||||
|
|
||||||
|
```yaml
|
||||||
|
- alert: PostgreSQLDown
|
||||||
|
expr: cnpg_collector_up == 0
|
||||||
|
for: 5m
|
||||||
|
labels:
|
||||||
|
severity: critical
|
||||||
|
|
||||||
|
- alert: PostgreSQLReplicationLag
|
||||||
|
expr: cnpg_pg_replication_lag_seconds > 30
|
||||||
|
for: 5m
|
||||||
|
labels:
|
||||||
|
severity: warning
|
||||||
|
|
||||||
|
- alert: PostgreSQLConnectionsHigh
|
||||||
|
expr: cnpg_pg_stat_activity_count / cnpg_pg_settings_max_connections > 0.8
|
||||||
|
for: 5m
|
||||||
|
labels:
|
||||||
|
severity: warning
|
||||||
|
```
|
||||||
|
|
||||||
|
## When NOT to Use CloudNativePG
|
||||||
|
|
||||||
|
| Scenario | Alternative |
|
||||||
|
|----------|-------------|
|
||||||
|
| Simple app, no HA needed | Embedded SQLite |
|
||||||
|
| MySQL/MariaDB required | Application-specific chart |
|
||||||
|
| Massive scale | External managed database |
|
||||||
|
| Non-relational data | Redis/Valkey, MongoDB |
|
||||||
|
|
||||||
|
## PostgreSQL Version Policy
|
||||||
|
|
||||||
|
- Use latest stable major version (currently 17)
|
||||||
|
- Minor version updates: automatic (`primaryUpdateStrategy: unsupervised`)
|
||||||
|
- Major version upgrades: manual with testing
|
||||||
|
|
||||||
|
## Future Enhancements
|
||||||
|
|
||||||
|
1. **Cross-cluster replication** - DR site replica
|
||||||
|
2. **Logical replication** - Selective table sync between clusters
|
||||||
|
3. **TimescaleDB extension** - Time-series optimization for metrics
|
||||||
|
4. **PgVector extension** - Vector storage alternative to Milvus
|
||||||
|
|
||||||
|
## References
|
||||||
|
|
||||||
|
* [CloudNativePG Documentation](https://cloudnative-pg.io/documentation/)
|
||||||
|
* [CNPG GitHub](https://github.com/cloudnative-pg/cloudnative-pg)
|
||||||
|
* [PostgreSQL High Availability](https://www.postgresql.org/docs/current/high-availability.html)
|
||||||
415
decisions/0028-authentik-sso-strategy.md
Normal file
415
decisions/0028-authentik-sso-strategy.md
Normal file
@@ -0,0 +1,415 @@
|
|||||||
|
# Authentik Single Sign-On Strategy
|
||||||
|
|
||||||
|
* Status: accepted
|
||||||
|
* Date: 2026-02-04
|
||||||
|
* Deciders: Billy
|
||||||
|
* Technical Story: Centralize authentication across all homelab applications
|
||||||
|
|
||||||
|
## Context and Problem Statement
|
||||||
|
|
||||||
|
A growing homelab with many self-hosted applications creates authentication sprawl - each app has its own user database, passwords, and session management. This creates poor user experience and security risks.
|
||||||
|
|
||||||
|
How do we centralize authentication while maintaining flexibility for different application requirements?
|
||||||
|
|
||||||
|
## Decision Drivers
|
||||||
|
|
||||||
|
* Single sign-on (SSO) for all applications
|
||||||
|
* Centralized user management and lifecycle
|
||||||
|
* MFA enforcement across all applications
|
||||||
|
* Open-source and self-hosted
|
||||||
|
* Low resource requirements for homelab scale
|
||||||
|
|
||||||
|
## Considered Options
|
||||||
|
|
||||||
|
1. **Authentik as OIDC/SAML provider**
|
||||||
|
2. **Keycloak**
|
||||||
|
3. **Authelia + LDAP**
|
||||||
|
4. **Per-application local auth**
|
||||||
|
|
||||||
|
## Decision Outcome
|
||||||
|
|
||||||
|
Chosen option: **Option 1 - Authentik as OIDC/SAML provider**
|
||||||
|
|
||||||
|
Authentik provides modern identity management with OIDC, SAML 2.0, LDAP, and SCIM support. Its flow-based authentication engine allows customizable login experiences.
|
||||||
|
|
||||||
|
### Positive Consequences
|
||||||
|
|
||||||
|
* Clean, modern UI for users and admins
|
||||||
|
* Flexible flow-based authentication
|
||||||
|
* Built-in MFA (TOTP, WebAuthn, SMS, Duo)
|
||||||
|
* Proxy provider for legacy apps
|
||||||
|
* SCIM for user provisioning
|
||||||
|
* Active development and community
|
||||||
|
|
||||||
|
### Negative Consequences
|
||||||
|
|
||||||
|
* Python-based (higher memory than Go alternatives)
|
||||||
|
* PostgreSQL dependency
|
||||||
|
* Some enterprise features require outpost pods
|
||||||
|
|
||||||
|
## Architecture
|
||||||
|
|
||||||
|
```
|
||||||
|
┌─────────────────────┐
|
||||||
|
│ User │
|
||||||
|
└──────────┬──────────┘
|
||||||
|
│
|
||||||
|
▼
|
||||||
|
┌─────────────────────┐
|
||||||
|
│ Ingress/Traefik │
|
||||||
|
└──────────┬──────────┘
|
||||||
|
│
|
||||||
|
┌──────────────────────────┼──────────────────────────┐
|
||||||
|
│ │ │
|
||||||
|
▼ ▼ ▼
|
||||||
|
┌─────────────────┐ ┌─────────────────┐ ┌─────────────────┐
|
||||||
|
│ auth.lab.io │ │ app.lab.io │ │ app2.lab.io │
|
||||||
|
│ (Authentik) │ │ (OIDC-enabled) │ │ (Proxy-auth) │
|
||||||
|
└─────────────────┘ └────────┬────────┘ └────────┬────────┘
|
||||||
|
│ │ │
|
||||||
|
│ ┌──────────────────┘ │
|
||||||
|
│ │ OIDC/OAuth2 │
|
||||||
|
│ │ │
|
||||||
|
▼ ▼ ▼
|
||||||
|
┌─────────────────────────────────────────────────────────────────┐
|
||||||
|
│ Authentik │
|
||||||
|
│ ┌─────────────┐ ┌─────────────┐ ┌─────────────┐ │
|
||||||
|
│ │ Server │ │ Worker │ │ Outpost │◄───────────┤
|
||||||
|
│ │ (API) │ │ (Tasks) │ │ (Proxy) │ Forward │
|
||||||
|
│ └──────┬──────┘ └──────┬──────┘ └─────────────┘ Auth │
|
||||||
|
│ │ │ │
|
||||||
|
│ └────────┬───────┘ │
|
||||||
|
│ │ │
|
||||||
|
│ ┌──────▼──────┐ │
|
||||||
|
│ │ Redis │ │
|
||||||
|
│ │ (Cache) │ │
|
||||||
|
│ └─────────────┘ │
|
||||||
|
│ │
|
||||||
|
└─────────────────────────────┬──────────────────────────────────┘
|
||||||
|
│
|
||||||
|
┌──────▼──────┐
|
||||||
|
│ PostgreSQL │
|
||||||
|
│ (CNPG) │
|
||||||
|
└─────────────┘
|
||||||
|
```
|
||||||
|
|
||||||
|
## Provider Configuration
|
||||||
|
|
||||||
|
### OIDC Applications
|
||||||
|
|
||||||
|
| Application | Provider Type | Claims Override | Notes |
|
||||||
|
|-------------|---------------|-----------------|-------|
|
||||||
|
| Gitea | OIDC | None | Admin mapping via group |
|
||||||
|
| Affine | OIDC | `email_verified: true` | See ADR-0016 |
|
||||||
|
| Companions | OIDC | None | Custom provider |
|
||||||
|
| Grafana | OIDC | `role` claim | Admin role mapping |
|
||||||
|
| ArgoCD | OIDC | `groups` claim | RBAC integration |
|
||||||
|
| MLflow | Proxy | N/A | Forward auth |
|
||||||
|
| Open WebUI | OIDC | None | LLM interface |
|
||||||
|
|
||||||
|
### Provider Template
|
||||||
|
|
||||||
|
```yaml
|
||||||
|
# Example OAuth2/OIDC Provider
|
||||||
|
apiVersion: authentik.io/v1
|
||||||
|
kind: OAuth2Provider
|
||||||
|
metadata:
|
||||||
|
name: gitea
|
||||||
|
spec:
|
||||||
|
name: Gitea
|
||||||
|
authorizationFlow: default-authorization-flow
|
||||||
|
clientId: ${GITEA_CLIENT_ID}
|
||||||
|
clientSecret: ${GITEA_CLIENT_SECRET}
|
||||||
|
redirectUris:
|
||||||
|
- https://git.lab.daviestechlabs.io/user/oauth2/authentik/callback
|
||||||
|
signingKey: authentik-self-signed
|
||||||
|
propertyMappings:
|
||||||
|
- authentik-default-openid
|
||||||
|
- authentik-default-email
|
||||||
|
- authentik-default-profile
|
||||||
|
```
|
||||||
|
|
||||||
|
## Authentication Flows
|
||||||
|
|
||||||
|
### Default Login Flow
|
||||||
|
|
||||||
|
```
|
||||||
|
┌─────────────┐ ┌─────────────┐ ┌─────────────┐ ┌─────────────┐
|
||||||
|
│ Login │────▶│ Username │────▶│ Password │────▶│ MFA │
|
||||||
|
│ Stage │ │ Stage │ │ Stage │ │ Stage │
|
||||||
|
└─────────────┘ └─────────────┘ └─────────────┘ └──────┬──────┘
|
||||||
|
│
|
||||||
|
▼
|
||||||
|
┌─────────────┐
|
||||||
|
│ Session │
|
||||||
|
│ Created │
|
||||||
|
└─────────────┘
|
||||||
|
```
|
||||||
|
|
||||||
|
### Flow Customization
|
||||||
|
|
||||||
|
- **Admin users:** Require hardware key (WebAuthn)
|
||||||
|
- **API access:** Service account tokens
|
||||||
|
- **External users:** Email verification + MFA enrollment
|
||||||
|
|
||||||
|
## Group-Based Authorization
|
||||||
|
|
||||||
|
### Group Structure
|
||||||
|
|
||||||
|
```
|
||||||
|
authentik-admins → Authentik admin access
|
||||||
|
├── cluster-admins → Full cluster access
|
||||||
|
├── gitea-admins → Git admin
|
||||||
|
├── monitoring-admins → Grafana admin
|
||||||
|
└── ai-platform-admins → AI/ML admin
|
||||||
|
|
||||||
|
authentik-users → Standard user access
|
||||||
|
├── developers → Git write, monitoring read
|
||||||
|
├── ml-engineers → AI/ML services access
|
||||||
|
└── guests → Read-only access
|
||||||
|
```
|
||||||
|
|
||||||
|
### Application Group Mapping
|
||||||
|
|
||||||
|
```yaml
|
||||||
|
# Grafana OIDC config
|
||||||
|
grafana:
|
||||||
|
auth.generic_oauth:
|
||||||
|
role_attribute_path: |
|
||||||
|
contains(groups[*], 'monitoring-admins') && 'Admin' ||
|
||||||
|
contains(groups[*], 'developers') && 'Editor' ||
|
||||||
|
'Viewer'
|
||||||
|
```
|
||||||
|
|
||||||
|
## Outpost Deployment
|
||||||
|
|
||||||
|
### Embedded Outpost (Default)
|
||||||
|
|
||||||
|
- Runs within Authentik server
|
||||||
|
- Handles LDAP and Radius
|
||||||
|
- Suitable for small deployments
|
||||||
|
|
||||||
|
### Standalone Outpost (Proxy)
|
||||||
|
|
||||||
|
```yaml
|
||||||
|
apiVersion: apps/v1
|
||||||
|
kind: Deployment
|
||||||
|
metadata:
|
||||||
|
name: authentik-outpost-proxy
|
||||||
|
spec:
|
||||||
|
replicas: 2
|
||||||
|
template:
|
||||||
|
spec:
|
||||||
|
containers:
|
||||||
|
- name: outpost
|
||||||
|
image: ghcr.io/goauthentik/proxy
|
||||||
|
ports:
|
||||||
|
- containerPort: 9000
|
||||||
|
name: http
|
||||||
|
- containerPort: 9443
|
||||||
|
name: https
|
||||||
|
env:
|
||||||
|
- name: AUTHENTIK_HOST
|
||||||
|
value: "https://auth.lab.daviestechlabs.io/"
|
||||||
|
- name: AUTHENTIK_TOKEN
|
||||||
|
valueFrom:
|
||||||
|
secretKeyRef:
|
||||||
|
name: authentik-outpost-token
|
||||||
|
key: token
|
||||||
|
```
|
||||||
|
|
||||||
|
### Forward Auth Integration
|
||||||
|
|
||||||
|
For applications without OIDC support:
|
||||||
|
|
||||||
|
```yaml
|
||||||
|
# Traefik ForwardAuth middleware
|
||||||
|
apiVersion: traefik.io/v1alpha1
|
||||||
|
kind: Middleware
|
||||||
|
metadata:
|
||||||
|
name: authentik-forward-auth
|
||||||
|
spec:
|
||||||
|
forwardAuth:
|
||||||
|
address: http://authentik-outpost-proxy.authentik.svc:9000/outpost.goauthentik.io/auth/traefik
|
||||||
|
trustForwardHeader: true
|
||||||
|
authResponseHeaders:
|
||||||
|
- X-authentik-username
|
||||||
|
- X-authentik-groups
|
||||||
|
- X-authentik-email
|
||||||
|
```
|
||||||
|
|
||||||
|
## MFA Enforcement
|
||||||
|
|
||||||
|
### Policies
|
||||||
|
|
||||||
|
| User Group | MFA Requirement |
|
||||||
|
|------------|-----------------|
|
||||||
|
| Admins | WebAuthn (hardware key) required |
|
||||||
|
| Developers | TOTP or WebAuthn required |
|
||||||
|
| Guests | MFA optional |
|
||||||
|
|
||||||
|
### Device Registration
|
||||||
|
|
||||||
|
- Self-service MFA enrollment
|
||||||
|
- Recovery codes generated at setup
|
||||||
|
- Admin can reset user MFA
|
||||||
|
|
||||||
|
## SCIM User Provisioning
|
||||||
|
|
||||||
|
### When to Use
|
||||||
|
|
||||||
|
- Automatic user creation in downstream apps
|
||||||
|
- Group membership sync
|
||||||
|
- User deprovisioning on termination
|
||||||
|
|
||||||
|
### Supported Apps
|
||||||
|
|
||||||
|
Currently using SCIM provisioning for:
|
||||||
|
- None (manual user creation in apps)
|
||||||
|
|
||||||
|
Future consideration for:
|
||||||
|
- Gitea organization sync
|
||||||
|
- Grafana team sync
|
||||||
|
|
||||||
|
## Secrets Management Integration
|
||||||
|
|
||||||
|
### Vault Integration
|
||||||
|
|
||||||
|
```yaml
|
||||||
|
# External Secret for Authentik DB credentials
|
||||||
|
apiVersion: external-secrets.io/v1
|
||||||
|
kind: ExternalSecret
|
||||||
|
metadata:
|
||||||
|
name: authentik-db-credentials
|
||||||
|
namespace: authentik
|
||||||
|
spec:
|
||||||
|
secretStoreRef:
|
||||||
|
kind: ClusterSecretStore
|
||||||
|
name: vault
|
||||||
|
target:
|
||||||
|
name: authentik-db-credentials
|
||||||
|
data:
|
||||||
|
- secretKey: password
|
||||||
|
remoteRef:
|
||||||
|
key: kv/data/authentik
|
||||||
|
property: db_password
|
||||||
|
- secretKey: secret_key
|
||||||
|
remoteRef:
|
||||||
|
key: kv/data/authentik
|
||||||
|
property: secret_key
|
||||||
|
```
|
||||||
|
|
||||||
|
## Monitoring
|
||||||
|
|
||||||
|
### Prometheus Metrics
|
||||||
|
|
||||||
|
Authentik exposes metrics at `/metrics`:
|
||||||
|
|
||||||
|
- `authentik_login_duration_seconds`
|
||||||
|
- `authentik_login_attempt_total`
|
||||||
|
- `authentik_outpost_connected`
|
||||||
|
- `authentik_provider_authorization_total`
|
||||||
|
|
||||||
|
### Grafana Dashboard
|
||||||
|
|
||||||
|
- Login success/failure rates
|
||||||
|
- Active sessions
|
||||||
|
- Provider usage
|
||||||
|
- MFA adoption rates
|
||||||
|
|
||||||
|
### Alerts
|
||||||
|
|
||||||
|
```yaml
|
||||||
|
- alert: AuthentikHighLoginFailures
|
||||||
|
expr: rate(authentik_login_attempt_total{result="failure"}[5m]) > 10
|
||||||
|
for: 5m
|
||||||
|
labels:
|
||||||
|
severity: warning
|
||||||
|
annotations:
|
||||||
|
summary: High login failure rate detected
|
||||||
|
|
||||||
|
- alert: AuthentikOutpostDisconnected
|
||||||
|
expr: authentik_outpost_connected == 0
|
||||||
|
for: 5m
|
||||||
|
labels:
|
||||||
|
severity: critical
|
||||||
|
```
|
||||||
|
|
||||||
|
## Backup and Recovery
|
||||||
|
|
||||||
|
### What to Backup
|
||||||
|
|
||||||
|
1. PostgreSQL database (via CNPG)
|
||||||
|
2. Media files (if custom branding)
|
||||||
|
3. Blueprint exports (configuration as code)
|
||||||
|
|
||||||
|
### Blueprints
|
||||||
|
|
||||||
|
Export configuration as YAML for GitOps:
|
||||||
|
|
||||||
|
```yaml
|
||||||
|
# authentik-blueprints/providers/gitea.yaml
|
||||||
|
version: 1
|
||||||
|
metadata:
|
||||||
|
name: Gitea OIDC Provider
|
||||||
|
entries:
|
||||||
|
- model: authentik_providers_oauth2.oauth2provider
|
||||||
|
identifiers:
|
||||||
|
name: gitea
|
||||||
|
attrs:
|
||||||
|
authorization_flow: !Find [authentik_flows.flow, [slug, default-authorization-flow]]
|
||||||
|
# ... provider config
|
||||||
|
```
|
||||||
|
|
||||||
|
## Integration Patterns
|
||||||
|
|
||||||
|
### Pattern 1: Native OIDC
|
||||||
|
|
||||||
|
Best for: Modern applications with OIDC support
|
||||||
|
|
||||||
|
```
|
||||||
|
App ──OIDC──▶ Authentik ──▶ App (with user info)
|
||||||
|
```
|
||||||
|
|
||||||
|
### Pattern 2: Proxy Forward Auth
|
||||||
|
|
||||||
|
Best for: Legacy apps without SSO support
|
||||||
|
|
||||||
|
```
|
||||||
|
Request ──▶ Traefik ──ForwardAuth──▶ Authentik Outpost
|
||||||
|
│ │
|
||||||
|
│◀──────Header injection─────┘
|
||||||
|
│
|
||||||
|
▼
|
||||||
|
App (reads X-authentik-* headers)
|
||||||
|
```
|
||||||
|
|
||||||
|
### Pattern 3: LDAP Compatibility
|
||||||
|
|
||||||
|
Best for: Apps requiring LDAP
|
||||||
|
|
||||||
|
```
|
||||||
|
App ──LDAP──▶ Authentik Outpost (LDAP) ──▶ Authentik
|
||||||
|
```
|
||||||
|
|
||||||
|
## Resource Requirements
|
||||||
|
|
||||||
|
| Component | CPU Request | Memory Request |
|
||||||
|
|-----------|-------------|----------------|
|
||||||
|
| Server | 100m | 500Mi |
|
||||||
|
| Worker | 100m | 500Mi |
|
||||||
|
| Redis | 50m | 128Mi |
|
||||||
|
| Outpost (each) | 50m | 128Mi |
|
||||||
|
|
||||||
|
## Future Enhancements
|
||||||
|
|
||||||
|
1. **Passkey/FIDO2** - Passwordless authentication
|
||||||
|
2. **External IdP federation** - Google/GitHub as upstream IdP
|
||||||
|
3. **Conditional access** - Device trust, network location policies
|
||||||
|
4. **Session revocation** - Force logout from all apps
|
||||||
|
|
||||||
|
## References
|
||||||
|
|
||||||
|
* [Authentik Documentation](https://goauthentik.io/docs/)
|
||||||
|
* [Authentik GitHub](https://github.com/goauthentik/authentik)
|
||||||
|
* [OIDC Specification](https://openid.net/specs/openid-connect-core-1_0.html)
|
||||||
Reference in New Issue
Block a user