docs: add ADRs 0025-0028 for infrastructure patterns

- 0025: Observability stack (Prometheus + ClickStack + OTEL) - 0026: Tiered storage strategy (Longhorn + NFS) - 0027: Database strategy (CloudNativePG for PostgreSQL) - 0028: Authentik SSO strategy (OIDC/SAML identity provider)
2026-02-04 08:55:15 -05:00
parent a128c265e4
commit b43c80153c
4 changed files with 1282 additions and 0 deletions
--- a/decisions/0025-observability-stack.md
+++ b/decisions/0025-observability-stack.md
@@ -0,0 +1,239 @@
+# Observability Stack Architecture
+
+* Status: accepted
+* Date: 2026-02-04
+* Deciders: Billy
+* Technical Story: Establish comprehensive observability for metrics, logs, and traces across the homelab
+
+## Context and Problem Statement
+
+A complex homelab with AI/ML workloads, multiple databases, and numerous services requires comprehensive observability to understand system behavior, debug issues, and optimize performance.
+
+How do we build an observability stack that provides metrics, logs, and traces while remaining manageable for a single operator?
+
+## Decision Drivers
+
+* Three pillars coverage - metrics, logs, and traces all addressed
+* Unified visualization - single pane of glass for all telemetry
+* Resource efficiency - don't overwhelm the cluster with observability overhead
+* OpenTelemetry compatibility - future-proof instrumentation standard
+* GitOps deployment - all configuration version-controlled
+
+## Considered Options
+
+1. **Prometheus + ClickStack + OpenTelemetry Collector**
+2. **Prometheus + Loki + Tempo (PLT Stack)**
+3. **Datadog/New Relic (SaaS)**
+4. **ELK Stack (Elasticsearch, Logstash, Kibana)**
+
+## Decision Outcome
+
+Chosen option: **Option 1 - Prometheus + ClickStack + OpenTelemetry Collector**
+
+Prometheus handles metrics with its mature ecosystem, ClickStack (ClickHouse-based) provides unified logs and traces storage with excellent performance, and OpenTelemetry Collector routes all telemetry data.
+
+### Positive Consequences
+
+* Prometheus ecosystem is mature with extensive service monitor support
+* ClickHouse provides fast querying for logs and traces at scale
+* OpenTelemetry is vendor-neutral and industry standard
+* Grafana provides unified dashboards for all data sources
+* Cost-effective (no SaaS fees)
+
+### Negative Consequences
+
+* More complex than pure SaaS solutions
+* ClickHouse requires storage management
+* Multiple components to maintain
+
+## Architecture
+
+```
+┌─────────────────────────────────────────────────────────────────────┐
+│                        Applications                                  │
+│  ┌──────────┐  ┌──────────┐  ┌──────────┐  ┌──────────┐            │
+│  │ Go Apps  │  │ Python   │  │ Node.js  │  │ Java     │            │
+│  │ (OTEL)   │  │ (OTEL)   │  │ (OTEL)   │  │ (OTEL)   │            │
+│  └────┬─────┘  └────┬─────┘  └────┬─────┘  └────┬─────┘            │
+└───────┼─────────────┼─────────────┼─────────────┼──────────────────┘
+        │             │             │             │
+        └──────────────────┬────────────────────────┘
+                           │ OTLP (gRPC/HTTP)
+                           ▼
+              ┌────────────────────────┐
+              │  OpenTelemetry         │
+              │  Collector             │
+              │  (traces, metrics,     │
+              │   logs)                │
+              └───────────┬────────────┘
+                          │
+          ┌───────────────┼───────────────┐
+          │               │               │
+          ▼               ▼               ▼
+┌─────────────────┐ ┌───────────┐ ┌───────────────┐
+│   ClickStack    │ │Prometheus │ │   Grafana     │
+│   (ClickHouse)  │ │           │ │               │
+│  ┌───────────┐  │ │ Metrics   │ │  Dashboards   │
+│  │  Traces   │  │ │ Storage   │ │  Alerting     │
+│  ├───────────┤  │ │           │ │  Exploration  │
+│  │   Logs    │  │ └───────────┘ │               │
+│  └───────────┘  │               └───────────────┘
+└─────────────────┘                      │
+                                         │
+                    ┌────────────────────┤
+                    │                    │
+              ┌─────▼─────┐        ┌─────▼─────┐
+              │Alertmanager│        │   ntfy    │
+              │           │        │ (push)    │
+              └───────────┘        └───────────┘
+```
+
+## Component Details
+
+### Metrics: Prometheus + kube-prometheus-stack
+
+**Deployment:** HelmRelease via Flux
+
+```yaml
+prometheus:
+  prometheusSpec:
+    retention: 14d
+    retentionSize: 50GB
+    storage:
+      volumeClaimTemplate:
+        spec:
+          storageClassName: longhorn
+          storage: 50Gi
+```
+
+**Key Features:**
+- ServiceMonitor auto-discovery for all workloads
+- 14-day retention with 50GB limit
+- PromPP image for enhanced performance
+- AlertManager for routing alerts
+
+### Logs & Traces: ClickStack
+
+**Why ClickStack over Loki/Tempo:**
+- Single storage backend (ClickHouse) for both logs and traces
+- Excellent query performance on large datasets
+- Built-in correlation between logs and traces
+- Lower resource overhead than separate Loki + Tempo
+
+**Configuration:**
+- OTEL Collector receives all telemetry
+- Forwards to ClickStack's OTEL collector
+- Grafana datasources for querying
+
+### Telemetry Collection: OpenTelemetry
+
+**OpenTelemetry Operator:** Manages auto-instrumentation
+
+```yaml
+apiVersion: opentelemetry.io/v1alpha1
+kind: Instrumentation
+metadata:
+  name: auto-instrumentation
+spec:
+  python:
+    image: ghcr.io/open-telemetry/opentelemetry-operator/autoinstrumentation-python
+  nodejs:
+    image: ghcr.io/open-telemetry/opentelemetry-operator/autoinstrumentation-nodejs
+```
+
+**OpenTelemetry Collector:** Central routing
+
+```yaml
+receivers:
+  otlp:
+    protocols:
+      grpc:
+        endpoint: 0.0.0.0:4317
+      http:
+        endpoint: 0.0.0.0:4318
+
+exporters:
+  otlphttp:
+    endpoint: http://clickstack-otel-collector:4318
+
+service:
+  pipelines:
+    traces:
+      receivers: [otlp]
+      exporters: [otlphttp]
+    metrics:
+      receivers: [otlp]
+      exporters: [otlphttp]
+    logs:
+      receivers: [otlp]
+      exporters: [otlphttp]
+```
+
+### Visualization: Grafana
+
+**Grafana Operator:** Manages dashboards and datasources as CRDs
+
+```yaml
+apiVersion: grafana.integreatly.org/v1beta1
+kind: GrafanaDashboard
+metadata:
+  name: kubernetes-nodes
+spec:
+  instanceSelector:
+    matchLabels:
+      grafana.internal/instance: grafana
+  url: https://grafana.com/api/dashboards/15758/revisions/44/download
+```
+
+**Datasources:**
+| Type | Source | Purpose |
+|------|--------|---------|
+| Prometheus | prometheus-operated:9090 | Metrics |
+| ClickHouse | clickstack:8123 | Logs & Traces |
+| Alertmanager | alertmanager-operated:9093 | Alert status |
+
+### Alerting Pipeline
+
+```
+Prometheus Rules → Alertmanager → ntfy → Discord/Mobile
+                                      └─→ Email (future)
+```
+
+**Alert Categories:**
+- Infrastructure: Node down, disk full, OOM
+- Application: Error rate, latency SLO breach
+- Security: Gatekeeper violations, vulnerability findings
+
+## Dashboards
+
+| Dashboard | Source | Purpose |
+|-----------|--------|---------|
+| Kubernetes Global | Grafana #15757 | Cluster overview |
+| Node Exporter | Grafana #1860 | Node metrics |
+| CNPG PostgreSQL | CNPG | Database health |
+| Flux | Flux Operator | GitOps status |
+| Cilium | Cilium | Network metrics |
+| Envoy Gateway | Envoy | Ingress metrics |
+
+## Resource Allocation
+
+| Component | CPU Request | Memory Limit |
+|-----------|-------------|--------------|
+| Prometheus | 100m | 2Gi |
+| OTEL Collector | 100m | 512Mi |
+| ClickStack | 500m | 2Gi |
+| Grafana | 100m | 256Mi |
+
+## Future Enhancements
+
+1. **Continuous Profiling** - Pyroscope for Go/Python profiling
+2. **SLO Tracking** - Sloth for SLI/SLO automation
+3. **Synthetic Monitoring** - Gatus for endpoint probing
+4. **Cost Attribution** - OpenCost for resource cost tracking
+
+## References
+
+* [OpenTelemetry Documentation](https://opentelemetry.io/docs/)
+* [ClickHouse for Observability](https://clickhouse.com/docs/en/use-cases/observability)
+* [kube-prometheus-stack](https://github.com/prometheus-community/helm-charts/tree/main/charts/kube-prometheus-stack)
+* [Grafana Operator](https://grafana.github.io/grafana-operator/)