docs: add ADRs 0025-0028 for infrastructure patterns
- 0025: Observability stack (Prometheus + ClickStack + OTEL) - 0026: Tiered storage strategy (Longhorn + NFS) - 0027: Database strategy (CloudNativePG for PostgreSQL) - 0028: Authentik SSO strategy (OIDC/SAML identity provider)
This commit is contained in:
239
decisions/0025-observability-stack.md
Normal file
239
decisions/0025-observability-stack.md
Normal file
@@ -0,0 +1,239 @@
|
||||
# Observability Stack Architecture
|
||||
|
||||
* Status: accepted
|
||||
* Date: 2026-02-04
|
||||
* Deciders: Billy
|
||||
* Technical Story: Establish comprehensive observability for metrics, logs, and traces across the homelab
|
||||
|
||||
## Context and Problem Statement
|
||||
|
||||
A complex homelab with AI/ML workloads, multiple databases, and numerous services requires comprehensive observability to understand system behavior, debug issues, and optimize performance.
|
||||
|
||||
How do we build an observability stack that provides metrics, logs, and traces while remaining manageable for a single operator?
|
||||
|
||||
## Decision Drivers
|
||||
|
||||
* Three pillars coverage - metrics, logs, and traces all addressed
|
||||
* Unified visualization - single pane of glass for all telemetry
|
||||
* Resource efficiency - don't overwhelm the cluster with observability overhead
|
||||
* OpenTelemetry compatibility - future-proof instrumentation standard
|
||||
* GitOps deployment - all configuration version-controlled
|
||||
|
||||
## Considered Options
|
||||
|
||||
1. **Prometheus + ClickStack + OpenTelemetry Collector**
|
||||
2. **Prometheus + Loki + Tempo (PLT Stack)**
|
||||
3. **Datadog/New Relic (SaaS)**
|
||||
4. **ELK Stack (Elasticsearch, Logstash, Kibana)**
|
||||
|
||||
## Decision Outcome
|
||||
|
||||
Chosen option: **Option 1 - Prometheus + ClickStack + OpenTelemetry Collector**
|
||||
|
||||
Prometheus handles metrics with its mature ecosystem, ClickStack (ClickHouse-based) provides unified logs and traces storage with excellent performance, and OpenTelemetry Collector routes all telemetry data.
|
||||
|
||||
### Positive Consequences
|
||||
|
||||
* Prometheus ecosystem is mature with extensive service monitor support
|
||||
* ClickHouse provides fast querying for logs and traces at scale
|
||||
* OpenTelemetry is vendor-neutral and industry standard
|
||||
* Grafana provides unified dashboards for all data sources
|
||||
* Cost-effective (no SaaS fees)
|
||||
|
||||
### Negative Consequences
|
||||
|
||||
* More complex than pure SaaS solutions
|
||||
* ClickHouse requires storage management
|
||||
* Multiple components to maintain
|
||||
|
||||
## Architecture
|
||||
|
||||
```
|
||||
┌─────────────────────────────────────────────────────────────────────┐
|
||||
│ Applications │
|
||||
│ ┌──────────┐ ┌──────────┐ ┌──────────┐ ┌──────────┐ │
|
||||
│ │ Go Apps │ │ Python │ │ Node.js │ │ Java │ │
|
||||
│ │ (OTEL) │ │ (OTEL) │ │ (OTEL) │ │ (OTEL) │ │
|
||||
│ └────┬─────┘ └────┬─────┘ └────┬─────┘ └────┬─────┘ │
|
||||
└───────┼─────────────┼─────────────┼─────────────┼──────────────────┘
|
||||
│ │ │ │
|
||||
└──────────────────┬────────────────────────┘
|
||||
│ OTLP (gRPC/HTTP)
|
||||
▼
|
||||
┌────────────────────────┐
|
||||
│ OpenTelemetry │
|
||||
│ Collector │
|
||||
│ (traces, metrics, │
|
||||
│ logs) │
|
||||
└───────────┬────────────┘
|
||||
│
|
||||
┌───────────────┼───────────────┐
|
||||
│ │ │
|
||||
▼ ▼ ▼
|
||||
┌─────────────────┐ ┌───────────┐ ┌───────────────┐
|
||||
│ ClickStack │ │Prometheus │ │ Grafana │
|
||||
│ (ClickHouse) │ │ │ │ │
|
||||
│ ┌───────────┐ │ │ Metrics │ │ Dashboards │
|
||||
│ │ Traces │ │ │ Storage │ │ Alerting │
|
||||
│ ├───────────┤ │ │ │ │ Exploration │
|
||||
│ │ Logs │ │ └───────────┘ │ │
|
||||
│ └───────────┘ │ └───────────────┘
|
||||
└─────────────────┘ │
|
||||
│
|
||||
┌────────────────────┤
|
||||
│ │
|
||||
┌─────▼─────┐ ┌─────▼─────┐
|
||||
│Alertmanager│ │ ntfy │
|
||||
│ │ │ (push) │
|
||||
└───────────┘ └───────────┘
|
||||
```
|
||||
|
||||
## Component Details
|
||||
|
||||
### Metrics: Prometheus + kube-prometheus-stack
|
||||
|
||||
**Deployment:** HelmRelease via Flux
|
||||
|
||||
```yaml
|
||||
prometheus:
|
||||
prometheusSpec:
|
||||
retention: 14d
|
||||
retentionSize: 50GB
|
||||
storage:
|
||||
volumeClaimTemplate:
|
||||
spec:
|
||||
storageClassName: longhorn
|
||||
storage: 50Gi
|
||||
```
|
||||
|
||||
**Key Features:**
|
||||
- ServiceMonitor auto-discovery for all workloads
|
||||
- 14-day retention with 50GB limit
|
||||
- PromPP image for enhanced performance
|
||||
- AlertManager for routing alerts
|
||||
|
||||
### Logs & Traces: ClickStack
|
||||
|
||||
**Why ClickStack over Loki/Tempo:**
|
||||
- Single storage backend (ClickHouse) for both logs and traces
|
||||
- Excellent query performance on large datasets
|
||||
- Built-in correlation between logs and traces
|
||||
- Lower resource overhead than separate Loki + Tempo
|
||||
|
||||
**Configuration:**
|
||||
- OTEL Collector receives all telemetry
|
||||
- Forwards to ClickStack's OTEL collector
|
||||
- Grafana datasources for querying
|
||||
|
||||
### Telemetry Collection: OpenTelemetry
|
||||
|
||||
**OpenTelemetry Operator:** Manages auto-instrumentation
|
||||
|
||||
```yaml
|
||||
apiVersion: opentelemetry.io/v1alpha1
|
||||
kind: Instrumentation
|
||||
metadata:
|
||||
name: auto-instrumentation
|
||||
spec:
|
||||
python:
|
||||
image: ghcr.io/open-telemetry/opentelemetry-operator/autoinstrumentation-python
|
||||
nodejs:
|
||||
image: ghcr.io/open-telemetry/opentelemetry-operator/autoinstrumentation-nodejs
|
||||
```
|
||||
|
||||
**OpenTelemetry Collector:** Central routing
|
||||
|
||||
```yaml
|
||||
receivers:
|
||||
otlp:
|
||||
protocols:
|
||||
grpc:
|
||||
endpoint: 0.0.0.0:4317
|
||||
http:
|
||||
endpoint: 0.0.0.0:4318
|
||||
|
||||
exporters:
|
||||
otlphttp:
|
||||
endpoint: http://clickstack-otel-collector:4318
|
||||
|
||||
service:
|
||||
pipelines:
|
||||
traces:
|
||||
receivers: [otlp]
|
||||
exporters: [otlphttp]
|
||||
metrics:
|
||||
receivers: [otlp]
|
||||
exporters: [otlphttp]
|
||||
logs:
|
||||
receivers: [otlp]
|
||||
exporters: [otlphttp]
|
||||
```
|
||||
|
||||
### Visualization: Grafana
|
||||
|
||||
**Grafana Operator:** Manages dashboards and datasources as CRDs
|
||||
|
||||
```yaml
|
||||
apiVersion: grafana.integreatly.org/v1beta1
|
||||
kind: GrafanaDashboard
|
||||
metadata:
|
||||
name: kubernetes-nodes
|
||||
spec:
|
||||
instanceSelector:
|
||||
matchLabels:
|
||||
grafana.internal/instance: grafana
|
||||
url: https://grafana.com/api/dashboards/15758/revisions/44/download
|
||||
```
|
||||
|
||||
**Datasources:**
|
||||
| Type | Source | Purpose |
|
||||
|------|--------|---------|
|
||||
| Prometheus | prometheus-operated:9090 | Metrics |
|
||||
| ClickHouse | clickstack:8123 | Logs & Traces |
|
||||
| Alertmanager | alertmanager-operated:9093 | Alert status |
|
||||
|
||||
### Alerting Pipeline
|
||||
|
||||
```
|
||||
Prometheus Rules → Alertmanager → ntfy → Discord/Mobile
|
||||
└─→ Email (future)
|
||||
```
|
||||
|
||||
**Alert Categories:**
|
||||
- Infrastructure: Node down, disk full, OOM
|
||||
- Application: Error rate, latency SLO breach
|
||||
- Security: Gatekeeper violations, vulnerability findings
|
||||
|
||||
## Dashboards
|
||||
|
||||
| Dashboard | Source | Purpose |
|
||||
|-----------|--------|---------|
|
||||
| Kubernetes Global | Grafana #15757 | Cluster overview |
|
||||
| Node Exporter | Grafana #1860 | Node metrics |
|
||||
| CNPG PostgreSQL | CNPG | Database health |
|
||||
| Flux | Flux Operator | GitOps status |
|
||||
| Cilium | Cilium | Network metrics |
|
||||
| Envoy Gateway | Envoy | Ingress metrics |
|
||||
|
||||
## Resource Allocation
|
||||
|
||||
| Component | CPU Request | Memory Limit |
|
||||
|-----------|-------------|--------------|
|
||||
| Prometheus | 100m | 2Gi |
|
||||
| OTEL Collector | 100m | 512Mi |
|
||||
| ClickStack | 500m | 2Gi |
|
||||
| Grafana | 100m | 256Mi |
|
||||
|
||||
## Future Enhancements
|
||||
|
||||
1. **Continuous Profiling** - Pyroscope for Go/Python profiling
|
||||
2. **SLO Tracking** - Sloth for SLI/SLO automation
|
||||
3. **Synthetic Monitoring** - Gatus for endpoint probing
|
||||
4. **Cost Attribution** - OpenCost for resource cost tracking
|
||||
|
||||
## References
|
||||
|
||||
* [OpenTelemetry Documentation](https://opentelemetry.io/docs/)
|
||||
* [ClickHouse for Observability](https://clickhouse.com/docs/en/use-cases/observability)
|
||||
* [kube-prometheus-stack](https://github.com/prometheus-community/helm-charts/tree/main/charts/kube-prometheus-stack)
|
||||
* [Grafana Operator](https://grafana.github.io/grafana-operator/)
|
||||
Reference in New Issue
Block a user