Files
homelab-design/decisions/0025-observability-stack.md
Billy D. b43c80153c docs: add ADRs 0025-0028 for infrastructure patterns
- 0025: Observability stack (Prometheus + ClickStack + OTEL)
- 0026: Tiered storage strategy (Longhorn + NFS)
- 0027: Database strategy (CloudNativePG for PostgreSQL)
- 0028: Authentik SSO strategy (OIDC/SAML identity provider)
2026-02-04 08:55:15 -05:00

240 lines
8.8 KiB
Markdown

# Observability Stack Architecture
* Status: accepted
* Date: 2026-02-04
* Deciders: Billy
* Technical Story: Establish comprehensive observability for metrics, logs, and traces across the homelab
## Context and Problem Statement
A complex homelab with AI/ML workloads, multiple databases, and numerous services requires comprehensive observability to understand system behavior, debug issues, and optimize performance.
How do we build an observability stack that provides metrics, logs, and traces while remaining manageable for a single operator?
## Decision Drivers
* Three pillars coverage - metrics, logs, and traces all addressed
* Unified visualization - single pane of glass for all telemetry
* Resource efficiency - don't overwhelm the cluster with observability overhead
* OpenTelemetry compatibility - future-proof instrumentation standard
* GitOps deployment - all configuration version-controlled
## Considered Options
1. **Prometheus + ClickStack + OpenTelemetry Collector**
2. **Prometheus + Loki + Tempo (PLT Stack)**
3. **Datadog/New Relic (SaaS)**
4. **ELK Stack (Elasticsearch, Logstash, Kibana)**
## Decision Outcome
Chosen option: **Option 1 - Prometheus + ClickStack + OpenTelemetry Collector**
Prometheus handles metrics with its mature ecosystem, ClickStack (ClickHouse-based) provides unified logs and traces storage with excellent performance, and OpenTelemetry Collector routes all telemetry data.
### Positive Consequences
* Prometheus ecosystem is mature with extensive service monitor support
* ClickHouse provides fast querying for logs and traces at scale
* OpenTelemetry is vendor-neutral and industry standard
* Grafana provides unified dashboards for all data sources
* Cost-effective (no SaaS fees)
### Negative Consequences
* More complex than pure SaaS solutions
* ClickHouse requires storage management
* Multiple components to maintain
## Architecture
```
┌─────────────────────────────────────────────────────────────────────┐
│ Applications │
│ ┌──────────┐ ┌──────────┐ ┌──────────┐ ┌──────────┐ │
│ │ Go Apps │ │ Python │ │ Node.js │ │ Java │ │
│ │ (OTEL) │ │ (OTEL) │ │ (OTEL) │ │ (OTEL) │ │
│ └────┬─────┘ └────┬─────┘ └────┬─────┘ └────┬─────┘ │
└───────┼─────────────┼─────────────┼─────────────┼──────────────────┘
│ │ │ │
└──────────────────┬────────────────────────┘
│ OTLP (gRPC/HTTP)
┌────────────────────────┐
│ OpenTelemetry │
│ Collector │
│ (traces, metrics, │
│ logs) │
└───────────┬────────────┘
┌───────────────┼───────────────┐
│ │ │
▼ ▼ ▼
┌─────────────────┐ ┌───────────┐ ┌───────────────┐
│ ClickStack │ │Prometheus │ │ Grafana │
│ (ClickHouse) │ │ │ │ │
│ ┌───────────┐ │ │ Metrics │ │ Dashboards │
│ │ Traces │ │ │ Storage │ │ Alerting │
│ ├───────────┤ │ │ │ │ Exploration │
│ │ Logs │ │ └───────────┘ │ │
│ └───────────┘ │ └───────────────┘
└─────────────────┘ │
┌────────────────────┤
│ │
┌─────▼─────┐ ┌─────▼─────┐
│Alertmanager│ │ ntfy │
│ │ │ (push) │
└───────────┘ └───────────┘
```
## Component Details
### Metrics: Prometheus + kube-prometheus-stack
**Deployment:** HelmRelease via Flux
```yaml
prometheus:
prometheusSpec:
retention: 14d
retentionSize: 50GB
storage:
volumeClaimTemplate:
spec:
storageClassName: longhorn
storage: 50Gi
```
**Key Features:**
- ServiceMonitor auto-discovery for all workloads
- 14-day retention with 50GB limit
- PromPP image for enhanced performance
- AlertManager for routing alerts
### Logs & Traces: ClickStack
**Why ClickStack over Loki/Tempo:**
- Single storage backend (ClickHouse) for both logs and traces
- Excellent query performance on large datasets
- Built-in correlation between logs and traces
- Lower resource overhead than separate Loki + Tempo
**Configuration:**
- OTEL Collector receives all telemetry
- Forwards to ClickStack's OTEL collector
- Grafana datasources for querying
### Telemetry Collection: OpenTelemetry
**OpenTelemetry Operator:** Manages auto-instrumentation
```yaml
apiVersion: opentelemetry.io/v1alpha1
kind: Instrumentation
metadata:
name: auto-instrumentation
spec:
python:
image: ghcr.io/open-telemetry/opentelemetry-operator/autoinstrumentation-python
nodejs:
image: ghcr.io/open-telemetry/opentelemetry-operator/autoinstrumentation-nodejs
```
**OpenTelemetry Collector:** Central routing
```yaml
receivers:
otlp:
protocols:
grpc:
endpoint: 0.0.0.0:4317
http:
endpoint: 0.0.0.0:4318
exporters:
otlphttp:
endpoint: http://clickstack-otel-collector:4318
service:
pipelines:
traces:
receivers: [otlp]
exporters: [otlphttp]
metrics:
receivers: [otlp]
exporters: [otlphttp]
logs:
receivers: [otlp]
exporters: [otlphttp]
```
### Visualization: Grafana
**Grafana Operator:** Manages dashboards and datasources as CRDs
```yaml
apiVersion: grafana.integreatly.org/v1beta1
kind: GrafanaDashboard
metadata:
name: kubernetes-nodes
spec:
instanceSelector:
matchLabels:
grafana.internal/instance: grafana
url: https://grafana.com/api/dashboards/15758/revisions/44/download
```
**Datasources:**
| Type | Source | Purpose |
|------|--------|---------|
| Prometheus | prometheus-operated:9090 | Metrics |
| ClickHouse | clickstack:8123 | Logs & Traces |
| Alertmanager | alertmanager-operated:9093 | Alert status |
### Alerting Pipeline
```
Prometheus Rules → Alertmanager → ntfy → Discord/Mobile
└─→ Email (future)
```
**Alert Categories:**
- Infrastructure: Node down, disk full, OOM
- Application: Error rate, latency SLO breach
- Security: Gatekeeper violations, vulnerability findings
## Dashboards
| Dashboard | Source | Purpose |
|-----------|--------|---------|
| Kubernetes Global | Grafana #15757 | Cluster overview |
| Node Exporter | Grafana #1860 | Node metrics |
| CNPG PostgreSQL | CNPG | Database health |
| Flux | Flux Operator | GitOps status |
| Cilium | Cilium | Network metrics |
| Envoy Gateway | Envoy | Ingress metrics |
## Resource Allocation
| Component | CPU Request | Memory Limit |
|-----------|-------------|--------------|
| Prometheus | 100m | 2Gi |
| OTEL Collector | 100m | 512Mi |
| ClickStack | 500m | 2Gi |
| Grafana | 100m | 256Mi |
## Future Enhancements
1. **Continuous Profiling** - Pyroscope for Go/Python profiling
2. **SLO Tracking** - Sloth for SLI/SLO automation
3. **Synthetic Monitoring** - Gatus for endpoint probing
4. **Cost Attribution** - OpenCost for resource cost tracking
## References
* [OpenTelemetry Documentation](https://opentelemetry.io/docs/)
* [ClickHouse for Observability](https://clickhouse.com/docs/en/use-cases/observability)
* [kube-prometheus-stack](https://github.com/prometheus-community/helm-charts/tree/main/charts/kube-prometheus-stack)
* [Grafana Operator](https://grafana.github.io/grafana-operator/)