- 0025: Observability stack (Prometheus + ClickStack + OTEL) - 0026: Tiered storage strategy (Longhorn + NFS) - 0027: Database strategy (CloudNativePG for PostgreSQL) - 0028: Authentik SSO strategy (OIDC/SAML identity provider)
8.8 KiB
Observability Stack Architecture
- Status: accepted
- Date: 2026-02-04
- Deciders: Billy
- Technical Story: Establish comprehensive observability for metrics, logs, and traces across the homelab
Context and Problem Statement
A complex homelab with AI/ML workloads, multiple databases, and numerous services requires comprehensive observability to understand system behavior, debug issues, and optimize performance.
How do we build an observability stack that provides metrics, logs, and traces while remaining manageable for a single operator?
Decision Drivers
- Three pillars coverage - metrics, logs, and traces all addressed
- Unified visualization - single pane of glass for all telemetry
- Resource efficiency - don't overwhelm the cluster with observability overhead
- OpenTelemetry compatibility - future-proof instrumentation standard
- GitOps deployment - all configuration version-controlled
Considered Options
- Prometheus + ClickStack + OpenTelemetry Collector
- Prometheus + Loki + Tempo (PLT Stack)
- Datadog/New Relic (SaaS)
- ELK Stack (Elasticsearch, Logstash, Kibana)
Decision Outcome
Chosen option: Option 1 - Prometheus + ClickStack + OpenTelemetry Collector
Prometheus handles metrics with its mature ecosystem, ClickStack (ClickHouse-based) provides unified logs and traces storage with excellent performance, and OpenTelemetry Collector routes all telemetry data.
Positive Consequences
- Prometheus ecosystem is mature with extensive service monitor support
- ClickHouse provides fast querying for logs and traces at scale
- OpenTelemetry is vendor-neutral and industry standard
- Grafana provides unified dashboards for all data sources
- Cost-effective (no SaaS fees)
Negative Consequences
- More complex than pure SaaS solutions
- ClickHouse requires storage management
- Multiple components to maintain
Architecture
┌─────────────────────────────────────────────────────────────────────┐
│ Applications │
│ ┌──────────┐ ┌──────────┐ ┌──────────┐ ┌──────────┐ │
│ │ Go Apps │ │ Python │ │ Node.js │ │ Java │ │
│ │ (OTEL) │ │ (OTEL) │ │ (OTEL) │ │ (OTEL) │ │
│ └────┬─────┘ └────┬─────┘ └────┬─────┘ └────┬─────┘ │
└───────┼─────────────┼─────────────┼─────────────┼──────────────────┘
│ │ │ │
└──────────────────┬────────────────────────┘
│ OTLP (gRPC/HTTP)
▼
┌────────────────────────┐
│ OpenTelemetry │
│ Collector │
│ (traces, metrics, │
│ logs) │
└───────────┬────────────┘
│
┌───────────────┼───────────────┐
│ │ │
▼ ▼ ▼
┌─────────────────┐ ┌───────────┐ ┌───────────────┐
│ ClickStack │ │Prometheus │ │ Grafana │
│ (ClickHouse) │ │ │ │ │
│ ┌───────────┐ │ │ Metrics │ │ Dashboards │
│ │ Traces │ │ │ Storage │ │ Alerting │
│ ├───────────┤ │ │ │ │ Exploration │
│ │ Logs │ │ └───────────┘ │ │
│ └───────────┘ │ └───────────────┘
└─────────────────┘ │
│
┌────────────────────┤
│ │
┌─────▼─────┐ ┌─────▼─────┐
│Alertmanager│ │ ntfy │
│ │ │ (push) │
└───────────┘ └───────────┘
Component Details
Metrics: Prometheus + kube-prometheus-stack
Deployment: HelmRelease via Flux
prometheus:
prometheusSpec:
retention: 14d
retentionSize: 50GB
storage:
volumeClaimTemplate:
spec:
storageClassName: longhorn
storage: 50Gi
Key Features:
- ServiceMonitor auto-discovery for all workloads
- 14-day retention with 50GB limit
- PromPP image for enhanced performance
- AlertManager for routing alerts
Logs & Traces: ClickStack
Why ClickStack over Loki/Tempo:
- Single storage backend (ClickHouse) for both logs and traces
- Excellent query performance on large datasets
- Built-in correlation between logs and traces
- Lower resource overhead than separate Loki + Tempo
Configuration:
- OTEL Collector receives all telemetry
- Forwards to ClickStack's OTEL collector
- Grafana datasources for querying
Telemetry Collection: OpenTelemetry
OpenTelemetry Operator: Manages auto-instrumentation
apiVersion: opentelemetry.io/v1alpha1
kind: Instrumentation
metadata:
name: auto-instrumentation
spec:
python:
image: ghcr.io/open-telemetry/opentelemetry-operator/autoinstrumentation-python
nodejs:
image: ghcr.io/open-telemetry/opentelemetry-operator/autoinstrumentation-nodejs
OpenTelemetry Collector: Central routing
receivers:
otlp:
protocols:
grpc:
endpoint: 0.0.0.0:4317
http:
endpoint: 0.0.0.0:4318
exporters:
otlphttp:
endpoint: http://clickstack-otel-collector:4318
service:
pipelines:
traces:
receivers: [otlp]
exporters: [otlphttp]
metrics:
receivers: [otlp]
exporters: [otlphttp]
logs:
receivers: [otlp]
exporters: [otlphttp]
Visualization: Grafana
Grafana Operator: Manages dashboards and datasources as CRDs
apiVersion: grafana.integreatly.org/v1beta1
kind: GrafanaDashboard
metadata:
name: kubernetes-nodes
spec:
instanceSelector:
matchLabels:
grafana.internal/instance: grafana
url: https://grafana.com/api/dashboards/15758/revisions/44/download
Datasources:
| Type | Source | Purpose |
|---|---|---|
| Prometheus | prometheus-operated:9090 | Metrics |
| ClickHouse | clickstack:8123 | Logs & Traces |
| Alertmanager | alertmanager-operated:9093 | Alert status |
Alerting Pipeline
Prometheus Rules → Alertmanager → ntfy → Discord/Mobile
└─→ Email (future)
Alert Categories:
- Infrastructure: Node down, disk full, OOM
- Application: Error rate, latency SLO breach
- Security: Gatekeeper violations, vulnerability findings
Dashboards
| Dashboard | Source | Purpose |
|---|---|---|
| Kubernetes Global | Grafana #15757 | Cluster overview |
| Node Exporter | Grafana #1860 | Node metrics |
| CNPG PostgreSQL | CNPG | Database health |
| Flux | Flux Operator | GitOps status |
| Cilium | Cilium | Network metrics |
| Envoy Gateway | Envoy | Ingress metrics |
Resource Allocation
| Component | CPU Request | Memory Limit |
|---|---|---|
| Prometheus | 100m | 2Gi |
| OTEL Collector | 100m | 512Mi |
| ClickStack | 500m | 2Gi |
| Grafana | 100m | 256Mi |
Future Enhancements
- Continuous Profiling - Pyroscope for Go/Python profiling
- SLO Tracking - Sloth for SLI/SLO automation
- Synthetic Monitoring - Gatus for endpoint probing
- Cost Attribution - OpenCost for resource cost tracking