Files
homelab-design/decisions/0025-observability-stack.md
Billy D. b43c80153c docs: add ADRs 0025-0028 for infrastructure patterns
- 0025: Observability stack (Prometheus + ClickStack + OTEL)
- 0026: Tiered storage strategy (Longhorn + NFS)
- 0027: Database strategy (CloudNativePG for PostgreSQL)
- 0028: Authentik SSO strategy (OIDC/SAML identity provider)
2026-02-04 08:55:15 -05:00

8.8 KiB

Observability Stack Architecture

  • Status: accepted
  • Date: 2026-02-04
  • Deciders: Billy
  • Technical Story: Establish comprehensive observability for metrics, logs, and traces across the homelab

Context and Problem Statement

A complex homelab with AI/ML workloads, multiple databases, and numerous services requires comprehensive observability to understand system behavior, debug issues, and optimize performance.

How do we build an observability stack that provides metrics, logs, and traces while remaining manageable for a single operator?

Decision Drivers

  • Three pillars coverage - metrics, logs, and traces all addressed
  • Unified visualization - single pane of glass for all telemetry
  • Resource efficiency - don't overwhelm the cluster with observability overhead
  • OpenTelemetry compatibility - future-proof instrumentation standard
  • GitOps deployment - all configuration version-controlled

Considered Options

  1. Prometheus + ClickStack + OpenTelemetry Collector
  2. Prometheus + Loki + Tempo (PLT Stack)
  3. Datadog/New Relic (SaaS)
  4. ELK Stack (Elasticsearch, Logstash, Kibana)

Decision Outcome

Chosen option: Option 1 - Prometheus + ClickStack + OpenTelemetry Collector

Prometheus handles metrics with its mature ecosystem, ClickStack (ClickHouse-based) provides unified logs and traces storage with excellent performance, and OpenTelemetry Collector routes all telemetry data.

Positive Consequences

  • Prometheus ecosystem is mature with extensive service monitor support
  • ClickHouse provides fast querying for logs and traces at scale
  • OpenTelemetry is vendor-neutral and industry standard
  • Grafana provides unified dashboards for all data sources
  • Cost-effective (no SaaS fees)

Negative Consequences

  • More complex than pure SaaS solutions
  • ClickHouse requires storage management
  • Multiple components to maintain

Architecture

┌─────────────────────────────────────────────────────────────────────┐
│                        Applications                                  │
│  ┌──────────┐  ┌──────────┐  ┌──────────┐  ┌──────────┐            │
│  │ Go Apps  │  │ Python   │  │ Node.js  │  │ Java     │            │
│  │ (OTEL)   │  │ (OTEL)   │  │ (OTEL)   │  │ (OTEL)   │            │
│  └────┬─────┘  └────┬─────┘  └────┬─────┘  └────┬─────┘            │
└───────┼─────────────┼─────────────┼─────────────┼──────────────────┘
        │             │             │             │
        └──────────────────┬────────────────────────┘
                           │ OTLP (gRPC/HTTP)
                           ▼
              ┌────────────────────────┐
              │  OpenTelemetry         │
              │  Collector             │
              │  (traces, metrics,     │
              │   logs)                │
              └───────────┬────────────┘
                          │
          ┌───────────────┼───────────────┐
          │               │               │
          ▼               ▼               ▼
┌─────────────────┐ ┌───────────┐ ┌───────────────┐
│   ClickStack    │ │Prometheus │ │   Grafana     │
│   (ClickHouse)  │ │           │ │               │
│  ┌───────────┐  │ │ Metrics   │ │  Dashboards   │
│  │  Traces   │  │ │ Storage   │ │  Alerting     │
│  ├───────────┤  │ │           │ │  Exploration  │
│  │   Logs    │  │ └───────────┘ │               │
│  └───────────┘  │               └───────────────┘
└─────────────────┘                      │
                                         │
                    ┌────────────────────┤
                    │                    │
              ┌─────▼─────┐        ┌─────▼─────┐
              │Alertmanager│        │   ntfy    │
              │           │        │ (push)    │
              └───────────┘        └───────────┘

Component Details

Metrics: Prometheus + kube-prometheus-stack

Deployment: HelmRelease via Flux

prometheus:
  prometheusSpec:
    retention: 14d
    retentionSize: 50GB
    storage:
      volumeClaimTemplate:
        spec:
          storageClassName: longhorn
          storage: 50Gi

Key Features:

  • ServiceMonitor auto-discovery for all workloads
  • 14-day retention with 50GB limit
  • PromPP image for enhanced performance
  • AlertManager for routing alerts

Logs & Traces: ClickStack

Why ClickStack over Loki/Tempo:

  • Single storage backend (ClickHouse) for both logs and traces
  • Excellent query performance on large datasets
  • Built-in correlation between logs and traces
  • Lower resource overhead than separate Loki + Tempo

Configuration:

  • OTEL Collector receives all telemetry
  • Forwards to ClickStack's OTEL collector
  • Grafana datasources for querying

Telemetry Collection: OpenTelemetry

OpenTelemetry Operator: Manages auto-instrumentation

apiVersion: opentelemetry.io/v1alpha1
kind: Instrumentation
metadata:
  name: auto-instrumentation
spec:
  python:
    image: ghcr.io/open-telemetry/opentelemetry-operator/autoinstrumentation-python
  nodejs:
    image: ghcr.io/open-telemetry/opentelemetry-operator/autoinstrumentation-nodejs

OpenTelemetry Collector: Central routing

receivers:
  otlp:
    protocols:
      grpc:
        endpoint: 0.0.0.0:4317
      http:
        endpoint: 0.0.0.0:4318

exporters:
  otlphttp:
    endpoint: http://clickstack-otel-collector:4318

service:
  pipelines:
    traces:
      receivers: [otlp]
      exporters: [otlphttp]
    metrics:
      receivers: [otlp]
      exporters: [otlphttp]
    logs:
      receivers: [otlp]
      exporters: [otlphttp]

Visualization: Grafana

Grafana Operator: Manages dashboards and datasources as CRDs

apiVersion: grafana.integreatly.org/v1beta1
kind: GrafanaDashboard
metadata:
  name: kubernetes-nodes
spec:
  instanceSelector:
    matchLabels:
      grafana.internal/instance: grafana
  url: https://grafana.com/api/dashboards/15758/revisions/44/download

Datasources:

Type Source Purpose
Prometheus prometheus-operated:9090 Metrics
ClickHouse clickstack:8123 Logs & Traces
Alertmanager alertmanager-operated:9093 Alert status

Alerting Pipeline

Prometheus Rules → Alertmanager → ntfy → Discord/Mobile
                                      └─→ Email (future)

Alert Categories:

  • Infrastructure: Node down, disk full, OOM
  • Application: Error rate, latency SLO breach
  • Security: Gatekeeper violations, vulnerability findings

Dashboards

Dashboard Source Purpose
Kubernetes Global Grafana #15757 Cluster overview
Node Exporter Grafana #1860 Node metrics
CNPG PostgreSQL CNPG Database health
Flux Flux Operator GitOps status
Cilium Cilium Network metrics
Envoy Gateway Envoy Ingress metrics

Resource Allocation

Component CPU Request Memory Limit
Prometheus 100m 2Gi
OTEL Collector 100m 512Mi
ClickStack 500m 2Gi
Grafana 100m 256Mi

Future Enhancements

  1. Continuous Profiling - Pyroscope for Go/Python profiling
  2. SLO Tracking - Sloth for SLI/SLO automation
  3. Synthetic Monitoring - Gatus for endpoint probing
  4. Cost Attribution - OpenCost for resource cost tracking

References