Files

Billy D. b43c80153c docs: add ADRs 0025-0028 for infrastructure patterns

- 0025: Observability stack (Prometheus + ClickStack + OTEL)
- 0026: Tiered storage strategy (Longhorn + NFS)
- 0027: Database strategy (CloudNativePG for PostgreSQL)
- 0028: Authentik SSO strategy (OIDC/SAML identity provider)

2026-02-04 08:55:15 -05:00

8.8 KiB

Raw Blame History

Observability Stack Architecture

Status: accepted
Date: 2026-02-04
Deciders: Billy
Technical Story: Establish comprehensive observability for metrics, logs, and traces across the homelab

Context and Problem Statement

A complex homelab with AI/ML workloads, multiple databases, and numerous services requires comprehensive observability to understand system behavior, debug issues, and optimize performance.

How do we build an observability stack that provides metrics, logs, and traces while remaining manageable for a single operator?

Decision Drivers

Three pillars coverage - metrics, logs, and traces all addressed
Unified visualization - single pane of glass for all telemetry
Resource efficiency - don't overwhelm the cluster with observability overhead
OpenTelemetry compatibility - future-proof instrumentation standard
GitOps deployment - all configuration version-controlled

Considered Options

Prometheus + ClickStack + OpenTelemetry Collector
Prometheus + Loki + Tempo (PLT Stack)
Datadog/New Relic (SaaS)
ELK Stack (Elasticsearch, Logstash, Kibana)

Decision Outcome

Chosen option: Option 1 - Prometheus + ClickStack + OpenTelemetry Collector

Prometheus handles metrics with its mature ecosystem, ClickStack (ClickHouse-based) provides unified logs and traces storage with excellent performance, and OpenTelemetry Collector routes all telemetry data.

Positive Consequences

Prometheus ecosystem is mature with extensive service monitor support
ClickHouse provides fast querying for logs and traces at scale
OpenTelemetry is vendor-neutral and industry standard
Grafana provides unified dashboards for all data sources
Cost-effective (no SaaS fees)

Negative Consequences

More complex than pure SaaS solutions
ClickHouse requires storage management
Multiple components to maintain

Architecture

┌─────────────────────────────────────────────────────────────────────┐
│                        Applications                                  │
│  ┌──────────┐  ┌──────────┐  ┌──────────┐  ┌──────────┐            │
│  │ Go Apps  │  │ Python   │  │ Node.js  │  │ Java     │            │
│  │ (OTEL)   │  │ (OTEL)   │  │ (OTEL)   │  │ (OTEL)   │            │
│  └────┬─────┘  └────┬─────┘  └────┬─────┘  └────┬─────┘            │
└───────┼─────────────┼─────────────┼─────────────┼──────────────────┘
        │             │             │             │
        └──────────────────┬────────────────────────┘
                           │ OTLP (gRPC/HTTP)
                           ▼
              ┌────────────────────────┐
              │  OpenTelemetry         │
              │  Collector             │
              │  (traces, metrics,     │
              │   logs)                │
              └───────────┬────────────┘
                          │
          ┌───────────────┼───────────────┐
          │               │               │
          ▼               ▼               ▼
┌─────────────────┐ ┌───────────┐ ┌───────────────┐
│   ClickStack    │ │Prometheus │ │   Grafana     │
│   (ClickHouse)  │ │           │ │               │
│  ┌───────────┐  │ │ Metrics   │ │  Dashboards   │
│  │  Traces   │  │ │ Storage   │ │  Alerting     │
│  ├───────────┤  │ │           │ │  Exploration  │
│  │   Logs    │  │ └───────────┘ │               │
│  └───────────┘  │               └───────────────┘
└─────────────────┘                      │
                                         │
                    ┌────────────────────┤
                    │                    │
              ┌─────▼─────┐        ┌─────▼─────┐
              │Alertmanager│        │   ntfy    │
              │           │        │ (push)    │
              └───────────┘        └───────────┘

Component Details

Metrics: Prometheus + kube-prometheus-stack

Deployment: HelmRelease via Flux

prometheus:
  prometheusSpec:
    retention: 14d
    retentionSize: 50GB
    storage:
      volumeClaimTemplate:
        spec:
          storageClassName: longhorn
          storage: 50Gi

Key Features:

ServiceMonitor auto-discovery for all workloads
14-day retention with 50GB limit
PromPP image for enhanced performance
AlertManager for routing alerts

Logs & Traces: ClickStack

Why ClickStack over Loki/Tempo:

Single storage backend (ClickHouse) for both logs and traces
Excellent query performance on large datasets
Built-in correlation between logs and traces
Lower resource overhead than separate Loki + Tempo

Configuration:

OTEL Collector receives all telemetry
Forwards to ClickStack's OTEL collector
Grafana datasources for querying

Telemetry Collection: OpenTelemetry

OpenTelemetry Operator: Manages auto-instrumentation

apiVersion: opentelemetry.io/v1alpha1
kind: Instrumentation
metadata:
  name: auto-instrumentation
spec:
  python:
    image: ghcr.io/open-telemetry/opentelemetry-operator/autoinstrumentation-python
  nodejs:
    image: ghcr.io/open-telemetry/opentelemetry-operator/autoinstrumentation-nodejs

OpenTelemetry Collector: Central routing

receivers:
  otlp:
    protocols:
      grpc:
        endpoint: 0.0.0.0:4317
      http:
        endpoint: 0.0.0.0:4318

exporters:
  otlphttp:
    endpoint: http://clickstack-otel-collector:4318

service:
  pipelines:
    traces:
      receivers: [otlp]
      exporters: [otlphttp]
    metrics:
      receivers: [otlp]
      exporters: [otlphttp]
    logs:
      receivers: [otlp]
      exporters: [otlphttp]

Visualization: Grafana

Grafana Operator: Manages dashboards and datasources as CRDs

apiVersion: grafana.integreatly.org/v1beta1
kind: GrafanaDashboard
metadata:
  name: kubernetes-nodes
spec:
  instanceSelector:
    matchLabels:
      grafana.internal/instance: grafana
  url: https://grafana.com/api/dashboards/15758/revisions/44/download

Datasources:

Type	Source	Purpose
Prometheus	prometheus-operated:9090	Metrics
ClickHouse	clickstack:8123	Logs & Traces
Alertmanager	alertmanager-operated:9093	Alert status

Alerting Pipeline

Prometheus Rules → Alertmanager → ntfy → Discord/Mobile
                                      └─→ Email (future)

Alert Categories:

Infrastructure: Node down, disk full, OOM
Application: Error rate, latency SLO breach
Security: Gatekeeper violations, vulnerability findings

Dashboards

Dashboard	Source	Purpose
Kubernetes Global	Grafana #15757	Cluster overview
Node Exporter	Grafana #1860	Node metrics
CNPG PostgreSQL	CNPG	Database health
Flux	Flux Operator	GitOps status
Cilium	Cilium	Network metrics
Envoy Gateway	Envoy	Ingress metrics

Resource Allocation

Component	CPU Request	Memory Limit
Prometheus	100m	2Gi
OTEL Collector	100m	512Mi
ClickStack	500m	2Gi
Grafana	100m	256Mi

Future Enhancements

Continuous Profiling - Pyroscope for Go/Python profiling
SLO Tracking - Sloth for SLI/SLO automation
Synthetic Monitoring - Gatus for endpoint probing
Cost Attribution - OpenCost for resource cost tracking

8.8 KiB Raw Blame History