homelab-design/decisions/0025-observability-stack.md

# Observability Stack Architecture

* Status: accepted
* Date: 2026-02-04
* Deciders: Billy
* Technical Story: Establish comprehensive observability for metrics, logs, and traces across the homelab

## Context and Problem Statement

A complex homelab with AI/ML workloads, multiple databases, and numerous services requires comprehensive observability to understand system behavior, debug issues, and optimize performance.

How do we build an observability stack that provides metrics, logs, and traces while remaining manageable for a single operator?

## Decision Drivers

* Three pillars coverage - metrics, logs, and traces all addressed
* Unified visualization - single pane of glass for all telemetry
* Resource efficiency - don't overwhelm the cluster with observability overhead
* OpenTelemetry compatibility - future-proof instrumentation standard
* GitOps deployment - all configuration version-controlled

## Considered Options

1. **Prometheus + ClickStack + OpenTelemetry Collector**
2. **Prometheus + Loki + Tempo (PLT Stack)**
3. **Datadog/New Relic (SaaS)**
4. **ELK Stack (Elasticsearch, Logstash, Kibana)**

## Decision Outcome

Chosen option: **Option 1 - Prometheus + ClickStack + OpenTelemetry Collector**

Prometheus handles metrics with its mature ecosystem, ClickStack (ClickHouse-based) provides unified logs and traces storage with excellent performance, and OpenTelemetry Collector routes all telemetry data.

### Positive Consequences

* Prometheus ecosystem is mature with extensive service monitor support
* ClickHouse provides fast querying for logs and traces at scale
* OpenTelemetry is vendor-neutral and industry standard
* Grafana provides unified dashboards for all data sources
* Cost-effective (no SaaS fees)

### Negative Consequences

* More complex than pure SaaS solutions
* ClickHouse requires storage management
* Multiple components to maintain

## Architecture

```
┌─────────────────────────────────────────────────────────────────────┐
│                        Applications                                  │
│  ┌──────────┐  ┌──────────┐  ┌──────────┐  ┌──────────┐            │
│  │ Go Apps  │  │ Python   │  │ Node.js  │  │ Java     │            │
│  │ (OTEL)   │  │ (OTEL)   │  │ (OTEL)   │  │ (OTEL)   │            │
│  └────┬─────┘  └────┬─────┘  └────┬─────┘  └────┬─────┘            │
└───────┼─────────────┼─────────────┼─────────────┼──────────────────┘
        │             │             │             │
        └──────────────────┬────────────────────────┘
                           │ OTLP (gRPC/HTTP)
                           ▼
              ┌────────────────────────┐
              │  OpenTelemetry         │
              │  Collector             │
              │  (traces, metrics,     │
              │   logs)                │
              └───────────┬────────────┘
                          │
          ┌───────────────┼───────────────┐
          │               │               │
          ▼               ▼               ▼
┌─────────────────┐ ┌───────────┐ ┌───────────────┐
│   ClickStack    │ │Prometheus │ │   Grafana     │
│   (ClickHouse)  │ │           │ │               │
│  ┌───────────┐  │ │ Metrics   │ │  Dashboards   │
│  │  Traces   │  │ │ Storage   │ │  Alerting     │
│  ├───────────┤  │ │           │ │  Exploration  │
│  │   Logs    │  │ └───────────┘ │               │
│  └───────────┘  │               └───────────────┘
└─────────────────┘                      │
                                         │
                    ┌────────────────────┤
                    │                    │
              ┌─────▼─────┐        ┌─────▼─────┐
              │Alertmanager│        │   ntfy    │
              │           │        │ (push)    │
              └───────────┘        └───────────┘
```

## Component Details

### Metrics: Prometheus + kube-prometheus-stack

**Deployment:** HelmRelease via Flux

```yaml
prometheus:
  prometheusSpec:
    retention: 14d
    retentionSize: 50GB
    storage:
      volumeClaimTemplate:
        spec:
          storageClassName: longhorn
          storage: 50Gi
```

**Key Features:**
- ServiceMonitor auto-discovery for all workloads
- 14-day retention with 50GB limit
- PromPP image for enhanced performance
- AlertManager for routing alerts

### Logs & Traces: ClickStack

**Why ClickStack over Loki/Tempo:**
- Single storage backend (ClickHouse) for both logs and traces
- Excellent query performance on large datasets
- Built-in correlation between logs and traces
- Lower resource overhead than separate Loki + Tempo

**Configuration:**
- OTEL Collector receives all telemetry
- Forwards to ClickStack's OTEL collector
- Grafana datasources for querying

### Telemetry Collection: OpenTelemetry

**OpenTelemetry Operator:** Manages auto-instrumentation

```yaml
apiVersion: opentelemetry.io/v1alpha1
kind: Instrumentation
metadata:
  name: auto-instrumentation
spec:
  python:
    image: ghcr.io/open-telemetry/opentelemetry-operator/autoinstrumentation-python
  nodejs:
    image: ghcr.io/open-telemetry/opentelemetry-operator/autoinstrumentation-nodejs
```

**OpenTelemetry Collector:** Central routing

```yaml
receivers:
  otlp:
    protocols:
      grpc:
        endpoint: 0.0.0.0:4317
      http:
        endpoint: 0.0.0.0:4318

exporters:
  otlphttp:
    endpoint: http://clickstack-otel-collector:4318

service:
  pipelines:
    traces:
      receivers: [otlp]
      exporters: [otlphttp]
    metrics:
      receivers: [otlp]
      exporters: [otlphttp]
    logs:
      receivers: [otlp]
      exporters: [otlphttp]
```

### Visualization: Grafana

**Grafana Operator:** Manages dashboards and datasources as CRDs

```yaml
apiVersion: grafana.integreatly.org/v1beta1
kind: GrafanaDashboard
metadata:
  name: kubernetes-nodes
spec:
  instanceSelector:
    matchLabels:
      grafana.internal/instance: grafana
  url: https://grafana.com/api/dashboards/15758/revisions/44/download
```

**Datasources:**
| Type | Source | Purpose |
|------|--------|---------|
| Prometheus | prometheus-operated:9090 | Metrics |
| ClickHouse | clickstack:8123 | Logs & Traces |
| Alertmanager | alertmanager-operated:9093 | Alert status |

### Alerting Pipeline

```
Prometheus Rules → Alertmanager → ntfy → Discord/Mobile
                                      └─→ Email (future)
```

**Alert Categories:**
- Infrastructure: Node down, disk full, OOM
- Application: Error rate, latency SLO breach
- Security: Gatekeeper violations, vulnerability findings

## Dashboards

| Dashboard | Source | Purpose |
|-----------|--------|---------|
| Kubernetes Global | Grafana #15757 | Cluster overview |
| Node Exporter | Grafana #1860 | Node metrics |
| CNPG PostgreSQL | CNPG | Database health |
| Flux | Flux Operator | GitOps status |
| Cilium | Cilium | Network metrics |
| Envoy Gateway | Envoy | Ingress metrics |

## Resource Allocation

| Component | CPU Request | Memory Limit |
|-----------|-------------|--------------|
| Prometheus | 100m | 2Gi |
| OTEL Collector | 100m | 512Mi |
| ClickStack | 500m | 2Gi |
| Grafana | 100m | 256Mi |

## Future Enhancements

1. **Continuous Profiling** - Pyroscope for Go/Python profiling
2. **SLO Tracking** - Sloth for SLI/SLO automation
3. **Synthetic Monitoring** - Gatus for endpoint probing
4. **Cost Attribution** - OpenCost for resource cost tracking

## References

* [OpenTelemetry Documentation](https://opentelemetry.io/docs/)
* [ClickHouse for Observability](https://clickhouse.com/docs/en/use-cases/observability)
* [kube-prometheus-stack](https://github.com/prometheus-community/helm-charts/tree/main/charts/kube-prometheus-stack)
* [Grafana Operator](https://grafana.github.io/grafana-operator/)