- 0025: Observability stack (Prometheus + ClickStack + OTEL) - 0026: Tiered storage strategy (Longhorn + NFS) - 0027: Database strategy (CloudNativePG for PostgreSQL) - 0028: Authentik SSO strategy (OIDC/SAML identity provider)
240 lines
8.8 KiB
Markdown
240 lines
8.8 KiB
Markdown
# Observability Stack Architecture
|
|
|
|
* Status: accepted
|
|
* Date: 2026-02-04
|
|
* Deciders: Billy
|
|
* Technical Story: Establish comprehensive observability for metrics, logs, and traces across the homelab
|
|
|
|
## Context and Problem Statement
|
|
|
|
A complex homelab with AI/ML workloads, multiple databases, and numerous services requires comprehensive observability to understand system behavior, debug issues, and optimize performance.
|
|
|
|
How do we build an observability stack that provides metrics, logs, and traces while remaining manageable for a single operator?
|
|
|
|
## Decision Drivers
|
|
|
|
* Three pillars coverage - metrics, logs, and traces all addressed
|
|
* Unified visualization - single pane of glass for all telemetry
|
|
* Resource efficiency - don't overwhelm the cluster with observability overhead
|
|
* OpenTelemetry compatibility - future-proof instrumentation standard
|
|
* GitOps deployment - all configuration version-controlled
|
|
|
|
## Considered Options
|
|
|
|
1. **Prometheus + ClickStack + OpenTelemetry Collector**
|
|
2. **Prometheus + Loki + Tempo (PLT Stack)**
|
|
3. **Datadog/New Relic (SaaS)**
|
|
4. **ELK Stack (Elasticsearch, Logstash, Kibana)**
|
|
|
|
## Decision Outcome
|
|
|
|
Chosen option: **Option 1 - Prometheus + ClickStack + OpenTelemetry Collector**
|
|
|
|
Prometheus handles metrics with its mature ecosystem, ClickStack (ClickHouse-based) provides unified logs and traces storage with excellent performance, and OpenTelemetry Collector routes all telemetry data.
|
|
|
|
### Positive Consequences
|
|
|
|
* Prometheus ecosystem is mature with extensive service monitor support
|
|
* ClickHouse provides fast querying for logs and traces at scale
|
|
* OpenTelemetry is vendor-neutral and industry standard
|
|
* Grafana provides unified dashboards for all data sources
|
|
* Cost-effective (no SaaS fees)
|
|
|
|
### Negative Consequences
|
|
|
|
* More complex than pure SaaS solutions
|
|
* ClickHouse requires storage management
|
|
* Multiple components to maintain
|
|
|
|
## Architecture
|
|
|
|
```
|
|
┌─────────────────────────────────────────────────────────────────────┐
|
|
│ Applications │
|
|
│ ┌──────────┐ ┌──────────┐ ┌──────────┐ ┌──────────┐ │
|
|
│ │ Go Apps │ │ Python │ │ Node.js │ │ Java │ │
|
|
│ │ (OTEL) │ │ (OTEL) │ │ (OTEL) │ │ (OTEL) │ │
|
|
│ └────┬─────┘ └────┬─────┘ └────┬─────┘ └────┬─────┘ │
|
|
└───────┼─────────────┼─────────────┼─────────────┼──────────────────┘
|
|
│ │ │ │
|
|
└──────────────────┬────────────────────────┘
|
|
│ OTLP (gRPC/HTTP)
|
|
▼
|
|
┌────────────────────────┐
|
|
│ OpenTelemetry │
|
|
│ Collector │
|
|
│ (traces, metrics, │
|
|
│ logs) │
|
|
└───────────┬────────────┘
|
|
│
|
|
┌───────────────┼───────────────┐
|
|
│ │ │
|
|
▼ ▼ ▼
|
|
┌─────────────────┐ ┌───────────┐ ┌───────────────┐
|
|
│ ClickStack │ │Prometheus │ │ Grafana │
|
|
│ (ClickHouse) │ │ │ │ │
|
|
│ ┌───────────┐ │ │ Metrics │ │ Dashboards │
|
|
│ │ Traces │ │ │ Storage │ │ Alerting │
|
|
│ ├───────────┤ │ │ │ │ Exploration │
|
|
│ │ Logs │ │ └───────────┘ │ │
|
|
│ └───────────┘ │ └───────────────┘
|
|
└─────────────────┘ │
|
|
│
|
|
┌────────────────────┤
|
|
│ │
|
|
┌─────▼─────┐ ┌─────▼─────┐
|
|
│Alertmanager│ │ ntfy │
|
|
│ │ │ (push) │
|
|
└───────────┘ └───────────┘
|
|
```
|
|
|
|
## Component Details
|
|
|
|
### Metrics: Prometheus + kube-prometheus-stack
|
|
|
|
**Deployment:** HelmRelease via Flux
|
|
|
|
```yaml
|
|
prometheus:
|
|
prometheusSpec:
|
|
retention: 14d
|
|
retentionSize: 50GB
|
|
storage:
|
|
volumeClaimTemplate:
|
|
spec:
|
|
storageClassName: longhorn
|
|
storage: 50Gi
|
|
```
|
|
|
|
**Key Features:**
|
|
- ServiceMonitor auto-discovery for all workloads
|
|
- 14-day retention with 50GB limit
|
|
- PromPP image for enhanced performance
|
|
- AlertManager for routing alerts
|
|
|
|
### Logs & Traces: ClickStack
|
|
|
|
**Why ClickStack over Loki/Tempo:**
|
|
- Single storage backend (ClickHouse) for both logs and traces
|
|
- Excellent query performance on large datasets
|
|
- Built-in correlation between logs and traces
|
|
- Lower resource overhead than separate Loki + Tempo
|
|
|
|
**Configuration:**
|
|
- OTEL Collector receives all telemetry
|
|
- Forwards to ClickStack's OTEL collector
|
|
- Grafana datasources for querying
|
|
|
|
### Telemetry Collection: OpenTelemetry
|
|
|
|
**OpenTelemetry Operator:** Manages auto-instrumentation
|
|
|
|
```yaml
|
|
apiVersion: opentelemetry.io/v1alpha1
|
|
kind: Instrumentation
|
|
metadata:
|
|
name: auto-instrumentation
|
|
spec:
|
|
python:
|
|
image: ghcr.io/open-telemetry/opentelemetry-operator/autoinstrumentation-python
|
|
nodejs:
|
|
image: ghcr.io/open-telemetry/opentelemetry-operator/autoinstrumentation-nodejs
|
|
```
|
|
|
|
**OpenTelemetry Collector:** Central routing
|
|
|
|
```yaml
|
|
receivers:
|
|
otlp:
|
|
protocols:
|
|
grpc:
|
|
endpoint: 0.0.0.0:4317
|
|
http:
|
|
endpoint: 0.0.0.0:4318
|
|
|
|
exporters:
|
|
otlphttp:
|
|
endpoint: http://clickstack-otel-collector:4318
|
|
|
|
service:
|
|
pipelines:
|
|
traces:
|
|
receivers: [otlp]
|
|
exporters: [otlphttp]
|
|
metrics:
|
|
receivers: [otlp]
|
|
exporters: [otlphttp]
|
|
logs:
|
|
receivers: [otlp]
|
|
exporters: [otlphttp]
|
|
```
|
|
|
|
### Visualization: Grafana
|
|
|
|
**Grafana Operator:** Manages dashboards and datasources as CRDs
|
|
|
|
```yaml
|
|
apiVersion: grafana.integreatly.org/v1beta1
|
|
kind: GrafanaDashboard
|
|
metadata:
|
|
name: kubernetes-nodes
|
|
spec:
|
|
instanceSelector:
|
|
matchLabels:
|
|
grafana.internal/instance: grafana
|
|
url: https://grafana.com/api/dashboards/15758/revisions/44/download
|
|
```
|
|
|
|
**Datasources:**
|
|
| Type | Source | Purpose |
|
|
|------|--------|---------|
|
|
| Prometheus | prometheus-operated:9090 | Metrics |
|
|
| ClickHouse | clickstack:8123 | Logs & Traces |
|
|
| Alertmanager | alertmanager-operated:9093 | Alert status |
|
|
|
|
### Alerting Pipeline
|
|
|
|
```
|
|
Prometheus Rules → Alertmanager → ntfy → Discord/Mobile
|
|
└─→ Email (future)
|
|
```
|
|
|
|
**Alert Categories:**
|
|
- Infrastructure: Node down, disk full, OOM
|
|
- Application: Error rate, latency SLO breach
|
|
- Security: Gatekeeper violations, vulnerability findings
|
|
|
|
## Dashboards
|
|
|
|
| Dashboard | Source | Purpose |
|
|
|-----------|--------|---------|
|
|
| Kubernetes Global | Grafana #15757 | Cluster overview |
|
|
| Node Exporter | Grafana #1860 | Node metrics |
|
|
| CNPG PostgreSQL | CNPG | Database health |
|
|
| Flux | Flux Operator | GitOps status |
|
|
| Cilium | Cilium | Network metrics |
|
|
| Envoy Gateway | Envoy | Ingress metrics |
|
|
|
|
## Resource Allocation
|
|
|
|
| Component | CPU Request | Memory Limit |
|
|
|-----------|-------------|--------------|
|
|
| Prometheus | 100m | 2Gi |
|
|
| OTEL Collector | 100m | 512Mi |
|
|
| ClickStack | 500m | 2Gi |
|
|
| Grafana | 100m | 256Mi |
|
|
|
|
## Future Enhancements
|
|
|
|
1. **Continuous Profiling** - Pyroscope for Go/Python profiling
|
|
2. **SLO Tracking** - Sloth for SLI/SLO automation
|
|
3. **Synthetic Monitoring** - Gatus for endpoint probing
|
|
4. **Cost Attribution** - OpenCost for resource cost tracking
|
|
|
|
## References
|
|
|
|
* [OpenTelemetry Documentation](https://opentelemetry.io/docs/)
|
|
* [ClickHouse for Observability](https://clickhouse.com/docs/en/use-cases/observability)
|
|
* [kube-prometheus-stack](https://github.com/prometheus-community/helm-charts/tree/main/charts/kube-prometheus-stack)
|
|
* [Grafana Operator](https://grafana.github.io/grafana-operator/)
|