docs: add ADR-0038/0039 and replace llm-workflows references with decomposed repos
All checks were successful
Update README with ADR Index / update-readme (push) Successful in 6s
All checks were successful
Update README with ADR Index / update-readme (push) Successful in 6s
- ADR-0038: Infrastructure metrics collection (smartctl, SNMP, blackbox, unpoller) - ADR-0039: Alerting and notification pipeline (Alertmanager → ntfy → Discord) - Replace llm-workflows GitHub links with Gitea daviestechlabs org repos - Update AGENT-ONBOARDING.md: remove llm-workflows from file tree, add missing repos - Update ADR-0006: fix multi-repo reference - Update ADR-0009: fix broken llm-workflows link - Update ADR-0024: mark ray-serve repo as created, update historical context - Update README: fix ADR-0016 status, add 0038/0039 to table, update badges
This commit is contained in:
143
decisions/0038-infrastructure-metrics-collection.md
Normal file
143
decisions/0038-infrastructure-metrics-collection.md
Normal file
@@ -0,0 +1,143 @@
|
||||
# Infrastructure Metrics Collection Strategy
|
||||
|
||||
* Status: accepted
|
||||
* Date: 2026-02-09
|
||||
* Deciders: Billy
|
||||
* Technical Story: Define what physical and network infrastructure to monitor and how to collect metrics beyond standard Kubernetes telemetry
|
||||
|
||||
## Context and Problem Statement
|
||||
|
||||
Standard Kubernetes observability (kube-state-metrics, node-exporter, cAdvisor) covers container and node health, but a homelab includes physical infrastructure that Kubernetes doesn't know about: UPS power, disk health, network equipment, and LAN host availability.
|
||||
|
||||
How do we extend Prometheus metrics collection to cover the full homelab infrastructure, including devices and hosts outside the Kubernetes cluster?
|
||||
|
||||
## Decision Drivers
|
||||
|
||||
* Early warning of hardware failures (disks, UPS battery, network gear)
|
||||
* Visibility into power consumption and UPS status
|
||||
* Network device monitoring without vendor lock-in
|
||||
* LAN host reachability tracking for non-Kubernetes services
|
||||
* Keep all metrics in a single Prometheus instance for unified querying
|
||||
|
||||
## Considered Options
|
||||
|
||||
1. **Purpose-built Prometheus exporters per domain** - smartctl, SNMP, blackbox, unpoller
|
||||
2. **Agent-based monitoring (Telegraf/Datadog agent)** - deploy agents to all hosts
|
||||
3. **SNMP polling for everything** - unified SNMP-based collection
|
||||
4. **External monitoring SaaS** - Uptime Robot, Datadog, etc.
|
||||
|
||||
## Decision Outcome
|
||||
|
||||
Chosen option: **Option 1 - Purpose-built Prometheus exporters per domain**, because each exporter is best-in-class for its domain, they integrate natively with Prometheus ServiceMonitors, and they require zero configuration on the monitored targets themselves.
|
||||
|
||||
### Positive Consequences
|
||||
|
||||
* Each exporter is purpose-built and well-maintained by its community
|
||||
* Native Prometheus integration via ServiceMonitor/ScrapeConfig
|
||||
* No agents needed on monitored devices
|
||||
* All metrics queryable in a single Prometheus instance and Grafana
|
||||
* Dedicated alerting rules per domain (disk health, UPS, LAN)
|
||||
|
||||
### Negative Consequences
|
||||
|
||||
* Multiple small deployments to maintain (one per exporter type)
|
||||
* Each exporter has its own configuration format
|
||||
* SNMP exporter requires SNMPv3 credential management
|
||||
|
||||
## Components
|
||||
|
||||
### Disk Health: smartctl-exporter
|
||||
|
||||
Monitors SMART attributes on all cluster node disks to detect early signs of failure.
|
||||
|
||||
| | |
|
||||
|---|---|
|
||||
| **Chart** | `prometheus-smartctl-exporter` v0.16.0 |
|
||||
| **Scope** | All amd64 nodes (excludes Raspberry Pi / ARM workers) |
|
||||
|
||||
**Alert rules (6):**
|
||||
- `SmartDeviceHighTemperature` — disk temp > 65°C
|
||||
- `SmartDeviceTestFailed` — SMART self-test failure
|
||||
- `SmartDeviceCriticalWarning` — NVMe critical warning bit set
|
||||
- `SmartDeviceMediaErrors` — NVMe media/integrity errors
|
||||
- `SmartDeviceAvailableSpareUnderThreshold` — NVMe spare capacity low
|
||||
- `SmartDeviceInterfaceSlow` — link not running at max negotiated speed
|
||||
|
||||
**Dashboard:** Smartctl Exporter (#22604)
|
||||
|
||||
### UPS Monitoring: SNMP Exporter
|
||||
|
||||
Monitors the CyberPower UPS via SNMPv3 for power status, battery health, and load.
|
||||
|
||||
| | |
|
||||
|---|---|
|
||||
| **Chart** | `prometheus-snmp-exporter` v9.11.0 |
|
||||
| **Target** | `ups.lab.daviestechlabs.io` |
|
||||
| **Auth** | SNMPv3 credentials from Vault via ExternalSecret |
|
||||
|
||||
**Alert rules:**
|
||||
- `UPSOnBattery` — critical if ≤ 20 min battery remaining while on battery power
|
||||
- `UPSReplaceBattery` — critical if UPS diagnostic test reports failure
|
||||
|
||||
**Dashboard:** CyberPower UPS (#12340)
|
||||
|
||||
The UPS load metric (`upsHighPrecOutputLoad`) is also consumed by Kromgo to display cluster power usage as an SVG badge.
|
||||
|
||||
### LAN Probing: Blackbox Exporter
|
||||
|
||||
Probes LAN hosts and services to detect outages for devices outside the Kubernetes cluster.
|
||||
|
||||
| | |
|
||||
|---|---|
|
||||
| **Chart** | `prometheus-blackbox-exporter` v11.7.0 |
|
||||
| **Modules** | `http_2xx`, `icmp`, `tcp_connect` (all IPv4-preferred) |
|
||||
|
||||
**Probe targets:**
|
||||
| Type | Targets |
|
||||
|------|---------|
|
||||
| ICMP | candlekeep, bruenor, catti, danilo, jetkvm, drizzt, elminster, regis, storm, wulfgar |
|
||||
| TCP | `expanse.internal:2049` (NFS service) |
|
||||
|
||||
**Alert rules:**
|
||||
- `LanProbeFailed` — critical if any LAN probe fails for 15 minutes
|
||||
|
||||
### Network Equipment: Unpoller
|
||||
|
||||
Exports UniFi network device metrics (APs, switches, PDUs) from the UniFi controller.
|
||||
|
||||
| | |
|
||||
|---|---|
|
||||
| **Image** | `ghcr.io/unpoller/unpoller:v2.33.0` |
|
||||
| **Controller** | `https://192.168.100.254` |
|
||||
| **Scrape interval** | 2 minutes (matches UniFi API poll rate) |
|
||||
|
||||
**Dashboards (5):**
|
||||
- UniFi PDU (#23027), Insights (#11315), Network Sites (#11311), UAP (#11314), USW (#11312)
|
||||
|
||||
### External Node Scraping
|
||||
|
||||
Static scrape targets for the NAS host running its own exporters outside the cluster:
|
||||
|
||||
| Target | Port | Exporter |
|
||||
|--------|------|----------|
|
||||
| `candlekeep.lab.daviestechlabs.io` | 9100 | node-exporter |
|
||||
| `candlekeep.lab.daviestechlabs.io` | 9633 | smartctl-exporter |
|
||||
| `jetkvm.lab.daviestechlabs.io` | — | JetKVM device metrics |
|
||||
|
||||
Configured via `additionalScrapeConfigs` in kube-prometheus-stack.
|
||||
|
||||
## Metrics Coverage Summary
|
||||
|
||||
| Domain | Exporter | Key Signals |
|
||||
|--------|----------|-------------|
|
||||
| Disk health | smartctl-exporter | Temperature, SMART status, media errors, spare capacity |
|
||||
| Power/UPS | SNMP exporter | Battery status, load, runtime remaining, diagnostics |
|
||||
| LAN hosts | Blackbox exporter | ICMP reachability, TCP connectivity |
|
||||
| Network gear | Unpoller | AP clients, switch throughput, PDU power |
|
||||
| NAS/external | Static scrape | Node metrics, disk health for off-cluster hosts |
|
||||
| KVM | Static scrape | JetKVM device metrics |
|
||||
|
||||
## Links
|
||||
|
||||
* Refined by [ADR-0025](0025-observability-stack.md)
|
||||
* Related to [ADR-0039](0039-alerting-notification-pipeline.md)
|
||||
Reference in New Issue
Block a user