Files
homelab-design/decisions/0038-infrastructure-metrics-collection.md
Billy D. 8e3e2043c3
All checks were successful
Update README with ADR Index / update-readme (push) Successful in 6s
docs: add ADR-0038/0039 and replace llm-workflows references with decomposed repos
- ADR-0038: Infrastructure metrics collection (smartctl, SNMP, blackbox, unpoller)
- ADR-0039: Alerting and notification pipeline (Alertmanager → ntfy → Discord)
- Replace llm-workflows GitHub links with Gitea daviestechlabs org repos
- Update AGENT-ONBOARDING.md: remove llm-workflows from file tree, add missing repos
- Update ADR-0006: fix multi-repo reference
- Update ADR-0009: fix broken llm-workflows link
- Update ADR-0024: mark ray-serve repo as created, update historical context
- Update README: fix ADR-0016 status, add 0038/0039 to table, update badges
2026-02-09 18:12:37 -05:00

144 lines
5.5 KiB
Markdown

# Infrastructure Metrics Collection Strategy
* Status: accepted
* Date: 2026-02-09
* Deciders: Billy
* Technical Story: Define what physical and network infrastructure to monitor and how to collect metrics beyond standard Kubernetes telemetry
## Context and Problem Statement
Standard Kubernetes observability (kube-state-metrics, node-exporter, cAdvisor) covers container and node health, but a homelab includes physical infrastructure that Kubernetes doesn't know about: UPS power, disk health, network equipment, and LAN host availability.
How do we extend Prometheus metrics collection to cover the full homelab infrastructure, including devices and hosts outside the Kubernetes cluster?
## Decision Drivers
* Early warning of hardware failures (disks, UPS battery, network gear)
* Visibility into power consumption and UPS status
* Network device monitoring without vendor lock-in
* LAN host reachability tracking for non-Kubernetes services
* Keep all metrics in a single Prometheus instance for unified querying
## Considered Options
1. **Purpose-built Prometheus exporters per domain** - smartctl, SNMP, blackbox, unpoller
2. **Agent-based monitoring (Telegraf/Datadog agent)** - deploy agents to all hosts
3. **SNMP polling for everything** - unified SNMP-based collection
4. **External monitoring SaaS** - Uptime Robot, Datadog, etc.
## Decision Outcome
Chosen option: **Option 1 - Purpose-built Prometheus exporters per domain**, because each exporter is best-in-class for its domain, they integrate natively with Prometheus ServiceMonitors, and they require zero configuration on the monitored targets themselves.
### Positive Consequences
* Each exporter is purpose-built and well-maintained by its community
* Native Prometheus integration via ServiceMonitor/ScrapeConfig
* No agents needed on monitored devices
* All metrics queryable in a single Prometheus instance and Grafana
* Dedicated alerting rules per domain (disk health, UPS, LAN)
### Negative Consequences
* Multiple small deployments to maintain (one per exporter type)
* Each exporter has its own configuration format
* SNMP exporter requires SNMPv3 credential management
## Components
### Disk Health: smartctl-exporter
Monitors SMART attributes on all cluster node disks to detect early signs of failure.
| | |
|---|---|
| **Chart** | `prometheus-smartctl-exporter` v0.16.0 |
| **Scope** | All amd64 nodes (excludes Raspberry Pi / ARM workers) |
**Alert rules (6):**
- `SmartDeviceHighTemperature` — disk temp > 65°C
- `SmartDeviceTestFailed` — SMART self-test failure
- `SmartDeviceCriticalWarning` — NVMe critical warning bit set
- `SmartDeviceMediaErrors` — NVMe media/integrity errors
- `SmartDeviceAvailableSpareUnderThreshold` — NVMe spare capacity low
- `SmartDeviceInterfaceSlow` — link not running at max negotiated speed
**Dashboard:** Smartctl Exporter (#22604)
### UPS Monitoring: SNMP Exporter
Monitors the CyberPower UPS via SNMPv3 for power status, battery health, and load.
| | |
|---|---|
| **Chart** | `prometheus-snmp-exporter` v9.11.0 |
| **Target** | `ups.lab.daviestechlabs.io` |
| **Auth** | SNMPv3 credentials from Vault via ExternalSecret |
**Alert rules:**
- `UPSOnBattery` — critical if ≤ 20 min battery remaining while on battery power
- `UPSReplaceBattery` — critical if UPS diagnostic test reports failure
**Dashboard:** CyberPower UPS (#12340)
The UPS load metric (`upsHighPrecOutputLoad`) is also consumed by Kromgo to display cluster power usage as an SVG badge.
### LAN Probing: Blackbox Exporter
Probes LAN hosts and services to detect outages for devices outside the Kubernetes cluster.
| | |
|---|---|
| **Chart** | `prometheus-blackbox-exporter` v11.7.0 |
| **Modules** | `http_2xx`, `icmp`, `tcp_connect` (all IPv4-preferred) |
**Probe targets:**
| Type | Targets |
|------|---------|
| ICMP | candlekeep, bruenor, catti, danilo, jetkvm, drizzt, elminster, regis, storm, wulfgar |
| TCP | `expanse.internal:2049` (NFS service) |
**Alert rules:**
- `LanProbeFailed` — critical if any LAN probe fails for 15 minutes
### Network Equipment: Unpoller
Exports UniFi network device metrics (APs, switches, PDUs) from the UniFi controller.
| | |
|---|---|
| **Image** | `ghcr.io/unpoller/unpoller:v2.33.0` |
| **Controller** | `https://192.168.100.254` |
| **Scrape interval** | 2 minutes (matches UniFi API poll rate) |
**Dashboards (5):**
- UniFi PDU (#23027), Insights (#11315), Network Sites (#11311), UAP (#11314), USW (#11312)
### External Node Scraping
Static scrape targets for the NAS host running its own exporters outside the cluster:
| Target | Port | Exporter |
|--------|------|----------|
| `candlekeep.lab.daviestechlabs.io` | 9100 | node-exporter |
| `candlekeep.lab.daviestechlabs.io` | 9633 | smartctl-exporter |
| `jetkvm.lab.daviestechlabs.io` | — | JetKVM device metrics |
Configured via `additionalScrapeConfigs` in kube-prometheus-stack.
## Metrics Coverage Summary
| Domain | Exporter | Key Signals |
|--------|----------|-------------|
| Disk health | smartctl-exporter | Temperature, SMART status, media errors, spare capacity |
| Power/UPS | SNMP exporter | Battery status, load, runtime remaining, diagnostics |
| LAN hosts | Blackbox exporter | ICMP reachability, TCP connectivity |
| Network gear | Unpoller | AP clients, switch throughput, PDU power |
| NAS/external | Static scrape | Node metrics, disk health for off-cluster hosts |
| KVM | Static scrape | JetKVM device metrics |
## Links
* Refined by [ADR-0025](0025-observability-stack.md)
* Related to [ADR-0039](0039-alerting-notification-pipeline.md)