- ADR-0038: Infrastructure metrics collection (smartctl, SNMP, blackbox, unpoller) - ADR-0039: Alerting and notification pipeline (Alertmanager → ntfy → Discord) - Replace llm-workflows GitHub links with Gitea daviestechlabs org repos - Update AGENT-ONBOARDING.md: remove llm-workflows from file tree, add missing repos - Update ADR-0006: fix multi-repo reference - Update ADR-0009: fix broken llm-workflows link - Update ADR-0024: mark ray-serve repo as created, update historical context - Update README: fix ADR-0016 status, add 0038/0039 to table, update badges
5.5 KiB
Infrastructure Metrics Collection Strategy
- Status: accepted
- Date: 2026-02-09
- Deciders: Billy
- Technical Story: Define what physical and network infrastructure to monitor and how to collect metrics beyond standard Kubernetes telemetry
Context and Problem Statement
Standard Kubernetes observability (kube-state-metrics, node-exporter, cAdvisor) covers container and node health, but a homelab includes physical infrastructure that Kubernetes doesn't know about: UPS power, disk health, network equipment, and LAN host availability.
How do we extend Prometheus metrics collection to cover the full homelab infrastructure, including devices and hosts outside the Kubernetes cluster?
Decision Drivers
- Early warning of hardware failures (disks, UPS battery, network gear)
- Visibility into power consumption and UPS status
- Network device monitoring without vendor lock-in
- LAN host reachability tracking for non-Kubernetes services
- Keep all metrics in a single Prometheus instance for unified querying
Considered Options
- Purpose-built Prometheus exporters per domain - smartctl, SNMP, blackbox, unpoller
- Agent-based monitoring (Telegraf/Datadog agent) - deploy agents to all hosts
- SNMP polling for everything - unified SNMP-based collection
- External monitoring SaaS - Uptime Robot, Datadog, etc.
Decision Outcome
Chosen option: Option 1 - Purpose-built Prometheus exporters per domain, because each exporter is best-in-class for its domain, they integrate natively with Prometheus ServiceMonitors, and they require zero configuration on the monitored targets themselves.
Positive Consequences
- Each exporter is purpose-built and well-maintained by its community
- Native Prometheus integration via ServiceMonitor/ScrapeConfig
- No agents needed on monitored devices
- All metrics queryable in a single Prometheus instance and Grafana
- Dedicated alerting rules per domain (disk health, UPS, LAN)
Negative Consequences
- Multiple small deployments to maintain (one per exporter type)
- Each exporter has its own configuration format
- SNMP exporter requires SNMPv3 credential management
Components
Disk Health: smartctl-exporter
Monitors SMART attributes on all cluster node disks to detect early signs of failure.
| Chart | prometheus-smartctl-exporter v0.16.0 |
| Scope | All amd64 nodes (excludes Raspberry Pi / ARM workers) |
Alert rules (6):
SmartDeviceHighTemperature— disk temp > 65°CSmartDeviceTestFailed— SMART self-test failureSmartDeviceCriticalWarning— NVMe critical warning bit setSmartDeviceMediaErrors— NVMe media/integrity errorsSmartDeviceAvailableSpareUnderThreshold— NVMe spare capacity lowSmartDeviceInterfaceSlow— link not running at max negotiated speed
Dashboard: Smartctl Exporter (#22604)
UPS Monitoring: SNMP Exporter
Monitors the CyberPower UPS via SNMPv3 for power status, battery health, and load.
| Chart | prometheus-snmp-exporter v9.11.0 |
| Target | ups.lab.daviestechlabs.io |
| Auth | SNMPv3 credentials from Vault via ExternalSecret |
Alert rules:
UPSOnBattery— critical if ≤ 20 min battery remaining while on battery powerUPSReplaceBattery— critical if UPS diagnostic test reports failure
Dashboard: CyberPower UPS (#12340)
The UPS load metric (upsHighPrecOutputLoad) is also consumed by Kromgo to display cluster power usage as an SVG badge.
LAN Probing: Blackbox Exporter
Probes LAN hosts and services to detect outages for devices outside the Kubernetes cluster.
| Chart | prometheus-blackbox-exporter v11.7.0 |
| Modules | http_2xx, icmp, tcp_connect (all IPv4-preferred) |
Probe targets:
| Type | Targets |
|---|---|
| ICMP | candlekeep, bruenor, catti, danilo, jetkvm, drizzt, elminster, regis, storm, wulfgar |
| TCP | expanse.internal:2049 (NFS service) |
Alert rules:
LanProbeFailed— critical if any LAN probe fails for 15 minutes
Network Equipment: Unpoller
Exports UniFi network device metrics (APs, switches, PDUs) from the UniFi controller.
| Image | ghcr.io/unpoller/unpoller:v2.33.0 |
| Controller | https://192.168.100.254 |
| Scrape interval | 2 minutes (matches UniFi API poll rate) |
Dashboards (5):
- UniFi PDU (#23027), Insights (#11315), Network Sites (#11311), UAP (#11314), USW (#11312)
External Node Scraping
Static scrape targets for the NAS host running its own exporters outside the cluster:
| Target | Port | Exporter |
|---|---|---|
candlekeep.lab.daviestechlabs.io |
9100 | node-exporter |
candlekeep.lab.daviestechlabs.io |
9633 | smartctl-exporter |
jetkvm.lab.daviestechlabs.io |
— | JetKVM device metrics |
Configured via additionalScrapeConfigs in kube-prometheus-stack.
Metrics Coverage Summary
| Domain | Exporter | Key Signals |
|---|---|---|
| Disk health | smartctl-exporter | Temperature, SMART status, media errors, spare capacity |
| Power/UPS | SNMP exporter | Battery status, load, runtime remaining, diagnostics |
| LAN hosts | Blackbox exporter | ICMP reachability, TCP connectivity |
| Network gear | Unpoller | AP clients, switch throughput, PDU power |
| NAS/external | Static scrape | Node metrics, disk health for off-cluster hosts |
| KVM | Static scrape | JetKVM device metrics |