Files
homelab-design/decisions/0038-infrastructure-metrics-collection.md
Billy D. 8e3e2043c3
All checks were successful
Update README with ADR Index / update-readme (push) Successful in 6s
docs: add ADR-0038/0039 and replace llm-workflows references with decomposed repos
- ADR-0038: Infrastructure metrics collection (smartctl, SNMP, blackbox, unpoller)
- ADR-0039: Alerting and notification pipeline (Alertmanager → ntfy → Discord)
- Replace llm-workflows GitHub links with Gitea daviestechlabs org repos
- Update AGENT-ONBOARDING.md: remove llm-workflows from file tree, add missing repos
- Update ADR-0006: fix multi-repo reference
- Update ADR-0009: fix broken llm-workflows link
- Update ADR-0024: mark ray-serve repo as created, update historical context
- Update README: fix ADR-0016 status, add 0038/0039 to table, update badges
2026-02-09 18:12:37 -05:00

5.5 KiB

Infrastructure Metrics Collection Strategy

  • Status: accepted
  • Date: 2026-02-09
  • Deciders: Billy
  • Technical Story: Define what physical and network infrastructure to monitor and how to collect metrics beyond standard Kubernetes telemetry

Context and Problem Statement

Standard Kubernetes observability (kube-state-metrics, node-exporter, cAdvisor) covers container and node health, but a homelab includes physical infrastructure that Kubernetes doesn't know about: UPS power, disk health, network equipment, and LAN host availability.

How do we extend Prometheus metrics collection to cover the full homelab infrastructure, including devices and hosts outside the Kubernetes cluster?

Decision Drivers

  • Early warning of hardware failures (disks, UPS battery, network gear)
  • Visibility into power consumption and UPS status
  • Network device monitoring without vendor lock-in
  • LAN host reachability tracking for non-Kubernetes services
  • Keep all metrics in a single Prometheus instance for unified querying

Considered Options

  1. Purpose-built Prometheus exporters per domain - smartctl, SNMP, blackbox, unpoller
  2. Agent-based monitoring (Telegraf/Datadog agent) - deploy agents to all hosts
  3. SNMP polling for everything - unified SNMP-based collection
  4. External monitoring SaaS - Uptime Robot, Datadog, etc.

Decision Outcome

Chosen option: Option 1 - Purpose-built Prometheus exporters per domain, because each exporter is best-in-class for its domain, they integrate natively with Prometheus ServiceMonitors, and they require zero configuration on the monitored targets themselves.

Positive Consequences

  • Each exporter is purpose-built and well-maintained by its community
  • Native Prometheus integration via ServiceMonitor/ScrapeConfig
  • No agents needed on monitored devices
  • All metrics queryable in a single Prometheus instance and Grafana
  • Dedicated alerting rules per domain (disk health, UPS, LAN)

Negative Consequences

  • Multiple small deployments to maintain (one per exporter type)
  • Each exporter has its own configuration format
  • SNMP exporter requires SNMPv3 credential management

Components

Disk Health: smartctl-exporter

Monitors SMART attributes on all cluster node disks to detect early signs of failure.

Chart prometheus-smartctl-exporter v0.16.0
Scope All amd64 nodes (excludes Raspberry Pi / ARM workers)

Alert rules (6):

  • SmartDeviceHighTemperature — disk temp > 65°C
  • SmartDeviceTestFailed — SMART self-test failure
  • SmartDeviceCriticalWarning — NVMe critical warning bit set
  • SmartDeviceMediaErrors — NVMe media/integrity errors
  • SmartDeviceAvailableSpareUnderThreshold — NVMe spare capacity low
  • SmartDeviceInterfaceSlow — link not running at max negotiated speed

Dashboard: Smartctl Exporter (#22604)

UPS Monitoring: SNMP Exporter

Monitors the CyberPower UPS via SNMPv3 for power status, battery health, and load.

Chart prometheus-snmp-exporter v9.11.0
Target ups.lab.daviestechlabs.io
Auth SNMPv3 credentials from Vault via ExternalSecret

Alert rules:

  • UPSOnBattery — critical if ≤ 20 min battery remaining while on battery power
  • UPSReplaceBattery — critical if UPS diagnostic test reports failure

Dashboard: CyberPower UPS (#12340)

The UPS load metric (upsHighPrecOutputLoad) is also consumed by Kromgo to display cluster power usage as an SVG badge.

LAN Probing: Blackbox Exporter

Probes LAN hosts and services to detect outages for devices outside the Kubernetes cluster.

Chart prometheus-blackbox-exporter v11.7.0
Modules http_2xx, icmp, tcp_connect (all IPv4-preferred)

Probe targets:

Type Targets
ICMP candlekeep, bruenor, catti, danilo, jetkvm, drizzt, elminster, regis, storm, wulfgar
TCP expanse.internal:2049 (NFS service)

Alert rules:

  • LanProbeFailed — critical if any LAN probe fails for 15 minutes

Network Equipment: Unpoller

Exports UniFi network device metrics (APs, switches, PDUs) from the UniFi controller.

Image ghcr.io/unpoller/unpoller:v2.33.0
Controller https://192.168.100.254
Scrape interval 2 minutes (matches UniFi API poll rate)

Dashboards (5):

  • UniFi PDU (#23027), Insights (#11315), Network Sites (#11311), UAP (#11314), USW (#11312)

External Node Scraping

Static scrape targets for the NAS host running its own exporters outside the cluster:

Target Port Exporter
candlekeep.lab.daviestechlabs.io 9100 node-exporter
candlekeep.lab.daviestechlabs.io 9633 smartctl-exporter
jetkvm.lab.daviestechlabs.io JetKVM device metrics

Configured via additionalScrapeConfigs in kube-prometheus-stack.

Metrics Coverage Summary

Domain Exporter Key Signals
Disk health smartctl-exporter Temperature, SMART status, media errors, spare capacity
Power/UPS SNMP exporter Battery status, load, runtime remaining, diagnostics
LAN hosts Blackbox exporter ICMP reachability, TCP connectivity
Network gear Unpoller AP clients, switch throughput, PDU power
NAS/external Static scrape Node metrics, disk health for off-cluster hosts
KVM Static scrape JetKVM device metrics