Files

Billy D. 8e3e2043c3

Update README with ADR Index / update-readme (push) Successful in 6s

Details

docs: add ADR-0038/0039 and replace llm-workflows references with decomposed repos

- ADR-0038: Infrastructure metrics collection (smartctl, SNMP, blackbox, unpoller)
- ADR-0039: Alerting and notification pipeline (Alertmanager → ntfy → Discord)
- Replace llm-workflows GitHub links with Gitea daviestechlabs org repos
- Update AGENT-ONBOARDING.md: remove llm-workflows from file tree, add missing repos
- Update ADR-0006: fix multi-repo reference
- Update ADR-0009: fix broken llm-workflows link
- Update ADR-0024: mark ray-serve repo as created, update historical context
- Update README: fix ADR-0016 status, add 0038/0039 to table, update badges

2026-02-09 18:12:37 -05:00

5.5 KiB

Raw Blame History

Infrastructure Metrics Collection Strategy

Status: accepted
Date: 2026-02-09
Deciders: Billy
Technical Story: Define what physical and network infrastructure to monitor and how to collect metrics beyond standard Kubernetes telemetry

Context and Problem Statement

Standard Kubernetes observability (kube-state-metrics, node-exporter, cAdvisor) covers container and node health, but a homelab includes physical infrastructure that Kubernetes doesn't know about: UPS power, disk health, network equipment, and LAN host availability.

How do we extend Prometheus metrics collection to cover the full homelab infrastructure, including devices and hosts outside the Kubernetes cluster?

Decision Drivers

Early warning of hardware failures (disks, UPS battery, network gear)
Visibility into power consumption and UPS status
Network device monitoring without vendor lock-in
LAN host reachability tracking for non-Kubernetes services
Keep all metrics in a single Prometheus instance for unified querying

Considered Options

Purpose-built Prometheus exporters per domain - smartctl, SNMP, blackbox, unpoller
Agent-based monitoring (Telegraf/Datadog agent) - deploy agents to all hosts
SNMP polling for everything - unified SNMP-based collection
External monitoring SaaS - Uptime Robot, Datadog, etc.

Decision Outcome

Chosen option: Option 1 - Purpose-built Prometheus exporters per domain, because each exporter is best-in-class for its domain, they integrate natively with Prometheus ServiceMonitors, and they require zero configuration on the monitored targets themselves.

Positive Consequences

Each exporter is purpose-built and well-maintained by its community
Native Prometheus integration via ServiceMonitor/ScrapeConfig
No agents needed on monitored devices
All metrics queryable in a single Prometheus instance and Grafana
Dedicated alerting rules per domain (disk health, UPS, LAN)

Negative Consequences

Multiple small deployments to maintain (one per exporter type)
Each exporter has its own configuration format
SNMP exporter requires SNMPv3 credential management

Components

Disk Health: smartctl-exporter

Monitors SMART attributes on all cluster node disks to detect early signs of failure.


Chart	`prometheus-smartctl-exporter` v0.16.0
Scope	All amd64 nodes (excludes Raspberry Pi / ARM workers)

Alert rules (6):

SmartDeviceHighTemperature — disk temp > 65°C
SmartDeviceTestFailed — SMART self-test failure
SmartDeviceCriticalWarning — NVMe critical warning bit set
SmartDeviceMediaErrors — NVMe media/integrity errors
SmartDeviceAvailableSpareUnderThreshold — NVMe spare capacity low
SmartDeviceInterfaceSlow — link not running at max negotiated speed

Dashboard: Smartctl Exporter (#22604)

UPS Monitoring: SNMP Exporter

Monitors the CyberPower UPS via SNMPv3 for power status, battery health, and load.


Chart	`prometheus-snmp-exporter` v9.11.0
Target	`ups.lab.daviestechlabs.io`
Auth	SNMPv3 credentials from Vault via ExternalSecret

Alert rules:

UPSOnBattery — critical if ≤ 20 min battery remaining while on battery power
UPSReplaceBattery — critical if UPS diagnostic test reports failure

Dashboard: CyberPower UPS (#12340)

The UPS load metric (upsHighPrecOutputLoad) is also consumed by Kromgo to display cluster power usage as an SVG badge.

LAN Probing: Blackbox Exporter

Probes LAN hosts and services to detect outages for devices outside the Kubernetes cluster.


Chart	`prometheus-blackbox-exporter` v11.7.0
Modules	`http_2xx`, `icmp`, `tcp_connect` (all IPv4-preferred)

Probe targets:

Type	Targets
ICMP	candlekeep, bruenor, catti, danilo, jetkvm, drizzt, elminster, regis, storm, wulfgar
TCP	`expanse.internal:2049` (NFS service)

Alert rules:

LanProbeFailed — critical if any LAN probe fails for 15 minutes

Network Equipment: Unpoller

Exports UniFi network device metrics (APs, switches, PDUs) from the UniFi controller.


Image	`ghcr.io/unpoller/unpoller:v2.33.0`
Controller	`https://192.168.100.254`
Scrape interval	2 minutes (matches UniFi API poll rate)

Dashboards (5):

UniFi PDU (#23027), Insights (#11315), Network Sites (#11311), UAP (#11314), USW (#11312)

External Node Scraping

Static scrape targets for the NAS host running its own exporters outside the cluster:

Target	Port	Exporter
`candlekeep.lab.daviestechlabs.io`	9100	node-exporter
`candlekeep.lab.daviestechlabs.io`	9633	smartctl-exporter
`jetkvm.lab.daviestechlabs.io`	—	JetKVM device metrics

Configured via additionalScrapeConfigs in kube-prometheus-stack.

Metrics Coverage Summary

Domain	Exporter	Key Signals
Disk health	smartctl-exporter	Temperature, SMART status, media errors, spare capacity
Power/UPS	SNMP exporter	Battery status, load, runtime remaining, diagnostics
LAN hosts	Blackbox exporter	ICMP reachability, TCP connectivity
Network gear	Unpoller	AP clients, switch throughput, PDU power
NAS/external	Static scrape	Node metrics, disk health for off-cluster hosts
KVM	Static scrape	JetKVM device metrics

5.5 KiB Raw Blame History