homelab-design/decisions/0038-infrastructure-metrics-collection.md

# Infrastructure Metrics Collection Strategy

* Status: accepted
* Date: 2026-02-09
* Deciders: Billy
* Technical Story: Define what physical and network infrastructure to monitor and how to collect metrics beyond standard Kubernetes telemetry

## Context and Problem Statement

Standard Kubernetes observability (kube-state-metrics, node-exporter, cAdvisor) covers container and node health, but a homelab includes physical infrastructure that Kubernetes doesn't know about: UPS power, disk health, network equipment, and LAN host availability.

How do we extend Prometheus metrics collection to cover the full homelab infrastructure, including devices and hosts outside the Kubernetes cluster?

## Decision Drivers

* Early warning of hardware failures (disks, UPS battery, network gear)
* Visibility into power consumption and UPS status
* Network device monitoring without vendor lock-in
* LAN host reachability tracking for non-Kubernetes services
* Keep all metrics in a single Prometheus instance for unified querying

## Considered Options

1. **Purpose-built Prometheus exporters per domain** - smartctl, SNMP, blackbox, unpoller
2. **Agent-based monitoring (Telegraf/Datadog agent)** - deploy agents to all hosts
3. **SNMP polling for everything** - unified SNMP-based collection
4. **External monitoring SaaS** - Uptime Robot, Datadog, etc.

## Decision Outcome

Chosen option: **Option 1 - Purpose-built Prometheus exporters per domain**, because each exporter is best-in-class for its domain, they integrate natively with Prometheus ServiceMonitors, and they require zero configuration on the monitored targets themselves.

### Positive Consequences

* Each exporter is purpose-built and well-maintained by its community
* Native Prometheus integration via ServiceMonitor/ScrapeConfig
* No agents needed on monitored devices
* All metrics queryable in a single Prometheus instance and Grafana
* Dedicated alerting rules per domain (disk health, UPS, LAN)

### Negative Consequences

* Multiple small deployments to maintain (one per exporter type)
* Each exporter has its own configuration format
* SNMP exporter requires SNMPv3 credential management

## Components

### Disk Health: smartctl-exporter

Monitors SMART attributes on all cluster node disks to detect early signs of failure.

| | |
|---|---|
| **Chart** | `prometheus-smartctl-exporter` v0.16.0 |
| **Scope** | All amd64 nodes (excludes Raspberry Pi / ARM workers) |

**Alert rules (6):**
- `SmartDeviceHighTemperature` — disk temp > 65°C
- `SmartDeviceTestFailed` — SMART self-test failure
- `SmartDeviceCriticalWarning` — NVMe critical warning bit set
- `SmartDeviceMediaErrors` — NVMe media/integrity errors
- `SmartDeviceAvailableSpareUnderThreshold` — NVMe spare capacity low
- `SmartDeviceInterfaceSlow` — link not running at max negotiated speed

**Dashboard:** Smartctl Exporter (#22604)

### UPS Monitoring: SNMP Exporter

Monitors the CyberPower UPS via SNMPv3 for power status, battery health, and load.

| | |
|---|---|
| **Chart** | `prometheus-snmp-exporter` v9.11.0 |
| **Target** | `ups.lab.daviestechlabs.io` |
| **Auth** | SNMPv3 credentials from Vault via ExternalSecret |

**Alert rules:**
- `UPSOnBattery` — critical if ≤ 20 min battery remaining while on battery power
- `UPSReplaceBattery` — critical if UPS diagnostic test reports failure

**Dashboard:** CyberPower UPS (#12340)

The UPS load metric (`upsHighPrecOutputLoad`) is also consumed by Kromgo to display cluster power usage as an SVG badge.

### LAN Probing: Blackbox Exporter

Probes LAN hosts and services to detect outages for devices outside the Kubernetes cluster.

| | |
|---|---|
| **Chart** | `prometheus-blackbox-exporter` v11.7.0 |
| **Modules** | `http_2xx`, `icmp`, `tcp_connect` (all IPv4-preferred) |

**Probe targets:**
| Type | Targets |
|------|---------|
| ICMP | candlekeep, bruenor, catti, danilo, jetkvm, drizzt, elminster, regis, storm, wulfgar |
| TCP | `expanse.internal:2049` (NFS service) |

**Alert rules:**
- `LanProbeFailed` — critical if any LAN probe fails for 15 minutes

### Network Equipment: Unpoller

Exports UniFi network device metrics (APs, switches, PDUs) from the UniFi controller.

| | |
|---|---|
| **Image** | `ghcr.io/unpoller/unpoller:v2.33.0` |
| **Controller** | `https://192.168.100.254` |
| **Scrape interval** | 2 minutes (matches UniFi API poll rate) |

**Dashboards (5):**
- UniFi PDU (#23027), Insights (#11315), Network Sites (#11311), UAP (#11314), USW (#11312)

### External Node Scraping

Static scrape targets for the NAS host running its own exporters outside the cluster:

| Target | Port | Exporter |
|--------|------|----------|
| `candlekeep.lab.daviestechlabs.io` | 9100 | node-exporter |
| `candlekeep.lab.daviestechlabs.io` | 9633 | smartctl-exporter |
| `jetkvm.lab.daviestechlabs.io` | — | JetKVM device metrics |

Configured via `additionalScrapeConfigs` in kube-prometheus-stack.

## Metrics Coverage Summary

| Domain | Exporter | Key Signals |
|--------|----------|-------------|
| Disk health | smartctl-exporter | Temperature, SMART status, media errors, spare capacity |
| Power/UPS | SNMP exporter | Battery status, load, runtime remaining, diagnostics |
| LAN hosts | Blackbox exporter | ICMP reachability, TCP connectivity |
| Network gear | Unpoller | AP clients, switch throughput, PDU power |
| NAS/external | Static scrape | Node metrics, disk health for off-cluster hosts |
| KVM | Static scrape | JetKVM device metrics |

## Links

* Refined by [ADR-0025](0025-observability-stack.md)
* Related to [ADR-0039](0039-alerting-notification-pipeline.md)