- ADR-0038: Infrastructure metrics collection (smartctl, SNMP, blackbox, unpoller) - ADR-0039: Alerting and notification pipeline (Alertmanager → ntfy → Discord) - Replace llm-workflows GitHub links with Gitea daviestechlabs org repos - Update AGENT-ONBOARDING.md: remove llm-workflows from file tree, add missing repos - Update ADR-0006: fix multi-repo reference - Update ADR-0009: fix broken llm-workflows link - Update ADR-0024: mark ray-serve repo as created, update historical context - Update README: fix ADR-0016 status, add 0038/0039 to table, update badges
10 KiB
Alerting and Notification Pipeline
- Status: accepted
- Date: 2026-02-09
- Deciders: Billy
- Technical Story: Design a reliable alerting pipeline from Prometheus to mobile/Discord notifications with noise management for a single-operator homelab
Context and Problem Statement
A homelab with 10+ LAN hosts, GPU workloads, UPS power, and dozens of services generates many alerts. A single operator needs to receive critical notifications promptly while avoiding alert fatigue from known-noisy conditions.
How do we route alerts from Prometheus to actionable notifications on Discord and mobile, while keeping noise under control?
Decision Drivers
- Critical alerts must reach the operator within seconds (mobile push + Discord)
- Alert fatigue must be minimized — suppress known-noisy alerts declaratively
- The pipeline should be fully self-hosted (no PagerDuty/Opsgenie SaaS)
- Alert routing must be GitOps-managed and version-controlled
- Uptime monitoring needs a public-facing status page
Considered Options
- Alertmanager → ntfy → ntfy-discord bridge with Silence Operator and Gatus
- Alertmanager → Discord webhook directly with manual silences
- Alertmanager → Grafana OnCall for incident management
- External SaaS (PagerDuty, Opsgenie)
Decision Outcome
Chosen option: Option 1 - Alertmanager → ntfy → ntfy-discord bridge with declarative silence management via Silence Operator and Gatus for uptime monitoring.
ntfy serves as a central notification hub that decouples alert producers from consumers. The custom ntfy-discord bridge forwards to Discord, while ntfy itself delivers mobile push notifications. Silence Operator manages suppression rules as Kubernetes CRs.
Positive Consequences
- Fully self-hosted, no external dependencies
- ntfy provides mobile push without app-specific integrations
- Decoupled architecture — adding new notification targets only requires subscribing to ntfy topics
- Silence rules are version-controlled Kubernetes resources
- Gatus provides a public status page independent of the alerting pipeline
Negative Consequences
- Custom bridge service (ntfy-discord) to maintain
- ntfy is a single point of failure for notifications (mitigated by persistent storage)
- No built-in on-call rotation or escalation (acceptable for single operator)
Architecture
┌──────────────────────────────────────────────────────────────┐
│ ALERT SOURCES │
│ │
│ PrometheusRules Gatus Endpoint Custom Webhooks │
│ (metric alerts) Monitors (CI, etc.) │
│ │ │ │ │
└────────┼────────────────┼────────────────────┼───────────────┘
│ │ │
▼ │ │
┌─────────────────┐ │ │
│ Alertmanager │ │ │
│ │ │ │
│ Routes by │ │ │
│ severity: │ │ │
│ critical→urgent│ │ │
│ warning→high │ │ │
│ default→null │ │ │
└────────┬────────┘ │ │
│ │ │
│ ┌────────────┘ │
▼ ▼ ▼
┌──────────────────────────────────────────────────────────────┐
│ ntfy │
│ │
│ Topics: │
│ alertmanager-alerts ← Alertmanager webhooks │
│ gatus ← Gatus endpoint failures │
│ gitea-ci ← CI pipeline notifications │
│ │
│ → Mobile push (ntfy app) │
│ → Web UI at ntfy.daviestechlabs.io │
└────────────────────┬─────────────────────────────────────────┘
│
▼
┌──────────────────────────────────────────────────────────────┐
│ ntfy-discord bridge │
│ │
│ Subscribes to: alertmanager-alerts, gatus, gitea-ci │
│ Forwards to: Discord webhooks (per-topic channels) │
│ Custom-built Go service with Prometheus metrics │
└──────────────────────────────────────────────────────────────┘
Component Details
Alertmanager Routing
Configured via AlertmanagerConfig in kube-prometheus-stack:
| Severity | ntfy Priority | Tags | Behavior |
|---|---|---|---|
critical |
urgent | rotating_light, alert |
Immediate push + Discord |
warning |
high | warning |
Push + Discord |
| All others | — | — | Routed to null-receiver (dropped) |
The webhook sends to http://ntfy-svc.observability.svc.cluster.local/alertmanager-alerts with Alertmanager template expansion for human-readable messages.
Custom Alert Rules
Beyond standard kube-prometheus-stack rules, custom PrometheusRules cover:
| Rule | Source | Severity |
|---|---|---|
DockerhubRateLimitRisk |
kube-prometheus-stack | — |
OomKilled |
kube-prometheus-stack | — |
ZfsUnexpectedPoolState |
kube-prometheus-stack | — |
UPSOnBattery |
SNMP exporter | critical |
UPSReplaceBattery |
SNMP exporter | critical |
LanProbeFailed |
Blackbox exporter | critical |
SmartDevice* (6 rules) |
smartctl-exporter | warning/critical |
GatusEndpointDown |
Gatus | critical |
GatusEndpointExposed |
Gatus | critical |
Noise Management: Silence Operator
The Silence Operator manages Alertmanager silences as Kubernetes custom resources, keeping suppression rules version-controlled in Git.
Active silences:
| Silence | Alert Suppressed | Reason |
|---|---|---|
longhorn-node-storage-diskspace-warning |
NodeDiskHighUtilization |
Longhorn storage devices are intentionally high-utilization |
node-root-diskspace-warning |
NodeDiskHighUtilization |
Root partition usage is expected |
nas-memory-high-utilization |
NodeMemoryHighUtilization |
NAS (candlekeep) runs memory-intensive workloads by design |
keda-hpa-maxed-out |
KubeHpaMaxedOut |
KEDA-managed HPAs scaling to max is normal behavior |
Uptime Monitoring: Gatus
Gatus provides endpoint monitoring and a public-facing status page, independent of the Prometheus alerting pipeline.
| Image | ghcr.io/twin/gatus:v5.34.0 |
| Status page | status.daviestechlabs.io (public) |
| Admin | gatus.daviestechlabs.io (public) |
Auto-discovery: A sidecar watches Kubernetes HTTPRoutes and Services, automatically generating monitoring endpoints for all exposed services.
Manual endpoints:
- Connectivity checks: Cloudflare (1.1.1.1), Google (8.8.8.8), Quad9 (9.9.9.9) via ICMP
- Gitea:
git.daviestechlabs.io - Container registry:
registry.lab.daviestechlabs.io
Alerting: Gatus sends failures to the gatus ntfy topic, which flows through the same ntfy → Discord pipeline.
PrometheusRules from Gatus metrics:
GatusEndpointDown— external/service endpoint failure for 5 min → criticalGatusEndpointExposed— internal endpoint reachable from public DNS for 5 min → critical (detects accidental exposure)
ntfy
| Image | binwiederhier/ntfy:v2.16.0 |
| URL | ntfy.daviestechlabs.io (public, Authentik SSO) |
| Storage | 5 Gi PVC (SQLite cache) |
Serves as the central notification hub. Protected by Authentik forward-auth via Envoy Gateway. Receives webhooks from Alertmanager and Gatus, delivers push notifications to the ntfy mobile app.
ntfy-discord Bridge
| Image | registry.lab.daviestechlabs.io/billy/ntfy-discord:v0.0.1 |
| Source | Custom Go service (in-repo: ntfy-discord/) |
Subscribes to ntfy topics and forwards notifications to Discord webhooks. Each topic maps to a Discord channel/webhook. Exposes Prometheus metrics via PodMonitor.
Notification Flow Example
1. Prometheus evaluates: smartctl SMART status ≠ 1
2. SmartDeviceTestFailed fires (severity: critical)
3. Alertmanager matches critical route → webhook to ntfy
4. ntfy receives on "alertmanager-alerts" topic
→ Pushes to mobile via ntfy app
→ ntfy-discord subscribes and forwards to Discord webhook
5. Operator receives push notification + Discord message