Files
homelab-design/decisions/0039-alerting-notification-pipeline.md
Billy D. 8e3e2043c3
All checks were successful
Update README with ADR Index / update-readme (push) Successful in 6s
docs: add ADR-0038/0039 and replace llm-workflows references with decomposed repos
- ADR-0038: Infrastructure metrics collection (smartctl, SNMP, blackbox, unpoller)
- ADR-0039: Alerting and notification pipeline (Alertmanager → ntfy → Discord)
- Replace llm-workflows GitHub links with Gitea daviestechlabs org repos
- Update AGENT-ONBOARDING.md: remove llm-workflows from file tree, add missing repos
- Update ADR-0006: fix multi-repo reference
- Update ADR-0009: fix broken llm-workflows link
- Update ADR-0024: mark ray-serve repo as created, update historical context
- Update README: fix ADR-0016 status, add 0038/0039 to table, update badges
2026-02-09 18:12:37 -05:00

10 KiB

Alerting and Notification Pipeline

  • Status: accepted
  • Date: 2026-02-09
  • Deciders: Billy
  • Technical Story: Design a reliable alerting pipeline from Prometheus to mobile/Discord notifications with noise management for a single-operator homelab

Context and Problem Statement

A homelab with 10+ LAN hosts, GPU workloads, UPS power, and dozens of services generates many alerts. A single operator needs to receive critical notifications promptly while avoiding alert fatigue from known-noisy conditions.

How do we route alerts from Prometheus to actionable notifications on Discord and mobile, while keeping noise under control?

Decision Drivers

  • Critical alerts must reach the operator within seconds (mobile push + Discord)
  • Alert fatigue must be minimized — suppress known-noisy alerts declaratively
  • The pipeline should be fully self-hosted (no PagerDuty/Opsgenie SaaS)
  • Alert routing must be GitOps-managed and version-controlled
  • Uptime monitoring needs a public-facing status page

Considered Options

  1. Alertmanager → ntfy → ntfy-discord bridge with Silence Operator and Gatus
  2. Alertmanager → Discord webhook directly with manual silences
  3. Alertmanager → Grafana OnCall for incident management
  4. External SaaS (PagerDuty, Opsgenie)

Decision Outcome

Chosen option: Option 1 - Alertmanager → ntfy → ntfy-discord bridge with declarative silence management via Silence Operator and Gatus for uptime monitoring.

ntfy serves as a central notification hub that decouples alert producers from consumers. The custom ntfy-discord bridge forwards to Discord, while ntfy itself delivers mobile push notifications. Silence Operator manages suppression rules as Kubernetes CRs.

Positive Consequences

  • Fully self-hosted, no external dependencies
  • ntfy provides mobile push without app-specific integrations
  • Decoupled architecture — adding new notification targets only requires subscribing to ntfy topics
  • Silence rules are version-controlled Kubernetes resources
  • Gatus provides a public status page independent of the alerting pipeline

Negative Consequences

  • Custom bridge service (ntfy-discord) to maintain
  • ntfy is a single point of failure for notifications (mitigated by persistent storage)
  • No built-in on-call rotation or escalation (acceptable for single operator)

Architecture

┌──────────────────────────────────────────────────────────────┐
│                    ALERT SOURCES                              │
│                                                              │
│  PrometheusRules    Gatus Endpoint     Custom Webhooks       │
│  (metric alerts)    Monitors           (CI, etc.)            │
│        │                │                    │               │
└────────┼────────────────┼────────────────────┼───────────────┘
         │                │                    │
         ▼                │                    │
┌─────────────────┐       │                    │
│  Alertmanager   │       │                    │
│                 │       │                    │
│  Routes by      │       │                    │
│  severity:      │       │                    │
│  critical→urgent│       │                    │
│  warning→high   │       │                    │
│  default→null   │       │                    │
└────────┬────────┘       │                    │
         │                │                    │
         │   ┌────────────┘                    │
         ▼   ▼                                 ▼
┌──────────────────────────────────────────────────────────────┐
│                         ntfy                                  │
│                                                              │
│  Topics:                                                     │
│    alertmanager-alerts  ←  Alertmanager webhooks              │
│    gatus                ←  Gatus endpoint failures            │
│    gitea-ci             ←  CI pipeline notifications          │
│                                                              │
│  → Mobile push (ntfy app)                                    │
│  → Web UI at ntfy.daviestechlabs.io                          │
└────────────────────┬─────────────────────────────────────────┘
                     │
                     ▼
┌──────────────────────────────────────────────────────────────┐
│                    ntfy-discord bridge                         │
│                                                              │
│  Subscribes to: alertmanager-alerts, gatus, gitea-ci         │
│  Forwards to: Discord webhooks (per-topic channels)          │
│  Custom-built Go service with Prometheus metrics             │
└──────────────────────────────────────────────────────────────┘

Component Details

Alertmanager Routing

Configured via AlertmanagerConfig in kube-prometheus-stack:

Severity ntfy Priority Tags Behavior
critical urgent rotating_light, alert Immediate push + Discord
warning high warning Push + Discord
All others Routed to null-receiver (dropped)

The webhook sends to http://ntfy-svc.observability.svc.cluster.local/alertmanager-alerts with Alertmanager template expansion for human-readable messages.

Custom Alert Rules

Beyond standard kube-prometheus-stack rules, custom PrometheusRules cover:

Rule Source Severity
DockerhubRateLimitRisk kube-prometheus-stack
OomKilled kube-prometheus-stack
ZfsUnexpectedPoolState kube-prometheus-stack
UPSOnBattery SNMP exporter critical
UPSReplaceBattery SNMP exporter critical
LanProbeFailed Blackbox exporter critical
SmartDevice* (6 rules) smartctl-exporter warning/critical
GatusEndpointDown Gatus critical
GatusEndpointExposed Gatus critical

Noise Management: Silence Operator

The Silence Operator manages Alertmanager silences as Kubernetes custom resources, keeping suppression rules version-controlled in Git.

Active silences:

Silence Alert Suppressed Reason
longhorn-node-storage-diskspace-warning NodeDiskHighUtilization Longhorn storage devices are intentionally high-utilization
node-root-diskspace-warning NodeDiskHighUtilization Root partition usage is expected
nas-memory-high-utilization NodeMemoryHighUtilization NAS (candlekeep) runs memory-intensive workloads by design
keda-hpa-maxed-out KubeHpaMaxedOut KEDA-managed HPAs scaling to max is normal behavior

Uptime Monitoring: Gatus

Gatus provides endpoint monitoring and a public-facing status page, independent of the Prometheus alerting pipeline.

Image ghcr.io/twin/gatus:v5.34.0
Status page status.daviestechlabs.io (public)
Admin gatus.daviestechlabs.io (public)

Auto-discovery: A sidecar watches Kubernetes HTTPRoutes and Services, automatically generating monitoring endpoints for all exposed services.

Manual endpoints:

  • Connectivity checks: Cloudflare (1.1.1.1), Google (8.8.8.8), Quad9 (9.9.9.9) via ICMP
  • Gitea: git.daviestechlabs.io
  • Container registry: registry.lab.daviestechlabs.io

Alerting: Gatus sends failures to the gatus ntfy topic, which flows through the same ntfy → Discord pipeline.

PrometheusRules from Gatus metrics:

  • GatusEndpointDown — external/service endpoint failure for 5 min → critical
  • GatusEndpointExposed — internal endpoint reachable from public DNS for 5 min → critical (detects accidental exposure)

ntfy

Image binwiederhier/ntfy:v2.16.0
URL ntfy.daviestechlabs.io (public, Authentik SSO)
Storage 5 Gi PVC (SQLite cache)

Serves as the central notification hub. Protected by Authentik forward-auth via Envoy Gateway. Receives webhooks from Alertmanager and Gatus, delivers push notifications to the ntfy mobile app.

ntfy-discord Bridge

Image registry.lab.daviestechlabs.io/billy/ntfy-discord:v0.0.1
Source Custom Go service (in-repo: ntfy-discord/)

Subscribes to ntfy topics and forwards notifications to Discord webhooks. Each topic maps to a Discord channel/webhook. Exposes Prometheus metrics via PodMonitor.

Notification Flow Example

1. Prometheus evaluates: smartctl SMART status ≠ 1
2. SmartDeviceTestFailed fires (severity: critical)
3. Alertmanager matches critical route → webhook to ntfy
4. ntfy receives on "alertmanager-alerts" topic
   → Pushes to mobile via ntfy app
   → ntfy-discord subscribes and forwards to Discord webhook
5. Operator receives push notification + Discord message