# Alerting and Notification Pipeline * Status: accepted * Date: 2026-02-09 * Deciders: Billy * Technical Story: Design a reliable alerting pipeline from Prometheus to mobile/Discord notifications with noise management for a single-operator homelab ## Context and Problem Statement A homelab with 10+ LAN hosts, GPU workloads, UPS power, and dozens of services generates many alerts. A single operator needs to receive critical notifications promptly while avoiding alert fatigue from known-noisy conditions. How do we route alerts from Prometheus to actionable notifications on Discord and mobile, while keeping noise under control? ## Decision Drivers * Critical alerts must reach the operator within seconds (mobile push + Discord) * Alert fatigue must be minimized — suppress known-noisy alerts declaratively * The pipeline should be fully self-hosted (no PagerDuty/Opsgenie SaaS) * Alert routing must be GitOps-managed and version-controlled * Uptime monitoring needs a public-facing status page ## Considered Options 1. **Alertmanager → ntfy → ntfy-discord bridge** with Silence Operator and Gatus 2. **Alertmanager → Discord webhook directly** with manual silences 3. **Alertmanager → Grafana OnCall** for incident management 4. **External SaaS (PagerDuty, Opsgenie)** ## Decision Outcome Chosen option: **Option 1 - Alertmanager → ntfy → ntfy-discord bridge** with declarative silence management via Silence Operator and Gatus for uptime monitoring. ntfy serves as a central notification hub that decouples alert producers from consumers. The custom ntfy-discord bridge forwards to Discord, while ntfy itself delivers mobile push notifications. Silence Operator manages suppression rules as Kubernetes CRs. ### Positive Consequences * Fully self-hosted, no external dependencies * ntfy provides mobile push without app-specific integrations * Decoupled architecture — adding new notification targets only requires subscribing to ntfy topics * Silence rules are version-controlled Kubernetes resources * Gatus provides a public status page independent of the alerting pipeline ### Negative Consequences * Custom bridge service (ntfy-discord) to maintain * ntfy is a single point of failure for notifications (mitigated by persistent storage) * No built-in on-call rotation or escalation (acceptable for single operator) ## Architecture ``` ┌──────────────────────────────────────────────────────────────┐ │ ALERT SOURCES │ │ │ │ PrometheusRules Gatus Endpoint Custom Webhooks │ │ (metric alerts) Monitors (CI, etc.) │ │ │ │ │ │ └────────┼────────────────┼────────────────────┼───────────────┘ │ │ │ ▼ │ │ ┌─────────────────┐ │ │ │ Alertmanager │ │ │ │ │ │ │ │ Routes by │ │ │ │ severity: │ │ │ │ critical→urgent│ │ │ │ warning→high │ │ │ │ default→null │ │ │ └────────┬────────┘ │ │ │ │ │ │ ┌────────────┘ │ ▼ ▼ ▼ ┌──────────────────────────────────────────────────────────────┐ │ ntfy │ │ │ │ Topics: │ │ alertmanager-alerts ← Alertmanager webhooks │ │ gatus ← Gatus endpoint failures │ │ gitea-ci ← CI pipeline notifications │ │ │ │ → Mobile push (ntfy app) │ │ → Web UI at ntfy.daviestechlabs.io │ └────────────────────┬─────────────────────────────────────────┘ │ ▼ ┌──────────────────────────────────────────────────────────────┐ │ ntfy-discord bridge │ │ │ │ Subscribes to: alertmanager-alerts, gatus, gitea-ci │ │ Forwards to: Discord webhooks (per-topic channels) │ │ Custom-built Go service with Prometheus metrics │ └──────────────────────────────────────────────────────────────┘ ``` ## Component Details ### Alertmanager Routing Configured via `AlertmanagerConfig` in kube-prometheus-stack: | Severity | ntfy Priority | Tags | Behavior | |----------|---------------|------|----------| | `critical` | urgent | `rotating_light`, `alert` | Immediate push + Discord | | `warning` | high | `warning` | Push + Discord | | All others | — | — | Routed to `null-receiver` (dropped) | The webhook sends to `http://ntfy-svc.observability.svc.cluster.local/alertmanager-alerts` with Alertmanager template expansion for human-readable messages. ### Custom Alert Rules Beyond standard kube-prometheus-stack rules, custom `PrometheusRules` cover: | Rule | Source | Severity | |------|--------|----------| | `DockerhubRateLimitRisk` | kube-prometheus-stack | — | | `OomKilled` | kube-prometheus-stack | — | | `ZfsUnexpectedPoolState` | kube-prometheus-stack | — | | `UPSOnBattery` | SNMP exporter | critical | | `UPSReplaceBattery` | SNMP exporter | critical | | `LanProbeFailed` | Blackbox exporter | critical | | `SmartDevice*` (6 rules) | smartctl-exporter | warning/critical | | `GatusEndpointDown` | Gatus | critical | | `GatusEndpointExposed` | Gatus | critical | ### Noise Management: Silence Operator The [Silence Operator](https://github.com/giantswarm/silence-operator) manages Alertmanager silences as Kubernetes custom resources, keeping suppression rules version-controlled in Git. **Active silences:** | Silence | Alert Suppressed | Reason | |---------|------------------|--------| | `longhorn-node-storage-diskspace-warning` | `NodeDiskHighUtilization` | Longhorn storage devices are intentionally high-utilization | | `node-root-diskspace-warning` | `NodeDiskHighUtilization` | Root partition usage is expected | | `nas-memory-high-utilization` | `NodeMemoryHighUtilization` | NAS (candlekeep) runs memory-intensive workloads by design | | `keda-hpa-maxed-out` | `KubeHpaMaxedOut` | KEDA-managed HPAs scaling to max is normal behavior | ### Uptime Monitoring: Gatus Gatus provides endpoint monitoring and a public-facing status page, independent of the Prometheus alerting pipeline. | | | |---|---| | **Image** | `ghcr.io/twin/gatus:v5.34.0` | | **Status page** | `status.daviestechlabs.io` (public) | | **Admin** | `gatus.daviestechlabs.io` (public) | **Auto-discovery:** A sidecar watches Kubernetes HTTPRoutes and Services, automatically generating monitoring endpoints for all exposed services. **Manual endpoints:** - Connectivity checks: Cloudflare (1.1.1.1), Google (8.8.8.8), Quad9 (9.9.9.9) via ICMP - Gitea: `git.daviestechlabs.io` - Container registry: `registry.lab.daviestechlabs.io` **Alerting:** Gatus sends failures to the `gatus` ntfy topic, which flows through the same ntfy → Discord pipeline. **PrometheusRules from Gatus metrics:** - `GatusEndpointDown` — external/service endpoint failure for 5 min → critical - `GatusEndpointExposed` — internal endpoint reachable from public DNS for 5 min → critical (detects accidental exposure) ### ntfy | | | |---|---| | **Image** | `binwiederhier/ntfy:v2.16.0` | | **URL** | `ntfy.daviestechlabs.io` (public, Authentik SSO) | | **Storage** | 5 Gi PVC (SQLite cache) | Serves as the central notification hub. Protected by Authentik forward-auth via Envoy Gateway. Receives webhooks from Alertmanager and Gatus, delivers push notifications to the ntfy mobile app. ### ntfy-discord Bridge | | | |---|---| | **Image** | `registry.lab.daviestechlabs.io/billy/ntfy-discord:v0.0.1` | | **Source** | Custom Go service (in-repo: `ntfy-discord/`) | Subscribes to ntfy topics and forwards notifications to Discord webhooks. Each topic maps to a Discord channel/webhook. Exposes Prometheus metrics via PodMonitor. ## Notification Flow Example ``` 1. Prometheus evaluates: smartctl SMART status ≠ 1 2. SmartDeviceTestFailed fires (severity: critical) 3. Alertmanager matches critical route → webhook to ntfy 4. ntfy receives on "alertmanager-alerts" topic → Pushes to mobile via ntfy app → ntfy-discord subscribes and forwards to Discord webhook 5. Operator receives push notification + Discord message ``` ## Links * Refined by [ADR-0025](0025-observability-stack.md) * Related to [ADR-0038](0038-infrastructure-metrics-collection.md) * Related to [ADR-0021](0021-notification-architecture.md) * Related to [ADR-0022](0022-ntfy-discord-bridge.md)