Files

Billy D. 8e3e2043c3

Update README with ADR Index / update-readme (push) Successful in 6s

Details

docs: add ADR-0038/0039 and replace llm-workflows references with decomposed repos

- ADR-0038: Infrastructure metrics collection (smartctl, SNMP, blackbox, unpoller)
- ADR-0039: Alerting and notification pipeline (Alertmanager → ntfy → Discord)
- Replace llm-workflows GitHub links with Gitea daviestechlabs org repos
- Update AGENT-ONBOARDING.md: remove llm-workflows from file tree, add missing repos
- Update ADR-0006: fix multi-repo reference
- Update ADR-0009: fix broken llm-workflows link
- Update ADR-0024: mark ray-serve repo as created, update historical context
- Update README: fix ADR-0016 status, add 0038/0039 to table, update badges

2026-02-09 18:12:37 -05:00

10 KiB

Raw Blame History

Alerting and Notification Pipeline

Status: accepted
Date: 2026-02-09
Deciders: Billy
Technical Story: Design a reliable alerting pipeline from Prometheus to mobile/Discord notifications with noise management for a single-operator homelab

Context and Problem Statement

A homelab with 10+ LAN hosts, GPU workloads, UPS power, and dozens of services generates many alerts. A single operator needs to receive critical notifications promptly while avoiding alert fatigue from known-noisy conditions.

How do we route alerts from Prometheus to actionable notifications on Discord and mobile, while keeping noise under control?

Decision Drivers

Critical alerts must reach the operator within seconds (mobile push + Discord)
Alert fatigue must be minimized — suppress known-noisy alerts declaratively
The pipeline should be fully self-hosted (no PagerDuty/Opsgenie SaaS)
Alert routing must be GitOps-managed and version-controlled
Uptime monitoring needs a public-facing status page

Considered Options

Alertmanager → ntfy → ntfy-discord bridge with Silence Operator and Gatus
Alertmanager → Discord webhook directly with manual silences
Alertmanager → Grafana OnCall for incident management
External SaaS (PagerDuty, Opsgenie)

Decision Outcome

Chosen option: Option 1 - Alertmanager → ntfy → ntfy-discord bridge with declarative silence management via Silence Operator and Gatus for uptime monitoring.

ntfy serves as a central notification hub that decouples alert producers from consumers. The custom ntfy-discord bridge forwards to Discord, while ntfy itself delivers mobile push notifications. Silence Operator manages suppression rules as Kubernetes CRs.

Positive Consequences

Fully self-hosted, no external dependencies
ntfy provides mobile push without app-specific integrations
Decoupled architecture — adding new notification targets only requires subscribing to ntfy topics
Silence rules are version-controlled Kubernetes resources
Gatus provides a public status page independent of the alerting pipeline

Negative Consequences

Custom bridge service (ntfy-discord) to maintain
ntfy is a single point of failure for notifications (mitigated by persistent storage)
No built-in on-call rotation or escalation (acceptable for single operator)

Architecture

┌──────────────────────────────────────────────────────────────┐
│                    ALERT SOURCES                              │
│                                                              │
│  PrometheusRules    Gatus Endpoint     Custom Webhooks       │
│  (metric alerts)    Monitors           (CI, etc.)            │
│        │                │                    │               │
└────────┼────────────────┼────────────────────┼───────────────┘
         │                │                    │
         ▼                │                    │
┌─────────────────┐       │                    │
│  Alertmanager   │       │                    │
│                 │       │                    │
│  Routes by      │       │                    │
│  severity:      │       │                    │
│  critical→urgent│       │                    │
│  warning→high   │       │                    │
│  default→null   │       │                    │
└────────┬────────┘       │                    │
         │                │                    │
         │   ┌────────────┘                    │
         ▼   ▼                                 ▼
┌──────────────────────────────────────────────────────────────┐
│                         ntfy                                  │
│                                                              │
│  Topics:                                                     │
│    alertmanager-alerts  ←  Alertmanager webhooks              │
│    gatus                ←  Gatus endpoint failures            │
│    gitea-ci             ←  CI pipeline notifications          │
│                                                              │
│  → Mobile push (ntfy app)                                    │
│  → Web UI at ntfy.daviestechlabs.io                          │
└────────────────────┬─────────────────────────────────────────┘
                     │
                     ▼
┌──────────────────────────────────────────────────────────────┐
│                    ntfy-discord bridge                         │
│                                                              │
│  Subscribes to: alertmanager-alerts, gatus, gitea-ci         │
│  Forwards to: Discord webhooks (per-topic channels)          │
│  Custom-built Go service with Prometheus metrics             │
└──────────────────────────────────────────────────────────────┘

Component Details

Alertmanager Routing

Configured via AlertmanagerConfig in kube-prometheus-stack:

Severity	ntfy Priority	Tags	Behavior
`critical`	urgent	`rotating_light`, `alert`	Immediate push + Discord
`warning`	high	`warning`	Push + Discord
All others	—	—	Routed to `null-receiver` (dropped)

The webhook sends to http://ntfy-svc.observability.svc.cluster.local/alertmanager-alerts with Alertmanager template expansion for human-readable messages.

Custom Alert Rules

Beyond standard kube-prometheus-stack rules, custom PrometheusRules cover:

Rule	Source	Severity
`DockerhubRateLimitRisk`	kube-prometheus-stack	—
`OomKilled`	kube-prometheus-stack	—
`ZfsUnexpectedPoolState`	kube-prometheus-stack	—
`UPSOnBattery`	SNMP exporter	critical
`UPSReplaceBattery`	SNMP exporter	critical
`LanProbeFailed`	Blackbox exporter	critical
`SmartDevice*` (6 rules)	smartctl-exporter	warning/critical
`GatusEndpointDown`	Gatus	critical
`GatusEndpointExposed`	Gatus	critical

Noise Management: Silence Operator

The Silence Operator manages Alertmanager silences as Kubernetes custom resources, keeping suppression rules version-controlled in Git.

Active silences:

Silence	Alert Suppressed	Reason
`longhorn-node-storage-diskspace-warning`	`NodeDiskHighUtilization`	Longhorn storage devices are intentionally high-utilization
`node-root-diskspace-warning`	`NodeDiskHighUtilization`	Root partition usage is expected
`nas-memory-high-utilization`	`NodeMemoryHighUtilization`	NAS (candlekeep) runs memory-intensive workloads by design
`keda-hpa-maxed-out`	`KubeHpaMaxedOut`	KEDA-managed HPAs scaling to max is normal behavior

Uptime Monitoring: Gatus

Gatus provides endpoint monitoring and a public-facing status page, independent of the Prometheus alerting pipeline.


Image	`ghcr.io/twin/gatus:v5.34.0`
Status page	`status.daviestechlabs.io` (public)
Admin	`gatus.daviestechlabs.io` (public)

Auto-discovery: A sidecar watches Kubernetes HTTPRoutes and Services, automatically generating monitoring endpoints for all exposed services.

Manual endpoints:

Connectivity checks: Cloudflare (1.1.1.1), Google (8.8.8.8), Quad9 (9.9.9.9) via ICMP
Gitea: git.daviestechlabs.io
Container registry: registry.lab.daviestechlabs.io

Alerting: Gatus sends failures to the gatus ntfy topic, which flows through the same ntfy → Discord pipeline.

PrometheusRules from Gatus metrics:

GatusEndpointDown — external/service endpoint failure for 5 min → critical
GatusEndpointExposed — internal endpoint reachable from public DNS for 5 min → critical (detects accidental exposure)

ntfy


Image	`binwiederhier/ntfy:v2.16.0`
URL	`ntfy.daviestechlabs.io` (public, Authentik SSO)
Storage	5 Gi PVC (SQLite cache)

Serves as the central notification hub. Protected by Authentik forward-auth via Envoy Gateway. Receives webhooks from Alertmanager and Gatus, delivers push notifications to the ntfy mobile app.

ntfy-discord Bridge


Image	`registry.lab.daviestechlabs.io/billy/ntfy-discord:v0.0.1`
Source	Custom Go service (in-repo: `ntfy-discord/`)

Subscribes to ntfy topics and forwards notifications to Discord webhooks. Each topic maps to a Discord channel/webhook. Exposes Prometheus metrics via PodMonitor.

Notification Flow Example

1. Prometheus evaluates: smartctl SMART status ≠ 1
2. SmartDeviceTestFailed fires (severity: critical)
3. Alertmanager matches critical route → webhook to ntfy
4. ntfy receives on "alertmanager-alerts" topic
   → Pushes to mobile via ntfy app
   → ntfy-discord subscribes and forwards to Discord webhook
5. Operator receives push notification + Discord message

10 KiB Raw Blame History