docs: add ADR-0038/0039 and replace llm-workflows references with decomposed repos
All checks were successful
Update README with ADR Index / update-readme (push) Successful in 6s
All checks were successful
Update README with ADR Index / update-readme (push) Successful in 6s
- ADR-0038: Infrastructure metrics collection (smartctl, SNMP, blackbox, unpoller) - ADR-0039: Alerting and notification pipeline (Alertmanager → ntfy → Discord) - Replace llm-workflows GitHub links with Gitea daviestechlabs org repos - Update AGENT-ONBOARDING.md: remove llm-workflows from file tree, add missing repos - Update ADR-0006: fix multi-repo reference - Update ADR-0009: fix broken llm-workflows link - Update ADR-0024: mark ray-serve repo as created, update historical context - Update README: fix ADR-0016 status, add 0038/0039 to table, update badges
This commit is contained in:
197
decisions/0039-alerting-notification-pipeline.md
Normal file
197
decisions/0039-alerting-notification-pipeline.md
Normal file
@@ -0,0 +1,197 @@
|
||||
# Alerting and Notification Pipeline
|
||||
|
||||
* Status: accepted
|
||||
* Date: 2026-02-09
|
||||
* Deciders: Billy
|
||||
* Technical Story: Design a reliable alerting pipeline from Prometheus to mobile/Discord notifications with noise management for a single-operator homelab
|
||||
|
||||
## Context and Problem Statement
|
||||
|
||||
A homelab with 10+ LAN hosts, GPU workloads, UPS power, and dozens of services generates many alerts. A single operator needs to receive critical notifications promptly while avoiding alert fatigue from known-noisy conditions.
|
||||
|
||||
How do we route alerts from Prometheus to actionable notifications on Discord and mobile, while keeping noise under control?
|
||||
|
||||
## Decision Drivers
|
||||
|
||||
* Critical alerts must reach the operator within seconds (mobile push + Discord)
|
||||
* Alert fatigue must be minimized — suppress known-noisy alerts declaratively
|
||||
* The pipeline should be fully self-hosted (no PagerDuty/Opsgenie SaaS)
|
||||
* Alert routing must be GitOps-managed and version-controlled
|
||||
* Uptime monitoring needs a public-facing status page
|
||||
|
||||
## Considered Options
|
||||
|
||||
1. **Alertmanager → ntfy → ntfy-discord bridge** with Silence Operator and Gatus
|
||||
2. **Alertmanager → Discord webhook directly** with manual silences
|
||||
3. **Alertmanager → Grafana OnCall** for incident management
|
||||
4. **External SaaS (PagerDuty, Opsgenie)**
|
||||
|
||||
## Decision Outcome
|
||||
|
||||
Chosen option: **Option 1 - Alertmanager → ntfy → ntfy-discord bridge** with declarative silence management via Silence Operator and Gatus for uptime monitoring.
|
||||
|
||||
ntfy serves as a central notification hub that decouples alert producers from consumers. The custom ntfy-discord bridge forwards to Discord, while ntfy itself delivers mobile push notifications. Silence Operator manages suppression rules as Kubernetes CRs.
|
||||
|
||||
### Positive Consequences
|
||||
|
||||
* Fully self-hosted, no external dependencies
|
||||
* ntfy provides mobile push without app-specific integrations
|
||||
* Decoupled architecture — adding new notification targets only requires subscribing to ntfy topics
|
||||
* Silence rules are version-controlled Kubernetes resources
|
||||
* Gatus provides a public status page independent of the alerting pipeline
|
||||
|
||||
### Negative Consequences
|
||||
|
||||
* Custom bridge service (ntfy-discord) to maintain
|
||||
* ntfy is a single point of failure for notifications (mitigated by persistent storage)
|
||||
* No built-in on-call rotation or escalation (acceptable for single operator)
|
||||
|
||||
## Architecture
|
||||
|
||||
```
|
||||
┌──────────────────────────────────────────────────────────────┐
|
||||
│ ALERT SOURCES │
|
||||
│ │
|
||||
│ PrometheusRules Gatus Endpoint Custom Webhooks │
|
||||
│ (metric alerts) Monitors (CI, etc.) │
|
||||
│ │ │ │ │
|
||||
└────────┼────────────────┼────────────────────┼───────────────┘
|
||||
│ │ │
|
||||
▼ │ │
|
||||
┌─────────────────┐ │ │
|
||||
│ Alertmanager │ │ │
|
||||
│ │ │ │
|
||||
│ Routes by │ │ │
|
||||
│ severity: │ │ │
|
||||
│ critical→urgent│ │ │
|
||||
│ warning→high │ │ │
|
||||
│ default→null │ │ │
|
||||
└────────┬────────┘ │ │
|
||||
│ │ │
|
||||
│ ┌────────────┘ │
|
||||
▼ ▼ ▼
|
||||
┌──────────────────────────────────────────────────────────────┐
|
||||
│ ntfy │
|
||||
│ │
|
||||
│ Topics: │
|
||||
│ alertmanager-alerts ← Alertmanager webhooks │
|
||||
│ gatus ← Gatus endpoint failures │
|
||||
│ gitea-ci ← CI pipeline notifications │
|
||||
│ │
|
||||
│ → Mobile push (ntfy app) │
|
||||
│ → Web UI at ntfy.daviestechlabs.io │
|
||||
└────────────────────┬─────────────────────────────────────────┘
|
||||
│
|
||||
▼
|
||||
┌──────────────────────────────────────────────────────────────┐
|
||||
│ ntfy-discord bridge │
|
||||
│ │
|
||||
│ Subscribes to: alertmanager-alerts, gatus, gitea-ci │
|
||||
│ Forwards to: Discord webhooks (per-topic channels) │
|
||||
│ Custom-built Go service with Prometheus metrics │
|
||||
└──────────────────────────────────────────────────────────────┘
|
||||
```
|
||||
|
||||
## Component Details
|
||||
|
||||
### Alertmanager Routing
|
||||
|
||||
Configured via `AlertmanagerConfig` in kube-prometheus-stack:
|
||||
|
||||
| Severity | ntfy Priority | Tags | Behavior |
|
||||
|----------|---------------|------|----------|
|
||||
| `critical` | urgent | `rotating_light`, `alert` | Immediate push + Discord |
|
||||
| `warning` | high | `warning` | Push + Discord |
|
||||
| All others | — | — | Routed to `null-receiver` (dropped) |
|
||||
|
||||
The webhook sends to `http://ntfy-svc.observability.svc.cluster.local/alertmanager-alerts` with Alertmanager template expansion for human-readable messages.
|
||||
|
||||
### Custom Alert Rules
|
||||
|
||||
Beyond standard kube-prometheus-stack rules, custom `PrometheusRules` cover:
|
||||
|
||||
| Rule | Source | Severity |
|
||||
|------|--------|----------|
|
||||
| `DockerhubRateLimitRisk` | kube-prometheus-stack | — |
|
||||
| `OomKilled` | kube-prometheus-stack | — |
|
||||
| `ZfsUnexpectedPoolState` | kube-prometheus-stack | — |
|
||||
| `UPSOnBattery` | SNMP exporter | critical |
|
||||
| `UPSReplaceBattery` | SNMP exporter | critical |
|
||||
| `LanProbeFailed` | Blackbox exporter | critical |
|
||||
| `SmartDevice*` (6 rules) | smartctl-exporter | warning/critical |
|
||||
| `GatusEndpointDown` | Gatus | critical |
|
||||
| `GatusEndpointExposed` | Gatus | critical |
|
||||
|
||||
### Noise Management: Silence Operator
|
||||
|
||||
The [Silence Operator](https://github.com/giantswarm/silence-operator) manages Alertmanager silences as Kubernetes custom resources, keeping suppression rules version-controlled in Git.
|
||||
|
||||
**Active silences:**
|
||||
|
||||
| Silence | Alert Suppressed | Reason |
|
||||
|---------|------------------|--------|
|
||||
| `longhorn-node-storage-diskspace-warning` | `NodeDiskHighUtilization` | Longhorn storage devices are intentionally high-utilization |
|
||||
| `node-root-diskspace-warning` | `NodeDiskHighUtilization` | Root partition usage is expected |
|
||||
| `nas-memory-high-utilization` | `NodeMemoryHighUtilization` | NAS (candlekeep) runs memory-intensive workloads by design |
|
||||
| `keda-hpa-maxed-out` | `KubeHpaMaxedOut` | KEDA-managed HPAs scaling to max is normal behavior |
|
||||
|
||||
### Uptime Monitoring: Gatus
|
||||
|
||||
Gatus provides endpoint monitoring and a public-facing status page, independent of the Prometheus alerting pipeline.
|
||||
|
||||
| | |
|
||||
|---|---|
|
||||
| **Image** | `ghcr.io/twin/gatus:v5.34.0` |
|
||||
| **Status page** | `status.daviestechlabs.io` (public) |
|
||||
| **Admin** | `gatus.daviestechlabs.io` (public) |
|
||||
|
||||
**Auto-discovery:** A sidecar watches Kubernetes HTTPRoutes and Services, automatically generating monitoring endpoints for all exposed services.
|
||||
|
||||
**Manual endpoints:**
|
||||
- Connectivity checks: Cloudflare (1.1.1.1), Google (8.8.8.8), Quad9 (9.9.9.9) via ICMP
|
||||
- Gitea: `git.daviestechlabs.io`
|
||||
- Container registry: `registry.lab.daviestechlabs.io`
|
||||
|
||||
**Alerting:** Gatus sends failures to the `gatus` ntfy topic, which flows through the same ntfy → Discord pipeline.
|
||||
|
||||
**PrometheusRules from Gatus metrics:**
|
||||
- `GatusEndpointDown` — external/service endpoint failure for 5 min → critical
|
||||
- `GatusEndpointExposed` — internal endpoint reachable from public DNS for 5 min → critical (detects accidental exposure)
|
||||
|
||||
### ntfy
|
||||
|
||||
| | |
|
||||
|---|---|
|
||||
| **Image** | `binwiederhier/ntfy:v2.16.0` |
|
||||
| **URL** | `ntfy.daviestechlabs.io` (public, Authentik SSO) |
|
||||
| **Storage** | 5 Gi PVC (SQLite cache) |
|
||||
|
||||
Serves as the central notification hub. Protected by Authentik forward-auth via Envoy Gateway. Receives webhooks from Alertmanager and Gatus, delivers push notifications to the ntfy mobile app.
|
||||
|
||||
### ntfy-discord Bridge
|
||||
|
||||
| | |
|
||||
|---|---|
|
||||
| **Image** | `registry.lab.daviestechlabs.io/billy/ntfy-discord:v0.0.1` |
|
||||
| **Source** | Custom Go service (in-repo: `ntfy-discord/`) |
|
||||
|
||||
Subscribes to ntfy topics and forwards notifications to Discord webhooks. Each topic maps to a Discord channel/webhook. Exposes Prometheus metrics via PodMonitor.
|
||||
|
||||
## Notification Flow Example
|
||||
|
||||
```
|
||||
1. Prometheus evaluates: smartctl SMART status ≠ 1
|
||||
2. SmartDeviceTestFailed fires (severity: critical)
|
||||
3. Alertmanager matches critical route → webhook to ntfy
|
||||
4. ntfy receives on "alertmanager-alerts" topic
|
||||
→ Pushes to mobile via ntfy app
|
||||
→ ntfy-discord subscribes and forwards to Discord webhook
|
||||
5. Operator receives push notification + Discord message
|
||||
```
|
||||
|
||||
## Links
|
||||
|
||||
* Refined by [ADR-0025](0025-observability-stack.md)
|
||||
* Related to [ADR-0038](0038-infrastructure-metrics-collection.md)
|
||||
* Related to [ADR-0021](0021-notification-architecture.md)
|
||||
* Related to [ADR-0022](0022-ntfy-discord-bridge.md)
|
||||
Reference in New Issue
Block a user