# Alerting and Notification Pipeline

* Status: accepted
* Date: 2026-02-09
* Deciders: Billy
* Technical Story: Design a reliable alerting pipeline from Prometheus to mobile/Discord notifications with noise management for a single-operator homelab

## Context and Problem Statement

A homelab with 10+ LAN hosts, GPU workloads, UPS power, and dozens of services generates many alerts. A single operator needs to receive critical notifications promptly while avoiding alert fatigue from known-noisy conditions.

How do we route alerts from Prometheus to actionable notifications on Discord and mobile, while keeping noise under control?

## Decision Drivers

* Critical alerts must reach the operator within seconds (mobile push + Discord)
* Alert fatigue must be minimized — suppress known-noisy alerts declaratively
* The pipeline should be fully self-hosted (no PagerDuty/Opsgenie SaaS)
* Alert routing must be GitOps-managed and version-controlled
* Uptime monitoring needs a public-facing status page

## Considered Options

1. **Alertmanager → ntfy → ntfy-discord bridge** with Silence Operator and Gatus
2. **Alertmanager → Discord webhook directly** with manual silences
3. **Alertmanager → Grafana OnCall** for incident management
4. **External SaaS (PagerDuty, Opsgenie)**

## Decision Outcome

Chosen option: **Option 1 - Alertmanager → ntfy → ntfy-discord bridge** with declarative silence management via Silence Operator and Gatus for uptime monitoring.

ntfy serves as a central notification hub that decouples alert producers from consumers. The custom ntfy-discord bridge forwards to Discord, while ntfy itself delivers mobile push notifications. Silence Operator manages suppression rules as Kubernetes CRs.

### Positive Consequences

* Fully self-hosted, no external dependencies
* ntfy provides mobile push without app-specific integrations
* Decoupled architecture — adding new notification targets only requires subscribing to ntfy topics
* Silence rules are version-controlled Kubernetes resources
* Gatus provides a public status page independent of the alerting pipeline

### Negative Consequences

* Custom bridge service (ntfy-discord) to maintain
* ntfy is a single point of failure for notifications (mitigated by persistent storage)
* No built-in on-call rotation or escalation (acceptable for single operator)

## Architecture

```
┌──────────────────────────────────────────────────────────────┐
│                    ALERT SOURCES                              │
│                                                              │
│  PrometheusRules    Gatus Endpoint     Custom Webhooks       │
│  (metric alerts)    Monitors           (CI, etc.)            │
│        │                │                    │               │
└────────┼────────────────┼────────────────────┼───────────────┘
         │                │                    │
         ▼                │                    │
┌─────────────────┐       │                    │
│  Alertmanager   │       │                    │
│                 │       │                    │
│  Routes by      │       │                    │
│  severity:      │       │                    │
│  critical→urgent│       │                    │
│  warning→high   │       │                    │
│  default→null   │       │                    │
└────────┬────────┘       │                    │
         │                │                    │
         │   ┌────────────┘                    │
         ▼   ▼                                 ▼
┌──────────────────────────────────────────────────────────────┐
│                         ntfy                                  │
│                                                              │
│  Topics:                                                     │
│    alertmanager-alerts  ←  Alertmanager webhooks              │
│    gatus                ←  Gatus endpoint failures            │
│    gitea-ci             ←  CI pipeline notifications          │
│                                                              │
│  → Mobile push (ntfy app)                                    │
│  → Web UI at ntfy.daviestechlabs.io                          │
└────────────────────┬─────────────────────────────────────────┘
                     │
                     ▼
┌──────────────────────────────────────────────────────────────┐
│                    ntfy-discord bridge                         │
│                                                              │
│  Subscribes to: alertmanager-alerts, gatus, gitea-ci         │
│  Forwards to: Discord webhooks (per-topic channels)          │
│  Custom-built Go service with Prometheus metrics             │
└──────────────────────────────────────────────────────────────┘
```

## Component Details

### Alertmanager Routing

Configured via `AlertmanagerConfig` in kube-prometheus-stack:

| Severity | ntfy Priority | Tags | Behavior |
|----------|---------------|------|----------|
| `critical` | urgent | `rotating_light`, `alert` | Immediate push + Discord |
| `warning` | high | `warning` | Push + Discord |
| All others | — | — | Routed to `null-receiver` (dropped) |

The webhook sends to `http://ntfy-svc.observability.svc.cluster.local/alertmanager-alerts` with Alertmanager template expansion for human-readable messages.

### Custom Alert Rules

Beyond standard kube-prometheus-stack rules, custom `PrometheusRules` cover:

| Rule | Source | Severity |
|------|--------|----------|
| `DockerhubRateLimitRisk` | kube-prometheus-stack | — |
| `OomKilled` | kube-prometheus-stack | — |
| `ZfsUnexpectedPoolState` | kube-prometheus-stack | — |
| `UPSOnBattery` | SNMP exporter | critical |
| `UPSReplaceBattery` | SNMP exporter | critical |
| `LanProbeFailed` | Blackbox exporter | critical |
| `SmartDevice*` (6 rules) | smartctl-exporter | warning/critical |
| `GatusEndpointDown` | Gatus | critical |
| `GatusEndpointExposed` | Gatus | critical |

### Noise Management: Silence Operator

The [Silence Operator](https://github.com/giantswarm/silence-operator) manages Alertmanager silences as Kubernetes custom resources, keeping suppression rules version-controlled in Git.

**Active silences:**

| Silence | Alert Suppressed | Reason |
|---------|------------------|--------|
| `longhorn-node-storage-diskspace-warning` | `NodeDiskHighUtilization` | Longhorn storage devices are intentionally high-utilization |
| `node-root-diskspace-warning` | `NodeDiskHighUtilization` | Root partition usage is expected |
| `nas-memory-high-utilization` | `NodeMemoryHighUtilization` | NAS (candlekeep) runs memory-intensive workloads by design |
| `keda-hpa-maxed-out` | `KubeHpaMaxedOut` | KEDA-managed HPAs scaling to max is normal behavior |

### Uptime Monitoring: Gatus

Gatus provides endpoint monitoring and a public-facing status page, independent of the Prometheus alerting pipeline.

| | |
|---|---|
| **Image** | `ghcr.io/twin/gatus:v5.34.0` |
| **Status page** | `status.daviestechlabs.io` (public) |
| **Admin** | `gatus.daviestechlabs.io` (public) |

**Auto-discovery:** A sidecar watches Kubernetes HTTPRoutes and Services, automatically generating monitoring endpoints for all exposed services.

**Manual endpoints:**
- Connectivity checks: Cloudflare (1.1.1.1), Google (8.8.8.8), Quad9 (9.9.9.9) via ICMP
- Gitea: `git.daviestechlabs.io`
- Container registry: `registry.lab.daviestechlabs.io`

**Alerting:** Gatus sends failures to the `gatus` ntfy topic, which flows through the same ntfy → Discord pipeline.

**PrometheusRules from Gatus metrics:**
- `GatusEndpointDown` — external/service endpoint failure for 5 min → critical
- `GatusEndpointExposed` — internal endpoint reachable from public DNS for 5 min → critical (detects accidental exposure)

### ntfy

| | |
|---|---|
| **Image** | `binwiederhier/ntfy:v2.16.0` |
| **URL** | `ntfy.daviestechlabs.io` (public, Authentik SSO) |
| **Storage** | 5 Gi PVC (SQLite cache) |

Serves as the central notification hub. Protected by Authentik forward-auth via Envoy Gateway. Receives webhooks from Alertmanager and Gatus, delivers push notifications to the ntfy mobile app.

### ntfy-discord Bridge

| | |
|---|---|
| **Image** | `registry.lab.daviestechlabs.io/billy/ntfy-discord:v0.0.1` |
| **Source** | Custom Go service (in-repo: `ntfy-discord/`) |

Subscribes to ntfy topics and forwards notifications to Discord webhooks. Each topic maps to a Discord channel/webhook. Exposes Prometheus metrics via PodMonitor.

## Notification Flow Example

```
1. Prometheus evaluates: smartctl SMART status ≠ 1
2. SmartDeviceTestFailed fires (severity: critical)
3. Alertmanager matches critical route → webhook to ntfy
4. ntfy receives on "alertmanager-alerts" topic
   → Pushes to mobile via ntfy app
   → ntfy-discord subscribes and forwards to Discord webhook
5. Operator receives push notification + Discord message
```

## Links

* Refined by [ADR-0025](0025-observability-stack.md)
* Related to [ADR-0038](0038-infrastructure-metrics-collection.md)
* Related to [ADR-0021](0021-notification-architecture.md)
* Related to [ADR-0022](0022-ntfy-discord-bridge.md)