docs: add ADR-0038/0039 and replace llm-workflows references with decomposed repos

- ADR-0038: Infrastructure metrics collection (smartctl, SNMP, blackbox, unpoller) - ADR-0039: Alerting and notification pipeline (Alertmanager → ntfy → Discord) - Replace llm-workflows GitHub links with Gitea daviestechlabs org repos - Update AGENT-ONBOARDING.md: remove llm-workflows from file tree, add missing repos - Update ADR-0006: fix multi-repo reference - Update ADR-0009: fix broken llm-workflows link - Update ADR-0024: mark ray-serve repo as created, update historical context - Update README: fix ADR-0016 status, add 0038/0039 to table, update badges
2026-02-09 18:10:56 -05:00
parent 6682fe271f
commit 8e3e2043c3
7 changed files with 392 additions and 24 deletions
--- a/AGENT-ONBOARDING.md
+++ b/AGENT-ONBOARDING.md
@@ -26,10 +26,15 @@ You are working on a **homelab Kubernetes cluster** running:
 | `chat-handler` | Text chat with RAG pipeline |
 | `voice-assistant` | Voice pipeline (STT → RAG → LLM → TTS) |
 | `kuberay-images` | GPU-specific Ray worker Docker images |
 | `pipeline-bridge` | Bridge between pipelines and services |
 | `stt-module` | Speech-to-text service |
 | `tts-module` | Text-to-speech service |
 | `ray-serve` | Ray Serve inference services |
 | `argo` | Argo Workflows (training, batch inference) |
 | `kubeflow` | Kubeflow Pipeline definitions |
 | `mlflow` | MLflow integration utilities |
 | `gradio-ui` | Gradio demo apps (embeddings, STT, TTS) |
 | `ntfy-discord` | ntfy → Discord notification bridge |
 ## 🏗️ System Architecture (30-Second Version)
@@ -73,8 +78,7 @@ kubernetes/apps/
 │   ├── kubeflow/             #   Pipelines, Training Operator
 │   ├── milvus/               #   Vector database
 │   ├── nats/                 #   Message bus
-│   ├── vllm/                 #   LLM inference
+│   └── vllm/                 #   LLM inference
 │   └── llm-workflows/        #   GitRepo sync to llm-workflows
 ├── analytics/                # 📊 Spark, Flink, ClickHouse
 ├── observability/            # 📈 Grafana, Alloy, OpenTelemetry
 └── security/                 # 🔒 Vault, Authentik, Falco
--- a/README.md
+++ b/README.md
@@ -8,7 +8,7 @@
 [![License](https://img.shields.io/badge/License-MIT-green)](LICENSE)
 <!-- ADR-BADGES-START -->
-![ADR Count](https://img.shields.io/badge/ADRs-37_total-blue?logo=bookstack) ![Accepted](https://img.shields.io/badge/accepted-36-brightgreen) ![Proposed](https://img.shields.io/badge/proposed-0-yellow)
+![ADR Count](https://img.shields.io/badge/ADRs-39_total-blue?logo=bookstack) ![Accepted](https://img.shields.io/badge/accepted-39-brightgreen)
 <!-- ADR-BADGES-END -->
 ## 📖 Quick Navigation
@@ -128,6 +128,8 @@ homelab-design/
 | 0035 | [ARM64 Raspberry Pi Worker Node Strategy](decisions/0035-arm64-worker-strategy.md) | ✅ accepted | 2026-02-05 |
 | 0036 | [Automated Dependency Updates with Renovate](decisions/0036-renovate-dependency-updates.md) | ✅ accepted | 2026-02-05 |
 | 0037 | [Node Naming Conventions](decisions/0037-node-naming-conventions.md) | ✅ accepted | 2026-02-05 |
 | 0038 | [Infrastructure Metrics Collection Strategy](decisions/0038-infrastructure-metrics-collection.md) | ✅ accepted | 2026-02-09 |
 | 0039 | [Alerting and Notification Pipeline](decisions/0039-alerting-notification-pipeline.md) | ✅ accepted | 2026-02-09 |
 <!-- ADR-TABLE-END -->
 ## 🔗 Related Repositories
@@ -135,9 +137,28 @@ homelab-design/
 | Repository | Purpose |
 |------------|---------|
 | [homelab-k8s2](https://github.com/Billy-Davies-2/homelab-k8s2) | Kubernetes manifests, Flux GitOps |
 | [llm-workflows](https://github.com/Billy-Davies-2/llm-workflows) | NATS handlers, Argo/KFP workflows |
 | [companions-frontend](https://github.com/Billy-Davies-2/companions-frontend) | Go web server, HTMX frontend |
 ### AI/ML Repos (git.daviestechlabs.io/daviestechlabs)
 The former monolithic `llm-workflows` repo has been archived and decomposed into:
 | Repository | Purpose |
 |------------|--------|
 | `handler-base` | Shared Python library for NATS handlers |
 | `chat-handler` | Text chat with RAG pipeline |
 | `voice-assistant` | Voice pipeline (STT → RAG → LLM → TTS) |
 | `pipeline-bridge` | Bridge between pipelines and services |
 | `stt-module` | Speech-to-text service |
 | `tts-module` | Text-to-speech service |
 | `ray-serve` | Ray Serve inference services |
 | `kuberay-images` | GPU-specific Ray worker Docker images |
 | `argo` | Argo Workflows (training, batch inference) |
 | `kubeflow` | Kubeflow Pipeline definitions |
 | `mlflow` | MLflow integration utilities |
 | `gradio-ui` | Gradio demo apps (embeddings, STT, TTS) |
 | `ntfy-discord` | ntfy → Discord notification bridge |
 ## 📝 Contributing
 1. For architecture changes, create an ADR in `decisions/`
--- a/decisions/0006-gitops-with-flux.md
+++ b/decisions/0006-gitops-with-flux.md
@@ -35,7 +35,7 @@ Chosen option: "Flux CD", because it provides a mature GitOps implementation wit
 * Git is single source of truth
 * Automatic drift detection and correction
 * Native SOPS/Age secret encryption
-* Multi-repository support (homelab-k8s2 + llm-workflows)
+* Multi-repository support (homelab-k8s2 + Gitea daviestechlabs repos)
 * Helm and Kustomize native support
 * Webhook-free sync (pull-based)
@@ -79,8 +79,10 @@ spec:
  # Public repos don't need secretRef
 ```
-Note: The monolithic `llm-workflows` repo has been decomposed into separate repos
+Note: The monolithic `llm-workflows` repo has been archived and decomposed into
-in the daviestechlabs Gitea organization. See AGENT-ONBOARDING.md for the full list.
+focused repos in the daviestechlabs Gitea organization (e.g. `chat-handler`,
 `voice-assistant`, `handler-base`, `ray-serve`, etc.). See AGENT-ONBOARDING.md
 for the full list.
 ### SOPS Integration
--- a/decisions/0009-dual-workflow-engines.md
+++ b/decisions/0009-dual-workflow-engines.md
@@ -121,4 +121,4 @@ Argo Workflow ──► Step: kfp-trigger ──► Kubeflow Pipeline
 * [Kubeflow Pipelines](https://kubeflow.org/docs/components/pipelines/)
 * [Argo Workflows](https://argoproj.github.io/workflows/)
 * [Argo Events](https://argoproj.github.io/events/)
-* Related: [kfp-integration.yaml](../../llm-workflows/argo/kfp-integration.yaml)
+* Related: kfp-integration.yaml (formerly in llm-workflows, now in the `argo` repo on Gitea)
--- a/decisions/0024-ray-repository-structure.md
+++ b/decisions/0024-ray-repository-structure.md
@@ -35,23 +35,24 @@
 | `spark-analytics-jobs` | Spark batch analytics |
 | `flink-analytics-jobs` | Flink streaming analytics |
-### Remaining Ray Component
+### Ray Component Repositories
-The `ray-serve` code still needs a dedicated repository for Ray Serve model inference services.
+Both Ray repositories now exist as standalone repos in the Gitea `daviestechlabs` organization:
-| Component | Current Location | Purpose |
+| Component | Location | Purpose |
-|-----------|------------------|---------|
+|-----------|----------|---------|
-| kuberay-images | `kuberay-images/` (standalone) | Docker images for Ray workers (NVIDIA, AMD, Intel) |
+| kuberay-images | `kuberay-images/` (standalone repo) | Docker images for Ray workers (NVIDIA, AMD, Intel) |
-| ray-serve | `llm-workflows/ray-serve/` | Ray Serve inference services |
+| ray-serve | `ray-serve/` (standalone repo) | Ray Serve inference services |
 | llm-workflows | `llm-workflows/` | Pipelines, handlers, STT/TTS, embeddings |
-### Problems with Current Structure
+### Problems with Monolithic Structure (Historical)
-1. **Tight Coupling**: ray-serve changes require llm-workflows repo access
+These were the problems with the original monolithic `llm-workflows` structure (now resolved):
-2. **CI/CD Complexity**: Building ray-serve images triggers unrelated workflow steps
+
-3. **Version Management**: Can't independently version ray-serve deployments
+1. **Tight Coupling**: ray-serve changes required llm-workflows repo access
-4. **Team Access**: Contributors to ray-serve need access to entire llm-workflows repo
+2. **CI/CD Complexity**: Building ray-serve images triggered unrelated workflow steps
-5. **Build Times**: Changes to unrelated code can trigger ray-serve rebuilds
+3. **Version Management**: Couldn't independently version ray-serve deployments
 4. **Team Access**: Contributors to ray-serve needed access to entire llm-workflows repo
 5. **Build Times**: Changes to unrelated code could trigger ray-serve rebuilds
 ## Decision
@@ -160,9 +161,9 @@ ray-serve/                         # PyPI package - application code
 1. ✅ `kuberay-images` already exists as standalone repo
 2. ✅ `llm-workflows` archived - all components extracted to dedicated repos
-3. [ ] Create `ray-serve` repo on Gitea
+3. ✅ `ray-serve` repo created on Gitea (`git.daviestechlabs.io/daviestechlabs/ray-serve`)
-4. [ ] Move `.gitea/workflows/publish-ray-serve.yaml` to new repo
+4. ✅ CI workflows moved to new repo
-5. [ ] Set up pyproject.toml for PyPI publishing
+5. ✅ pyproject.toml configured for PyPI publishing
 6. [ ] Update RayService manifests to `pip install ray-serve==X.Y.Z`
 7. [ ] Verify Ray cluster pulls package correctly at runtime
--- a/decisions/0038-infrastructure-metrics-collection.md
+++ b/decisions/0038-infrastructure-metrics-collection.md
@@ -0,0 +1,143 @@
 # Infrastructure Metrics Collection Strategy
 * Status: accepted
 * Date: 2026-02-09
 * Deciders: Billy
 * Technical Story: Define what physical and network infrastructure to monitor and how to collect metrics beyond standard Kubernetes telemetry
 ## Context and Problem Statement
 Standard Kubernetes observability (kube-state-metrics, node-exporter, cAdvisor) covers container and node health, but a homelab includes physical infrastructure that Kubernetes doesn't know about: UPS power, disk health, network equipment, and LAN host availability.
 How do we extend Prometheus metrics collection to cover the full homelab infrastructure, including devices and hosts outside the Kubernetes cluster?
 ## Decision Drivers
 * Early warning of hardware failures (disks, UPS battery, network gear)
 * Visibility into power consumption and UPS status
 * Network device monitoring without vendor lock-in
 * LAN host reachability tracking for non-Kubernetes services
 * Keep all metrics in a single Prometheus instance for unified querying
 ## Considered Options
 1. **Purpose-built Prometheus exporters per domain** - smartctl, SNMP, blackbox, unpoller
 2. **Agent-based monitoring (Telegraf/Datadog agent)** - deploy agents to all hosts
 3. **SNMP polling for everything** - unified SNMP-based collection
 4. **External monitoring SaaS** - Uptime Robot, Datadog, etc.
 ## Decision Outcome
 Chosen option: **Option 1 - Purpose-built Prometheus exporters per domain**, because each exporter is best-in-class for its domain, they integrate natively with Prometheus ServiceMonitors, and they require zero configuration on the monitored targets themselves.
 ### Positive Consequences
 * Each exporter is purpose-built and well-maintained by its community
 * Native Prometheus integration via ServiceMonitor/ScrapeConfig
 * No agents needed on monitored devices
 * All metrics queryable in a single Prometheus instance and Grafana
 * Dedicated alerting rules per domain (disk health, UPS, LAN)
 ### Negative Consequences
 * Multiple small deployments to maintain (one per exporter type)
 * Each exporter has its own configuration format
 * SNMP exporter requires SNMPv3 credential management
 ## Components
 ### Disk Health: smartctl-exporter
 Monitors SMART attributes on all cluster node disks to detect early signs of failure.
 | | |
 |---|---|
 | **Chart** | `prometheus-smartctl-exporter` v0.16.0 |
 | **Scope** | All amd64 nodes (excludes Raspberry Pi / ARM workers) |
 **Alert rules (6):**
 - `SmartDeviceHighTemperature` — disk temp > 65°C
 - `SmartDeviceTestFailed` — SMART self-test failure
 - `SmartDeviceCriticalWarning` — NVMe critical warning bit set
 - `SmartDeviceMediaErrors` — NVMe media/integrity errors
 - `SmartDeviceAvailableSpareUnderThreshold` — NVMe spare capacity low
 - `SmartDeviceInterfaceSlow` — link not running at max negotiated speed
 **Dashboard:** Smartctl Exporter (#22604)
 ### UPS Monitoring: SNMP Exporter
 Monitors the CyberPower UPS via SNMPv3 for power status, battery health, and load.
 | | |
 |---|---|
 | **Chart** | `prometheus-snmp-exporter` v9.11.0 |
 | **Target** | `ups.lab.daviestechlabs.io` |
 | **Auth** | SNMPv3 credentials from Vault via ExternalSecret |
 **Alert rules:**
 - `UPSOnBattery` — critical if ≤ 20 min battery remaining while on battery power
 - `UPSReplaceBattery` — critical if UPS diagnostic test reports failure
 **Dashboard:** CyberPower UPS (#12340)
 The UPS load metric (`upsHighPrecOutputLoad`) is also consumed by Kromgo to display cluster power usage as an SVG badge.
 ### LAN Probing: Blackbox Exporter
 Probes LAN hosts and services to detect outages for devices outside the Kubernetes cluster.
 | | |
 |---|---|
 | **Chart** | `prometheus-blackbox-exporter` v11.7.0 |
 | **Modules** | `http_2xx`, `icmp`, `tcp_connect` (all IPv4-preferred) |
 **Probe targets:**
 | Type | Targets |
 |------|---------|
 | ICMP | candlekeep, bruenor, catti, danilo, jetkvm, drizzt, elminster, regis, storm, wulfgar |
 | TCP | `expanse.internal:2049` (NFS service) |
 **Alert rules:**
 - `LanProbeFailed` — critical if any LAN probe fails for 15 minutes
 ### Network Equipment: Unpoller
 Exports UniFi network device metrics (APs, switches, PDUs) from the UniFi controller.
 | | |
 |---|---|
 | **Image** | `ghcr.io/unpoller/unpoller:v2.33.0` |
 | **Controller** | `https://192.168.100.254` |
 | **Scrape interval** | 2 minutes (matches UniFi API poll rate) |
 **Dashboards (5):**
 - UniFi PDU (#23027), Insights (#11315), Network Sites (#11311), UAP (#11314), USW (#11312)
 ### External Node Scraping
 Static scrape targets for the NAS host running its own exporters outside the cluster:
 | Target | Port | Exporter |
 |--------|------|----------|
 | `candlekeep.lab.daviestechlabs.io` | 9100 | node-exporter |
 | `candlekeep.lab.daviestechlabs.io` | 9633 | smartctl-exporter |
 | `jetkvm.lab.daviestechlabs.io` | — | JetKVM device metrics |
 Configured via `additionalScrapeConfigs` in kube-prometheus-stack.
 ## Metrics Coverage Summary
 | Domain | Exporter | Key Signals |
 |--------|----------|-------------|
 | Disk health | smartctl-exporter | Temperature, SMART status, media errors, spare capacity |
 | Power/UPS | SNMP exporter | Battery status, load, runtime remaining, diagnostics |
 | LAN hosts | Blackbox exporter | ICMP reachability, TCP connectivity |
 | Network gear | Unpoller | AP clients, switch throughput, PDU power |
 | NAS/external | Static scrape | Node metrics, disk health for off-cluster hosts |
 | KVM | Static scrape | JetKVM device metrics |
 ## Links
 * Refined by [ADR-0025](0025-observability-stack.md)
 * Related to [ADR-0039](0039-alerting-notification-pipeline.md)
--- a/decisions/0039-alerting-notification-pipeline.md
+++ b/decisions/0039-alerting-notification-pipeline.md
@@ -0,0 +1,197 @@
 # Alerting and Notification Pipeline
 * Status: accepted
 * Date: 2026-02-09
 * Deciders: Billy
 * Technical Story: Design a reliable alerting pipeline from Prometheus to mobile/Discord notifications with noise management for a single-operator homelab
 ## Context and Problem Statement
 A homelab with 10+ LAN hosts, GPU workloads, UPS power, and dozens of services generates many alerts. A single operator needs to receive critical notifications promptly while avoiding alert fatigue from known-noisy conditions.
 How do we route alerts from Prometheus to actionable notifications on Discord and mobile, while keeping noise under control?
 ## Decision Drivers
 * Critical alerts must reach the operator within seconds (mobile push + Discord)
 * Alert fatigue must be minimized — suppress known-noisy alerts declaratively
 * The pipeline should be fully self-hosted (no PagerDuty/Opsgenie SaaS)
 * Alert routing must be GitOps-managed and version-controlled
 * Uptime monitoring needs a public-facing status page
 ## Considered Options
 1. **Alertmanager → ntfy → ntfy-discord bridge** with Silence Operator and Gatus
 2. **Alertmanager → Discord webhook directly** with manual silences
 3. **Alertmanager → Grafana OnCall** for incident management
 4. **External SaaS (PagerDuty, Opsgenie)**
 ## Decision Outcome
 Chosen option: **Option 1 - Alertmanager → ntfy → ntfy-discord bridge** with declarative silence management via Silence Operator and Gatus for uptime monitoring.
 ntfy serves as a central notification hub that decouples alert producers from consumers. The custom ntfy-discord bridge forwards to Discord, while ntfy itself delivers mobile push notifications. Silence Operator manages suppression rules as Kubernetes CRs.
 ### Positive Consequences
 * Fully self-hosted, no external dependencies
 * ntfy provides mobile push without app-specific integrations
 * Decoupled architecture — adding new notification targets only requires subscribing to ntfy topics
 * Silence rules are version-controlled Kubernetes resources
 * Gatus provides a public status page independent of the alerting pipeline
 ### Negative Consequences
 * Custom bridge service (ntfy-discord) to maintain
 * ntfy is a single point of failure for notifications (mitigated by persistent storage)
 * No built-in on-call rotation or escalation (acceptable for single operator)
 ## Architecture
 ```
 ┌──────────────────────────────────────────────────────────────┐
 │                    ALERT SOURCES                              │
 │                                                              │
 │  PrometheusRules    Gatus Endpoint     Custom Webhooks       │
 │  (metric alerts)    Monitors           (CI, etc.)            │
 │        │                │                    │               │
 └────────┼────────────────┼────────────────────┼───────────────┘
         │                │                    │
         ▼                │                    │
 ┌─────────────────┐       │                    │
 │  Alertmanager   │       │                    │
 │                 │       │                    │
 │  Routes by      │       │                    │
 │  severity:      │       │                    │
 │  critical→urgent│       │                    │
 │  warning→high   │       │                    │
 │  default→null   │       │                    │
 └────────┬────────┘       │                    │
         │                │                    │
         │   ┌────────────┘                    │
         ▼   ▼                                 ▼
 ┌──────────────────────────────────────────────────────────────┐
 │                         ntfy                                  │
 │                                                              │
 │  Topics:                                                     │
 │    alertmanager-alerts  ←  Alertmanager webhooks              │
 │    gatus                ←  Gatus endpoint failures            │
 │    gitea-ci             ←  CI pipeline notifications          │
 │                                                              │
 │  → Mobile push (ntfy app)                                    │
 │  → Web UI at ntfy.daviestechlabs.io                          │
 └────────────────────┬─────────────────────────────────────────┘
                     │
                     ▼
 ┌──────────────────────────────────────────────────────────────┐
 │                    ntfy-discord bridge                         │
 │                                                              │
 │  Subscribes to: alertmanager-alerts, gatus, gitea-ci         │
 │  Forwards to: Discord webhooks (per-topic channels)          │
 │  Custom-built Go service with Prometheus metrics             │
 └──────────────────────────────────────────────────────────────┘
 ```
 ## Component Details
 ### Alertmanager Routing
 Configured via `AlertmanagerConfig` in kube-prometheus-stack:
 | Severity | ntfy Priority | Tags | Behavior |
 |----------|---------------|------|----------|
 | `critical` | urgent | `rotating_light`, `alert` | Immediate push + Discord |
 | `warning` | high | `warning` | Push + Discord |
 | All others | — | — | Routed to `null-receiver` (dropped) |
 The webhook sends to `http://ntfy-svc.observability.svc.cluster.local/alertmanager-alerts` with Alertmanager template expansion for human-readable messages.
 ### Custom Alert Rules
 Beyond standard kube-prometheus-stack rules, custom `PrometheusRules` cover:
 | Rule | Source | Severity |
 |------|--------|----------|
 | `DockerhubRateLimitRisk` | kube-prometheus-stack | — |
 | `OomKilled` | kube-prometheus-stack | — |
 | `ZfsUnexpectedPoolState` | kube-prometheus-stack | — |
 | `UPSOnBattery` | SNMP exporter | critical |
 | `UPSReplaceBattery` | SNMP exporter | critical |
 | `LanProbeFailed` | Blackbox exporter | critical |
 | `SmartDevice*` (6 rules) | smartctl-exporter | warning/critical |
 | `GatusEndpointDown` | Gatus | critical |
 | `GatusEndpointExposed` | Gatus | critical |
 ### Noise Management: Silence Operator
 The [Silence Operator](https://github.com/giantswarm/silence-operator) manages Alertmanager silences as Kubernetes custom resources, keeping suppression rules version-controlled in Git.
 **Active silences:**
 | Silence | Alert Suppressed | Reason |
 |---------|------------------|--------|
 | `longhorn-node-storage-diskspace-warning` | `NodeDiskHighUtilization` | Longhorn storage devices are intentionally high-utilization |
 | `node-root-diskspace-warning` | `NodeDiskHighUtilization` | Root partition usage is expected |
 | `nas-memory-high-utilization` | `NodeMemoryHighUtilization` | NAS (candlekeep) runs memory-intensive workloads by design |
 | `keda-hpa-maxed-out` | `KubeHpaMaxedOut` | KEDA-managed HPAs scaling to max is normal behavior |
 ### Uptime Monitoring: Gatus
 Gatus provides endpoint monitoring and a public-facing status page, independent of the Prometheus alerting pipeline.
 | | |
 |---|---|
 | **Image** | `ghcr.io/twin/gatus:v5.34.0` |
 | **Status page** | `status.daviestechlabs.io` (public) |
 | **Admin** | `gatus.daviestechlabs.io` (public) |
 **Auto-discovery:** A sidecar watches Kubernetes HTTPRoutes and Services, automatically generating monitoring endpoints for all exposed services.
 **Manual endpoints:**
 - Connectivity checks: Cloudflare (1.1.1.1), Google (8.8.8.8), Quad9 (9.9.9.9) via ICMP
 - Gitea: `git.daviestechlabs.io`
 - Container registry: `registry.lab.daviestechlabs.io`
 **Alerting:** Gatus sends failures to the `gatus` ntfy topic, which flows through the same ntfy → Discord pipeline.
 **PrometheusRules from Gatus metrics:**
 - `GatusEndpointDown` — external/service endpoint failure for 5 min → critical
 - `GatusEndpointExposed` — internal endpoint reachable from public DNS for 5 min → critical (detects accidental exposure)
 ### ntfy
 | | |
 |---|---|
 | **Image** | `binwiederhier/ntfy:v2.16.0` |
 | **URL** | `ntfy.daviestechlabs.io` (public, Authentik SSO) |
 | **Storage** | 5 Gi PVC (SQLite cache) |
 Serves as the central notification hub. Protected by Authentik forward-auth via Envoy Gateway. Receives webhooks from Alertmanager and Gatus, delivers push notifications to the ntfy mobile app.
 ### ntfy-discord Bridge
 | | |
 |---|---|
 | **Image** | `registry.lab.daviestechlabs.io/billy/ntfy-discord:v0.0.1` |
 | **Source** | Custom Go service (in-repo: `ntfy-discord/`) |
 Subscribes to ntfy topics and forwards notifications to Discord webhooks. Each topic maps to a Discord channel/webhook. Exposes Prometheus metrics via PodMonitor.
 ## Notification Flow Example
 ```
 1. Prometheus evaluates: smartctl SMART status ≠ 1
 2. SmartDeviceTestFailed fires (severity: critical)
 3. Alertmanager matches critical route → webhook to ntfy
 4. ntfy receives on "alertmanager-alerts" topic
   → Pushes to mobile via ntfy app
   → ntfy-discord subscribes and forwards to Discord webhook
 5. Operator receives push notification + Discord message
 ```
 ## Links
 * Refined by [ADR-0025](0025-observability-stack.md)
 * Related to [ADR-0038](0038-infrastructure-metrics-collection.md)
 * Related to [ADR-0021](0021-notification-architecture.md)
 * Related to [ADR-0022](0022-ntfy-discord-bridge.md)