From 8e3e2043c3f53530dacdc638ef301110287312ec Mon Sep 17 00:00:00 2001
From: "Billy D." <billy.davies.10@icloud.com>
Date: Mon, 9 Feb 2026 18:10:56 -0500
Subject: [PATCH] docs: add ADR-0038/0039 and replace llm-workflows references
 with decomposed repos
MIME-Version: 1.0
Content-Type: text/plain; charset=UTF-8
Content-Transfer-Encoding: 8bit

- ADR-0038: Infrastructure metrics collection (smartctl, SNMP, blackbox, unpoller)
- ADR-0039: Alerting and notification pipeline (Alertmanager → ntfy → Discord)
- Replace llm-workflows GitHub links with Gitea daviestechlabs org repos
- Update AGENT-ONBOARDING.md: remove llm-workflows from file tree, add missing repos
- Update ADR-0006: fix multi-repo reference
- Update ADR-0009: fix broken llm-workflows link
- Update ADR-0024: mark ray-serve repo as created, update historical context
- Update README: fix ADR-0016 status, add 0038/0039 to table, update badges
---
 AGENT-ONBOARDING.md                           |   8 +-
 README.md                                     |  25 ++-
 decisions/0006-gitops-with-flux.md            |   8 +-
 decisions/0009-dual-workflow-engines.md       |   2 +-
 decisions/0024-ray-repository-structure.md    |  33 +--
 .../0038-infrastructure-metrics-collection.md | 143 +++++++++++++
 .../0039-alerting-notification-pipeline.md    | 197 ++++++++++++++++++
 7 files changed, 392 insertions(+), 24 deletions(-)
 create mode 100644 decisions/0038-infrastructure-metrics-collection.md
 create mode 100644 decisions/0039-alerting-notification-pipeline.md

diff --git a/AGENT-ONBOARDING.md b/AGENT-ONBOARDING.md
index 32d0f82..25a71ce 100644
--- a/AGENT-ONBOARDING.md
+++ b/AGENT-ONBOARDING.md
@@ -26,10 +26,15 @@ You are working on a **homelab Kubernetes cluster** running:
 | `chat-handler` | Text chat with RAG pipeline |
 | `voice-assistant` | Voice pipeline (STT → RAG → LLM → TTS) |
 | `kuberay-images` | GPU-specific Ray worker Docker images |
+| `pipeline-bridge` | Bridge between pipelines and services |
+| `stt-module` | Speech-to-text service |
+| `tts-module` | Text-to-speech service |
+| `ray-serve` | Ray Serve inference services |
 | `argo` | Argo Workflows (training, batch inference) |
 | `kubeflow` | Kubeflow Pipeline definitions |
 | `mlflow` | MLflow integration utilities |
 | `gradio-ui` | Gradio demo apps (embeddings, STT, TTS) |
+| `ntfy-discord` | ntfy → Discord notification bridge |
 
 ## 🏗️ System Architecture (30-Second Version)
 
@@ -73,8 +78,7 @@ kubernetes/apps/
 │   ├── kubeflow/             #   Pipelines, Training Operator
 │   ├── milvus/               #   Vector database
 │   ├── nats/                 #   Message bus
-│   ├── vllm/                 #   LLM inference
-│   └── llm-workflows/        #   GitRepo sync to llm-workflows
+│   └── vllm/                 #   LLM inference
 ├── analytics/                # 📊 Spark, Flink, ClickHouse
 ├── observability/            # 📈 Grafana, Alloy, OpenTelemetry
 └── security/                 # 🔒 Vault, Authentik, Falco
diff --git a/README.md b/README.md
index baa536e..1264f2d 100644
--- a/README.md
+++ b/README.md
@@ -8,7 +8,7 @@
 [![License](https://img.shields.io/badge/License-MIT-green)](LICENSE)
 
 <!-- ADR-BADGES-START -->
-![ADR Count](https://img.shields.io/badge/ADRs-37_total-blue?logo=bookstack) ![Accepted](https://img.shields.io/badge/accepted-36-brightgreen) ![Proposed](https://img.shields.io/badge/proposed-0-yellow)
+![ADR Count](https://img.shields.io/badge/ADRs-39_total-blue?logo=bookstack) ![Accepted](https://img.shields.io/badge/accepted-39-brightgreen)
 <!-- ADR-BADGES-END -->
 
 ## 📖 Quick Navigation
@@ -128,6 +128,8 @@ homelab-design/
 | 0035 | [ARM64 Raspberry Pi Worker Node Strategy](decisions/0035-arm64-worker-strategy.md) | ✅ accepted | 2026-02-05 |
 | 0036 | [Automated Dependency Updates with Renovate](decisions/0036-renovate-dependency-updates.md) | ✅ accepted | 2026-02-05 |
 | 0037 | [Node Naming Conventions](decisions/0037-node-naming-conventions.md) | ✅ accepted | 2026-02-05 |
+| 0038 | [Infrastructure Metrics Collection Strategy](decisions/0038-infrastructure-metrics-collection.md) | ✅ accepted | 2026-02-09 |
+| 0039 | [Alerting and Notification Pipeline](decisions/0039-alerting-notification-pipeline.md) | ✅ accepted | 2026-02-09 |
 <!-- ADR-TABLE-END -->
 
 ## 🔗 Related Repositories
@@ -135,9 +137,28 @@ homelab-design/
 | Repository | Purpose |
 |------------|---------|
 | [homelab-k8s2](https://github.com/Billy-Davies-2/homelab-k8s2) | Kubernetes manifests, Flux GitOps |
-| [llm-workflows](https://github.com/Billy-Davies-2/llm-workflows) | NATS handlers, Argo/KFP workflows |
 | [companions-frontend](https://github.com/Billy-Davies-2/companions-frontend) | Go web server, HTMX frontend |
 
+### AI/ML Repos (git.daviestechlabs.io/daviestechlabs)
+
+The former monolithic `llm-workflows` repo has been archived and decomposed into:
+
+| Repository | Purpose |
+|------------|--------|
+| `handler-base` | Shared Python library for NATS handlers |
+| `chat-handler` | Text chat with RAG pipeline |
+| `voice-assistant` | Voice pipeline (STT → RAG → LLM → TTS) |
+| `pipeline-bridge` | Bridge between pipelines and services |
+| `stt-module` | Speech-to-text service |
+| `tts-module` | Text-to-speech service |
+| `ray-serve` | Ray Serve inference services |
+| `kuberay-images` | GPU-specific Ray worker Docker images |
+| `argo` | Argo Workflows (training, batch inference) |
+| `kubeflow` | Kubeflow Pipeline definitions |
+| `mlflow` | MLflow integration utilities |
+| `gradio-ui` | Gradio demo apps (embeddings, STT, TTS) |
+| `ntfy-discord` | ntfy → Discord notification bridge |
+
 ## 📝 Contributing
 
 1. For architecture changes, create an ADR in `decisions/`
diff --git a/decisions/0006-gitops-with-flux.md b/decisions/0006-gitops-with-flux.md
index 839bebd..8b839ab 100644
--- a/decisions/0006-gitops-with-flux.md
+++ b/decisions/0006-gitops-with-flux.md
@@ -35,7 +35,7 @@ Chosen option: "Flux CD", because it provides a mature GitOps implementation wit
 * Git is single source of truth
 * Automatic drift detection and correction
 * Native SOPS/Age secret encryption
-* Multi-repository support (homelab-k8s2 + llm-workflows)
+* Multi-repository support (homelab-k8s2 + Gitea daviestechlabs repos)
 * Helm and Kustomize native support
 * Webhook-free sync (pull-based)
 
@@ -79,8 +79,10 @@ spec:
   # Public repos don't need secretRef
 ```
 
-Note: The monolithic `llm-workflows` repo has been decomposed into separate repos
-in the daviestechlabs Gitea organization. See AGENT-ONBOARDING.md for the full list.
+Note: The monolithic `llm-workflows` repo has been archived and decomposed into
+focused repos in the daviestechlabs Gitea organization (e.g. `chat-handler`,
+`voice-assistant`, `handler-base`, `ray-serve`, etc.). See AGENT-ONBOARDING.md
+for the full list.
 
 ### SOPS Integration
 
diff --git a/decisions/0009-dual-workflow-engines.md b/decisions/0009-dual-workflow-engines.md
index 94be24b..4f09f8b 100644
--- a/decisions/0009-dual-workflow-engines.md
+++ b/decisions/0009-dual-workflow-engines.md
@@ -121,4 +121,4 @@ Argo Workflow ──► Step: kfp-trigger ──► Kubeflow Pipeline
 * [Kubeflow Pipelines](https://kubeflow.org/docs/components/pipelines/)
 * [Argo Workflows](https://argoproj.github.io/workflows/)
 * [Argo Events](https://argoproj.github.io/events/)
-* Related: [kfp-integration.yaml](../../llm-workflows/argo/kfp-integration.yaml)
+* Related: kfp-integration.yaml (formerly in llm-workflows, now in the `argo` repo on Gitea)
diff --git a/decisions/0024-ray-repository-structure.md b/decisions/0024-ray-repository-structure.md
index 0d8a014..16ada5b 100644
--- a/decisions/0024-ray-repository-structure.md
+++ b/decisions/0024-ray-repository-structure.md
@@ -35,23 +35,24 @@
 | `spark-analytics-jobs` | Spark batch analytics |
 | `flink-analytics-jobs` | Flink streaming analytics |
 
-### Remaining Ray Component
+### Ray Component Repositories
 
-The `ray-serve` code still needs a dedicated repository for Ray Serve model inference services.
+Both Ray repositories now exist as standalone repos in the Gitea `daviestechlabs` organization:
 
-| Component | Current Location | Purpose |
-|-----------|------------------|---------|
-| kuberay-images | `kuberay-images/` (standalone) | Docker images for Ray workers (NVIDIA, AMD, Intel) |
-| ray-serve | `llm-workflows/ray-serve/` | Ray Serve inference services |
-| llm-workflows | `llm-workflows/` | Pipelines, handlers, STT/TTS, embeddings |
+| Component | Location | Purpose |
+|-----------|----------|---------|
+| kuberay-images | `kuberay-images/` (standalone repo) | Docker images for Ray workers (NVIDIA, AMD, Intel) |
+| ray-serve | `ray-serve/` (standalone repo) | Ray Serve inference services |
 
-### Problems with Current Structure
+### Problems with Monolithic Structure (Historical)
 
-1. **Tight Coupling**: ray-serve changes require llm-workflows repo access
-2. **CI/CD Complexity**: Building ray-serve images triggers unrelated workflow steps
-3. **Version Management**: Can't independently version ray-serve deployments
-4. **Team Access**: Contributors to ray-serve need access to entire llm-workflows repo
-5. **Build Times**: Changes to unrelated code can trigger ray-serve rebuilds
+These were the problems with the original monolithic `llm-workflows` structure (now resolved):
+
+1. **Tight Coupling**: ray-serve changes required llm-workflows repo access
+2. **CI/CD Complexity**: Building ray-serve images triggered unrelated workflow steps
+3. **Version Management**: Couldn't independently version ray-serve deployments
+4. **Team Access**: Contributors to ray-serve needed access to entire llm-workflows repo
+5. **Build Times**: Changes to unrelated code could trigger ray-serve rebuilds
 
 ## Decision
 
@@ -160,9 +161,9 @@ ray-serve/                         # PyPI package - application code
 
 1. ✅ `kuberay-images` already exists as standalone repo
 2. ✅ `llm-workflows` archived - all components extracted to dedicated repos
-3. [ ] Create `ray-serve` repo on Gitea
-4. [ ] Move `.gitea/workflows/publish-ray-serve.yaml` to new repo
-5. [ ] Set up pyproject.toml for PyPI publishing
+3. ✅ `ray-serve` repo created on Gitea (`git.daviestechlabs.io/daviestechlabs/ray-serve`)
+4. ✅ CI workflows moved to new repo
+5. ✅ pyproject.toml configured for PyPI publishing
 6. [ ] Update RayService manifests to `pip install ray-serve==X.Y.Z`
 7. [ ] Verify Ray cluster pulls package correctly at runtime
 
diff --git a/decisions/0038-infrastructure-metrics-collection.md b/decisions/0038-infrastructure-metrics-collection.md
new file mode 100644
index 0000000..6b12925
--- /dev/null
+++ b/decisions/0038-infrastructure-metrics-collection.md
@@ -0,0 +1,143 @@
+# Infrastructure Metrics Collection Strategy
+
+* Status: accepted
+* Date: 2026-02-09
+* Deciders: Billy
+* Technical Story: Define what physical and network infrastructure to monitor and how to collect metrics beyond standard Kubernetes telemetry
+
+## Context and Problem Statement
+
+Standard Kubernetes observability (kube-state-metrics, node-exporter, cAdvisor) covers container and node health, but a homelab includes physical infrastructure that Kubernetes doesn't know about: UPS power, disk health, network equipment, and LAN host availability.
+
+How do we extend Prometheus metrics collection to cover the full homelab infrastructure, including devices and hosts outside the Kubernetes cluster?
+
+## Decision Drivers
+
+* Early warning of hardware failures (disks, UPS battery, network gear)
+* Visibility into power consumption and UPS status
+* Network device monitoring without vendor lock-in
+* LAN host reachability tracking for non-Kubernetes services
+* Keep all metrics in a single Prometheus instance for unified querying
+
+## Considered Options
+
+1. **Purpose-built Prometheus exporters per domain** - smartctl, SNMP, blackbox, unpoller
+2. **Agent-based monitoring (Telegraf/Datadog agent)** - deploy agents to all hosts
+3. **SNMP polling for everything** - unified SNMP-based collection
+4. **External monitoring SaaS** - Uptime Robot, Datadog, etc.
+
+## Decision Outcome
+
+Chosen option: **Option 1 - Purpose-built Prometheus exporters per domain**, because each exporter is best-in-class for its domain, they integrate natively with Prometheus ServiceMonitors, and they require zero configuration on the monitored targets themselves.
+
+### Positive Consequences
+
+* Each exporter is purpose-built and well-maintained by its community
+* Native Prometheus integration via ServiceMonitor/ScrapeConfig
+* No agents needed on monitored devices
+* All metrics queryable in a single Prometheus instance and Grafana
+* Dedicated alerting rules per domain (disk health, UPS, LAN)
+
+### Negative Consequences
+
+* Multiple small deployments to maintain (one per exporter type)
+* Each exporter has its own configuration format
+* SNMP exporter requires SNMPv3 credential management
+
+## Components
+
+### Disk Health: smartctl-exporter
+
+Monitors SMART attributes on all cluster node disks to detect early signs of failure.
+
+| | |
+|---|---|
+| **Chart** | `prometheus-smartctl-exporter` v0.16.0 |
+| **Scope** | All amd64 nodes (excludes Raspberry Pi / ARM workers) |
+
+**Alert rules (6):**
+- `SmartDeviceHighTemperature` — disk temp > 65°C
+- `SmartDeviceTestFailed` — SMART self-test failure
+- `SmartDeviceCriticalWarning` — NVMe critical warning bit set
+- `SmartDeviceMediaErrors` — NVMe media/integrity errors
+- `SmartDeviceAvailableSpareUnderThreshold` — NVMe spare capacity low
+- `SmartDeviceInterfaceSlow` — link not running at max negotiated speed
+
+**Dashboard:** Smartctl Exporter (#22604)
+
+### UPS Monitoring: SNMP Exporter
+
+Monitors the CyberPower UPS via SNMPv3 for power status, battery health, and load.
+
+| | |
+|---|---|
+| **Chart** | `prometheus-snmp-exporter` v9.11.0 |
+| **Target** | `ups.lab.daviestechlabs.io` |
+| **Auth** | SNMPv3 credentials from Vault via ExternalSecret |
+
+**Alert rules:**
+- `UPSOnBattery` — critical if ≤ 20 min battery remaining while on battery power
+- `UPSReplaceBattery` — critical if UPS diagnostic test reports failure
+
+**Dashboard:** CyberPower UPS (#12340)
+
+The UPS load metric (`upsHighPrecOutputLoad`) is also consumed by Kromgo to display cluster power usage as an SVG badge.
+
+### LAN Probing: Blackbox Exporter
+
+Probes LAN hosts and services to detect outages for devices outside the Kubernetes cluster.
+
+| | |
+|---|---|
+| **Chart** | `prometheus-blackbox-exporter` v11.7.0 |
+| **Modules** | `http_2xx`, `icmp`, `tcp_connect` (all IPv4-preferred) |
+
+**Probe targets:**
+| Type | Targets |
+|------|---------|
+| ICMP | candlekeep, bruenor, catti, danilo, jetkvm, drizzt, elminster, regis, storm, wulfgar |
+| TCP | `expanse.internal:2049` (NFS service) |
+
+**Alert rules:**
+- `LanProbeFailed` — critical if any LAN probe fails for 15 minutes
+
+### Network Equipment: Unpoller
+
+Exports UniFi network device metrics (APs, switches, PDUs) from the UniFi controller.
+
+| | |
+|---|---|
+| **Image** | `ghcr.io/unpoller/unpoller:v2.33.0` |
+| **Controller** | `https://192.168.100.254` |
+| **Scrape interval** | 2 minutes (matches UniFi API poll rate) |
+
+**Dashboards (5):**
+- UniFi PDU (#23027), Insights (#11315), Network Sites (#11311), UAP (#11314), USW (#11312)
+
+### External Node Scraping
+
+Static scrape targets for the NAS host running its own exporters outside the cluster:
+
+| Target | Port | Exporter |
+|--------|------|----------|
+| `candlekeep.lab.daviestechlabs.io` | 9100 | node-exporter |
+| `candlekeep.lab.daviestechlabs.io` | 9633 | smartctl-exporter |
+| `jetkvm.lab.daviestechlabs.io` | — | JetKVM device metrics |
+
+Configured via `additionalScrapeConfigs` in kube-prometheus-stack.
+
+## Metrics Coverage Summary
+
+| Domain | Exporter | Key Signals |
+|--------|----------|-------------|
+| Disk health | smartctl-exporter | Temperature, SMART status, media errors, spare capacity |
+| Power/UPS | SNMP exporter | Battery status, load, runtime remaining, diagnostics |
+| LAN hosts | Blackbox exporter | ICMP reachability, TCP connectivity |
+| Network gear | Unpoller | AP clients, switch throughput, PDU power |
+| NAS/external | Static scrape | Node metrics, disk health for off-cluster hosts |
+| KVM | Static scrape | JetKVM device metrics |
+
+## Links
+
+* Refined by [ADR-0025](0025-observability-stack.md)
+* Related to [ADR-0039](0039-alerting-notification-pipeline.md)
diff --git a/decisions/0039-alerting-notification-pipeline.md b/decisions/0039-alerting-notification-pipeline.md
new file mode 100644
index 0000000..8d5b3fe
--- /dev/null
+++ b/decisions/0039-alerting-notification-pipeline.md
@@ -0,0 +1,197 @@
+# Alerting and Notification Pipeline
+
+* Status: accepted
+* Date: 2026-02-09
+* Deciders: Billy
+* Technical Story: Design a reliable alerting pipeline from Prometheus to mobile/Discord notifications with noise management for a single-operator homelab
+
+## Context and Problem Statement
+
+A homelab with 10+ LAN hosts, GPU workloads, UPS power, and dozens of services generates many alerts. A single operator needs to receive critical notifications promptly while avoiding alert fatigue from known-noisy conditions.
+
+How do we route alerts from Prometheus to actionable notifications on Discord and mobile, while keeping noise under control?
+
+## Decision Drivers
+
+* Critical alerts must reach the operator within seconds (mobile push + Discord)
+* Alert fatigue must be minimized — suppress known-noisy alerts declaratively
+* The pipeline should be fully self-hosted (no PagerDuty/Opsgenie SaaS)
+* Alert routing must be GitOps-managed and version-controlled
+* Uptime monitoring needs a public-facing status page
+
+## Considered Options
+
+1. **Alertmanager → ntfy → ntfy-discord bridge** with Silence Operator and Gatus
+2. **Alertmanager → Discord webhook directly** with manual silences
+3. **Alertmanager → Grafana OnCall** for incident management
+4. **External SaaS (PagerDuty, Opsgenie)**
+
+## Decision Outcome
+
+Chosen option: **Option 1 - Alertmanager → ntfy → ntfy-discord bridge** with declarative silence management via Silence Operator and Gatus for uptime monitoring.
+
+ntfy serves as a central notification hub that decouples alert producers from consumers. The custom ntfy-discord bridge forwards to Discord, while ntfy itself delivers mobile push notifications. Silence Operator manages suppression rules as Kubernetes CRs.
+
+### Positive Consequences
+
+* Fully self-hosted, no external dependencies
+* ntfy provides mobile push without app-specific integrations
+* Decoupled architecture — adding new notification targets only requires subscribing to ntfy topics
+* Silence rules are version-controlled Kubernetes resources
+* Gatus provides a public status page independent of the alerting pipeline
+
+### Negative Consequences
+
+* Custom bridge service (ntfy-discord) to maintain
+* ntfy is a single point of failure for notifications (mitigated by persistent storage)
+* No built-in on-call rotation or escalation (acceptable for single operator)
+
+## Architecture
+
+```
+┌──────────────────────────────────────────────────────────────┐
+│                    ALERT SOURCES                              │
+│                                                              │
+│  PrometheusRules    Gatus Endpoint     Custom Webhooks       │
+│  (metric alerts)    Monitors           (CI, etc.)            │
+│        │                │                    │               │
+└────────┼────────────────┼────────────────────┼───────────────┘
+         │                │                    │
+         ▼                │                    │
+┌─────────────────┐       │                    │
+│  Alertmanager   │       │                    │
+│                 │       │                    │
+│  Routes by      │       │                    │
+│  severity:      │       │                    │
+│  critical→urgent│       │                    │
+│  warning→high   │       │                    │
+│  default→null   │       │                    │
+└────────┬────────┘       │                    │
+         │                │                    │
+         │   ┌────────────┘                    │
+         ▼   ▼                                 ▼
+┌──────────────────────────────────────────────────────────────┐
+│                         ntfy                                  │
+│                                                              │
+│  Topics:                                                     │
+│    alertmanager-alerts  ←  Alertmanager webhooks              │
+│    gatus                ←  Gatus endpoint failures            │
+│    gitea-ci             ←  CI pipeline notifications          │
+│                                                              │
+│  → Mobile push (ntfy app)                                    │
+│  → Web UI at ntfy.daviestechlabs.io                          │
+└────────────────────┬─────────────────────────────────────────┘
+                     │
+                     ▼
+┌──────────────────────────────────────────────────────────────┐
+│                    ntfy-discord bridge                         │
+│                                                              │
+│  Subscribes to: alertmanager-alerts, gatus, gitea-ci         │
+│  Forwards to: Discord webhooks (per-topic channels)          │
+│  Custom-built Go service with Prometheus metrics             │
+└──────────────────────────────────────────────────────────────┘
+```
+
+## Component Details
+
+### Alertmanager Routing
+
+Configured via `AlertmanagerConfig` in kube-prometheus-stack:
+
+| Severity | ntfy Priority | Tags | Behavior |
+|----------|---------------|------|----------|
+| `critical` | urgent | `rotating_light`, `alert` | Immediate push + Discord |
+| `warning` | high | `warning` | Push + Discord |
+| All others | — | — | Routed to `null-receiver` (dropped) |
+
+The webhook sends to `http://ntfy-svc.observability.svc.cluster.local/alertmanager-alerts` with Alertmanager template expansion for human-readable messages.
+
+### Custom Alert Rules
+
+Beyond standard kube-prometheus-stack rules, custom `PrometheusRules` cover:
+
+| Rule | Source | Severity |
+|------|--------|----------|
+| `DockerhubRateLimitRisk` | kube-prometheus-stack | — |
+| `OomKilled` | kube-prometheus-stack | — |
+| `ZfsUnexpectedPoolState` | kube-prometheus-stack | — |
+| `UPSOnBattery` | SNMP exporter | critical |
+| `UPSReplaceBattery` | SNMP exporter | critical |
+| `LanProbeFailed` | Blackbox exporter | critical |
+| `SmartDevice*` (6 rules) | smartctl-exporter | warning/critical |
+| `GatusEndpointDown` | Gatus | critical |
+| `GatusEndpointExposed` | Gatus | critical |
+
+### Noise Management: Silence Operator
+
+The [Silence Operator](https://github.com/giantswarm/silence-operator) manages Alertmanager silences as Kubernetes custom resources, keeping suppression rules version-controlled in Git.
+
+**Active silences:**
+
+| Silence | Alert Suppressed | Reason |
+|---------|------------------|--------|
+| `longhorn-node-storage-diskspace-warning` | `NodeDiskHighUtilization` | Longhorn storage devices are intentionally high-utilization |
+| `node-root-diskspace-warning` | `NodeDiskHighUtilization` | Root partition usage is expected |
+| `nas-memory-high-utilization` | `NodeMemoryHighUtilization` | NAS (candlekeep) runs memory-intensive workloads by design |
+| `keda-hpa-maxed-out` | `KubeHpaMaxedOut` | KEDA-managed HPAs scaling to max is normal behavior |
+
+### Uptime Monitoring: Gatus
+
+Gatus provides endpoint monitoring and a public-facing status page, independent of the Prometheus alerting pipeline.
+
+| | |
+|---|---|
+| **Image** | `ghcr.io/twin/gatus:v5.34.0` |
+| **Status page** | `status.daviestechlabs.io` (public) |
+| **Admin** | `gatus.daviestechlabs.io` (public) |
+
+**Auto-discovery:** A sidecar watches Kubernetes HTTPRoutes and Services, automatically generating monitoring endpoints for all exposed services.
+
+**Manual endpoints:**
+- Connectivity checks: Cloudflare (1.1.1.1), Google (8.8.8.8), Quad9 (9.9.9.9) via ICMP
+- Gitea: `git.daviestechlabs.io`
+- Container registry: `registry.lab.daviestechlabs.io`
+
+**Alerting:** Gatus sends failures to the `gatus` ntfy topic, which flows through the same ntfy → Discord pipeline.
+
+**PrometheusRules from Gatus metrics:**
+- `GatusEndpointDown` — external/service endpoint failure for 5 min → critical
+- `GatusEndpointExposed` — internal endpoint reachable from public DNS for 5 min → critical (detects accidental exposure)
+
+### ntfy
+
+| | |
+|---|---|
+| **Image** | `binwiederhier/ntfy:v2.16.0` |
+| **URL** | `ntfy.daviestechlabs.io` (public, Authentik SSO) |
+| **Storage** | 5 Gi PVC (SQLite cache) |
+
+Serves as the central notification hub. Protected by Authentik forward-auth via Envoy Gateway. Receives webhooks from Alertmanager and Gatus, delivers push notifications to the ntfy mobile app.
+
+### ntfy-discord Bridge
+
+| | |
+|---|---|
+| **Image** | `registry.lab.daviestechlabs.io/billy/ntfy-discord:v0.0.1` |
+| **Source** | Custom Go service (in-repo: `ntfy-discord/`) |
+
+Subscribes to ntfy topics and forwards notifications to Discord webhooks. Each topic maps to a Discord channel/webhook. Exposes Prometheus metrics via PodMonitor.
+
+## Notification Flow Example
+
+```
+1. Prometheus evaluates: smartctl SMART status ≠ 1
+2. SmartDeviceTestFailed fires (severity: critical)
+3. Alertmanager matches critical route → webhook to ntfy
+4. ntfy receives on "alertmanager-alerts" topic
+   → Pushes to mobile via ntfy app
+   → ntfy-discord subscribes and forwards to Discord webhook
+5. Operator receives push notification + Discord message
+```
+
+## Links
+
+* Refined by [ADR-0025](0025-observability-stack.md)
+* Related to [ADR-0038](0038-infrastructure-metrics-collection.md)
+* Related to [ADR-0021](0021-notification-architecture.md)
+* Related to [ADR-0022](0022-ntfy-discord-bridge.md)