docs: add ADR-0038/0039 and replace llm-workflows references with decomposed repos
All checks were successful
Update README with ADR Index / update-readme (push) Successful in 6s
All checks were successful
Update README with ADR Index / update-readme (push) Successful in 6s
- ADR-0038: Infrastructure metrics collection (smartctl, SNMP, blackbox, unpoller) - ADR-0039: Alerting and notification pipeline (Alertmanager → ntfy → Discord) - Replace llm-workflows GitHub links with Gitea daviestechlabs org repos - Update AGENT-ONBOARDING.md: remove llm-workflows from file tree, add missing repos - Update ADR-0006: fix multi-repo reference - Update ADR-0009: fix broken llm-workflows link - Update ADR-0024: mark ray-serve repo as created, update historical context - Update README: fix ADR-0016 status, add 0038/0039 to table, update badges
This commit is contained in:
@@ -26,10 +26,15 @@ You are working on a **homelab Kubernetes cluster** running:
|
|||||||
| `chat-handler` | Text chat with RAG pipeline |
|
| `chat-handler` | Text chat with RAG pipeline |
|
||||||
| `voice-assistant` | Voice pipeline (STT → RAG → LLM → TTS) |
|
| `voice-assistant` | Voice pipeline (STT → RAG → LLM → TTS) |
|
||||||
| `kuberay-images` | GPU-specific Ray worker Docker images |
|
| `kuberay-images` | GPU-specific Ray worker Docker images |
|
||||||
|
| `pipeline-bridge` | Bridge between pipelines and services |
|
||||||
|
| `stt-module` | Speech-to-text service |
|
||||||
|
| `tts-module` | Text-to-speech service |
|
||||||
|
| `ray-serve` | Ray Serve inference services |
|
||||||
| `argo` | Argo Workflows (training, batch inference) |
|
| `argo` | Argo Workflows (training, batch inference) |
|
||||||
| `kubeflow` | Kubeflow Pipeline definitions |
|
| `kubeflow` | Kubeflow Pipeline definitions |
|
||||||
| `mlflow` | MLflow integration utilities |
|
| `mlflow` | MLflow integration utilities |
|
||||||
| `gradio-ui` | Gradio demo apps (embeddings, STT, TTS) |
|
| `gradio-ui` | Gradio demo apps (embeddings, STT, TTS) |
|
||||||
|
| `ntfy-discord` | ntfy → Discord notification bridge |
|
||||||
|
|
||||||
## 🏗️ System Architecture (30-Second Version)
|
## 🏗️ System Architecture (30-Second Version)
|
||||||
|
|
||||||
@@ -73,8 +78,7 @@ kubernetes/apps/
|
|||||||
│ ├── kubeflow/ # Pipelines, Training Operator
|
│ ├── kubeflow/ # Pipelines, Training Operator
|
||||||
│ ├── milvus/ # Vector database
|
│ ├── milvus/ # Vector database
|
||||||
│ ├── nats/ # Message bus
|
│ ├── nats/ # Message bus
|
||||||
│ ├── vllm/ # LLM inference
|
│ └── vllm/ # LLM inference
|
||||||
│ └── llm-workflows/ # GitRepo sync to llm-workflows
|
|
||||||
├── analytics/ # 📊 Spark, Flink, ClickHouse
|
├── analytics/ # 📊 Spark, Flink, ClickHouse
|
||||||
├── observability/ # 📈 Grafana, Alloy, OpenTelemetry
|
├── observability/ # 📈 Grafana, Alloy, OpenTelemetry
|
||||||
└── security/ # 🔒 Vault, Authentik, Falco
|
└── security/ # 🔒 Vault, Authentik, Falco
|
||||||
|
|||||||
25
README.md
25
README.md
@@ -8,7 +8,7 @@
|
|||||||
[](LICENSE)
|
[](LICENSE)
|
||||||
|
|
||||||
<!-- ADR-BADGES-START -->
|
<!-- ADR-BADGES-START -->
|
||||||
  
|
 
|
||||||
<!-- ADR-BADGES-END -->
|
<!-- ADR-BADGES-END -->
|
||||||
|
|
||||||
## 📖 Quick Navigation
|
## 📖 Quick Navigation
|
||||||
@@ -128,6 +128,8 @@ homelab-design/
|
|||||||
| 0035 | [ARM64 Raspberry Pi Worker Node Strategy](decisions/0035-arm64-worker-strategy.md) | ✅ accepted | 2026-02-05 |
|
| 0035 | [ARM64 Raspberry Pi Worker Node Strategy](decisions/0035-arm64-worker-strategy.md) | ✅ accepted | 2026-02-05 |
|
||||||
| 0036 | [Automated Dependency Updates with Renovate](decisions/0036-renovate-dependency-updates.md) | ✅ accepted | 2026-02-05 |
|
| 0036 | [Automated Dependency Updates with Renovate](decisions/0036-renovate-dependency-updates.md) | ✅ accepted | 2026-02-05 |
|
||||||
| 0037 | [Node Naming Conventions](decisions/0037-node-naming-conventions.md) | ✅ accepted | 2026-02-05 |
|
| 0037 | [Node Naming Conventions](decisions/0037-node-naming-conventions.md) | ✅ accepted | 2026-02-05 |
|
||||||
|
| 0038 | [Infrastructure Metrics Collection Strategy](decisions/0038-infrastructure-metrics-collection.md) | ✅ accepted | 2026-02-09 |
|
||||||
|
| 0039 | [Alerting and Notification Pipeline](decisions/0039-alerting-notification-pipeline.md) | ✅ accepted | 2026-02-09 |
|
||||||
<!-- ADR-TABLE-END -->
|
<!-- ADR-TABLE-END -->
|
||||||
|
|
||||||
## 🔗 Related Repositories
|
## 🔗 Related Repositories
|
||||||
@@ -135,9 +137,28 @@ homelab-design/
|
|||||||
| Repository | Purpose |
|
| Repository | Purpose |
|
||||||
|------------|---------|
|
|------------|---------|
|
||||||
| [homelab-k8s2](https://github.com/Billy-Davies-2/homelab-k8s2) | Kubernetes manifests, Flux GitOps |
|
| [homelab-k8s2](https://github.com/Billy-Davies-2/homelab-k8s2) | Kubernetes manifests, Flux GitOps |
|
||||||
| [llm-workflows](https://github.com/Billy-Davies-2/llm-workflows) | NATS handlers, Argo/KFP workflows |
|
|
||||||
| [companions-frontend](https://github.com/Billy-Davies-2/companions-frontend) | Go web server, HTMX frontend |
|
| [companions-frontend](https://github.com/Billy-Davies-2/companions-frontend) | Go web server, HTMX frontend |
|
||||||
|
|
||||||
|
### AI/ML Repos (git.daviestechlabs.io/daviestechlabs)
|
||||||
|
|
||||||
|
The former monolithic `llm-workflows` repo has been archived and decomposed into:
|
||||||
|
|
||||||
|
| Repository | Purpose |
|
||||||
|
|------------|--------|
|
||||||
|
| `handler-base` | Shared Python library for NATS handlers |
|
||||||
|
| `chat-handler` | Text chat with RAG pipeline |
|
||||||
|
| `voice-assistant` | Voice pipeline (STT → RAG → LLM → TTS) |
|
||||||
|
| `pipeline-bridge` | Bridge between pipelines and services |
|
||||||
|
| `stt-module` | Speech-to-text service |
|
||||||
|
| `tts-module` | Text-to-speech service |
|
||||||
|
| `ray-serve` | Ray Serve inference services |
|
||||||
|
| `kuberay-images` | GPU-specific Ray worker Docker images |
|
||||||
|
| `argo` | Argo Workflows (training, batch inference) |
|
||||||
|
| `kubeflow` | Kubeflow Pipeline definitions |
|
||||||
|
| `mlflow` | MLflow integration utilities |
|
||||||
|
| `gradio-ui` | Gradio demo apps (embeddings, STT, TTS) |
|
||||||
|
| `ntfy-discord` | ntfy → Discord notification bridge |
|
||||||
|
|
||||||
## 📝 Contributing
|
## 📝 Contributing
|
||||||
|
|
||||||
1. For architecture changes, create an ADR in `decisions/`
|
1. For architecture changes, create an ADR in `decisions/`
|
||||||
|
|||||||
@@ -35,7 +35,7 @@ Chosen option: "Flux CD", because it provides a mature GitOps implementation wit
|
|||||||
* Git is single source of truth
|
* Git is single source of truth
|
||||||
* Automatic drift detection and correction
|
* Automatic drift detection and correction
|
||||||
* Native SOPS/Age secret encryption
|
* Native SOPS/Age secret encryption
|
||||||
* Multi-repository support (homelab-k8s2 + llm-workflows)
|
* Multi-repository support (homelab-k8s2 + Gitea daviestechlabs repos)
|
||||||
* Helm and Kustomize native support
|
* Helm and Kustomize native support
|
||||||
* Webhook-free sync (pull-based)
|
* Webhook-free sync (pull-based)
|
||||||
|
|
||||||
@@ -79,8 +79,10 @@ spec:
|
|||||||
# Public repos don't need secretRef
|
# Public repos don't need secretRef
|
||||||
```
|
```
|
||||||
|
|
||||||
Note: The monolithic `llm-workflows` repo has been decomposed into separate repos
|
Note: The monolithic `llm-workflows` repo has been archived and decomposed into
|
||||||
in the daviestechlabs Gitea organization. See AGENT-ONBOARDING.md for the full list.
|
focused repos in the daviestechlabs Gitea organization (e.g. `chat-handler`,
|
||||||
|
`voice-assistant`, `handler-base`, `ray-serve`, etc.). See AGENT-ONBOARDING.md
|
||||||
|
for the full list.
|
||||||
|
|
||||||
### SOPS Integration
|
### SOPS Integration
|
||||||
|
|
||||||
|
|||||||
@@ -121,4 +121,4 @@ Argo Workflow ──► Step: kfp-trigger ──► Kubeflow Pipeline
|
|||||||
* [Kubeflow Pipelines](https://kubeflow.org/docs/components/pipelines/)
|
* [Kubeflow Pipelines](https://kubeflow.org/docs/components/pipelines/)
|
||||||
* [Argo Workflows](https://argoproj.github.io/workflows/)
|
* [Argo Workflows](https://argoproj.github.io/workflows/)
|
||||||
* [Argo Events](https://argoproj.github.io/events/)
|
* [Argo Events](https://argoproj.github.io/events/)
|
||||||
* Related: [kfp-integration.yaml](../../llm-workflows/argo/kfp-integration.yaml)
|
* Related: kfp-integration.yaml (formerly in llm-workflows, now in the `argo` repo on Gitea)
|
||||||
|
|||||||
@@ -35,23 +35,24 @@
|
|||||||
| `spark-analytics-jobs` | Spark batch analytics |
|
| `spark-analytics-jobs` | Spark batch analytics |
|
||||||
| `flink-analytics-jobs` | Flink streaming analytics |
|
| `flink-analytics-jobs` | Flink streaming analytics |
|
||||||
|
|
||||||
### Remaining Ray Component
|
### Ray Component Repositories
|
||||||
|
|
||||||
The `ray-serve` code still needs a dedicated repository for Ray Serve model inference services.
|
Both Ray repositories now exist as standalone repos in the Gitea `daviestechlabs` organization:
|
||||||
|
|
||||||
| Component | Current Location | Purpose |
|
| Component | Location | Purpose |
|
||||||
|-----------|------------------|---------|
|
|-----------|----------|---------|
|
||||||
| kuberay-images | `kuberay-images/` (standalone) | Docker images for Ray workers (NVIDIA, AMD, Intel) |
|
| kuberay-images | `kuberay-images/` (standalone repo) | Docker images for Ray workers (NVIDIA, AMD, Intel) |
|
||||||
| ray-serve | `llm-workflows/ray-serve/` | Ray Serve inference services |
|
| ray-serve | `ray-serve/` (standalone repo) | Ray Serve inference services |
|
||||||
| llm-workflows | `llm-workflows/` | Pipelines, handlers, STT/TTS, embeddings |
|
|
||||||
|
|
||||||
### Problems with Current Structure
|
### Problems with Monolithic Structure (Historical)
|
||||||
|
|
||||||
1. **Tight Coupling**: ray-serve changes require llm-workflows repo access
|
These were the problems with the original monolithic `llm-workflows` structure (now resolved):
|
||||||
2. **CI/CD Complexity**: Building ray-serve images triggers unrelated workflow steps
|
|
||||||
3. **Version Management**: Can't independently version ray-serve deployments
|
1. **Tight Coupling**: ray-serve changes required llm-workflows repo access
|
||||||
4. **Team Access**: Contributors to ray-serve need access to entire llm-workflows repo
|
2. **CI/CD Complexity**: Building ray-serve images triggered unrelated workflow steps
|
||||||
5. **Build Times**: Changes to unrelated code can trigger ray-serve rebuilds
|
3. **Version Management**: Couldn't independently version ray-serve deployments
|
||||||
|
4. **Team Access**: Contributors to ray-serve needed access to entire llm-workflows repo
|
||||||
|
5. **Build Times**: Changes to unrelated code could trigger ray-serve rebuilds
|
||||||
|
|
||||||
## Decision
|
## Decision
|
||||||
|
|
||||||
@@ -160,9 +161,9 @@ ray-serve/ # PyPI package - application code
|
|||||||
|
|
||||||
1. ✅ `kuberay-images` already exists as standalone repo
|
1. ✅ `kuberay-images` already exists as standalone repo
|
||||||
2. ✅ `llm-workflows` archived - all components extracted to dedicated repos
|
2. ✅ `llm-workflows` archived - all components extracted to dedicated repos
|
||||||
3. [ ] Create `ray-serve` repo on Gitea
|
3. ✅ `ray-serve` repo created on Gitea (`git.daviestechlabs.io/daviestechlabs/ray-serve`)
|
||||||
4. [ ] Move `.gitea/workflows/publish-ray-serve.yaml` to new repo
|
4. ✅ CI workflows moved to new repo
|
||||||
5. [ ] Set up pyproject.toml for PyPI publishing
|
5. ✅ pyproject.toml configured for PyPI publishing
|
||||||
6. [ ] Update RayService manifests to `pip install ray-serve==X.Y.Z`
|
6. [ ] Update RayService manifests to `pip install ray-serve==X.Y.Z`
|
||||||
7. [ ] Verify Ray cluster pulls package correctly at runtime
|
7. [ ] Verify Ray cluster pulls package correctly at runtime
|
||||||
|
|
||||||
|
|||||||
143
decisions/0038-infrastructure-metrics-collection.md
Normal file
143
decisions/0038-infrastructure-metrics-collection.md
Normal file
@@ -0,0 +1,143 @@
|
|||||||
|
# Infrastructure Metrics Collection Strategy
|
||||||
|
|
||||||
|
* Status: accepted
|
||||||
|
* Date: 2026-02-09
|
||||||
|
* Deciders: Billy
|
||||||
|
* Technical Story: Define what physical and network infrastructure to monitor and how to collect metrics beyond standard Kubernetes telemetry
|
||||||
|
|
||||||
|
## Context and Problem Statement
|
||||||
|
|
||||||
|
Standard Kubernetes observability (kube-state-metrics, node-exporter, cAdvisor) covers container and node health, but a homelab includes physical infrastructure that Kubernetes doesn't know about: UPS power, disk health, network equipment, and LAN host availability.
|
||||||
|
|
||||||
|
How do we extend Prometheus metrics collection to cover the full homelab infrastructure, including devices and hosts outside the Kubernetes cluster?
|
||||||
|
|
||||||
|
## Decision Drivers
|
||||||
|
|
||||||
|
* Early warning of hardware failures (disks, UPS battery, network gear)
|
||||||
|
* Visibility into power consumption and UPS status
|
||||||
|
* Network device monitoring without vendor lock-in
|
||||||
|
* LAN host reachability tracking for non-Kubernetes services
|
||||||
|
* Keep all metrics in a single Prometheus instance for unified querying
|
||||||
|
|
||||||
|
## Considered Options
|
||||||
|
|
||||||
|
1. **Purpose-built Prometheus exporters per domain** - smartctl, SNMP, blackbox, unpoller
|
||||||
|
2. **Agent-based monitoring (Telegraf/Datadog agent)** - deploy agents to all hosts
|
||||||
|
3. **SNMP polling for everything** - unified SNMP-based collection
|
||||||
|
4. **External monitoring SaaS** - Uptime Robot, Datadog, etc.
|
||||||
|
|
||||||
|
## Decision Outcome
|
||||||
|
|
||||||
|
Chosen option: **Option 1 - Purpose-built Prometheus exporters per domain**, because each exporter is best-in-class for its domain, they integrate natively with Prometheus ServiceMonitors, and they require zero configuration on the monitored targets themselves.
|
||||||
|
|
||||||
|
### Positive Consequences
|
||||||
|
|
||||||
|
* Each exporter is purpose-built and well-maintained by its community
|
||||||
|
* Native Prometheus integration via ServiceMonitor/ScrapeConfig
|
||||||
|
* No agents needed on monitored devices
|
||||||
|
* All metrics queryable in a single Prometheus instance and Grafana
|
||||||
|
* Dedicated alerting rules per domain (disk health, UPS, LAN)
|
||||||
|
|
||||||
|
### Negative Consequences
|
||||||
|
|
||||||
|
* Multiple small deployments to maintain (one per exporter type)
|
||||||
|
* Each exporter has its own configuration format
|
||||||
|
* SNMP exporter requires SNMPv3 credential management
|
||||||
|
|
||||||
|
## Components
|
||||||
|
|
||||||
|
### Disk Health: smartctl-exporter
|
||||||
|
|
||||||
|
Monitors SMART attributes on all cluster node disks to detect early signs of failure.
|
||||||
|
|
||||||
|
| | |
|
||||||
|
|---|---|
|
||||||
|
| **Chart** | `prometheus-smartctl-exporter` v0.16.0 |
|
||||||
|
| **Scope** | All amd64 nodes (excludes Raspberry Pi / ARM workers) |
|
||||||
|
|
||||||
|
**Alert rules (6):**
|
||||||
|
- `SmartDeviceHighTemperature` — disk temp > 65°C
|
||||||
|
- `SmartDeviceTestFailed` — SMART self-test failure
|
||||||
|
- `SmartDeviceCriticalWarning` — NVMe critical warning bit set
|
||||||
|
- `SmartDeviceMediaErrors` — NVMe media/integrity errors
|
||||||
|
- `SmartDeviceAvailableSpareUnderThreshold` — NVMe spare capacity low
|
||||||
|
- `SmartDeviceInterfaceSlow` — link not running at max negotiated speed
|
||||||
|
|
||||||
|
**Dashboard:** Smartctl Exporter (#22604)
|
||||||
|
|
||||||
|
### UPS Monitoring: SNMP Exporter
|
||||||
|
|
||||||
|
Monitors the CyberPower UPS via SNMPv3 for power status, battery health, and load.
|
||||||
|
|
||||||
|
| | |
|
||||||
|
|---|---|
|
||||||
|
| **Chart** | `prometheus-snmp-exporter` v9.11.0 |
|
||||||
|
| **Target** | `ups.lab.daviestechlabs.io` |
|
||||||
|
| **Auth** | SNMPv3 credentials from Vault via ExternalSecret |
|
||||||
|
|
||||||
|
**Alert rules:**
|
||||||
|
- `UPSOnBattery` — critical if ≤ 20 min battery remaining while on battery power
|
||||||
|
- `UPSReplaceBattery` — critical if UPS diagnostic test reports failure
|
||||||
|
|
||||||
|
**Dashboard:** CyberPower UPS (#12340)
|
||||||
|
|
||||||
|
The UPS load metric (`upsHighPrecOutputLoad`) is also consumed by Kromgo to display cluster power usage as an SVG badge.
|
||||||
|
|
||||||
|
### LAN Probing: Blackbox Exporter
|
||||||
|
|
||||||
|
Probes LAN hosts and services to detect outages for devices outside the Kubernetes cluster.
|
||||||
|
|
||||||
|
| | |
|
||||||
|
|---|---|
|
||||||
|
| **Chart** | `prometheus-blackbox-exporter` v11.7.0 |
|
||||||
|
| **Modules** | `http_2xx`, `icmp`, `tcp_connect` (all IPv4-preferred) |
|
||||||
|
|
||||||
|
**Probe targets:**
|
||||||
|
| Type | Targets |
|
||||||
|
|------|---------|
|
||||||
|
| ICMP | candlekeep, bruenor, catti, danilo, jetkvm, drizzt, elminster, regis, storm, wulfgar |
|
||||||
|
| TCP | `expanse.internal:2049` (NFS service) |
|
||||||
|
|
||||||
|
**Alert rules:**
|
||||||
|
- `LanProbeFailed` — critical if any LAN probe fails for 15 minutes
|
||||||
|
|
||||||
|
### Network Equipment: Unpoller
|
||||||
|
|
||||||
|
Exports UniFi network device metrics (APs, switches, PDUs) from the UniFi controller.
|
||||||
|
|
||||||
|
| | |
|
||||||
|
|---|---|
|
||||||
|
| **Image** | `ghcr.io/unpoller/unpoller:v2.33.0` |
|
||||||
|
| **Controller** | `https://192.168.100.254` |
|
||||||
|
| **Scrape interval** | 2 minutes (matches UniFi API poll rate) |
|
||||||
|
|
||||||
|
**Dashboards (5):**
|
||||||
|
- UniFi PDU (#23027), Insights (#11315), Network Sites (#11311), UAP (#11314), USW (#11312)
|
||||||
|
|
||||||
|
### External Node Scraping
|
||||||
|
|
||||||
|
Static scrape targets for the NAS host running its own exporters outside the cluster:
|
||||||
|
|
||||||
|
| Target | Port | Exporter |
|
||||||
|
|--------|------|----------|
|
||||||
|
| `candlekeep.lab.daviestechlabs.io` | 9100 | node-exporter |
|
||||||
|
| `candlekeep.lab.daviestechlabs.io` | 9633 | smartctl-exporter |
|
||||||
|
| `jetkvm.lab.daviestechlabs.io` | — | JetKVM device metrics |
|
||||||
|
|
||||||
|
Configured via `additionalScrapeConfigs` in kube-prometheus-stack.
|
||||||
|
|
||||||
|
## Metrics Coverage Summary
|
||||||
|
|
||||||
|
| Domain | Exporter | Key Signals |
|
||||||
|
|--------|----------|-------------|
|
||||||
|
| Disk health | smartctl-exporter | Temperature, SMART status, media errors, spare capacity |
|
||||||
|
| Power/UPS | SNMP exporter | Battery status, load, runtime remaining, diagnostics |
|
||||||
|
| LAN hosts | Blackbox exporter | ICMP reachability, TCP connectivity |
|
||||||
|
| Network gear | Unpoller | AP clients, switch throughput, PDU power |
|
||||||
|
| NAS/external | Static scrape | Node metrics, disk health for off-cluster hosts |
|
||||||
|
| KVM | Static scrape | JetKVM device metrics |
|
||||||
|
|
||||||
|
## Links
|
||||||
|
|
||||||
|
* Refined by [ADR-0025](0025-observability-stack.md)
|
||||||
|
* Related to [ADR-0039](0039-alerting-notification-pipeline.md)
|
||||||
197
decisions/0039-alerting-notification-pipeline.md
Normal file
197
decisions/0039-alerting-notification-pipeline.md
Normal file
@@ -0,0 +1,197 @@
|
|||||||
|
# Alerting and Notification Pipeline
|
||||||
|
|
||||||
|
* Status: accepted
|
||||||
|
* Date: 2026-02-09
|
||||||
|
* Deciders: Billy
|
||||||
|
* Technical Story: Design a reliable alerting pipeline from Prometheus to mobile/Discord notifications with noise management for a single-operator homelab
|
||||||
|
|
||||||
|
## Context and Problem Statement
|
||||||
|
|
||||||
|
A homelab with 10+ LAN hosts, GPU workloads, UPS power, and dozens of services generates many alerts. A single operator needs to receive critical notifications promptly while avoiding alert fatigue from known-noisy conditions.
|
||||||
|
|
||||||
|
How do we route alerts from Prometheus to actionable notifications on Discord and mobile, while keeping noise under control?
|
||||||
|
|
||||||
|
## Decision Drivers
|
||||||
|
|
||||||
|
* Critical alerts must reach the operator within seconds (mobile push + Discord)
|
||||||
|
* Alert fatigue must be minimized — suppress known-noisy alerts declaratively
|
||||||
|
* The pipeline should be fully self-hosted (no PagerDuty/Opsgenie SaaS)
|
||||||
|
* Alert routing must be GitOps-managed and version-controlled
|
||||||
|
* Uptime monitoring needs a public-facing status page
|
||||||
|
|
||||||
|
## Considered Options
|
||||||
|
|
||||||
|
1. **Alertmanager → ntfy → ntfy-discord bridge** with Silence Operator and Gatus
|
||||||
|
2. **Alertmanager → Discord webhook directly** with manual silences
|
||||||
|
3. **Alertmanager → Grafana OnCall** for incident management
|
||||||
|
4. **External SaaS (PagerDuty, Opsgenie)**
|
||||||
|
|
||||||
|
## Decision Outcome
|
||||||
|
|
||||||
|
Chosen option: **Option 1 - Alertmanager → ntfy → ntfy-discord bridge** with declarative silence management via Silence Operator and Gatus for uptime monitoring.
|
||||||
|
|
||||||
|
ntfy serves as a central notification hub that decouples alert producers from consumers. The custom ntfy-discord bridge forwards to Discord, while ntfy itself delivers mobile push notifications. Silence Operator manages suppression rules as Kubernetes CRs.
|
||||||
|
|
||||||
|
### Positive Consequences
|
||||||
|
|
||||||
|
* Fully self-hosted, no external dependencies
|
||||||
|
* ntfy provides mobile push without app-specific integrations
|
||||||
|
* Decoupled architecture — adding new notification targets only requires subscribing to ntfy topics
|
||||||
|
* Silence rules are version-controlled Kubernetes resources
|
||||||
|
* Gatus provides a public status page independent of the alerting pipeline
|
||||||
|
|
||||||
|
### Negative Consequences
|
||||||
|
|
||||||
|
* Custom bridge service (ntfy-discord) to maintain
|
||||||
|
* ntfy is a single point of failure for notifications (mitigated by persistent storage)
|
||||||
|
* No built-in on-call rotation or escalation (acceptable for single operator)
|
||||||
|
|
||||||
|
## Architecture
|
||||||
|
|
||||||
|
```
|
||||||
|
┌──────────────────────────────────────────────────────────────┐
|
||||||
|
│ ALERT SOURCES │
|
||||||
|
│ │
|
||||||
|
│ PrometheusRules Gatus Endpoint Custom Webhooks │
|
||||||
|
│ (metric alerts) Monitors (CI, etc.) │
|
||||||
|
│ │ │ │ │
|
||||||
|
└────────┼────────────────┼────────────────────┼───────────────┘
|
||||||
|
│ │ │
|
||||||
|
▼ │ │
|
||||||
|
┌─────────────────┐ │ │
|
||||||
|
│ Alertmanager │ │ │
|
||||||
|
│ │ │ │
|
||||||
|
│ Routes by │ │ │
|
||||||
|
│ severity: │ │ │
|
||||||
|
│ critical→urgent│ │ │
|
||||||
|
│ warning→high │ │ │
|
||||||
|
│ default→null │ │ │
|
||||||
|
└────────┬────────┘ │ │
|
||||||
|
│ │ │
|
||||||
|
│ ┌────────────┘ │
|
||||||
|
▼ ▼ ▼
|
||||||
|
┌──────────────────────────────────────────────────────────────┐
|
||||||
|
│ ntfy │
|
||||||
|
│ │
|
||||||
|
│ Topics: │
|
||||||
|
│ alertmanager-alerts ← Alertmanager webhooks │
|
||||||
|
│ gatus ← Gatus endpoint failures │
|
||||||
|
│ gitea-ci ← CI pipeline notifications │
|
||||||
|
│ │
|
||||||
|
│ → Mobile push (ntfy app) │
|
||||||
|
│ → Web UI at ntfy.daviestechlabs.io │
|
||||||
|
└────────────────────┬─────────────────────────────────────────┘
|
||||||
|
│
|
||||||
|
▼
|
||||||
|
┌──────────────────────────────────────────────────────────────┐
|
||||||
|
│ ntfy-discord bridge │
|
||||||
|
│ │
|
||||||
|
│ Subscribes to: alertmanager-alerts, gatus, gitea-ci │
|
||||||
|
│ Forwards to: Discord webhooks (per-topic channels) │
|
||||||
|
│ Custom-built Go service with Prometheus metrics │
|
||||||
|
└──────────────────────────────────────────────────────────────┘
|
||||||
|
```
|
||||||
|
|
||||||
|
## Component Details
|
||||||
|
|
||||||
|
### Alertmanager Routing
|
||||||
|
|
||||||
|
Configured via `AlertmanagerConfig` in kube-prometheus-stack:
|
||||||
|
|
||||||
|
| Severity | ntfy Priority | Tags | Behavior |
|
||||||
|
|----------|---------------|------|----------|
|
||||||
|
| `critical` | urgent | `rotating_light`, `alert` | Immediate push + Discord |
|
||||||
|
| `warning` | high | `warning` | Push + Discord |
|
||||||
|
| All others | — | — | Routed to `null-receiver` (dropped) |
|
||||||
|
|
||||||
|
The webhook sends to `http://ntfy-svc.observability.svc.cluster.local/alertmanager-alerts` with Alertmanager template expansion for human-readable messages.
|
||||||
|
|
||||||
|
### Custom Alert Rules
|
||||||
|
|
||||||
|
Beyond standard kube-prometheus-stack rules, custom `PrometheusRules` cover:
|
||||||
|
|
||||||
|
| Rule | Source | Severity |
|
||||||
|
|------|--------|----------|
|
||||||
|
| `DockerhubRateLimitRisk` | kube-prometheus-stack | — |
|
||||||
|
| `OomKilled` | kube-prometheus-stack | — |
|
||||||
|
| `ZfsUnexpectedPoolState` | kube-prometheus-stack | — |
|
||||||
|
| `UPSOnBattery` | SNMP exporter | critical |
|
||||||
|
| `UPSReplaceBattery` | SNMP exporter | critical |
|
||||||
|
| `LanProbeFailed` | Blackbox exporter | critical |
|
||||||
|
| `SmartDevice*` (6 rules) | smartctl-exporter | warning/critical |
|
||||||
|
| `GatusEndpointDown` | Gatus | critical |
|
||||||
|
| `GatusEndpointExposed` | Gatus | critical |
|
||||||
|
|
||||||
|
### Noise Management: Silence Operator
|
||||||
|
|
||||||
|
The [Silence Operator](https://github.com/giantswarm/silence-operator) manages Alertmanager silences as Kubernetes custom resources, keeping suppression rules version-controlled in Git.
|
||||||
|
|
||||||
|
**Active silences:**
|
||||||
|
|
||||||
|
| Silence | Alert Suppressed | Reason |
|
||||||
|
|---------|------------------|--------|
|
||||||
|
| `longhorn-node-storage-diskspace-warning` | `NodeDiskHighUtilization` | Longhorn storage devices are intentionally high-utilization |
|
||||||
|
| `node-root-diskspace-warning` | `NodeDiskHighUtilization` | Root partition usage is expected |
|
||||||
|
| `nas-memory-high-utilization` | `NodeMemoryHighUtilization` | NAS (candlekeep) runs memory-intensive workloads by design |
|
||||||
|
| `keda-hpa-maxed-out` | `KubeHpaMaxedOut` | KEDA-managed HPAs scaling to max is normal behavior |
|
||||||
|
|
||||||
|
### Uptime Monitoring: Gatus
|
||||||
|
|
||||||
|
Gatus provides endpoint monitoring and a public-facing status page, independent of the Prometheus alerting pipeline.
|
||||||
|
|
||||||
|
| | |
|
||||||
|
|---|---|
|
||||||
|
| **Image** | `ghcr.io/twin/gatus:v5.34.0` |
|
||||||
|
| **Status page** | `status.daviestechlabs.io` (public) |
|
||||||
|
| **Admin** | `gatus.daviestechlabs.io` (public) |
|
||||||
|
|
||||||
|
**Auto-discovery:** A sidecar watches Kubernetes HTTPRoutes and Services, automatically generating monitoring endpoints for all exposed services.
|
||||||
|
|
||||||
|
**Manual endpoints:**
|
||||||
|
- Connectivity checks: Cloudflare (1.1.1.1), Google (8.8.8.8), Quad9 (9.9.9.9) via ICMP
|
||||||
|
- Gitea: `git.daviestechlabs.io`
|
||||||
|
- Container registry: `registry.lab.daviestechlabs.io`
|
||||||
|
|
||||||
|
**Alerting:** Gatus sends failures to the `gatus` ntfy topic, which flows through the same ntfy → Discord pipeline.
|
||||||
|
|
||||||
|
**PrometheusRules from Gatus metrics:**
|
||||||
|
- `GatusEndpointDown` — external/service endpoint failure for 5 min → critical
|
||||||
|
- `GatusEndpointExposed` — internal endpoint reachable from public DNS for 5 min → critical (detects accidental exposure)
|
||||||
|
|
||||||
|
### ntfy
|
||||||
|
|
||||||
|
| | |
|
||||||
|
|---|---|
|
||||||
|
| **Image** | `binwiederhier/ntfy:v2.16.0` |
|
||||||
|
| **URL** | `ntfy.daviestechlabs.io` (public, Authentik SSO) |
|
||||||
|
| **Storage** | 5 Gi PVC (SQLite cache) |
|
||||||
|
|
||||||
|
Serves as the central notification hub. Protected by Authentik forward-auth via Envoy Gateway. Receives webhooks from Alertmanager and Gatus, delivers push notifications to the ntfy mobile app.
|
||||||
|
|
||||||
|
### ntfy-discord Bridge
|
||||||
|
|
||||||
|
| | |
|
||||||
|
|---|---|
|
||||||
|
| **Image** | `registry.lab.daviestechlabs.io/billy/ntfy-discord:v0.0.1` |
|
||||||
|
| **Source** | Custom Go service (in-repo: `ntfy-discord/`) |
|
||||||
|
|
||||||
|
Subscribes to ntfy topics and forwards notifications to Discord webhooks. Each topic maps to a Discord channel/webhook. Exposes Prometheus metrics via PodMonitor.
|
||||||
|
|
||||||
|
## Notification Flow Example
|
||||||
|
|
||||||
|
```
|
||||||
|
1. Prometheus evaluates: smartctl SMART status ≠ 1
|
||||||
|
2. SmartDeviceTestFailed fires (severity: critical)
|
||||||
|
3. Alertmanager matches critical route → webhook to ntfy
|
||||||
|
4. ntfy receives on "alertmanager-alerts" topic
|
||||||
|
→ Pushes to mobile via ntfy app
|
||||||
|
→ ntfy-discord subscribes and forwards to Discord webhook
|
||||||
|
5. Operator receives push notification + Discord message
|
||||||
|
```
|
||||||
|
|
||||||
|
## Links
|
||||||
|
|
||||||
|
* Refined by [ADR-0025](0025-observability-stack.md)
|
||||||
|
* Related to [ADR-0038](0038-infrastructure-metrics-collection.md)
|
||||||
|
* Related to [ADR-0021](0021-notification-architecture.md)
|
||||||
|
* Related to [ADR-0022](0022-ntfy-discord-bridge.md)
|
||||||
Reference in New Issue
Block a user