diff --git a/AGENT-ONBOARDING.md b/AGENT-ONBOARDING.md index 32d0f82..25a71ce 100644 --- a/AGENT-ONBOARDING.md +++ b/AGENT-ONBOARDING.md @@ -26,10 +26,15 @@ You are working on a **homelab Kubernetes cluster** running: | `chat-handler` | Text chat with RAG pipeline | | `voice-assistant` | Voice pipeline (STT β†’ RAG β†’ LLM β†’ TTS) | | `kuberay-images` | GPU-specific Ray worker Docker images | +| `pipeline-bridge` | Bridge between pipelines and services | +| `stt-module` | Speech-to-text service | +| `tts-module` | Text-to-speech service | +| `ray-serve` | Ray Serve inference services | | `argo` | Argo Workflows (training, batch inference) | | `kubeflow` | Kubeflow Pipeline definitions | | `mlflow` | MLflow integration utilities | | `gradio-ui` | Gradio demo apps (embeddings, STT, TTS) | +| `ntfy-discord` | ntfy β†’ Discord notification bridge | ## πŸ—οΈ System Architecture (30-Second Version) @@ -73,8 +78,7 @@ kubernetes/apps/ β”‚ β”œβ”€β”€ kubeflow/ # Pipelines, Training Operator β”‚ β”œβ”€β”€ milvus/ # Vector database β”‚ β”œβ”€β”€ nats/ # Message bus -β”‚ β”œβ”€β”€ vllm/ # LLM inference -β”‚ └── llm-workflows/ # GitRepo sync to llm-workflows +β”‚ └── vllm/ # LLM inference β”œβ”€β”€ analytics/ # πŸ“Š Spark, Flink, ClickHouse β”œβ”€β”€ observability/ # πŸ“ˆ Grafana, Alloy, OpenTelemetry └── security/ # πŸ”’ Vault, Authentik, Falco diff --git a/README.md b/README.md index baa536e..1264f2d 100644 --- a/README.md +++ b/README.md @@ -8,7 +8,7 @@ [![License](https://img.shields.io/badge/License-MIT-green)](LICENSE) -![ADR Count](https://img.shields.io/badge/ADRs-37_total-blue?logo=bookstack) ![Accepted](https://img.shields.io/badge/accepted-36-brightgreen) ![Proposed](https://img.shields.io/badge/proposed-0-yellow) +![ADR Count](https://img.shields.io/badge/ADRs-39_total-blue?logo=bookstack) ![Accepted](https://img.shields.io/badge/accepted-39-brightgreen) ## πŸ“– Quick Navigation @@ -128,6 +128,8 @@ homelab-design/ | 0035 | [ARM64 Raspberry Pi Worker Node Strategy](decisions/0035-arm64-worker-strategy.md) | βœ… accepted | 2026-02-05 | | 0036 | [Automated Dependency Updates with Renovate](decisions/0036-renovate-dependency-updates.md) | βœ… accepted | 2026-02-05 | | 0037 | [Node Naming Conventions](decisions/0037-node-naming-conventions.md) | βœ… accepted | 2026-02-05 | +| 0038 | [Infrastructure Metrics Collection Strategy](decisions/0038-infrastructure-metrics-collection.md) | βœ… accepted | 2026-02-09 | +| 0039 | [Alerting and Notification Pipeline](decisions/0039-alerting-notification-pipeline.md) | βœ… accepted | 2026-02-09 | ## πŸ”— Related Repositories @@ -135,9 +137,28 @@ homelab-design/ | Repository | Purpose | |------------|---------| | [homelab-k8s2](https://github.com/Billy-Davies-2/homelab-k8s2) | Kubernetes manifests, Flux GitOps | -| [llm-workflows](https://github.com/Billy-Davies-2/llm-workflows) | NATS handlers, Argo/KFP workflows | | [companions-frontend](https://github.com/Billy-Davies-2/companions-frontend) | Go web server, HTMX frontend | +### AI/ML Repos (git.daviestechlabs.io/daviestechlabs) + +The former monolithic `llm-workflows` repo has been archived and decomposed into: + +| Repository | Purpose | +|------------|--------| +| `handler-base` | Shared Python library for NATS handlers | +| `chat-handler` | Text chat with RAG pipeline | +| `voice-assistant` | Voice pipeline (STT β†’ RAG β†’ LLM β†’ TTS) | +| `pipeline-bridge` | Bridge between pipelines and services | +| `stt-module` | Speech-to-text service | +| `tts-module` | Text-to-speech service | +| `ray-serve` | Ray Serve inference services | +| `kuberay-images` | GPU-specific Ray worker Docker images | +| `argo` | Argo Workflows (training, batch inference) | +| `kubeflow` | Kubeflow Pipeline definitions | +| `mlflow` | MLflow integration utilities | +| `gradio-ui` | Gradio demo apps (embeddings, STT, TTS) | +| `ntfy-discord` | ntfy β†’ Discord notification bridge | + ## πŸ“ Contributing 1. For architecture changes, create an ADR in `decisions/` diff --git a/decisions/0006-gitops-with-flux.md b/decisions/0006-gitops-with-flux.md index 839bebd..8b839ab 100644 --- a/decisions/0006-gitops-with-flux.md +++ b/decisions/0006-gitops-with-flux.md @@ -35,7 +35,7 @@ Chosen option: "Flux CD", because it provides a mature GitOps implementation wit * Git is single source of truth * Automatic drift detection and correction * Native SOPS/Age secret encryption -* Multi-repository support (homelab-k8s2 + llm-workflows) +* Multi-repository support (homelab-k8s2 + Gitea daviestechlabs repos) * Helm and Kustomize native support * Webhook-free sync (pull-based) @@ -79,8 +79,10 @@ spec: # Public repos don't need secretRef ``` -Note: The monolithic `llm-workflows` repo has been decomposed into separate repos -in the daviestechlabs Gitea organization. See AGENT-ONBOARDING.md for the full list. +Note: The monolithic `llm-workflows` repo has been archived and decomposed into +focused repos in the daviestechlabs Gitea organization (e.g. `chat-handler`, +`voice-assistant`, `handler-base`, `ray-serve`, etc.). See AGENT-ONBOARDING.md +for the full list. ### SOPS Integration diff --git a/decisions/0009-dual-workflow-engines.md b/decisions/0009-dual-workflow-engines.md index 94be24b..4f09f8b 100644 --- a/decisions/0009-dual-workflow-engines.md +++ b/decisions/0009-dual-workflow-engines.md @@ -121,4 +121,4 @@ Argo Workflow ──► Step: kfp-trigger ──► Kubeflow Pipeline * [Kubeflow Pipelines](https://kubeflow.org/docs/components/pipelines/) * [Argo Workflows](https://argoproj.github.io/workflows/) * [Argo Events](https://argoproj.github.io/events/) -* Related: [kfp-integration.yaml](../../llm-workflows/argo/kfp-integration.yaml) +* Related: kfp-integration.yaml (formerly in llm-workflows, now in the `argo` repo on Gitea) diff --git a/decisions/0024-ray-repository-structure.md b/decisions/0024-ray-repository-structure.md index 0d8a014..16ada5b 100644 --- a/decisions/0024-ray-repository-structure.md +++ b/decisions/0024-ray-repository-structure.md @@ -35,23 +35,24 @@ | `spark-analytics-jobs` | Spark batch analytics | | `flink-analytics-jobs` | Flink streaming analytics | -### Remaining Ray Component +### Ray Component Repositories -The `ray-serve` code still needs a dedicated repository for Ray Serve model inference services. +Both Ray repositories now exist as standalone repos in the Gitea `daviestechlabs` organization: -| Component | Current Location | Purpose | -|-----------|------------------|---------| -| kuberay-images | `kuberay-images/` (standalone) | Docker images for Ray workers (NVIDIA, AMD, Intel) | -| ray-serve | `llm-workflows/ray-serve/` | Ray Serve inference services | -| llm-workflows | `llm-workflows/` | Pipelines, handlers, STT/TTS, embeddings | +| Component | Location | Purpose | +|-----------|----------|---------| +| kuberay-images | `kuberay-images/` (standalone repo) | Docker images for Ray workers (NVIDIA, AMD, Intel) | +| ray-serve | `ray-serve/` (standalone repo) | Ray Serve inference services | -### Problems with Current Structure +### Problems with Monolithic Structure (Historical) -1. **Tight Coupling**: ray-serve changes require llm-workflows repo access -2. **CI/CD Complexity**: Building ray-serve images triggers unrelated workflow steps -3. **Version Management**: Can't independently version ray-serve deployments -4. **Team Access**: Contributors to ray-serve need access to entire llm-workflows repo -5. **Build Times**: Changes to unrelated code can trigger ray-serve rebuilds +These were the problems with the original monolithic `llm-workflows` structure (now resolved): + +1. **Tight Coupling**: ray-serve changes required llm-workflows repo access +2. **CI/CD Complexity**: Building ray-serve images triggered unrelated workflow steps +3. **Version Management**: Couldn't independently version ray-serve deployments +4. **Team Access**: Contributors to ray-serve needed access to entire llm-workflows repo +5. **Build Times**: Changes to unrelated code could trigger ray-serve rebuilds ## Decision @@ -160,9 +161,9 @@ ray-serve/ # PyPI package - application code 1. βœ… `kuberay-images` already exists as standalone repo 2. βœ… `llm-workflows` archived - all components extracted to dedicated repos -3. [ ] Create `ray-serve` repo on Gitea -4. [ ] Move `.gitea/workflows/publish-ray-serve.yaml` to new repo -5. [ ] Set up pyproject.toml for PyPI publishing +3. βœ… `ray-serve` repo created on Gitea (`git.daviestechlabs.io/daviestechlabs/ray-serve`) +4. βœ… CI workflows moved to new repo +5. βœ… pyproject.toml configured for PyPI publishing 6. [ ] Update RayService manifests to `pip install ray-serve==X.Y.Z` 7. [ ] Verify Ray cluster pulls package correctly at runtime diff --git a/decisions/0038-infrastructure-metrics-collection.md b/decisions/0038-infrastructure-metrics-collection.md new file mode 100644 index 0000000..6b12925 --- /dev/null +++ b/decisions/0038-infrastructure-metrics-collection.md @@ -0,0 +1,143 @@ +# Infrastructure Metrics Collection Strategy + +* Status: accepted +* Date: 2026-02-09 +* Deciders: Billy +* Technical Story: Define what physical and network infrastructure to monitor and how to collect metrics beyond standard Kubernetes telemetry + +## Context and Problem Statement + +Standard Kubernetes observability (kube-state-metrics, node-exporter, cAdvisor) covers container and node health, but a homelab includes physical infrastructure that Kubernetes doesn't know about: UPS power, disk health, network equipment, and LAN host availability. + +How do we extend Prometheus metrics collection to cover the full homelab infrastructure, including devices and hosts outside the Kubernetes cluster? + +## Decision Drivers + +* Early warning of hardware failures (disks, UPS battery, network gear) +* Visibility into power consumption and UPS status +* Network device monitoring without vendor lock-in +* LAN host reachability tracking for non-Kubernetes services +* Keep all metrics in a single Prometheus instance for unified querying + +## Considered Options + +1. **Purpose-built Prometheus exporters per domain** - smartctl, SNMP, blackbox, unpoller +2. **Agent-based monitoring (Telegraf/Datadog agent)** - deploy agents to all hosts +3. **SNMP polling for everything** - unified SNMP-based collection +4. **External monitoring SaaS** - Uptime Robot, Datadog, etc. + +## Decision Outcome + +Chosen option: **Option 1 - Purpose-built Prometheus exporters per domain**, because each exporter is best-in-class for its domain, they integrate natively with Prometheus ServiceMonitors, and they require zero configuration on the monitored targets themselves. + +### Positive Consequences + +* Each exporter is purpose-built and well-maintained by its community +* Native Prometheus integration via ServiceMonitor/ScrapeConfig +* No agents needed on monitored devices +* All metrics queryable in a single Prometheus instance and Grafana +* Dedicated alerting rules per domain (disk health, UPS, LAN) + +### Negative Consequences + +* Multiple small deployments to maintain (one per exporter type) +* Each exporter has its own configuration format +* SNMP exporter requires SNMPv3 credential management + +## Components + +### Disk Health: smartctl-exporter + +Monitors SMART attributes on all cluster node disks to detect early signs of failure. + +| | | +|---|---| +| **Chart** | `prometheus-smartctl-exporter` v0.16.0 | +| **Scope** | All amd64 nodes (excludes Raspberry Pi / ARM workers) | + +**Alert rules (6):** +- `SmartDeviceHighTemperature` β€” disk temp > 65Β°C +- `SmartDeviceTestFailed` β€” SMART self-test failure +- `SmartDeviceCriticalWarning` β€” NVMe critical warning bit set +- `SmartDeviceMediaErrors` β€” NVMe media/integrity errors +- `SmartDeviceAvailableSpareUnderThreshold` β€” NVMe spare capacity low +- `SmartDeviceInterfaceSlow` β€” link not running at max negotiated speed + +**Dashboard:** Smartctl Exporter (#22604) + +### UPS Monitoring: SNMP Exporter + +Monitors the CyberPower UPS via SNMPv3 for power status, battery health, and load. + +| | | +|---|---| +| **Chart** | `prometheus-snmp-exporter` v9.11.0 | +| **Target** | `ups.lab.daviestechlabs.io` | +| **Auth** | SNMPv3 credentials from Vault via ExternalSecret | + +**Alert rules:** +- `UPSOnBattery` β€” critical if ≀ 20 min battery remaining while on battery power +- `UPSReplaceBattery` β€” critical if UPS diagnostic test reports failure + +**Dashboard:** CyberPower UPS (#12340) + +The UPS load metric (`upsHighPrecOutputLoad`) is also consumed by Kromgo to display cluster power usage as an SVG badge. + +### LAN Probing: Blackbox Exporter + +Probes LAN hosts and services to detect outages for devices outside the Kubernetes cluster. + +| | | +|---|---| +| **Chart** | `prometheus-blackbox-exporter` v11.7.0 | +| **Modules** | `http_2xx`, `icmp`, `tcp_connect` (all IPv4-preferred) | + +**Probe targets:** +| Type | Targets | +|------|---------| +| ICMP | candlekeep, bruenor, catti, danilo, jetkvm, drizzt, elminster, regis, storm, wulfgar | +| TCP | `expanse.internal:2049` (NFS service) | + +**Alert rules:** +- `LanProbeFailed` β€” critical if any LAN probe fails for 15 minutes + +### Network Equipment: Unpoller + +Exports UniFi network device metrics (APs, switches, PDUs) from the UniFi controller. + +| | | +|---|---| +| **Image** | `ghcr.io/unpoller/unpoller:v2.33.0` | +| **Controller** | `https://192.168.100.254` | +| **Scrape interval** | 2 minutes (matches UniFi API poll rate) | + +**Dashboards (5):** +- UniFi PDU (#23027), Insights (#11315), Network Sites (#11311), UAP (#11314), USW (#11312) + +### External Node Scraping + +Static scrape targets for the NAS host running its own exporters outside the cluster: + +| Target | Port | Exporter | +|--------|------|----------| +| `candlekeep.lab.daviestechlabs.io` | 9100 | node-exporter | +| `candlekeep.lab.daviestechlabs.io` | 9633 | smartctl-exporter | +| `jetkvm.lab.daviestechlabs.io` | β€” | JetKVM device metrics | + +Configured via `additionalScrapeConfigs` in kube-prometheus-stack. + +## Metrics Coverage Summary + +| Domain | Exporter | Key Signals | +|--------|----------|-------------| +| Disk health | smartctl-exporter | Temperature, SMART status, media errors, spare capacity | +| Power/UPS | SNMP exporter | Battery status, load, runtime remaining, diagnostics | +| LAN hosts | Blackbox exporter | ICMP reachability, TCP connectivity | +| Network gear | Unpoller | AP clients, switch throughput, PDU power | +| NAS/external | Static scrape | Node metrics, disk health for off-cluster hosts | +| KVM | Static scrape | JetKVM device metrics | + +## Links + +* Refined by [ADR-0025](0025-observability-stack.md) +* Related to [ADR-0039](0039-alerting-notification-pipeline.md) diff --git a/decisions/0039-alerting-notification-pipeline.md b/decisions/0039-alerting-notification-pipeline.md new file mode 100644 index 0000000..8d5b3fe --- /dev/null +++ b/decisions/0039-alerting-notification-pipeline.md @@ -0,0 +1,197 @@ +# Alerting and Notification Pipeline + +* Status: accepted +* Date: 2026-02-09 +* Deciders: Billy +* Technical Story: Design a reliable alerting pipeline from Prometheus to mobile/Discord notifications with noise management for a single-operator homelab + +## Context and Problem Statement + +A homelab with 10+ LAN hosts, GPU workloads, UPS power, and dozens of services generates many alerts. A single operator needs to receive critical notifications promptly while avoiding alert fatigue from known-noisy conditions. + +How do we route alerts from Prometheus to actionable notifications on Discord and mobile, while keeping noise under control? + +## Decision Drivers + +* Critical alerts must reach the operator within seconds (mobile push + Discord) +* Alert fatigue must be minimized β€” suppress known-noisy alerts declaratively +* The pipeline should be fully self-hosted (no PagerDuty/Opsgenie SaaS) +* Alert routing must be GitOps-managed and version-controlled +* Uptime monitoring needs a public-facing status page + +## Considered Options + +1. **Alertmanager β†’ ntfy β†’ ntfy-discord bridge** with Silence Operator and Gatus +2. **Alertmanager β†’ Discord webhook directly** with manual silences +3. **Alertmanager β†’ Grafana OnCall** for incident management +4. **External SaaS (PagerDuty, Opsgenie)** + +## Decision Outcome + +Chosen option: **Option 1 - Alertmanager β†’ ntfy β†’ ntfy-discord bridge** with declarative silence management via Silence Operator and Gatus for uptime monitoring. + +ntfy serves as a central notification hub that decouples alert producers from consumers. The custom ntfy-discord bridge forwards to Discord, while ntfy itself delivers mobile push notifications. Silence Operator manages suppression rules as Kubernetes CRs. + +### Positive Consequences + +* Fully self-hosted, no external dependencies +* ntfy provides mobile push without app-specific integrations +* Decoupled architecture β€” adding new notification targets only requires subscribing to ntfy topics +* Silence rules are version-controlled Kubernetes resources +* Gatus provides a public status page independent of the alerting pipeline + +### Negative Consequences + +* Custom bridge service (ntfy-discord) to maintain +* ntfy is a single point of failure for notifications (mitigated by persistent storage) +* No built-in on-call rotation or escalation (acceptable for single operator) + +## Architecture + +``` +β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β” +β”‚ ALERT SOURCES β”‚ +β”‚ β”‚ +β”‚ PrometheusRules Gatus Endpoint Custom Webhooks β”‚ +β”‚ (metric alerts) Monitors (CI, etc.) β”‚ +β”‚ β”‚ β”‚ β”‚ β”‚ +β””β”€β”€β”€β”€β”€β”€β”€β”€β”Όβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”Όβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”Όβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜ + β”‚ β”‚ β”‚ + β–Ό β”‚ β”‚ +β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β” β”‚ β”‚ +β”‚ Alertmanager β”‚ β”‚ β”‚ +β”‚ β”‚ β”‚ β”‚ +β”‚ Routes by β”‚ β”‚ β”‚ +β”‚ severity: β”‚ β”‚ β”‚ +β”‚ criticalβ†’urgentβ”‚ β”‚ β”‚ +β”‚ warningβ†’high β”‚ β”‚ β”‚ +β”‚ defaultβ†’null β”‚ β”‚ β”‚ +β””β”€β”€β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”€β”˜ β”‚ β”‚ + β”‚ β”‚ β”‚ + β”‚ β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜ β”‚ + β–Ό β–Ό β–Ό +β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β” +β”‚ ntfy β”‚ +β”‚ β”‚ +β”‚ Topics: β”‚ +β”‚ alertmanager-alerts ← Alertmanager webhooks β”‚ +β”‚ gatus ← Gatus endpoint failures β”‚ +β”‚ gitea-ci ← CI pipeline notifications β”‚ +β”‚ β”‚ +β”‚ β†’ Mobile push (ntfy app) β”‚ +β”‚ β†’ Web UI at ntfy.daviestechlabs.io β”‚ +β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜ + β”‚ + β–Ό +β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β” +β”‚ ntfy-discord bridge β”‚ +β”‚ β”‚ +β”‚ Subscribes to: alertmanager-alerts, gatus, gitea-ci β”‚ +β”‚ Forwards to: Discord webhooks (per-topic channels) β”‚ +β”‚ Custom-built Go service with Prometheus metrics β”‚ +β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜ +``` + +## Component Details + +### Alertmanager Routing + +Configured via `AlertmanagerConfig` in kube-prometheus-stack: + +| Severity | ntfy Priority | Tags | Behavior | +|----------|---------------|------|----------| +| `critical` | urgent | `rotating_light`, `alert` | Immediate push + Discord | +| `warning` | high | `warning` | Push + Discord | +| All others | β€” | β€” | Routed to `null-receiver` (dropped) | + +The webhook sends to `http://ntfy-svc.observability.svc.cluster.local/alertmanager-alerts` with Alertmanager template expansion for human-readable messages. + +### Custom Alert Rules + +Beyond standard kube-prometheus-stack rules, custom `PrometheusRules` cover: + +| Rule | Source | Severity | +|------|--------|----------| +| `DockerhubRateLimitRisk` | kube-prometheus-stack | β€” | +| `OomKilled` | kube-prometheus-stack | β€” | +| `ZfsUnexpectedPoolState` | kube-prometheus-stack | β€” | +| `UPSOnBattery` | SNMP exporter | critical | +| `UPSReplaceBattery` | SNMP exporter | critical | +| `LanProbeFailed` | Blackbox exporter | critical | +| `SmartDevice*` (6 rules) | smartctl-exporter | warning/critical | +| `GatusEndpointDown` | Gatus | critical | +| `GatusEndpointExposed` | Gatus | critical | + +### Noise Management: Silence Operator + +The [Silence Operator](https://github.com/giantswarm/silence-operator) manages Alertmanager silences as Kubernetes custom resources, keeping suppression rules version-controlled in Git. + +**Active silences:** + +| Silence | Alert Suppressed | Reason | +|---------|------------------|--------| +| `longhorn-node-storage-diskspace-warning` | `NodeDiskHighUtilization` | Longhorn storage devices are intentionally high-utilization | +| `node-root-diskspace-warning` | `NodeDiskHighUtilization` | Root partition usage is expected | +| `nas-memory-high-utilization` | `NodeMemoryHighUtilization` | NAS (candlekeep) runs memory-intensive workloads by design | +| `keda-hpa-maxed-out` | `KubeHpaMaxedOut` | KEDA-managed HPAs scaling to max is normal behavior | + +### Uptime Monitoring: Gatus + +Gatus provides endpoint monitoring and a public-facing status page, independent of the Prometheus alerting pipeline. + +| | | +|---|---| +| **Image** | `ghcr.io/twin/gatus:v5.34.0` | +| **Status page** | `status.daviestechlabs.io` (public) | +| **Admin** | `gatus.daviestechlabs.io` (public) | + +**Auto-discovery:** A sidecar watches Kubernetes HTTPRoutes and Services, automatically generating monitoring endpoints for all exposed services. + +**Manual endpoints:** +- Connectivity checks: Cloudflare (1.1.1.1), Google (8.8.8.8), Quad9 (9.9.9.9) via ICMP +- Gitea: `git.daviestechlabs.io` +- Container registry: `registry.lab.daviestechlabs.io` + +**Alerting:** Gatus sends failures to the `gatus` ntfy topic, which flows through the same ntfy β†’ Discord pipeline. + +**PrometheusRules from Gatus metrics:** +- `GatusEndpointDown` β€” external/service endpoint failure for 5 min β†’ critical +- `GatusEndpointExposed` β€” internal endpoint reachable from public DNS for 5 min β†’ critical (detects accidental exposure) + +### ntfy + +| | | +|---|---| +| **Image** | `binwiederhier/ntfy:v2.16.0` | +| **URL** | `ntfy.daviestechlabs.io` (public, Authentik SSO) | +| **Storage** | 5 Gi PVC (SQLite cache) | + +Serves as the central notification hub. Protected by Authentik forward-auth via Envoy Gateway. Receives webhooks from Alertmanager and Gatus, delivers push notifications to the ntfy mobile app. + +### ntfy-discord Bridge + +| | | +|---|---| +| **Image** | `registry.lab.daviestechlabs.io/billy/ntfy-discord:v0.0.1` | +| **Source** | Custom Go service (in-repo: `ntfy-discord/`) | + +Subscribes to ntfy topics and forwards notifications to Discord webhooks. Each topic maps to a Discord channel/webhook. Exposes Prometheus metrics via PodMonitor. + +## Notification Flow Example + +``` +1. Prometheus evaluates: smartctl SMART status β‰  1 +2. SmartDeviceTestFailed fires (severity: critical) +3. Alertmanager matches critical route β†’ webhook to ntfy +4. ntfy receives on "alertmanager-alerts" topic + β†’ Pushes to mobile via ntfy app + β†’ ntfy-discord subscribes and forwards to Discord webhook +5. Operator receives push notification + Discord message +``` + +## Links + +* Refined by [ADR-0025](0025-observability-stack.md) +* Related to [ADR-0038](0038-infrastructure-metrics-collection.md) +* Related to [ADR-0021](0021-notification-architecture.md) +* Related to [ADR-0022](0022-ntfy-discord-bridge.md)