docs: add ADRs 0043-0053 covering remaining architecture gaps

New ADRs: - 0043: Cilium CNI and Network Fabric - 0044: DNS and External Access Architecture - 0045: TLS Certificate Strategy (cert-manager) - 0046: Companions Frontend Architecture - 0047: MLflow Experiment Tracking and Model Registry - 0048: Entertainment and Media Stack - 0049: Self-Hosted Productivity Suite - 0050: Argo Rollouts Progressive Delivery - 0051: KEDA Event-Driven Autoscaling - 0052: Cluster Utilities (Spegel, Descheduler, Reloader, CSI-NFS) - 0053: Vaultwarden Password Management README updated with table entries and badge count (53 total).
2026-02-09 18:36:39 -05:00
parent 49ce970780
commit 5846d0dc16
12 changed files with 1141 additions and 1 deletions
--- a/decisions/0051-keda-event-driven-autoscaling.md
+++ b/decisions/0051-keda-event-driven-autoscaling.md
@@ -0,0 +1,68 @@
+# KEDA Event-Driven Autoscaling
+
+* Status: accepted
+* Date: 2026-02-09
+* Deciders: Billy
+* Technical Story: Scale workloads based on external event sources rather than only CPU/memory metrics
+
+## Context and Problem Statement
+
+Kubernetes Horizontal Pod Autoscaler (HPA) scales on CPU and memory, but many homelab workloads have scaling signals from external systems — Envoy Gateway request queues, NATS queue depth, or GPU utilization. Scaling on the right signal reduces latency and avoids over-provisioning.
+
+How do we autoscale workloads based on external metrics like message queues, HTTP request rates, and custom Prometheus queries?
+
+## Decision Drivers
+
+* Scale on NATS queue depth for inference pipelines
+* Scale on Envoy Gateway metrics for HTTP workloads  
+* Prometheus integration for arbitrary custom metrics
+* CRD-based scalers compatible with Flux GitOps
+* Low resource overhead for the scaler controller itself
+
+## Considered Options
+
+1. **KEDA** — Kubernetes Event-Driven Autoscaling
+2. **Custom HPA with Prometheus Adapter** — HPA + external-metrics API
+3. **Knative Serving** — Serverless autoscaler with scale-to-zero
+
+## Decision Outcome
+
+Chosen option: **KEDA**, because it provides a large catalog of built-in scalers (Prometheus, NATS, HTTP), supports scale-to-zero, and integrates cleanly with existing HelmRelease/Kustomization GitOps.
+
+### Positive Consequences
+
+* 60+ built-in scalers covering all homelab event sources
+* ScaledObject CRDs fit naturally in GitOps workflow
+* Scale-to-zero for bursty workloads (saves GPU resources)
+* ServiceMonitors for self-monitoring via Prometheus
+* Grafana dashboard included for visibility
+
+### Negative Consequences
+
+* Additional CRDs and controller pods
+* ScaledObject/TriggerAuthentication learning curve
+* Potential conflict with manually-defined HPAs
+
+## Deployment Configuration
+
+| | |
+|---|---|
+| **Chart** | `keda` OCI chart v2.19.0 |
+| **Namespace** | `keda` |
+| **Monitoring** | ServiceMonitor enabled, Grafana dashboard provisioned |
+| **Webhooks** | Enabled |
+
+## Scaling Use Cases
+
+| Workload | Scaler | Signal | Target |
+|----------|--------|--------|--------|
+| Ray Serve inference | Prometheus | Pending request queue depth | 1-4 replicas |
+| Envoy Gateway | Prometheus | Active connections per gateway | KEDA manages envoy proxy fleet |
+| Voice pipeline | NATS | Message queue length | 0-2 replicas |
+| Batch inference | Prometheus | Job queue size | 0-N GPU pods |
+
+## Links
+
+* Related to [ADR-0010](0010-scalable-inference-platform.md) (inference scaling)
+* Related to [ADR-0038](0038-infrastructure-metrics-collection.md) (Prometheus metrics)
+* [KEDA Documentation](https://keda.sh/docs/)