docs: add ADRs 0043-0053 covering remaining architecture gaps

New ADRs: - 0043: Cilium CNI and Network Fabric - 0044: DNS and External Access Architecture - 0045: TLS Certificate Strategy (cert-manager) - 0046: Companions Frontend Architecture - 0047: MLflow Experiment Tracking and Model Registry - 0048: Entertainment and Media Stack - 0049: Self-Hosted Productivity Suite - 0050: Argo Rollouts Progressive Delivery - 0051: KEDA Event-Driven Autoscaling - 0052: Cluster Utilities (Spegel, Descheduler, Reloader, CSI-NFS) - 0053: Vaultwarden Password Management README updated with table entries and badge count (53 total).
2026-02-09 18:36:39 -05:00
parent 49ce970780
commit 5846d0dc16
12 changed files with 1141 additions and 1 deletions
--- a/decisions/0052-cluster-utilities-optimization.md
+++ b/decisions/0052-cluster-utilities-optimization.md
@@ -0,0 +1,104 @@
+# Cluster Utilities and Optimization
+
+* Status: accepted
+* Date: 2026-02-09
+* Deciders: Billy
+* Technical Story: Deploy supporting utilities that improve cluster efficiency, reliability, and operational overhead
+
+## Context and Problem Statement
+
+A Kubernetes cluster running diverse workloads benefits from several operational utilities — image caching to reduce pull times, workload rebalancing for efficiency, automatic secret/configmap reloading, and shared storage provisioning. Each is small individually but collectively they significantly improve cluster operations.
+
+How do we manage these cross-cutting cluster utilities consistently?
+
+## Decision Drivers
+
+* Reduce container image pull latency across nodes
+* Automatically rebalance workloads for even resource utilization
+* Eliminate manual pod restarts when secrets/configmaps change
+* Provide shared NFS storage class for ReadWriteMany workloads
+* Minimal resource overhead per utility
+
+## Decision Outcome
+
+Deploy four cluster utilities — Spegel (image cache), Descheduler (pod rebalancing), Reloader (config reload), and CSI-NFS (NFS StorageClass) — each solving a distinct operational concern with minimal footprint.
+
+## Components
+
+### Spegel — Peer-to-Peer Image Registry Mirror
+
+Spegel distributes container images between nodes, so pulling an image already present on _any_ node avoids hitting the external registry.
+
+| | |
+|---|---|
+| **Chart** | `spegel` OCI chart v0.3.0 |
+| **Namespace** | `spegel` |
+| **Port** | 29999 |
+| **Mode** | P2P mirror (DaemonSet, one pod per node) |
+
+**Mirrored Registries:**
+- `docker.io`, `ghcr.io`, `quay.io`, `gcr.io`
+- `registry.k8s.io`, `mcr.microsoft.com`
+- `git.daviestechlabs.io` (Gitea), `public.ecr.aws`
+
+Spegel registers as a containerd mirror, intercepting pulls before they reach the internet. Especially valuable for large ML model images (5-20GB) that would otherwise be pulled repeatedly.
+
+### Descheduler — Workload Rebalancing
+
+The descheduler evicts pods to allow the scheduler to redistribute them more optimally.
+
+| | |
+|---|---|
+| **Chart** | `descheduler` v0.33.0 |
+| **Namespace** | `descheduler` |
+| **Mode** | Deployment (continuous) |
+| **Strategy** | `LowNodeUtilization` |
+
+**Excluded Namespaces:** `ai-ml`, `kuberay`, `gitea`
+
+AI/ML and Gitea namespaces are excluded because GPU workloads and git repositories should not be disrupted by rebalancing.
+
+### Reloader — Automatic Config Reload
+
+Reloader watches for Secret and ConfigMap changes and triggers rolling restarts on Deployments/StatefulSets that reference them.
+
+| | |
+|---|---|
+| **Chart** | `reloader` v2.2.7 |
+| **Namespace** | `reloader` |
+| **Monitoring** | PodMonitor enabled |
+| **Security** | Read-only root filesystem |
+
+Eliminates manual `kubectl rollout restart` after Vault secret rotations or config changes.
+
+### CSI-NFS — NFS StorageClass
+
+Provides a Kubernetes StorageClass backed by the NAS (candlekeep) NFS export.
+
+| | |
+|---|---|
+| **Chart** | `csi-driver-nfs` v4.13.0 |
+| **Namespace** | `csi-nfs` |
+| **StorageClass** | `nfs-slow` |
+| **NFS Server** | `candlekeep` → `/kubernetes` |
+| **NFS Version** | 4.1, `nconnect=16` |
+
+`nfs-slow` provides ReadWriteMany access for workloads that need shared storage (media library, ML artifacts, photo libraries). Named "slow" relative to Longhorn SSDs, not in absolute terms. The `nconnect=16` option enables 16 parallel NFS connections per mount for improved throughput.
+
+## Resource Overhead
+
+| Utility | Pods | CPU Request | Memory Request |
+|---------|------|-------------|----------------|
+| Spegel | 1 per node (DaemonSet) | — | — |
+| Descheduler | 1 | — | — |
+| Reloader | 1 | — | — |
+| CSI-NFS | 1 controller + DaemonSet | — | — |
+| **Total** | ~8-12 pods | Minimal | Minimal |
+
+All four utilities are lightweight and designed to run alongside workloads with negligible resource impact.
+
+## Links
+
+* Related to [ADR-0026](0026-storage-strategy.md) (Longhorn + NFS storage strategy)
+* Related to [ADR-0003](0003-bare-metal-kubernetes.md) (Talos container runtime / containerd)
+* [Spegel](https://github.com/spegel-org/spegel) · [Descheduler](https://sigs.k8s.io/descheduler) · [Reloader](https://github.com/stakater/Reloader) · [CSI-NFS](https://github.com/kubernetes-csi/csi-driver-nfs)