docs: add ADRs 0043-0053 covering remaining architecture gaps
All checks were successful
Update README with ADR Index / update-readme (push) Successful in 6s

New ADRs:
- 0043: Cilium CNI and Network Fabric
- 0044: DNS and External Access Architecture
- 0045: TLS Certificate Strategy (cert-manager)
- 0046: Companions Frontend Architecture
- 0047: MLflow Experiment Tracking and Model Registry
- 0048: Entertainment and Media Stack
- 0049: Self-Hosted Productivity Suite
- 0050: Argo Rollouts Progressive Delivery
- 0051: KEDA Event-Driven Autoscaling
- 0052: Cluster Utilities (Spegel, Descheduler, Reloader, CSI-NFS)
- 0053: Vaultwarden Password Management

README updated with table entries and badge count (53 total).
This commit is contained in:
2026-02-09 18:36:39 -05:00
parent 49ce970780
commit 5846d0dc16
12 changed files with 1141 additions and 1 deletions

View File

@@ -0,0 +1,104 @@
# Cluster Utilities and Optimization
* Status: accepted
* Date: 2026-02-09
* Deciders: Billy
* Technical Story: Deploy supporting utilities that improve cluster efficiency, reliability, and operational overhead
## Context and Problem Statement
A Kubernetes cluster running diverse workloads benefits from several operational utilities — image caching to reduce pull times, workload rebalancing for efficiency, automatic secret/configmap reloading, and shared storage provisioning. Each is small individually but collectively they significantly improve cluster operations.
How do we manage these cross-cutting cluster utilities consistently?
## Decision Drivers
* Reduce container image pull latency across nodes
* Automatically rebalance workloads for even resource utilization
* Eliminate manual pod restarts when secrets/configmaps change
* Provide shared NFS storage class for ReadWriteMany workloads
* Minimal resource overhead per utility
## Decision Outcome
Deploy four cluster utilities — Spegel (image cache), Descheduler (pod rebalancing), Reloader (config reload), and CSI-NFS (NFS StorageClass) — each solving a distinct operational concern with minimal footprint.
## Components
### Spegel — Peer-to-Peer Image Registry Mirror
Spegel distributes container images between nodes, so pulling an image already present on _any_ node avoids hitting the external registry.
| | |
|---|---|
| **Chart** | `spegel` OCI chart v0.3.0 |
| **Namespace** | `spegel` |
| **Port** | 29999 |
| **Mode** | P2P mirror (DaemonSet, one pod per node) |
**Mirrored Registries:**
- `docker.io`, `ghcr.io`, `quay.io`, `gcr.io`
- `registry.k8s.io`, `mcr.microsoft.com`
- `git.daviestechlabs.io` (Gitea), `public.ecr.aws`
Spegel registers as a containerd mirror, intercepting pulls before they reach the internet. Especially valuable for large ML model images (5-20GB) that would otherwise be pulled repeatedly.
### Descheduler — Workload Rebalancing
The descheduler evicts pods to allow the scheduler to redistribute them more optimally.
| | |
|---|---|
| **Chart** | `descheduler` v0.33.0 |
| **Namespace** | `descheduler` |
| **Mode** | Deployment (continuous) |
| **Strategy** | `LowNodeUtilization` |
**Excluded Namespaces:** `ai-ml`, `kuberay`, `gitea`
AI/ML and Gitea namespaces are excluded because GPU workloads and git repositories should not be disrupted by rebalancing.
### Reloader — Automatic Config Reload
Reloader watches for Secret and ConfigMap changes and triggers rolling restarts on Deployments/StatefulSets that reference them.
| | |
|---|---|
| **Chart** | `reloader` v2.2.7 |
| **Namespace** | `reloader` |
| **Monitoring** | PodMonitor enabled |
| **Security** | Read-only root filesystem |
Eliminates manual `kubectl rollout restart` after Vault secret rotations or config changes.
### CSI-NFS — NFS StorageClass
Provides a Kubernetes StorageClass backed by the NAS (candlekeep) NFS export.
| | |
|---|---|
| **Chart** | `csi-driver-nfs` v4.13.0 |
| **Namespace** | `csi-nfs` |
| **StorageClass** | `nfs-slow` |
| **NFS Server** | `candlekeep``/kubernetes` |
| **NFS Version** | 4.1, `nconnect=16` |
`nfs-slow` provides ReadWriteMany access for workloads that need shared storage (media library, ML artifacts, photo libraries). Named "slow" relative to Longhorn SSDs, not in absolute terms. The `nconnect=16` option enables 16 parallel NFS connections per mount for improved throughput.
## Resource Overhead
| Utility | Pods | CPU Request | Memory Request |
|---------|------|-------------|----------------|
| Spegel | 1 per node (DaemonSet) | — | — |
| Descheduler | 1 | — | — |
| Reloader | 1 | — | — |
| CSI-NFS | 1 controller + DaemonSet | — | — |
| **Total** | ~8-12 pods | Minimal | Minimal |
All four utilities are lightweight and designed to run alongside workloads with negligible resource impact.
## Links
* Related to [ADR-0026](0026-storage-strategy.md) (Longhorn + NFS storage strategy)
* Related to [ADR-0003](0003-bare-metal-kubernetes.md) (Talos container runtime / containerd)
* [Spegel](https://github.com/spegel-org/spegel) · [Descheduler](https://sigs.k8s.io/descheduler) · [Reloader](https://github.com/stakater/Reloader) · [CSI-NFS](https://github.com/kubernetes-csi/csi-driver-nfs)