# Cluster Utilities and Optimization * Status: accepted * Date: 2026-02-09 * Deciders: Billy * Technical Story: Deploy supporting utilities that improve cluster efficiency, reliability, and operational overhead ## Context and Problem Statement A Kubernetes cluster running diverse workloads benefits from several operational utilities — image caching to reduce pull times, workload rebalancing for efficiency, automatic secret/configmap reloading, and shared storage provisioning. Each is small individually but collectively they significantly improve cluster operations. How do we manage these cross-cutting cluster utilities consistently? ## Decision Drivers * Reduce container image pull latency across nodes * Automatically rebalance workloads for even resource utilization * Eliminate manual pod restarts when secrets/configmaps change * Provide shared NFS storage class for ReadWriteMany workloads * Minimal resource overhead per utility ## Decision Outcome Deploy four cluster utilities — Spegel (image cache), Descheduler (pod rebalancing), Reloader (config reload), and CSI-NFS (NFS StorageClass) — each solving a distinct operational concern with minimal footprint. ## Components ### Spegel — Peer-to-Peer Image Registry Mirror Spegel distributes container images between nodes, so pulling an image already present on _any_ node avoids hitting the external registry. | | | |---|---| | **Chart** | `spegel` OCI chart v0.3.0 | | **Namespace** | `spegel` | | **Port** | 29999 | | **Mode** | P2P mirror (DaemonSet, one pod per node) | **Mirrored Registries:** - `docker.io`, `ghcr.io`, `quay.io`, `gcr.io` - `registry.k8s.io`, `mcr.microsoft.com` - `git.daviestechlabs.io` (Gitea), `public.ecr.aws` Spegel registers as a containerd mirror, intercepting pulls before they reach the internet. Especially valuable for large ML model images (5-20GB) that would otherwise be pulled repeatedly. ### Descheduler — Workload Rebalancing The descheduler evicts pods to allow the scheduler to redistribute them more optimally. | | | |---|---| | **Chart** | `descheduler` v0.33.0 | | **Namespace** | `descheduler` | | **Mode** | Deployment (continuous) | | **Strategy** | `LowNodeUtilization` | **Excluded Namespaces:** `ai-ml`, `kuberay`, `gitea` AI/ML and Gitea namespaces are excluded because GPU workloads and git repositories should not be disrupted by rebalancing. ### Reloader — Automatic Config Reload Reloader watches for Secret and ConfigMap changes and triggers rolling restarts on Deployments/StatefulSets that reference them. | | | |---|---| | **Chart** | `reloader` v2.2.7 | | **Namespace** | `reloader` | | **Monitoring** | PodMonitor enabled | | **Security** | Read-only root filesystem | Eliminates manual `kubectl rollout restart` after Vault secret rotations or config changes. ### CSI-NFS — NFS StorageClass Provides a Kubernetes StorageClass backed by the NAS (candlekeep) NFS export. | | | |---|---| | **Chart** | `csi-driver-nfs` v4.13.0 | | **Namespace** | `csi-nfs` | | **StorageClass** | `nfs-slow` | | **NFS Server** | `candlekeep` → `/kubernetes` | | **NFS Version** | 4.1, `nconnect=16` | `nfs-slow` provides ReadWriteMany access for workloads that need shared storage (media library, ML artifacts, photo libraries). Named "slow" relative to Longhorn SSDs, not in absolute terms. The `nconnect=16` option enables 16 parallel NFS connections per mount for improved throughput. ## Resource Overhead | Utility | Pods | CPU Request | Memory Request | |---------|------|-------------|----------------| | Spegel | 1 per node (DaemonSet) | — | — | | Descheduler | 1 | — | — | | Reloader | 1 | — | — | | CSI-NFS | 1 controller + DaemonSet | — | — | | **Total** | ~8-12 pods | Minimal | Minimal | All four utilities are lightweight and designed to run alongside workloads with negligible resource impact. ## Links * Related to [ADR-0026](0026-storage-strategy.md) (Longhorn + NFS storage strategy) * Related to [ADR-0003](0003-bare-metal-kubernetes.md) (Talos container runtime / containerd) * [Spegel](https://github.com/spegel-org/spegel) · [Descheduler](https://sigs.k8s.io/descheduler) · [Reloader](https://github.com/stakater/Reloader) · [CSI-NFS](https://github.com/kubernetes-csi/csi-driver-nfs)