New ADRs: - 0043: Cilium CNI and Network Fabric - 0044: DNS and External Access Architecture - 0045: TLS Certificate Strategy (cert-manager) - 0046: Companions Frontend Architecture - 0047: MLflow Experiment Tracking and Model Registry - 0048: Entertainment and Media Stack - 0049: Self-Hosted Productivity Suite - 0050: Argo Rollouts Progressive Delivery - 0051: KEDA Event-Driven Autoscaling - 0052: Cluster Utilities (Spegel, Descheduler, Reloader, CSI-NFS) - 0053: Vaultwarden Password Management README updated with table entries and badge count (53 total).
4.2 KiB
Cluster Utilities and Optimization
- Status: accepted
- Date: 2026-02-09
- Deciders: Billy
- Technical Story: Deploy supporting utilities that improve cluster efficiency, reliability, and operational overhead
Context and Problem Statement
A Kubernetes cluster running diverse workloads benefits from several operational utilities — image caching to reduce pull times, workload rebalancing for efficiency, automatic secret/configmap reloading, and shared storage provisioning. Each is small individually but collectively they significantly improve cluster operations.
How do we manage these cross-cutting cluster utilities consistently?
Decision Drivers
- Reduce container image pull latency across nodes
- Automatically rebalance workloads for even resource utilization
- Eliminate manual pod restarts when secrets/configmaps change
- Provide shared NFS storage class for ReadWriteMany workloads
- Minimal resource overhead per utility
Decision Outcome
Deploy four cluster utilities — Spegel (image cache), Descheduler (pod rebalancing), Reloader (config reload), and CSI-NFS (NFS StorageClass) — each solving a distinct operational concern with minimal footprint.
Components
Spegel — Peer-to-Peer Image Registry Mirror
Spegel distributes container images between nodes, so pulling an image already present on any node avoids hitting the external registry.
| Chart | spegel OCI chart v0.3.0 |
| Namespace | spegel |
| Port | 29999 |
| Mode | P2P mirror (DaemonSet, one pod per node) |
Mirrored Registries:
docker.io,ghcr.io,quay.io,gcr.ioregistry.k8s.io,mcr.microsoft.comgit.daviestechlabs.io(Gitea),public.ecr.aws
Spegel registers as a containerd mirror, intercepting pulls before they reach the internet. Especially valuable for large ML model images (5-20GB) that would otherwise be pulled repeatedly.
Descheduler — Workload Rebalancing
The descheduler evicts pods to allow the scheduler to redistribute them more optimally.
| Chart | descheduler v0.33.0 |
| Namespace | descheduler |
| Mode | Deployment (continuous) |
| Strategy | LowNodeUtilization |
Excluded Namespaces: ai-ml, kuberay, gitea
AI/ML and Gitea namespaces are excluded because GPU workloads and git repositories should not be disrupted by rebalancing.
Reloader — Automatic Config Reload
Reloader watches for Secret and ConfigMap changes and triggers rolling restarts on Deployments/StatefulSets that reference them.
| Chart | reloader v2.2.7 |
| Namespace | reloader |
| Monitoring | PodMonitor enabled |
| Security | Read-only root filesystem |
Eliminates manual kubectl rollout restart after Vault secret rotations or config changes.
CSI-NFS — NFS StorageClass
Provides a Kubernetes StorageClass backed by the NAS (candlekeep) NFS export.
| Chart | csi-driver-nfs v4.13.0 |
| Namespace | csi-nfs |
| StorageClass | nfs-slow |
| NFS Server | candlekeep → /kubernetes |
| NFS Version | 4.1, nconnect=16 |
nfs-slow provides ReadWriteMany access for workloads that need shared storage (media library, ML artifacts, photo libraries). Named "slow" relative to Longhorn SSDs, not in absolute terms. The nconnect=16 option enables 16 parallel NFS connections per mount for improved throughput.
Resource Overhead
| Utility | Pods | CPU Request | Memory Request |
|---|---|---|---|
| Spegel | 1 per node (DaemonSet) | — | — |
| Descheduler | 1 | — | — |
| Reloader | 1 | — | — |
| CSI-NFS | 1 controller + DaemonSet | — | — |
| Total | ~8-12 pods | Minimal | Minimal |
All four utilities are lightweight and designed to run alongside workloads with negligible resource impact.