Files
homelab-design/decisions/0052-cluster-utilities-optimization.md
Billy D. 5846d0dc16
All checks were successful
Update README with ADR Index / update-readme (push) Successful in 6s
docs: add ADRs 0043-0053 covering remaining architecture gaps
New ADRs:
- 0043: Cilium CNI and Network Fabric
- 0044: DNS and External Access Architecture
- 0045: TLS Certificate Strategy (cert-manager)
- 0046: Companions Frontend Architecture
- 0047: MLflow Experiment Tracking and Model Registry
- 0048: Entertainment and Media Stack
- 0049: Self-Hosted Productivity Suite
- 0050: Argo Rollouts Progressive Delivery
- 0051: KEDA Event-Driven Autoscaling
- 0052: Cluster Utilities (Spegel, Descheduler, Reloader, CSI-NFS)
- 0053: Vaultwarden Password Management

README updated with table entries and badge count (53 total).
2026-02-09 18:37:14 -05:00

4.2 KiB

Cluster Utilities and Optimization

  • Status: accepted
  • Date: 2026-02-09
  • Deciders: Billy
  • Technical Story: Deploy supporting utilities that improve cluster efficiency, reliability, and operational overhead

Context and Problem Statement

A Kubernetes cluster running diverse workloads benefits from several operational utilities — image caching to reduce pull times, workload rebalancing for efficiency, automatic secret/configmap reloading, and shared storage provisioning. Each is small individually but collectively they significantly improve cluster operations.

How do we manage these cross-cutting cluster utilities consistently?

Decision Drivers

  • Reduce container image pull latency across nodes
  • Automatically rebalance workloads for even resource utilization
  • Eliminate manual pod restarts when secrets/configmaps change
  • Provide shared NFS storage class for ReadWriteMany workloads
  • Minimal resource overhead per utility

Decision Outcome

Deploy four cluster utilities — Spegel (image cache), Descheduler (pod rebalancing), Reloader (config reload), and CSI-NFS (NFS StorageClass) — each solving a distinct operational concern with minimal footprint.

Components

Spegel — Peer-to-Peer Image Registry Mirror

Spegel distributes container images between nodes, so pulling an image already present on any node avoids hitting the external registry.

Chart spegel OCI chart v0.3.0
Namespace spegel
Port 29999
Mode P2P mirror (DaemonSet, one pod per node)

Mirrored Registries:

  • docker.io, ghcr.io, quay.io, gcr.io
  • registry.k8s.io, mcr.microsoft.com
  • git.daviestechlabs.io (Gitea), public.ecr.aws

Spegel registers as a containerd mirror, intercepting pulls before they reach the internet. Especially valuable for large ML model images (5-20GB) that would otherwise be pulled repeatedly.

Descheduler — Workload Rebalancing

The descheduler evicts pods to allow the scheduler to redistribute them more optimally.

Chart descheduler v0.33.0
Namespace descheduler
Mode Deployment (continuous)
Strategy LowNodeUtilization

Excluded Namespaces: ai-ml, kuberay, gitea

AI/ML and Gitea namespaces are excluded because GPU workloads and git repositories should not be disrupted by rebalancing.

Reloader — Automatic Config Reload

Reloader watches for Secret and ConfigMap changes and triggers rolling restarts on Deployments/StatefulSets that reference them.

Chart reloader v2.2.7
Namespace reloader
Monitoring PodMonitor enabled
Security Read-only root filesystem

Eliminates manual kubectl rollout restart after Vault secret rotations or config changes.

CSI-NFS — NFS StorageClass

Provides a Kubernetes StorageClass backed by the NAS (candlekeep) NFS export.

Chart csi-driver-nfs v4.13.0
Namespace csi-nfs
StorageClass nfs-slow
NFS Server candlekeep/kubernetes
NFS Version 4.1, nconnect=16

nfs-slow provides ReadWriteMany access for workloads that need shared storage (media library, ML artifacts, photo libraries). Named "slow" relative to Longhorn SSDs, not in absolute terms. The nconnect=16 option enables 16 parallel NFS connections per mount for improved throughput.

Resource Overhead

Utility Pods CPU Request Memory Request
Spegel 1 per node (DaemonSet)
Descheduler 1
Reloader 1
CSI-NFS 1 controller + DaemonSet
Total ~8-12 pods Minimal Minimal

All four utilities are lightweight and designed to run alongside workloads with negligible resource impact.