Files

Update README with ADR Index / update-readme (push) Successful in 6s

Details

docs: add ADRs 0043-0053 covering remaining architecture gaps

New ADRs:
- 0043: Cilium CNI and Network Fabric
- 0044: DNS and External Access Architecture
- 0045: TLS Certificate Strategy (cert-manager)
- 0046: Companions Frontend Architecture
- 0047: MLflow Experiment Tracking and Model Registry
- 0048: Entertainment and Media Stack
- 0049: Self-Hosted Productivity Suite
- 0050: Argo Rollouts Progressive Delivery
- 0051: KEDA Event-Driven Autoscaling
- 0052: Cluster Utilities (Spegel, Descheduler, Reloader, CSI-NFS)
- 0053: Vaultwarden Password Management

README updated with table entries and badge count (53 total).

2026-02-09 18:37:14 -05:00

4.2 KiB

Raw Blame History

Cluster Utilities and Optimization

Status: accepted
Date: 2026-02-09
Deciders: Billy
Technical Story: Deploy supporting utilities that improve cluster efficiency, reliability, and operational overhead

Context and Problem Statement

A Kubernetes cluster running diverse workloads benefits from several operational utilities — image caching to reduce pull times, workload rebalancing for efficiency, automatic secret/configmap reloading, and shared storage provisioning. Each is small individually but collectively they significantly improve cluster operations.

How do we manage these cross-cutting cluster utilities consistently?

Decision Drivers

Reduce container image pull latency across nodes
Automatically rebalance workloads for even resource utilization
Eliminate manual pod restarts when secrets/configmaps change
Provide shared NFS storage class for ReadWriteMany workloads
Minimal resource overhead per utility

Decision Outcome

Deploy four cluster utilities — Spegel (image cache), Descheduler (pod rebalancing), Reloader (config reload), and CSI-NFS (NFS StorageClass) — each solving a distinct operational concern with minimal footprint.

Components

Spegel — Peer-to-Peer Image Registry Mirror

Spegel distributes container images between nodes, so pulling an image already present on any node avoids hitting the external registry.


Chart	`spegel` OCI chart v0.3.0
Namespace	`spegel`
Port	29999
Mode	P2P mirror (DaemonSet, one pod per node)

Mirrored Registries:

docker.io, ghcr.io, quay.io, gcr.io
registry.k8s.io, mcr.microsoft.com
git.daviestechlabs.io (Gitea), public.ecr.aws

Spegel registers as a containerd mirror, intercepting pulls before they reach the internet. Especially valuable for large ML model images (5-20GB) that would otherwise be pulled repeatedly.

Descheduler — Workload Rebalancing

The descheduler evicts pods to allow the scheduler to redistribute them more optimally.


Chart	`descheduler` v0.33.0
Namespace	`descheduler`
Mode	Deployment (continuous)
Strategy	`LowNodeUtilization`

Excluded Namespaces: ai-ml, kuberay, gitea

AI/ML and Gitea namespaces are excluded because GPU workloads and git repositories should not be disrupted by rebalancing.

Reloader — Automatic Config Reload

Reloader watches for Secret and ConfigMap changes and triggers rolling restarts on Deployments/StatefulSets that reference them.


Chart	`reloader` v2.2.7
Namespace	`reloader`
Monitoring	PodMonitor enabled
Security	Read-only root filesystem

Eliminates manual kubectl rollout restart after Vault secret rotations or config changes.

CSI-NFS — NFS StorageClass

Provides a Kubernetes StorageClass backed by the NAS (candlekeep) NFS export.


Chart	`csi-driver-nfs` v4.13.0
Namespace	`csi-nfs`
StorageClass	`nfs-slow`
NFS Server	`candlekeep` → `/kubernetes`
NFS Version	4.1, `nconnect=16`

nfs-slow provides ReadWriteMany access for workloads that need shared storage (media library, ML artifacts, photo libraries). Named "slow" relative to Longhorn SSDs, not in absolute terms. The nconnect=16 option enables 16 parallel NFS connections per mount for improved throughput.

Resource Overhead

Utility	Pods	CPU Request	Memory Request
Spegel	1 per node (DaemonSet)	—	—
Descheduler	1	—	—
Reloader	1	—	—
CSI-NFS	1 controller + DaemonSet	—	—
Total	~8-12 pods	Minimal	Minimal

All four utilities are lightweight and designed to run alongside workloads with negligible resource impact.

4.2 KiB Raw Blame History