homelab-design/decisions/0052-cluster-utilities-optimization.md

# Cluster Utilities and Optimization

* Status: accepted
* Date: 2026-02-09
* Deciders: Billy
* Technical Story: Deploy supporting utilities that improve cluster efficiency, reliability, and operational overhead

## Context and Problem Statement

A Kubernetes cluster running diverse workloads benefits from several operational utilities — image caching to reduce pull times, workload rebalancing for efficiency, automatic secret/configmap reloading, and shared storage provisioning. Each is small individually but collectively they significantly improve cluster operations.

How do we manage these cross-cutting cluster utilities consistently?

## Decision Drivers

* Reduce container image pull latency across nodes
* Automatically rebalance workloads for even resource utilization
* Eliminate manual pod restarts when secrets/configmaps change
* Provide shared NFS storage class for ReadWriteMany workloads
* Minimal resource overhead per utility

## Decision Outcome

Deploy four cluster utilities — Spegel (image cache), Descheduler (pod rebalancing), Reloader (config reload), and CSI-NFS (NFS StorageClass) — each solving a distinct operational concern with minimal footprint.

## Components

### Spegel — Peer-to-Peer Image Registry Mirror

Spegel distributes container images between nodes, so pulling an image already present on _any_ node avoids hitting the external registry.

| | |
|---|---|
| **Chart** | `spegel` OCI chart v0.3.0 |
| **Namespace** | `spegel` |
| **Port** | 29999 |
| **Mode** | P2P mirror (DaemonSet, one pod per node) |

**Mirrored Registries:**
- `docker.io`, `ghcr.io`, `quay.io`, `gcr.io`
- `registry.k8s.io`, `mcr.microsoft.com`
- `git.daviestechlabs.io` (Gitea), `public.ecr.aws`

Spegel registers as a containerd mirror, intercepting pulls before they reach the internet. Especially valuable for large ML model images (5-20GB) that would otherwise be pulled repeatedly.

### Descheduler — Workload Rebalancing

The descheduler evicts pods to allow the scheduler to redistribute them more optimally.

| | |
|---|---|
| **Chart** | `descheduler` v0.33.0 |
| **Namespace** | `descheduler` |
| **Mode** | Deployment (continuous) |
| **Strategy** | `LowNodeUtilization` |

**Excluded Namespaces:** `ai-ml`, `kuberay`, `gitea`

AI/ML and Gitea namespaces are excluded because GPU workloads and git repositories should not be disrupted by rebalancing.

### Reloader — Automatic Config Reload

Reloader watches for Secret and ConfigMap changes and triggers rolling restarts on Deployments/StatefulSets that reference them.

| | |
|---|---|
| **Chart** | `reloader` v2.2.7 |
| **Namespace** | `reloader` |
| **Monitoring** | PodMonitor enabled |
| **Security** | Read-only root filesystem |

Eliminates manual `kubectl rollout restart` after Vault secret rotations or config changes.

### CSI-NFS — NFS StorageClass

Provides a Kubernetes StorageClass backed by the NAS (candlekeep) NFS export.

| | |
|---|---|
| **Chart** | `csi-driver-nfs` v4.13.0 |
| **Namespace** | `csi-nfs` |
| **StorageClass** | `nfs-slow` |
| **NFS Server** | `candlekeep` → `/kubernetes` |
| **NFS Version** | 4.1, `nconnect=16` |

`nfs-slow` provides ReadWriteMany access for workloads that need shared storage (media library, ML artifacts, photo libraries). Named "slow" relative to Longhorn SSDs, not in absolute terms. The `nconnect=16` option enables 16 parallel NFS connections per mount for improved throughput.

## Resource Overhead

| Utility | Pods | CPU Request | Memory Request |
|---------|------|-------------|----------------|
| Spegel | 1 per node (DaemonSet) | — | — |
| Descheduler | 1 | — | — |
| Reloader | 1 | — | — |
| CSI-NFS | 1 controller + DaemonSet | — | — |
| **Total** | ~8-12 pods | Minimal | Minimal |

All four utilities are lightweight and designed to run alongside workloads with negligible resource impact.

## Links

* Related to [ADR-0026](0026-storage-strategy.md) (Longhorn + NFS storage strategy)
* Related to [ADR-0003](0003-bare-metal-kubernetes.md) (Talos container runtime / containerd)
* [Spegel](https://github.com/spegel-org/spegel) · [Descheduler](https://sigs.k8s.io/descheduler) · [Reloader](https://github.com/stakater/Reloader) · [CSI-NFS](https://github.com/kubernetes-csi/csi-driver-nfs)