updating to match everything in my homelab.

2026-02-05 16:13:53 -05:00
parent f8787379c5
commit 80fb911e22
30 changed files with 3107 additions and 7 deletions
--- a/decisions/0035-arm64-worker-strategy.md
+++ b/decisions/0035-arm64-worker-strategy.md
@@ -0,0 +1,195 @@
+# ARM64 Raspberry Pi Worker Node Strategy
+
+* Status: accepted
+* Date: 2026-02-05
+* Deciders: Billy
+* Technical Story: Integrate Raspberry Pi nodes into the Kubernetes cluster
+
+## Context and Problem Statement
+
+The homelab cluster includes 5 Raspberry Pi 4/5 nodes (ARM64 architecture) alongside x86_64 servers. These low-power nodes provide:
+- Additional compute capacity for lightweight workloads
+- Geographic distribution within the home network
+- Learning platform for multi-architecture Kubernetes
+
+However, ARM64 nodes have constraints:
+- No GPU acceleration
+- Lower CPU/memory than x86_64 servers
+- Some container images lack ARM64 support
+- Limited local storage
+
+How do we effectively integrate ARM64 nodes while avoiding scheduling failures?
+
+## Decision Drivers
+
+* Maximize utilization of ARM64 compute
+* Prevent ARM-incompatible workloads from scheduling
+* Maintain cluster stability
+* Support multi-arch container images
+* Minimize operational overhead
+
+## Considered Options
+
+1. **Node labels + affinity for workload placement**
+2. **Separate ARM64-only namespace**
+3. **Taints to exclude from general scheduling**
+4. **ARM64 nodes for specific workload types only**
+
+## Decision Outcome
+
+Chosen option: **Option 1 + Option 4 hybrid** - Use node labels with affinity rules, and designate ARM64 nodes for specific workload categories.
+
+ARM64 nodes handle:
+- Lightweight control plane components (where multi-arch images exist)
+- Velero node-agent (backup DaemonSet)
+- Node-level monitoring (Prometheus node-exporter)
+- Future: Edge/IoT workloads
+
+### Positive Consequences
+
+* Clear workload segmentation
+* No scheduling failures from arch mismatch
+* Efficient use of low-power nodes
+* Room for future ARM-specific workloads
+* Cost-effective cluster expansion
+
+### Negative Consequences
+
+* Some nodes may be underutilized
+* Must maintain multi-arch image awareness
+* Additional scheduling complexity
+
+## Cluster Composition
+
+| Node | Architecture | Role | Instance Type |
+|------|--------------|------|---------------|
+| bruenor | amd64 | control-plane | - |
+| catti | amd64 | control-plane | - |
+| storm | amd64 | control-plane | - |
+| khelben | amd64 | GPU worker (Strix Halo) | - |
+| elminster | amd64 | GPU worker (NVIDIA) | - |
+| drizzt | amd64 | GPU worker (RDNA2) | - |
+| danilo | amd64 | GPU worker (Intel Arc) | - |
+| regis | amd64 | worker | - |
+| wulfgar | amd64 | worker | - |
+| **durnan** | **arm64** | worker | raspberry-pi |
+| **elaith** | **arm64** | worker | raspberry-pi |
+| **jarlaxle** | **arm64** | worker | raspberry-pi |
+| **mirt** | **arm64** | worker | raspberry-pi |
+| **volo** | **arm64** | worker | raspberry-pi |
+
+## Node Labels
+
+```yaml
+# Applied via Talos machine config or kubectl
+labels:
+  kubernetes.io/arch: arm64
+  kubernetes.io/os: linux
+  node.kubernetes.io/instance-type: raspberry-pi
+  kubernetes.io/storage: none  # No Longhorn on Pis
+```
+
+## Workload Placement
+
+### DaemonSets (Run Everywhere)
+
+These run on all nodes including ARM64:
+
+| DaemonSet | Namespace | Multi-arch |
+|-----------|-----------|------------|
+| velero-node-agent | velero | ✅ |
+| cilium-agent | kube-system | ✅ |
+| node-exporter | observability | ✅ |
+
+### ARM64-Excluded Workloads
+
+These explicitly exclude ARM64 via node affinity:
+
+```yaml
+spec:
+  affinity:
+    nodeAffinity:
+      requiredDuringSchedulingIgnoredDuringExecution:
+        nodeSelectorTerms:
+          - matchExpressions:
+              - key: kubernetes.io/arch
+                operator: In
+                values:
+                  - amd64
+```
+
+| Workload Type | Reason for Exclusion |
+|---------------|----------------------|
+| GPU workloads | No GPU on Pis |
+| Longhorn | Pis have no storage label |
+| Heavy databases | Insufficient resources |
+| Most HelmReleases | Image compatibility |
+
+### ARM64-Compatible Light Workloads
+
+Potential future workloads for ARM64 nodes:
+
+| Workload | Use Case |
+|----------|----------|
+| MQTT broker | IoT message routing |
+| Pi-hole | DNS ad blocking |
+| Home Assistant | Home automation |
+| Lightweight proxies | Traffic routing |
+
+## Storage Exclusion
+
+ARM64 nodes are excluded from Longhorn:
+
+```yaml
+# Longhorn Helm values
+defaultSettings:
+  systemManagedComponentsNodeSelector: "kubernetes.io/arch:amd64"
+```
+
+Node label:
+```yaml
+kubernetes.io/storage: none
+```
+
+## Resource Constraints
+
+| Node Type | CPU | Memory | Typical Available |
+|-----------|-----|--------|-------------------|
+| Raspberry Pi 4 | 4 cores | 4-8GB | 3 cores, 3GB |
+| Raspberry Pi 5 | 4 cores | 8GB | 3.5 cores, 6GB |
+
+## Multi-Architecture Image Strategy
+
+For workloads that should run on ARM64:
+
+1. **Use multi-arch base images** (e.g., `alpine`, `debian`)
+2. **Build with Docker buildx**:
+   ```bash
+   docker buildx build --platform linux/amd64,linux/arm64 -t myimage:latest .
+   ```
+3. **Verify arch support** before deployment
+
+## Monitoring ARM64 Nodes
+
+```promql
+# Node resource usage by architecture
+sum by (node, arch) (
+  node_memory_MemAvailable_bytes{} 
+  * on(node) group_left(arch) 
+  kube_node_labels{label_kubernetes_io_arch!=""}
+)
+```
+
+## Future Considerations
+
+- **Edge workloads**: ARM64 nodes ideal for edge compute patterns
+- **IoT integration**: MQTT, sensor data collection
+- **Scale-out**: Add more Pis for lightweight workload capacity
+- **ARM64 ML inference**: Some models support ARM (TensorFlow Lite)
+
+## Links
+
+* [Kubernetes Multi-Architecture](https://kubernetes.io/docs/concepts/containers/images/#multi-architecture-images)
+* [Talos on Raspberry Pi](https://talos.dev/v1.12/talos-guides/install/single-board-computers/rpi_generic/)
+* Related: [ADR-0002](0002-use-talos-linux.md) - Use Talos Linux
+* Related: [ADR-0026](0026-storage-strategy.md) - Storage Strategy