updating to match everything in my homelab.
This commit is contained in:
195
decisions/0035-arm64-worker-strategy.md
Normal file
195
decisions/0035-arm64-worker-strategy.md
Normal file
@@ -0,0 +1,195 @@
|
||||
# ARM64 Raspberry Pi Worker Node Strategy
|
||||
|
||||
* Status: accepted
|
||||
* Date: 2026-02-05
|
||||
* Deciders: Billy
|
||||
* Technical Story: Integrate Raspberry Pi nodes into the Kubernetes cluster
|
||||
|
||||
## Context and Problem Statement
|
||||
|
||||
The homelab cluster includes 5 Raspberry Pi 4/5 nodes (ARM64 architecture) alongside x86_64 servers. These low-power nodes provide:
|
||||
- Additional compute capacity for lightweight workloads
|
||||
- Geographic distribution within the home network
|
||||
- Learning platform for multi-architecture Kubernetes
|
||||
|
||||
However, ARM64 nodes have constraints:
|
||||
- No GPU acceleration
|
||||
- Lower CPU/memory than x86_64 servers
|
||||
- Some container images lack ARM64 support
|
||||
- Limited local storage
|
||||
|
||||
How do we effectively integrate ARM64 nodes while avoiding scheduling failures?
|
||||
|
||||
## Decision Drivers
|
||||
|
||||
* Maximize utilization of ARM64 compute
|
||||
* Prevent ARM-incompatible workloads from scheduling
|
||||
* Maintain cluster stability
|
||||
* Support multi-arch container images
|
||||
* Minimize operational overhead
|
||||
|
||||
## Considered Options
|
||||
|
||||
1. **Node labels + affinity for workload placement**
|
||||
2. **Separate ARM64-only namespace**
|
||||
3. **Taints to exclude from general scheduling**
|
||||
4. **ARM64 nodes for specific workload types only**
|
||||
|
||||
## Decision Outcome
|
||||
|
||||
Chosen option: **Option 1 + Option 4 hybrid** - Use node labels with affinity rules, and designate ARM64 nodes for specific workload categories.
|
||||
|
||||
ARM64 nodes handle:
|
||||
- Lightweight control plane components (where multi-arch images exist)
|
||||
- Velero node-agent (backup DaemonSet)
|
||||
- Node-level monitoring (Prometheus node-exporter)
|
||||
- Future: Edge/IoT workloads
|
||||
|
||||
### Positive Consequences
|
||||
|
||||
* Clear workload segmentation
|
||||
* No scheduling failures from arch mismatch
|
||||
* Efficient use of low-power nodes
|
||||
* Room for future ARM-specific workloads
|
||||
* Cost-effective cluster expansion
|
||||
|
||||
### Negative Consequences
|
||||
|
||||
* Some nodes may be underutilized
|
||||
* Must maintain multi-arch image awareness
|
||||
* Additional scheduling complexity
|
||||
|
||||
## Cluster Composition
|
||||
|
||||
| Node | Architecture | Role | Instance Type |
|
||||
|------|--------------|------|---------------|
|
||||
| bruenor | amd64 | control-plane | - |
|
||||
| catti | amd64 | control-plane | - |
|
||||
| storm | amd64 | control-plane | - |
|
||||
| khelben | amd64 | GPU worker (Strix Halo) | - |
|
||||
| elminster | amd64 | GPU worker (NVIDIA) | - |
|
||||
| drizzt | amd64 | GPU worker (RDNA2) | - |
|
||||
| danilo | amd64 | GPU worker (Intel Arc) | - |
|
||||
| regis | amd64 | worker | - |
|
||||
| wulfgar | amd64 | worker | - |
|
||||
| **durnan** | **arm64** | worker | raspberry-pi |
|
||||
| **elaith** | **arm64** | worker | raspberry-pi |
|
||||
| **jarlaxle** | **arm64** | worker | raspberry-pi |
|
||||
| **mirt** | **arm64** | worker | raspberry-pi |
|
||||
| **volo** | **arm64** | worker | raspberry-pi |
|
||||
|
||||
## Node Labels
|
||||
|
||||
```yaml
|
||||
# Applied via Talos machine config or kubectl
|
||||
labels:
|
||||
kubernetes.io/arch: arm64
|
||||
kubernetes.io/os: linux
|
||||
node.kubernetes.io/instance-type: raspberry-pi
|
||||
kubernetes.io/storage: none # No Longhorn on Pis
|
||||
```
|
||||
|
||||
## Workload Placement
|
||||
|
||||
### DaemonSets (Run Everywhere)
|
||||
|
||||
These run on all nodes including ARM64:
|
||||
|
||||
| DaemonSet | Namespace | Multi-arch |
|
||||
|-----------|-----------|------------|
|
||||
| velero-node-agent | velero | ✅ |
|
||||
| cilium-agent | kube-system | ✅ |
|
||||
| node-exporter | observability | ✅ |
|
||||
|
||||
### ARM64-Excluded Workloads
|
||||
|
||||
These explicitly exclude ARM64 via node affinity:
|
||||
|
||||
```yaml
|
||||
spec:
|
||||
affinity:
|
||||
nodeAffinity:
|
||||
requiredDuringSchedulingIgnoredDuringExecution:
|
||||
nodeSelectorTerms:
|
||||
- matchExpressions:
|
||||
- key: kubernetes.io/arch
|
||||
operator: In
|
||||
values:
|
||||
- amd64
|
||||
```
|
||||
|
||||
| Workload Type | Reason for Exclusion |
|
||||
|---------------|----------------------|
|
||||
| GPU workloads | No GPU on Pis |
|
||||
| Longhorn | Pis have no storage label |
|
||||
| Heavy databases | Insufficient resources |
|
||||
| Most HelmReleases | Image compatibility |
|
||||
|
||||
### ARM64-Compatible Light Workloads
|
||||
|
||||
Potential future workloads for ARM64 nodes:
|
||||
|
||||
| Workload | Use Case |
|
||||
|----------|----------|
|
||||
| MQTT broker | IoT message routing |
|
||||
| Pi-hole | DNS ad blocking |
|
||||
| Home Assistant | Home automation |
|
||||
| Lightweight proxies | Traffic routing |
|
||||
|
||||
## Storage Exclusion
|
||||
|
||||
ARM64 nodes are excluded from Longhorn:
|
||||
|
||||
```yaml
|
||||
# Longhorn Helm values
|
||||
defaultSettings:
|
||||
systemManagedComponentsNodeSelector: "kubernetes.io/arch:amd64"
|
||||
```
|
||||
|
||||
Node label:
|
||||
```yaml
|
||||
kubernetes.io/storage: none
|
||||
```
|
||||
|
||||
## Resource Constraints
|
||||
|
||||
| Node Type | CPU | Memory | Typical Available |
|
||||
|-----------|-----|--------|-------------------|
|
||||
| Raspberry Pi 4 | 4 cores | 4-8GB | 3 cores, 3GB |
|
||||
| Raspberry Pi 5 | 4 cores | 8GB | 3.5 cores, 6GB |
|
||||
|
||||
## Multi-Architecture Image Strategy
|
||||
|
||||
For workloads that should run on ARM64:
|
||||
|
||||
1. **Use multi-arch base images** (e.g., `alpine`, `debian`)
|
||||
2. **Build with Docker buildx**:
|
||||
```bash
|
||||
docker buildx build --platform linux/amd64,linux/arm64 -t myimage:latest .
|
||||
```
|
||||
3. **Verify arch support** before deployment
|
||||
|
||||
## Monitoring ARM64 Nodes
|
||||
|
||||
```promql
|
||||
# Node resource usage by architecture
|
||||
sum by (node, arch) (
|
||||
node_memory_MemAvailable_bytes{}
|
||||
* on(node) group_left(arch)
|
||||
kube_node_labels{label_kubernetes_io_arch!=""}
|
||||
)
|
||||
```
|
||||
|
||||
## Future Considerations
|
||||
|
||||
- **Edge workloads**: ARM64 nodes ideal for edge compute patterns
|
||||
- **IoT integration**: MQTT, sensor data collection
|
||||
- **Scale-out**: Add more Pis for lightweight workload capacity
|
||||
- **ARM64 ML inference**: Some models support ARM (TensorFlow Lite)
|
||||
|
||||
## Links
|
||||
|
||||
* [Kubernetes Multi-Architecture](https://kubernetes.io/docs/concepts/containers/images/#multi-architecture-images)
|
||||
* [Talos on Raspberry Pi](https://talos.dev/v1.12/talos-guides/install/single-board-computers/rpi_generic/)
|
||||
* Related: [ADR-0002](0002-use-talos-linux.md) - Use Talos Linux
|
||||
* Related: [ADR-0026](0026-storage-strategy.md) - Storage Strategy
|
||||
Reference in New Issue
Block a user