docs: add ADRs 0043-0053 covering remaining architecture gaps
All checks were successful
Update README with ADR Index / update-readme (push) Successful in 6s

New ADRs:
- 0043: Cilium CNI and Network Fabric
- 0044: DNS and External Access Architecture
- 0045: TLS Certificate Strategy (cert-manager)
- 0046: Companions Frontend Architecture
- 0047: MLflow Experiment Tracking and Model Registry
- 0048: Entertainment and Media Stack
- 0049: Self-Hosted Productivity Suite
- 0050: Argo Rollouts Progressive Delivery
- 0051: KEDA Event-Driven Autoscaling
- 0052: Cluster Utilities (Spegel, Descheduler, Reloader, CSI-NFS)
- 0053: Vaultwarden Password Management

README updated with table entries and badge count (53 total).
This commit is contained in:
2026-02-09 18:36:39 -05:00
parent 49ce970780
commit 5846d0dc16
12 changed files with 1141 additions and 1 deletions

View File

@@ -0,0 +1,118 @@
# Cilium CNI and Network Fabric
* Status: accepted
* Date: 2026-02-09
* Deciders: Billy
* Technical Story: Select and configure the Container Network Interface (CNI) plugin for pod networking, load balancing, and service mesh capabilities on Talos Linux
## Context and Problem Statement
A Kubernetes cluster requires a CNI plugin to provide pod-to-pod networking, service load balancing, and network policy enforcement. The homelab runs on Talos Linux (immutable OS) with heterogeneous hardware (amd64 + arm64) and needs L2 LoadBalancer IP advertisement for bare-metal services.
How do we provide reliable, performant networking with bare-metal LoadBalancer support and observability integration?
## Decision Drivers
* Must work on Talos Linux (eBPF-capable, no iptables preference)
* Bare-metal LoadBalancer IP assignment (no cloud provider)
* L2 IP advertisement for LAN services
* kube-proxy replacement for performance
* Future-proof for network policy enforcement
* Active community and CNCF backing
## Considered Options
1. **Cilium** — eBPF-based CNI with kube-proxy replacement
2. **Calico** — Established CNI with eBPF and BGP support
3. **Flannel + MetalLB** — Simple overlay with separate LB
4. **Antrea** — VMware-backed OVS-based CNI
## Decision Outcome
Chosen option: **Cilium**, because it provides an eBPF-native dataplane that replaces kube-proxy, L2 LoadBalancer announcements (eliminating MetalLB), and native Talos support.
### Positive Consequences
* Single component handles CNI + kube-proxy + LoadBalancer (replaces 3 tools)
* eBPF dataplane is more efficient than iptables on large clusters
* L2 announcements provide bare-metal LoadBalancer without MetalLB
* Maglev + DSR load balancing for consistent hashing and reduced latency
* Strong Talos Linux integration and testing
* Prometheus metrics and Grafana dashboards included
### Negative Consequences
* More complex configuration than simple CNIs
* eBPF requires compatible kernel (Talos provides this)
* `hostLegacyRouting: true` workaround needed for Talos issue #10002
## Deployment Configuration
| | |
|---|---|
| **Chart** | `cilium` from `oci://ghcr.io/home-operations/charts-mirror/cilium` |
| **Version** | 1.18.6 |
| **Namespace** | `kube-system` |
### Core Networking
| Setting | Value | Rationale |
|---------|-------|-----------|
| `kubeProxyReplacement` | `true` | Replace kube-proxy entirely — lower latency, fewer components |
| `routingMode` | `native` | Direct routing, no encapsulation overhead |
| `autoDirectNodeRoutes` | `true` | Auto-configure inter-node routes |
| `ipv4NativeRoutingCIDR` | `10.42.0.0/16` | Pod CIDR for native routing |
| `ipam.mode` | `kubernetes` | Use Kubernetes IPAM |
| `endpointRoutes.enabled` | `true` | Per-endpoint routing for better granularity |
| `bpf.masquerade` | `true` | eBPF-based masquerading |
| `bpf.hostLegacyRouting` | `true` | Workaround for Talos issue #10002 |
### Load Balancing
| Setting | Value | Rationale |
|---------|-------|-----------|
| `loadBalancer.algorithm` | `maglev` | Consistent hashing for stable backend selection |
| `loadBalancer.mode` | `dsr` | Direct Server Return — response bypasses LB for lower latency |
| `socketLB.enabled` | `true` | Socket-level load balancing for host-namespace pods |
### L2 Announcements
Cilium replaces MetalLB for bare-metal LoadBalancer IP assignment:
```
CiliumLoadBalancerIPPool: 192.168.100.0/24
CiliumL2AnnouncementPolicy: announces on all Linux nodes
```
Key VIPs assigned from this pool:
- `192.168.100.200` — k8s-gateway (internal DNS)
- `192.168.100.201` — envoy-internal gateway
- `192.168.100.210` — envoy-external gateway
### Multi-Network Support
| Setting | Value | Rationale |
|---------|-------|-----------|
| `cni.exclusive` | `false` | Paired with Multus CNI for multi-network pods |
This enables workloads like qbittorrent to use secondary network interfaces (e.g., VPN).
### Disabled Features
| Feature | Reason |
|---------|--------|
| Hubble | Not needed — tracing handled by OpenTelemetry stack |
| Gateway API | Offloaded to dedicated Envoy Gateway deployment |
| Envoy (built-in) | Using separate Envoy Gateway for more control |
## Observability
- **Prometheus:** ServiceMonitor enabled for both agent and operator
- **Grafana:** Two dashboards via `GrafanaDashboard` CRDs (Cilium agent + operator)
## Links
* Related to [ADR-0044](0044-dns-and-external-access.md) (L2 IPs feed gateway VIPs)
* Related to [ADR-0002](0002-use-talos-linux.md) (Talos eBPF compatibility)
* [Cilium Documentation](https://docs.cilium.io/)
* [Talos Cilium Guide](https://www.talos.dev/latest/kubernetes-guides/network/deploying-cilium/)