Files
homelab-design/decisions/0043-cilium-cni-network-fabric.md
Billy D. 5846d0dc16
All checks were successful
Update README with ADR Index / update-readme (push) Successful in 6s
docs: add ADRs 0043-0053 covering remaining architecture gaps
New ADRs:
- 0043: Cilium CNI and Network Fabric
- 0044: DNS and External Access Architecture
- 0045: TLS Certificate Strategy (cert-manager)
- 0046: Companions Frontend Architecture
- 0047: MLflow Experiment Tracking and Model Registry
- 0048: Entertainment and Media Stack
- 0049: Self-Hosted Productivity Suite
- 0050: Argo Rollouts Progressive Delivery
- 0051: KEDA Event-Driven Autoscaling
- 0052: Cluster Utilities (Spegel, Descheduler, Reloader, CSI-NFS)
- 0053: Vaultwarden Password Management

README updated with table entries and badge count (53 total).
2026-02-09 18:37:14 -05:00

4.6 KiB

Cilium CNI and Network Fabric

  • Status: accepted
  • Date: 2026-02-09
  • Deciders: Billy
  • Technical Story: Select and configure the Container Network Interface (CNI) plugin for pod networking, load balancing, and service mesh capabilities on Talos Linux

Context and Problem Statement

A Kubernetes cluster requires a CNI plugin to provide pod-to-pod networking, service load balancing, and network policy enforcement. The homelab runs on Talos Linux (immutable OS) with heterogeneous hardware (amd64 + arm64) and needs L2 LoadBalancer IP advertisement for bare-metal services.

How do we provide reliable, performant networking with bare-metal LoadBalancer support and observability integration?

Decision Drivers

  • Must work on Talos Linux (eBPF-capable, no iptables preference)
  • Bare-metal LoadBalancer IP assignment (no cloud provider)
  • L2 IP advertisement for LAN services
  • kube-proxy replacement for performance
  • Future-proof for network policy enforcement
  • Active community and CNCF backing

Considered Options

  1. Cilium — eBPF-based CNI with kube-proxy replacement
  2. Calico — Established CNI with eBPF and BGP support
  3. Flannel + MetalLB — Simple overlay with separate LB
  4. Antrea — VMware-backed OVS-based CNI

Decision Outcome

Chosen option: Cilium, because it provides an eBPF-native dataplane that replaces kube-proxy, L2 LoadBalancer announcements (eliminating MetalLB), and native Talos support.

Positive Consequences

  • Single component handles CNI + kube-proxy + LoadBalancer (replaces 3 tools)
  • eBPF dataplane is more efficient than iptables on large clusters
  • L2 announcements provide bare-metal LoadBalancer without MetalLB
  • Maglev + DSR load balancing for consistent hashing and reduced latency
  • Strong Talos Linux integration and testing
  • Prometheus metrics and Grafana dashboards included

Negative Consequences

  • More complex configuration than simple CNIs
  • eBPF requires compatible kernel (Talos provides this)
  • hostLegacyRouting: true workaround needed for Talos issue #10002

Deployment Configuration

Chart cilium from oci://ghcr.io/home-operations/charts-mirror/cilium
Version 1.18.6
Namespace kube-system

Core Networking

Setting Value Rationale
kubeProxyReplacement true Replace kube-proxy entirely — lower latency, fewer components
routingMode native Direct routing, no encapsulation overhead
autoDirectNodeRoutes true Auto-configure inter-node routes
ipv4NativeRoutingCIDR 10.42.0.0/16 Pod CIDR for native routing
ipam.mode kubernetes Use Kubernetes IPAM
endpointRoutes.enabled true Per-endpoint routing for better granularity
bpf.masquerade true eBPF-based masquerading
bpf.hostLegacyRouting true Workaround for Talos issue #10002

Load Balancing

Setting Value Rationale
loadBalancer.algorithm maglev Consistent hashing for stable backend selection
loadBalancer.mode dsr Direct Server Return — response bypasses LB for lower latency
socketLB.enabled true Socket-level load balancing for host-namespace pods

L2 Announcements

Cilium replaces MetalLB for bare-metal LoadBalancer IP assignment:

CiliumLoadBalancerIPPool: 192.168.100.0/24
CiliumL2AnnouncementPolicy: announces on all Linux nodes

Key VIPs assigned from this pool:

  • 192.168.100.200 — k8s-gateway (internal DNS)
  • 192.168.100.201 — envoy-internal gateway
  • 192.168.100.210 — envoy-external gateway

Multi-Network Support

Setting Value Rationale
cni.exclusive false Paired with Multus CNI for multi-network pods

This enables workloads like qbittorrent to use secondary network interfaces (e.g., VPN).

Disabled Features

Feature Reason
Hubble Not needed — tracing handled by OpenTelemetry stack
Gateway API Offloaded to dedicated Envoy Gateway deployment
Envoy (built-in) Using separate Envoy Gateway for more control

Observability

  • Prometheus: ServiceMonitor enabled for both agent and operator
  • Grafana: Two dashboards via GrafanaDashboard CRDs (Cilium agent + operator)