- Add AGENT-ONBOARDING.md for AI agents - Add ARCHITECTURE.md with full system overview - Add TECH-STACK.md with complete technology inventory - Add DOMAIN-MODEL.md with entities and bounded contexts - Add CODING-CONVENTIONS.md with patterns and practices - Add GLOSSARY.md with terminology reference - Add C4 diagrams (Context and Container levels) - Add 10 ADRs documenting key decisions: - Talos Linux, NATS, MessagePack, Multi-GPU strategy - GitOps with Flux, KServe, Milvus, Dual workflow engines - Envoy Gateway - Add specs directory with JetStream configuration - Add diagrams for GPU allocation and data flows Based on analysis of homelab-k8s2 and llm-workflows repositories and kubectl cluster-info dump data.
2.8 KiB
2.8 KiB
Use Talos Linux for Kubernetes Nodes
- Status: accepted
- Date: 2025-11-30
- Deciders: Billy Davies
- Technical Story: Selecting OS for bare-metal Kubernetes cluster
Context and Problem Statement
We need a reliable, secure operating system for running Kubernetes on bare-metal homelab nodes. The OS should minimize attack surface, be easy to manage at scale, and support our GPU requirements (AMD ROCm, NVIDIA CUDA, Intel).
Decision Drivers
- Security-first design (immutable, minimal)
- API-driven management (no SSH)
- Support for various GPU drivers
- Kubernetes-native focus
- Community support and updates
- Ease of upgrades
Considered Options
- Ubuntu Server with kubeadm
- Flatcar Container Linux
- Talos Linux
- k3OS (discontinued)
- Rocky Linux with RKE2
Decision Outcome
Chosen option: "Talos Linux", because it provides an immutable, API-driven, Kubernetes-focused OS that minimizes attack surface and simplifies operations.
Positive Consequences
- Immutable root filesystem prevents drift
- No SSH reduces attack vectors
- API-driven management integrates well with GitOps
- Schematic system allows custom kernel modules (GPU drivers)
- Consistent configuration across all nodes
- Automatic updates with minimal disruption
Negative Consequences
- Learning curve for API-driven management
- Debugging requires different approaches (no SSH)
- Custom extensions require schematic IDs
- Less flexibility for non-Kubernetes workloads
Pros and Cons of the Options
Ubuntu Server with kubeadm
- Good, because familiar
- Good, because extensive package availability
- Good, because easy debugging via SSH
- Bad, because mutable system leads to drift
- Bad, because large attack surface
- Bad, because manual package management
Flatcar Container Linux
- Good, because immutable
- Good, because auto-updates
- Good, because container-focused
- Bad, because less Kubernetes-specific
- Bad, because smaller community than Talos
- Bad, because GPU driver setup more complex
Talos Linux
- Good, because purpose-built for Kubernetes
- Good, because immutable and minimal
- Good, because API-driven (no SSH)
- Good, because excellent Kubernetes integration
- Good, because active development and community
- Good, because schematic system for GPU drivers
- Bad, because learning curve
- Bad, because no traditional debugging
k3OS
- Good, because simple
- Bad, because discontinued
Rocky Linux with RKE2
- Good, because enterprise-like
- Good, because familiar Linux experience
- Bad, because mutable system
- Bad, because more operational overhead
- Bad, because larger attack surface
Links
- Talos Linux
- Talos Image Factory
- Related: ADR-0005 - GPU driver integration via schematics