- Add AGENT-ONBOARDING.md for AI agents - Add ARCHITECTURE.md with full system overview - Add TECH-STACK.md with complete technology inventory - Add DOMAIN-MODEL.md with entities and bounded contexts - Add CODING-CONVENTIONS.md with patterns and practices - Add GLOSSARY.md with terminology reference - Add C4 diagrams (Context and Container levels) - Add 10 ADRs documenting key decisions: - Talos Linux, NATS, MessagePack, Multi-GPU strategy - GitOps with Flux, KServe, Milvus, Dual workflow engines - Envoy Gateway - Add specs directory with JetStream configuration - Add diagrams for GPU allocation and data flows Based on analysis of homelab-k8s2 and llm-workflows repositories and kubectl cluster-info dump data.
98 lines
2.8 KiB
Markdown
98 lines
2.8 KiB
Markdown
# Use Talos Linux for Kubernetes Nodes
|
|
|
|
* Status: accepted
|
|
* Date: 2025-11-30
|
|
* Deciders: Billy Davies
|
|
* Technical Story: Selecting OS for bare-metal Kubernetes cluster
|
|
|
|
## Context and Problem Statement
|
|
|
|
We need a reliable, secure operating system for running Kubernetes on bare-metal homelab nodes. The OS should minimize attack surface, be easy to manage at scale, and support our GPU requirements (AMD ROCm, NVIDIA CUDA, Intel).
|
|
|
|
## Decision Drivers
|
|
|
|
* Security-first design (immutable, minimal)
|
|
* API-driven management (no SSH)
|
|
* Support for various GPU drivers
|
|
* Kubernetes-native focus
|
|
* Community support and updates
|
|
* Ease of upgrades
|
|
|
|
## Considered Options
|
|
|
|
* Ubuntu Server with kubeadm
|
|
* Flatcar Container Linux
|
|
* Talos Linux
|
|
* k3OS (discontinued)
|
|
* Rocky Linux with RKE2
|
|
|
|
## Decision Outcome
|
|
|
|
Chosen option: "Talos Linux", because it provides an immutable, API-driven, Kubernetes-focused OS that minimizes attack surface and simplifies operations.
|
|
|
|
### Positive Consequences
|
|
|
|
* Immutable root filesystem prevents drift
|
|
* No SSH reduces attack vectors
|
|
* API-driven management integrates well with GitOps
|
|
* Schematic system allows custom kernel modules (GPU drivers)
|
|
* Consistent configuration across all nodes
|
|
* Automatic updates with minimal disruption
|
|
|
|
### Negative Consequences
|
|
|
|
* Learning curve for API-driven management
|
|
* Debugging requires different approaches (no SSH)
|
|
* Custom extensions require schematic IDs
|
|
* Less flexibility for non-Kubernetes workloads
|
|
|
|
## Pros and Cons of the Options
|
|
|
|
### Ubuntu Server with kubeadm
|
|
|
|
* Good, because familiar
|
|
* Good, because extensive package availability
|
|
* Good, because easy debugging via SSH
|
|
* Bad, because mutable system leads to drift
|
|
* Bad, because large attack surface
|
|
* Bad, because manual package management
|
|
|
|
### Flatcar Container Linux
|
|
|
|
* Good, because immutable
|
|
* Good, because auto-updates
|
|
* Good, because container-focused
|
|
* Bad, because less Kubernetes-specific
|
|
* Bad, because smaller community than Talos
|
|
* Bad, because GPU driver setup more complex
|
|
|
|
### Talos Linux
|
|
|
|
* Good, because purpose-built for Kubernetes
|
|
* Good, because immutable and minimal
|
|
* Good, because API-driven (no SSH)
|
|
* Good, because excellent Kubernetes integration
|
|
* Good, because active development and community
|
|
* Good, because schematic system for GPU drivers
|
|
* Bad, because learning curve
|
|
* Bad, because no traditional debugging
|
|
|
|
### k3OS
|
|
|
|
* Good, because simple
|
|
* Bad, because discontinued
|
|
|
|
### Rocky Linux with RKE2
|
|
|
|
* Good, because enterprise-like
|
|
* Good, because familiar Linux experience
|
|
* Bad, because mutable system
|
|
* Bad, because more operational overhead
|
|
* Bad, because larger attack surface
|
|
|
|
## Links
|
|
|
|
* [Talos Linux](https://talos.dev)
|
|
* [Talos Image Factory](https://factory.talos.dev)
|
|
* Related: [ADR-0005](0005-multi-gpu-strategy.md) - GPU driver integration via schematics
|