feat: add comprehensive architecture documentation
- Add AGENT-ONBOARDING.md for AI agents - Add ARCHITECTURE.md with full system overview - Add TECH-STACK.md with complete technology inventory - Add DOMAIN-MODEL.md with entities and bounded contexts - Add CODING-CONVENTIONS.md with patterns and practices - Add GLOSSARY.md with terminology reference - Add C4 diagrams (Context and Container levels) - Add 10 ADRs documenting key decisions: - Talos Linux, NATS, MessagePack, Multi-GPU strategy - GitOps with Flux, KServe, Milvus, Dual workflow engines - Envoy Gateway - Add specs directory with JetStream configuration - Add diagrams for GPU allocation and data flows Based on analysis of homelab-k8s2 and llm-workflows repositories and kubectl cluster-info dump data.
This commit is contained in:
97
decisions/0002-use-talos-linux.md
Normal file
97
decisions/0002-use-talos-linux.md
Normal file
@@ -0,0 +1,97 @@
|
||||
# Use Talos Linux for Kubernetes Nodes
|
||||
|
||||
* Status: accepted
|
||||
* Date: 2025-11-30
|
||||
* Deciders: Billy Davies
|
||||
* Technical Story: Selecting OS for bare-metal Kubernetes cluster
|
||||
|
||||
## Context and Problem Statement
|
||||
|
||||
We need a reliable, secure operating system for running Kubernetes on bare-metal homelab nodes. The OS should minimize attack surface, be easy to manage at scale, and support our GPU requirements (AMD ROCm, NVIDIA CUDA, Intel).
|
||||
|
||||
## Decision Drivers
|
||||
|
||||
* Security-first design (immutable, minimal)
|
||||
* API-driven management (no SSH)
|
||||
* Support for various GPU drivers
|
||||
* Kubernetes-native focus
|
||||
* Community support and updates
|
||||
* Ease of upgrades
|
||||
|
||||
## Considered Options
|
||||
|
||||
* Ubuntu Server with kubeadm
|
||||
* Flatcar Container Linux
|
||||
* Talos Linux
|
||||
* k3OS (discontinued)
|
||||
* Rocky Linux with RKE2
|
||||
|
||||
## Decision Outcome
|
||||
|
||||
Chosen option: "Talos Linux", because it provides an immutable, API-driven, Kubernetes-focused OS that minimizes attack surface and simplifies operations.
|
||||
|
||||
### Positive Consequences
|
||||
|
||||
* Immutable root filesystem prevents drift
|
||||
* No SSH reduces attack vectors
|
||||
* API-driven management integrates well with GitOps
|
||||
* Schematic system allows custom kernel modules (GPU drivers)
|
||||
* Consistent configuration across all nodes
|
||||
* Automatic updates with minimal disruption
|
||||
|
||||
### Negative Consequences
|
||||
|
||||
* Learning curve for API-driven management
|
||||
* Debugging requires different approaches (no SSH)
|
||||
* Custom extensions require schematic IDs
|
||||
* Less flexibility for non-Kubernetes workloads
|
||||
|
||||
## Pros and Cons of the Options
|
||||
|
||||
### Ubuntu Server with kubeadm
|
||||
|
||||
* Good, because familiar
|
||||
* Good, because extensive package availability
|
||||
* Good, because easy debugging via SSH
|
||||
* Bad, because mutable system leads to drift
|
||||
* Bad, because large attack surface
|
||||
* Bad, because manual package management
|
||||
|
||||
### Flatcar Container Linux
|
||||
|
||||
* Good, because immutable
|
||||
* Good, because auto-updates
|
||||
* Good, because container-focused
|
||||
* Bad, because less Kubernetes-specific
|
||||
* Bad, because smaller community than Talos
|
||||
* Bad, because GPU driver setup more complex
|
||||
|
||||
### Talos Linux
|
||||
|
||||
* Good, because purpose-built for Kubernetes
|
||||
* Good, because immutable and minimal
|
||||
* Good, because API-driven (no SSH)
|
||||
* Good, because excellent Kubernetes integration
|
||||
* Good, because active development and community
|
||||
* Good, because schematic system for GPU drivers
|
||||
* Bad, because learning curve
|
||||
* Bad, because no traditional debugging
|
||||
|
||||
### k3OS
|
||||
|
||||
* Good, because simple
|
||||
* Bad, because discontinued
|
||||
|
||||
### Rocky Linux with RKE2
|
||||
|
||||
* Good, because enterprise-like
|
||||
* Good, because familiar Linux experience
|
||||
* Bad, because mutable system
|
||||
* Bad, because more operational overhead
|
||||
* Bad, because larger attack surface
|
||||
|
||||
## Links
|
||||
|
||||
* [Talos Linux](https://talos.dev)
|
||||
* [Talos Image Factory](https://factory.talos.dev)
|
||||
* Related: [ADR-0005](0005-multi-gpu-strategy.md) - GPU driver integration via schematics
|
||||
Reference in New Issue
Block a user