feat: add comprehensive architecture documentation

- Add AGENT-ONBOARDING.md for AI agents - Add ARCHITECTURE.md with full system overview - Add TECH-STACK.md with complete technology inventory - Add DOMAIN-MODEL.md with entities and bounded contexts - Add CODING-CONVENTIONS.md with patterns and practices - Add GLOSSARY.md with terminology reference - Add C4 diagrams (Context and Container levels) - Add 10 ADRs documenting key decisions: - Talos Linux, NATS, MessagePack, Multi-GPU strategy - GitOps with Flux, KServe, Milvus, Dual workflow engines - Envoy Gateway - Add specs directory with JetStream configuration - Add diagrams for GPU allocation and data flows Based on analysis of homelab-k8s2 and llm-workflows repositories and kubectl cluster-info dump data.
2026-02-01 14:30:05 -05:00
parent 4d4f6f464c
commit 832cda34bd
26 changed files with 3805 additions and 2 deletions
--- a/decisions/0005-multi-gpu-strategy.md
+++ b/decisions/0005-multi-gpu-strategy.md
@@ -0,0 +1,145 @@
+# Multi-GPU Heterogeneous Strategy
+
+* Status: accepted
+* Date: 2025-12-01
+* Deciders: Billy Davies
+* Technical Story: GPU allocation strategy for AI workloads
+
+## Context and Problem Statement
+
+The homelab has diverse GPU hardware:
+- AMD Strix Halo (64GB unified memory) - khelben
+- NVIDIA RTX 2070 (8GB VRAM) - elminster  
+- AMD Radeon 680M (12GB VRAM) - drizzt
+- Intel Arc (integrated) - danilo
+
+Different AI workloads have different requirements. How do we allocate GPUs effectively?
+
+## Decision Drivers
+
+* Maximize utilization of all GPUs
+* Match workloads to appropriate hardware
+* Support concurrent inference services
+* Enable fractional GPU sharing where appropriate
+* Minimize cross-vendor complexity
+
+## Considered Options
+
+* Single GPU vendor only
+* All workloads on largest GPU
+* Workload-specific GPU allocation
+* Dynamic GPU scheduling (MIG/fractional)
+
+## Decision Outcome
+
+Chosen option: "Workload-specific GPU allocation with dedicated nodes", where each AI service is pinned to the most appropriate GPU based on requirements.
+
+### Allocation Strategy
+
+| Workload | GPU | Node | Rationale |
+|----------|-----|------|-----------|
+| vLLM (LLM inference) | AMD Strix Halo (64GB) | khelben (dedicated) | Large models need unified memory |
+| Whisper (STT) | NVIDIA RTX 2070 (8GB) | elminster | CUDA optimized, medium memory |
+| XTTS (TTS) | NVIDIA RTX 2070 (8GB) | elminster | Shares with Whisper |
+| BGE Embeddings | AMD Radeon 680M (12GB) | drizzt | ROCm support, batch processing |
+| BGE Reranker | Intel Arc | danilo | Light workload, Intel optimization |
+
+### Positive Consequences
+
+* Each workload gets optimal hardware
+* No GPU memory contention for LLM
+* NVIDIA services can share via time-slicing
+* Cost-effective use of varied hardware
+* Clear ownership and debugging
+
+### Negative Consequences
+
+* More complex scheduling (node taints/tolerations)
+* Less flexibility for workload migration
+* Must maintain multiple GPU driver stacks
+* Some GPUs underutilized at times
+
+## Implementation
+
+### Node Taints
+
+```yaml
+# khelben - dedicated vLLM node
+nodeTaints:
+  dedicated: "vllm:NoSchedule"
+```
+
+### Pod Tolerations and Node Affinity
+
+```yaml
+# vLLM deployment
+spec:
+  tolerations:
+    - key: "dedicated"
+      operator: "Equal"
+      value: "vllm"
+      effect: "NoSchedule"
+  affinity:
+    nodeAffinity:
+      requiredDuringSchedulingIgnoredDuringExecution:
+        nodeSelectorTerms:
+          - matchExpressions:
+              - key: kubernetes.io/hostname
+                operator: In
+                values: ["khelben"]
+```
+
+### Resource Limits
+
+```yaml
+# NVIDIA GPU (elminster)
+resources:
+  limits:
+    nvidia.com/gpu: 1
+
+# AMD GPU (drizzt, khelben)  
+resources:
+  limits:
+    amd.com/gpu: 1
+```
+
+## Pros and Cons of the Options
+
+### Single GPU vendor only
+
+* Good, because simpler driver management
+* Good, because consistent tooling
+* Bad, because wastes existing hardware
+* Bad, because higher cost for new hardware
+
+### All workloads on largest GPU
+
+* Good, because simple scheduling
+* Good, because unified memory benefits
+* Bad, because memory contention
+* Bad, because single point of failure
+* Bad, because wastes other GPUs
+
+### Workload-specific allocation (chosen)
+
+* Good, because optimal hardware matching
+* Good, because uses all available GPUs
+* Good, because clear resource boundaries
+* Good, because parallel inference
+* Bad, because more complex configuration
+* Bad, because multiple driver stacks
+
+### Dynamic GPU scheduling
+
+* Good, because flexible
+* Good, because maximizes utilization
+* Bad, because complex to implement
+* Bad, because MIG not available on consumer GPUs
+* Bad, because cross-vendor scheduling immature
+
+## Links
+
+* [Volcano Scheduler](https://volcano.sh)
+* [AMD GPU Device Plugin](https://github.com/ROCm/k8s-device-plugin)
+* [NVIDIA Device Plugin](https://github.com/NVIDIA/k8s-device-plugin)
+* Related: [ADR-0002](0002-use-talos-linux.md) - GPU drivers via Talos schematics