feat: add comprehensive architecture documentation

- Add AGENT-ONBOARDING.md for AI agents - Add ARCHITECTURE.md with full system overview - Add TECH-STACK.md with complete technology inventory - Add DOMAIN-MODEL.md with entities and bounded contexts - Add CODING-CONVENTIONS.md with patterns and practices - Add GLOSSARY.md with terminology reference - Add C4 diagrams (Context and Container levels) - Add 10 ADRs documenting key decisions: - Talos Linux, NATS, MessagePack, Multi-GPU strategy - GitOps with Flux, KServe, Milvus, Dual workflow engines - Envoy Gateway - Add specs directory with JetStream configuration - Add diagrams for GPU allocation and data flows Based on analysis of homelab-k8s2 and llm-workflows repositories and kubectl cluster-info dump data.
2026-02-01 14:30:05 -05:00
parent 4d4f6f464c
commit 832cda34bd
26 changed files with 3805 additions and 2 deletions
--- a/decisions/0006-gitops-with-flux.md
+++ b/decisions/0006-gitops-with-flux.md
@@ -0,0 +1,140 @@
+# GitOps with Flux CD
+
+* Status: accepted
+* Date: 2025-11-30
+* Deciders: Billy Davies
+* Technical Story: Implementing GitOps for cluster management
+
+## Context and Problem Statement
+
+Managing a Kubernetes cluster with numerous applications, configurations, and secrets requires a reliable, auditable, and reproducible approach. Manual `kubectl apply` is error-prone and doesn't track state over time.
+
+## Decision Drivers
+
+* Infrastructure as Code (IaC) principles
+* Audit trail for all changes
+* Self-healing cluster state
+* Multi-repository support
+* Secret encryption integration
+* Active community and maintenance
+
+## Considered Options
+
+* Manual kubectl apply
+* ArgoCD
+* Flux CD
+* Rancher Fleet
+* Pulumi/Terraform for Kubernetes
+
+## Decision Outcome
+
+Chosen option: "Flux CD", because it provides a mature GitOps implementation with excellent multi-source support, SOPS integration, and aligns well with the Kubernetes ecosystem.
+
+### Positive Consequences
+
+* Git is single source of truth
+* Automatic drift detection and correction
+* Native SOPS/Age secret encryption
+* Multi-repository support (homelab-k8s2 + llm-workflows)
+* Helm and Kustomize native support
+* Webhook-free sync (pull-based)
+
+### Negative Consequences
+
+* No built-in UI (use CLI or third-party)
+* Learning curve for CRD-based configuration
+* Debugging requires understanding Flux controllers
+
+## Configuration
+
+### Repository Structure
+
+```
+homelab-k8s2/
+├── kubernetes/
+│   ├── flux/            # Flux system config
+│   │   ├── config/
+│   │   │   ├── cluster.yaml
+│   │   │   └── secrets.yaml  # SOPS encrypted
+│   │   └── repositories/
+│   │       ├── helm/    # HelmRepositories
+│   │       └── git/     # GitRepositories
+│   └── apps/            # Application Kustomizations
+```
+
+### Multi-Repository Sync
+
+```yaml
+# GitRepository for llm-workflows
+apiVersion: source.toolkit.fluxcd.io/v1
+kind: GitRepository
+metadata:
+  name: llm-workflows
+  namespace: flux-system
+spec:
+  url: ssh://git@github.com/Billy-Davies-2/llm-workflows
+  ref:
+    branch: main
+  secretRef:
+    name: github-deploy-key
+```
+
+### SOPS Integration
+
+```yaml
+# .sops.yaml
+creation_rules:
+  - path_regex: .*\.sops\.yaml$
+    age: >-
+      age1...  # Public key
+```
+
+## Pros and Cons of the Options
+
+### Manual kubectl apply
+
+* Good, because simple
+* Good, because no setup
+* Bad, because no audit trail
+* Bad, because no drift detection
+* Bad, because not reproducible
+
+### ArgoCD
+
+* Good, because great UI
+* Good, because app-of-apps pattern
+* Good, because large community
+* Bad, because heavier resource usage
+* Bad, because webhook-dependent sync
+* Bad, because SOPS requires plugins
+
+### Flux CD
+
+* Good, because lightweight
+* Good, because pull-based (no webhooks)
+* Good, because native SOPS support
+* Good, because multi-source/multi-tenant
+* Good, because Kubernetes-native CRDs
+* Bad, because no built-in UI
+* Bad, because CRD learning curve
+
+### Rancher Fleet
+
+* Good, because integrated with Rancher
+* Good, because multi-cluster
+* Bad, because Rancher ecosystem lock-in
+* Bad, because smaller community
+
+### Pulumi/Terraform
+
+* Good, because familiar IaC tools
+* Good, because drift detection
+* Bad, because not Kubernetes-native
+* Bad, because requires state management
+* Bad, because not continuous reconciliation
+
+## Links
+
+* [Flux CD](https://fluxcd.io)
+* [SOPS Integration](https://fluxcd.io/flux/guides/mozilla-sops/)
+* [flux-local](https://github.com/allenporter/flux-local) - Local testing