feat: add comprehensive architecture documentation

- Add AGENT-ONBOARDING.md for AI agents
- Add ARCHITECTURE.md with full system overview
- Add TECH-STACK.md with complete technology inventory
- Add DOMAIN-MODEL.md with entities and bounded contexts
- Add CODING-CONVENTIONS.md with patterns and practices
- Add GLOSSARY.md with terminology reference
- Add C4 diagrams (Context and Container levels)
- Add 10 ADRs documenting key decisions:
  - Talos Linux, NATS, MessagePack, Multi-GPU strategy
  - GitOps with Flux, KServe, Milvus, Dual workflow engines
  - Envoy Gateway
- Add specs directory with JetStream configuration
- Add diagrams for GPU allocation and data flows

Based on analysis of homelab-k8s2 and llm-workflows repositories
and kubectl cluster-info dump data.
This commit is contained in:
2026-02-01 14:30:05 -05:00
parent 4d4f6f464c
commit 832cda34bd
26 changed files with 3805 additions and 2 deletions

View File

@@ -0,0 +1,140 @@
# GitOps with Flux CD
* Status: accepted
* Date: 2025-11-30
* Deciders: Billy Davies
* Technical Story: Implementing GitOps for cluster management
## Context and Problem Statement
Managing a Kubernetes cluster with numerous applications, configurations, and secrets requires a reliable, auditable, and reproducible approach. Manual `kubectl apply` is error-prone and doesn't track state over time.
## Decision Drivers
* Infrastructure as Code (IaC) principles
* Audit trail for all changes
* Self-healing cluster state
* Multi-repository support
* Secret encryption integration
* Active community and maintenance
## Considered Options
* Manual kubectl apply
* ArgoCD
* Flux CD
* Rancher Fleet
* Pulumi/Terraform for Kubernetes
## Decision Outcome
Chosen option: "Flux CD", because it provides a mature GitOps implementation with excellent multi-source support, SOPS integration, and aligns well with the Kubernetes ecosystem.
### Positive Consequences
* Git is single source of truth
* Automatic drift detection and correction
* Native SOPS/Age secret encryption
* Multi-repository support (homelab-k8s2 + llm-workflows)
* Helm and Kustomize native support
* Webhook-free sync (pull-based)
### Negative Consequences
* No built-in UI (use CLI or third-party)
* Learning curve for CRD-based configuration
* Debugging requires understanding Flux controllers
## Configuration
### Repository Structure
```
homelab-k8s2/
├── kubernetes/
│ ├── flux/ # Flux system config
│ │ ├── config/
│ │ │ ├── cluster.yaml
│ │ │ └── secrets.yaml # SOPS encrypted
│ │ └── repositories/
│ │ ├── helm/ # HelmRepositories
│ │ └── git/ # GitRepositories
│ └── apps/ # Application Kustomizations
```
### Multi-Repository Sync
```yaml
# GitRepository for llm-workflows
apiVersion: source.toolkit.fluxcd.io/v1
kind: GitRepository
metadata:
name: llm-workflows
namespace: flux-system
spec:
url: ssh://git@github.com/Billy-Davies-2/llm-workflows
ref:
branch: main
secretRef:
name: github-deploy-key
```
### SOPS Integration
```yaml
# .sops.yaml
creation_rules:
- path_regex: .*\.sops\.yaml$
age: >-
age1... # Public key
```
## Pros and Cons of the Options
### Manual kubectl apply
* Good, because simple
* Good, because no setup
* Bad, because no audit trail
* Bad, because no drift detection
* Bad, because not reproducible
### ArgoCD
* Good, because great UI
* Good, because app-of-apps pattern
* Good, because large community
* Bad, because heavier resource usage
* Bad, because webhook-dependent sync
* Bad, because SOPS requires plugins
### Flux CD
* Good, because lightweight
* Good, because pull-based (no webhooks)
* Good, because native SOPS support
* Good, because multi-source/multi-tenant
* Good, because Kubernetes-native CRDs
* Bad, because no built-in UI
* Bad, because CRD learning curve
### Rancher Fleet
* Good, because integrated with Rancher
* Good, because multi-cluster
* Bad, because Rancher ecosystem lock-in
* Bad, because smaller community
### Pulumi/Terraform
* Good, because familiar IaC tools
* Good, because drift detection
* Bad, because not Kubernetes-native
* Bad, because requires state management
* Bad, because not continuous reconciliation
## Links
* [Flux CD](https://fluxcd.io)
* [SOPS Integration](https://fluxcd.io/flux/guides/mozilla-sops/)
* [flux-local](https://github.com/allenporter/flux-local) - Local testing