feat: add comprehensive architecture documentation
- Add AGENT-ONBOARDING.md for AI agents - Add ARCHITECTURE.md with full system overview - Add TECH-STACK.md with complete technology inventory - Add DOMAIN-MODEL.md with entities and bounded contexts - Add CODING-CONVENTIONS.md with patterns and practices - Add GLOSSARY.md with terminology reference - Add C4 diagrams (Context and Container levels) - Add 10 ADRs documenting key decisions: - Talos Linux, NATS, MessagePack, Multi-GPU strategy - GitOps with Flux, KServe, Milvus, Dual workflow engines - Envoy Gateway - Add specs directory with JetStream configuration - Add diagrams for GPU allocation and data flows Based on analysis of homelab-k8s2 and llm-workflows repositories and kubectl cluster-info dump data.
This commit is contained in:
140
decisions/0006-gitops-with-flux.md
Normal file
140
decisions/0006-gitops-with-flux.md
Normal file
@@ -0,0 +1,140 @@
|
||||
# GitOps with Flux CD
|
||||
|
||||
* Status: accepted
|
||||
* Date: 2025-11-30
|
||||
* Deciders: Billy Davies
|
||||
* Technical Story: Implementing GitOps for cluster management
|
||||
|
||||
## Context and Problem Statement
|
||||
|
||||
Managing a Kubernetes cluster with numerous applications, configurations, and secrets requires a reliable, auditable, and reproducible approach. Manual `kubectl apply` is error-prone and doesn't track state over time.
|
||||
|
||||
## Decision Drivers
|
||||
|
||||
* Infrastructure as Code (IaC) principles
|
||||
* Audit trail for all changes
|
||||
* Self-healing cluster state
|
||||
* Multi-repository support
|
||||
* Secret encryption integration
|
||||
* Active community and maintenance
|
||||
|
||||
## Considered Options
|
||||
|
||||
* Manual kubectl apply
|
||||
* ArgoCD
|
||||
* Flux CD
|
||||
* Rancher Fleet
|
||||
* Pulumi/Terraform for Kubernetes
|
||||
|
||||
## Decision Outcome
|
||||
|
||||
Chosen option: "Flux CD", because it provides a mature GitOps implementation with excellent multi-source support, SOPS integration, and aligns well with the Kubernetes ecosystem.
|
||||
|
||||
### Positive Consequences
|
||||
|
||||
* Git is single source of truth
|
||||
* Automatic drift detection and correction
|
||||
* Native SOPS/Age secret encryption
|
||||
* Multi-repository support (homelab-k8s2 + llm-workflows)
|
||||
* Helm and Kustomize native support
|
||||
* Webhook-free sync (pull-based)
|
||||
|
||||
### Negative Consequences
|
||||
|
||||
* No built-in UI (use CLI or third-party)
|
||||
* Learning curve for CRD-based configuration
|
||||
* Debugging requires understanding Flux controllers
|
||||
|
||||
## Configuration
|
||||
|
||||
### Repository Structure
|
||||
|
||||
```
|
||||
homelab-k8s2/
|
||||
├── kubernetes/
|
||||
│ ├── flux/ # Flux system config
|
||||
│ │ ├── config/
|
||||
│ │ │ ├── cluster.yaml
|
||||
│ │ │ └── secrets.yaml # SOPS encrypted
|
||||
│ │ └── repositories/
|
||||
│ │ ├── helm/ # HelmRepositories
|
||||
│ │ └── git/ # GitRepositories
|
||||
│ └── apps/ # Application Kustomizations
|
||||
```
|
||||
|
||||
### Multi-Repository Sync
|
||||
|
||||
```yaml
|
||||
# GitRepository for llm-workflows
|
||||
apiVersion: source.toolkit.fluxcd.io/v1
|
||||
kind: GitRepository
|
||||
metadata:
|
||||
name: llm-workflows
|
||||
namespace: flux-system
|
||||
spec:
|
||||
url: ssh://git@github.com/Billy-Davies-2/llm-workflows
|
||||
ref:
|
||||
branch: main
|
||||
secretRef:
|
||||
name: github-deploy-key
|
||||
```
|
||||
|
||||
### SOPS Integration
|
||||
|
||||
```yaml
|
||||
# .sops.yaml
|
||||
creation_rules:
|
||||
- path_regex: .*\.sops\.yaml$
|
||||
age: >-
|
||||
age1... # Public key
|
||||
```
|
||||
|
||||
## Pros and Cons of the Options
|
||||
|
||||
### Manual kubectl apply
|
||||
|
||||
* Good, because simple
|
||||
* Good, because no setup
|
||||
* Bad, because no audit trail
|
||||
* Bad, because no drift detection
|
||||
* Bad, because not reproducible
|
||||
|
||||
### ArgoCD
|
||||
|
||||
* Good, because great UI
|
||||
* Good, because app-of-apps pattern
|
||||
* Good, because large community
|
||||
* Bad, because heavier resource usage
|
||||
* Bad, because webhook-dependent sync
|
||||
* Bad, because SOPS requires plugins
|
||||
|
||||
### Flux CD
|
||||
|
||||
* Good, because lightweight
|
||||
* Good, because pull-based (no webhooks)
|
||||
* Good, because native SOPS support
|
||||
* Good, because multi-source/multi-tenant
|
||||
* Good, because Kubernetes-native CRDs
|
||||
* Bad, because no built-in UI
|
||||
* Bad, because CRD learning curve
|
||||
|
||||
### Rancher Fleet
|
||||
|
||||
* Good, because integrated with Rancher
|
||||
* Good, because multi-cluster
|
||||
* Bad, because Rancher ecosystem lock-in
|
||||
* Bad, because smaller community
|
||||
|
||||
### Pulumi/Terraform
|
||||
|
||||
* Good, because familiar IaC tools
|
||||
* Good, because drift detection
|
||||
* Bad, because not Kubernetes-native
|
||||
* Bad, because requires state management
|
||||
* Bad, because not continuous reconciliation
|
||||
|
||||
## Links
|
||||
|
||||
* [Flux CD](https://fluxcd.io)
|
||||
* [SOPS Integration](https://fluxcd.io/flux/guides/mozilla-sops/)
|
||||
* [flux-local](https://github.com/allenporter/flux-local) - Local testing
|
||||
Reference in New Issue
Block a user