All checks were successful
Update README with ADR Index / update-readme (push) Successful in 6s
- ADR-0038: Infrastructure metrics collection (smartctl, SNMP, blackbox, unpoller) - ADR-0039: Alerting and notification pipeline (Alertmanager → ntfy → Discord) - Replace llm-workflows GitHub links with Gitea daviestechlabs org repos - Update AGENT-ONBOARDING.md: remove llm-workflows from file tree, add missing repos - Update ADR-0006: fix multi-repo reference - Update ADR-0009: fix broken llm-workflows link - Update ADR-0024: mark ray-serve repo as created, update historical context - Update README: fix ADR-0016 status, add 0038/0039 to table, update badges
146 lines
3.8 KiB
Markdown
146 lines
3.8 KiB
Markdown
# GitOps with Flux CD
|
|
|
|
* Status: accepted
|
|
* Date: 2025-11-30
|
|
* Deciders: Billy Davies
|
|
* Technical Story: Implementing GitOps for cluster management
|
|
|
|
## Context and Problem Statement
|
|
|
|
Managing a Kubernetes cluster with numerous applications, configurations, and secrets requires a reliable, auditable, and reproducible approach. Manual `kubectl apply` is error-prone and doesn't track state over time.
|
|
|
|
## Decision Drivers
|
|
|
|
* Infrastructure as Code (IaC) principles
|
|
* Audit trail for all changes
|
|
* Self-healing cluster state
|
|
* Multi-repository support
|
|
* Secret encryption integration
|
|
* Active community and maintenance
|
|
|
|
## Considered Options
|
|
|
|
* Manual kubectl apply
|
|
* ArgoCD
|
|
* Flux CD
|
|
* Rancher Fleet
|
|
* Pulumi/Terraform for Kubernetes
|
|
|
|
## Decision Outcome
|
|
|
|
Chosen option: "Flux CD", because it provides a mature GitOps implementation with excellent multi-source support, SOPS integration, and aligns well with the Kubernetes ecosystem.
|
|
|
|
### Positive Consequences
|
|
|
|
* Git is single source of truth
|
|
* Automatic drift detection and correction
|
|
* Native SOPS/Age secret encryption
|
|
* Multi-repository support (homelab-k8s2 + Gitea daviestechlabs repos)
|
|
* Helm and Kustomize native support
|
|
* Webhook-free sync (pull-based)
|
|
|
|
### Negative Consequences
|
|
|
|
* No built-in UI (use CLI or third-party)
|
|
* Learning curve for CRD-based configuration
|
|
* Debugging requires understanding Flux controllers
|
|
|
|
## Configuration
|
|
|
|
### Repository Structure
|
|
|
|
```
|
|
homelab-k8s2/
|
|
├── kubernetes/
|
|
│ ├── flux/ # Flux system config
|
|
│ │ ├── config/
|
|
│ │ │ ├── cluster.yaml
|
|
│ │ │ └── secrets.yaml # SOPS encrypted
|
|
│ │ └── repositories/
|
|
│ │ ├── helm/ # HelmRepositories
|
|
│ │ └── git/ # GitRepositories
|
|
│ └── apps/ # Application Kustomizations
|
|
```
|
|
|
|
### Multi-Repository Sync
|
|
|
|
```yaml
|
|
# GitRepository for Gitea repos (daviestechlabs org)
|
|
# Examples: argo, kubeflow, chat-handler, voice-assistant
|
|
apiVersion: source.toolkit.fluxcd.io/v1
|
|
kind: GitRepository
|
|
metadata:
|
|
name: argo-workflows
|
|
namespace: flux-system
|
|
spec:
|
|
url: https://git.daviestechlabs.io/daviestechlabs/argo.git
|
|
ref:
|
|
branch: main
|
|
# Public repos don't need secretRef
|
|
```
|
|
|
|
Note: The monolithic `llm-workflows` repo has been archived and decomposed into
|
|
focused repos in the daviestechlabs Gitea organization (e.g. `chat-handler`,
|
|
`voice-assistant`, `handler-base`, `ray-serve`, etc.). See AGENT-ONBOARDING.md
|
|
for the full list.
|
|
|
|
### SOPS Integration
|
|
|
|
```yaml
|
|
# .sops.yaml
|
|
creation_rules:
|
|
- path_regex: .*\.sops\.yaml$
|
|
age: >-
|
|
age1... # Public key
|
|
```
|
|
|
|
## Pros and Cons of the Options
|
|
|
|
### Manual kubectl apply
|
|
|
|
* Good, because simple
|
|
* Good, because no setup
|
|
* Bad, because no audit trail
|
|
* Bad, because no drift detection
|
|
* Bad, because not reproducible
|
|
|
|
### ArgoCD
|
|
|
|
* Good, because great UI
|
|
* Good, because app-of-apps pattern
|
|
* Good, because large community
|
|
* Bad, because heavier resource usage
|
|
* Bad, because webhook-dependent sync
|
|
* Bad, because SOPS requires plugins
|
|
|
|
### Flux CD
|
|
|
|
* Good, because lightweight
|
|
* Good, because pull-based (no webhooks)
|
|
* Good, because native SOPS support
|
|
* Good, because multi-source/multi-tenant
|
|
* Good, because Kubernetes-native CRDs
|
|
* Bad, because no built-in UI
|
|
* Bad, because CRD learning curve
|
|
|
|
### Rancher Fleet
|
|
|
|
* Good, because integrated with Rancher
|
|
* Good, because multi-cluster
|
|
* Bad, because Rancher ecosystem lock-in
|
|
* Bad, because smaller community
|
|
|
|
### Pulumi/Terraform
|
|
|
|
* Good, because familiar IaC tools
|
|
* Good, because drift detection
|
|
* Bad, because not Kubernetes-native
|
|
* Bad, because requires state management
|
|
* Bad, because not continuous reconciliation
|
|
|
|
## Links
|
|
|
|
* [Flux CD](https://fluxcd.io)
|
|
* [SOPS Integration](https://fluxcd.io/flux/guides/mozilla-sops/)
|
|
* [flux-local](https://github.com/allenporter/flux-local) - Local testing
|