From 5846d0dc16ad1775d97db4aff2311ae0185340ab Mon Sep 17 00:00:00 2001 From: "Billy D." Date: Mon, 9 Feb 2026 18:36:39 -0500 Subject: [PATCH] docs: add ADRs 0043-0053 covering remaining architecture gaps New ADRs: - 0043: Cilium CNI and Network Fabric - 0044: DNS and External Access Architecture - 0045: TLS Certificate Strategy (cert-manager) - 0046: Companions Frontend Architecture - 0047: MLflow Experiment Tracking and Model Registry - 0048: Entertainment and Media Stack - 0049: Self-Hosted Productivity Suite - 0050: Argo Rollouts Progressive Delivery - 0051: KEDA Event-Driven Autoscaling - 0052: Cluster Utilities (Spegel, Descheduler, Reloader, CSI-NFS) - 0053: Vaultwarden Password Management README updated with table entries and badge count (53 total). --- README.md | 13 +- decisions/0043-cilium-cni-network-fabric.md | 118 +++++++++++++++ decisions/0044-dns-and-external-access.md | 129 +++++++++++++++++ decisions/0045-tls-certificate-strategy.md | 77 ++++++++++ .../0046-companions-frontend-architecture.md | 137 ++++++++++++++++++ decisions/0047-mlflow-experiment-tracking.md | 114 +++++++++++++++ decisions/0048-entertainment-media-stack.md | 99 +++++++++++++ .../0049-self-hosted-productivity-suite.md | 122 ++++++++++++++++ ...0050-argo-rollouts-progressive-delivery.md | 71 +++++++++ .../0051-keda-event-driven-autoscaling.md | 68 +++++++++ .../0052-cluster-utilities-optimization.md | 104 +++++++++++++ .../0053-vaultwarden-password-management.md | 90 ++++++++++++ 12 files changed, 1141 insertions(+), 1 deletion(-) create mode 100644 decisions/0043-cilium-cni-network-fabric.md create mode 100644 decisions/0044-dns-and-external-access.md create mode 100644 decisions/0045-tls-certificate-strategy.md create mode 100644 decisions/0046-companions-frontend-architecture.md create mode 100644 decisions/0047-mlflow-experiment-tracking.md create mode 100644 decisions/0048-entertainment-media-stack.md create mode 100644 decisions/0049-self-hosted-productivity-suite.md create mode 100644 decisions/0050-argo-rollouts-progressive-delivery.md create mode 100644 decisions/0051-keda-event-driven-autoscaling.md create mode 100644 decisions/0052-cluster-utilities-optimization.md create mode 100644 decisions/0053-vaultwarden-password-management.md diff --git a/README.md b/README.md index 7ab5278..29b54de 100644 --- a/README.md +++ b/README.md @@ -8,7 +8,7 @@ [![License](https://img.shields.io/badge/License-MIT-green)](LICENSE) -![ADR Count](https://img.shields.io/badge/ADRs-42_total-blue?logo=bookstack) ![Accepted](https://img.shields.io/badge/accepted-41-brightgreen) ![Proposed](https://img.shields.io/badge/proposed-0-yellow) +![ADR Count](https://img.shields.io/badge/ADRs-53_total-blue?logo=bookstack) ![Accepted](https://img.shields.io/badge/accepted-53-brightgreen) ## ๐Ÿ“– Quick Navigation @@ -133,6 +133,17 @@ homelab-design/ | 0040 | [OPA Gatekeeper Policy Framework](decisions/0040-opa-gatekeeper-policy-framework.md) | โœ… accepted | 2026-02-09 | | 0041 | [Falco Runtime Threat Detection](decisions/0041-falco-runtime-threat-detection.md) | โœ… accepted | 2026-02-09 | | 0042 | [Trivy Operator Vulnerability Scanning](decisions/0042-trivy-operator-vulnerability-scanning.md) | โœ… accepted | 2026-02-09 | +| 0043 | [Cilium CNI and Network Fabric](decisions/0043-cilium-cni-network-fabric.md) | โœ… accepted | 2026-02-09 | +| 0044 | [DNS and External Access Architecture](decisions/0044-dns-and-external-access.md) | โœ… accepted | 2026-02-09 | +| 0045 | [TLS Certificate Strategy](decisions/0045-tls-certificate-strategy.md) | โœ… accepted | 2026-02-09 | +| 0046 | [Companions Frontend Architecture](decisions/0046-companions-frontend-architecture.md) | โœ… accepted | 2026-02-09 | +| 0047 | [MLflow Experiment Tracking and Model Registry](decisions/0047-mlflow-experiment-tracking.md) | โœ… accepted | 2026-02-09 | +| 0048 | [Entertainment and Media Stack](decisions/0048-entertainment-media-stack.md) | โœ… accepted | 2026-02-09 | +| 0049 | [Self-Hosted Productivity Suite](decisions/0049-self-hosted-productivity-suite.md) | โœ… accepted | 2026-02-09 | +| 0050 | [Argo Rollouts Progressive Delivery](decisions/0050-argo-rollouts-progressive-delivery.md) | โœ… accepted | 2026-02-09 | +| 0051 | [KEDA Event-Driven Autoscaling](decisions/0051-keda-event-driven-autoscaling.md) | โœ… accepted | 2026-02-09 | +| 0052 | [Cluster Utilities and Optimization](decisions/0052-cluster-utilities-optimization.md) | โœ… accepted | 2026-02-09 | +| 0053 | [Vaultwarden Password Management](decisions/0053-vaultwarden-password-management.md) | โœ… accepted | 2026-02-09 | ## ๐Ÿ”— Related Repositories diff --git a/decisions/0043-cilium-cni-network-fabric.md b/decisions/0043-cilium-cni-network-fabric.md new file mode 100644 index 0000000..6b8676b --- /dev/null +++ b/decisions/0043-cilium-cni-network-fabric.md @@ -0,0 +1,118 @@ +# Cilium CNI and Network Fabric + +* Status: accepted +* Date: 2026-02-09 +* Deciders: Billy +* Technical Story: Select and configure the Container Network Interface (CNI) plugin for pod networking, load balancing, and service mesh capabilities on Talos Linux + +## Context and Problem Statement + +A Kubernetes cluster requires a CNI plugin to provide pod-to-pod networking, service load balancing, and network policy enforcement. The homelab runs on Talos Linux (immutable OS) with heterogeneous hardware (amd64 + arm64) and needs L2 LoadBalancer IP advertisement for bare-metal services. + +How do we provide reliable, performant networking with bare-metal LoadBalancer support and observability integration? + +## Decision Drivers + +* Must work on Talos Linux (eBPF-capable, no iptables preference) +* Bare-metal LoadBalancer IP assignment (no cloud provider) +* L2 IP advertisement for LAN services +* kube-proxy replacement for performance +* Future-proof for network policy enforcement +* Active community and CNCF backing + +## Considered Options + +1. **Cilium** โ€” eBPF-based CNI with kube-proxy replacement +2. **Calico** โ€” Established CNI with eBPF and BGP support +3. **Flannel + MetalLB** โ€” Simple overlay with separate LB +4. **Antrea** โ€” VMware-backed OVS-based CNI + +## Decision Outcome + +Chosen option: **Cilium**, because it provides an eBPF-native dataplane that replaces kube-proxy, L2 LoadBalancer announcements (eliminating MetalLB), and native Talos support. + +### Positive Consequences + +* Single component handles CNI + kube-proxy + LoadBalancer (replaces 3 tools) +* eBPF dataplane is more efficient than iptables on large clusters +* L2 announcements provide bare-metal LoadBalancer without MetalLB +* Maglev + DSR load balancing for consistent hashing and reduced latency +* Strong Talos Linux integration and testing +* Prometheus metrics and Grafana dashboards included + +### Negative Consequences + +* More complex configuration than simple CNIs +* eBPF requires compatible kernel (Talos provides this) +* `hostLegacyRouting: true` workaround needed for Talos issue #10002 + +## Deployment Configuration + +| | | +|---|---| +| **Chart** | `cilium` from `oci://ghcr.io/home-operations/charts-mirror/cilium` | +| **Version** | 1.18.6 | +| **Namespace** | `kube-system` | + +### Core Networking + +| Setting | Value | Rationale | +|---------|-------|-----------| +| `kubeProxyReplacement` | `true` | Replace kube-proxy entirely โ€” lower latency, fewer components | +| `routingMode` | `native` | Direct routing, no encapsulation overhead | +| `autoDirectNodeRoutes` | `true` | Auto-configure inter-node routes | +| `ipv4NativeRoutingCIDR` | `10.42.0.0/16` | Pod CIDR for native routing | +| `ipam.mode` | `kubernetes` | Use Kubernetes IPAM | +| `endpointRoutes.enabled` | `true` | Per-endpoint routing for better granularity | +| `bpf.masquerade` | `true` | eBPF-based masquerading | +| `bpf.hostLegacyRouting` | `true` | Workaround for Talos issue #10002 | + +### Load Balancing + +| Setting | Value | Rationale | +|---------|-------|-----------| +| `loadBalancer.algorithm` | `maglev` | Consistent hashing for stable backend selection | +| `loadBalancer.mode` | `dsr` | Direct Server Return โ€” response bypasses LB for lower latency | +| `socketLB.enabled` | `true` | Socket-level load balancing for host-namespace pods | + +### L2 Announcements + +Cilium replaces MetalLB for bare-metal LoadBalancer IP assignment: + +``` +CiliumLoadBalancerIPPool: 192.168.100.0/24 +CiliumL2AnnouncementPolicy: announces on all Linux nodes +``` + +Key VIPs assigned from this pool: +- `192.168.100.200` โ€” k8s-gateway (internal DNS) +- `192.168.100.201` โ€” envoy-internal gateway +- `192.168.100.210` โ€” envoy-external gateway + +### Multi-Network Support + +| Setting | Value | Rationale | +|---------|-------|-----------| +| `cni.exclusive` | `false` | Paired with Multus CNI for multi-network pods | + +This enables workloads like qbittorrent to use secondary network interfaces (e.g., VPN). + +### Disabled Features + +| Feature | Reason | +|---------|--------| +| Hubble | Not needed โ€” tracing handled by OpenTelemetry stack | +| Gateway API | Offloaded to dedicated Envoy Gateway deployment | +| Envoy (built-in) | Using separate Envoy Gateway for more control | + +## Observability + +- **Prometheus:** ServiceMonitor enabled for both agent and operator +- **Grafana:** Two dashboards via `GrafanaDashboard` CRDs (Cilium agent + operator) + +## Links + +* Related to [ADR-0044](0044-dns-and-external-access.md) (L2 IPs feed gateway VIPs) +* Related to [ADR-0002](0002-use-talos-linux.md) (Talos eBPF compatibility) +* [Cilium Documentation](https://docs.cilium.io/) +* [Talos Cilium Guide](https://www.talos.dev/latest/kubernetes-guides/network/deploying-cilium/) diff --git a/decisions/0044-dns-and-external-access.md b/decisions/0044-dns-and-external-access.md new file mode 100644 index 0000000..e9457ce --- /dev/null +++ b/decisions/0044-dns-and-external-access.md @@ -0,0 +1,129 @@ +# DNS and External Access Architecture + +* Status: accepted +* Date: 2026-02-09 +* Deciders: Billy +* Technical Story: Design a multi-layer DNS and ingress architecture providing both public and private access to cluster services + +## Context and Problem Statement + +A homelab behind a residential network needs to expose some services publicly (productivity apps, status pages) while keeping others private (admin UIs, AI inference). DNS resolution must work correctly for both external users and LAN clients, avoiding hairpin NAT issues. + +How do we provide split-horizon DNS, secure external access, and automated DNS management across public and private domains? + +## Decision Drivers + +* No public IP dependency โ€” residential NAT, dynamic IP +* Split-horizon DNS โ€” same domain resolves differently inside vs outside +* Automated DNS record management from Kubernetes resources +* External access without opening router ports +* LAN clients should resolve directly to cluster IPs (no hairpin NAT) +* Separate gateways for public vs internal services + +## Decision Outcome + +A four-component DNS architecture with Cloudflare Tunnel for external access: + +1. **Cloudflare Tunnel** โ€” encrypted tunnel for external traffic (no open ports) +2. **Cloudflare DNS (external-dns)** โ€” syncs public DNS records from HTTPRoutes +3. **k8s-gateway** โ€” internal DNS server for split-horizon resolution +4. **UniFi DNS (external-dns)** โ€” syncs internal DNS records to home network DNS + +## Architecture + +``` +โ”Œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ” +โ”‚ EXTERNAL ACCESS โ”‚ +โ”‚ โ”‚ +โ”‚ Internet โ†’ Cloudflare CDN/Proxy โ”‚ +โ”‚ โ†“ โ”‚ +โ”‚ Cloudflare Tunnel (QUIC + post-quantum encryption) โ”‚ +โ”‚ โ†“ โ”‚ +โ”‚ envoy-external (192.168.100.210) โ”‚ +โ”‚ *.daviestechlabs.io (Let's Encrypt wildcard) โ”‚ +โ”‚ Services: Affine, Immich, Nextcloud, ntfy, Gatus, etc. โ”‚ +โ””โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”˜ + +โ”Œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ” +โ”‚ INTERNAL ACCESS โ”‚ +โ”‚ โ”‚ +โ”‚ LAN client โ†’ UniFi DNS (home router) โ”‚ +โ”‚ โ†“ (*.lab.daviestechlabs.io โ†’ k8s-gateway VIP) โ”‚ +โ”‚ k8s-gateway (192.168.100.200, port 53) โ”‚ +โ”‚ โ†“ (resolves from HTTPRoutes/Services) โ”‚ +โ”‚ envoy-internal (192.168.100.201) โ”‚ +โ”‚ *.lab.daviestechlabs.io (self-signed + LE certs) โ”‚ +โ”‚ Services: Grafana, Prometheus, MLflow, Companions, etc. โ”‚ +โ”‚ OIDC auth via Authentik โ”‚ +โ””โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”˜ + +โ”Œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ” +โ”‚ DNS AUTOMATION โ”‚ +โ”‚ โ”‚ +โ”‚ HTTPRoute created โ†’ external-dns (Cloudflare) syncs public records โ”‚ +โ”‚ โ†’ external-dns (UniFi) syncs LAN records โ”‚ +โ”‚ โ”‚ +โ”‚ Split-horizon: external client resolves to Cloudflare proxy IP โ”‚ +โ”‚ LAN client resolves directly to cluster VIP โ”‚ +โ””โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”˜ +``` + +## Component Details + +### Cloudflare Tunnel + +| | | +|---|---| +| **Image** | `cloudflare/cloudflared:2026.1.2` | +| **Transport** | QUIC with post-quantum encryption | + +Provides secure ingress without opening any router ports. All external traffic enters via Cloudflare's network and is tunneled to `envoy-external` over an encrypted QUIC connection. + +- Ingress: `*.daviestechlabs.io` โ†’ `https://envoy-external.network.svc.cluster.local:443` +- HTTP/2 origin, TLS verified against origin server name +- DNSEndpoint CNAME: `external.daviestechlabs.io` โ†’ `.cfargotunnel.com` +- Resources: 10m CPU, 256Mi memory limit +- Security: non-root (UID 65534), read-only rootfs, capabilities dropped + +### Cloudflare DNS (external-dns) + +| | | +|---|---| +| **Chart** | `external-dns` v1.20.0 | +| **Provider** | Cloudflare | + +Watches `gateway-httproute` and `DNSEndpoint` CRDs, syncs to Cloudflare DNS. Records are Cloudflare-proxied (orange cloud) for DDoS protection. TXT prefix `k8s.` for ownership tracking. Sync policy: full lifecycle management. + +### k8s-gateway (Internal DNS) + +| | | +|---|---| +| **Chart** | `k8s-gateway` v3.4.1 | +| **VIP** | `192.168.100.200` (port 53) | + +CoreDNS-based DNS server that resolves cluster service domains by watching HTTPRoute and Service resources. Provides split-horizon: LAN clients query this server (via UniFi DNS forwarding) and get direct cluster IPs instead of Cloudflare proxy IPs. + +### UniFi DNS (external-dns) + +| | | +|---|---| +| **Chart** | `external-dns` v1.20.0 | +| **Webhook** | `ghcr.io/kashalls/external-dns-unifi-webhook:v0.8.1` | + +Syncs `*.lab.daviestechlabs.io` records to the UniFi controller's DNS server at `192.168.100.254`. This means LAN devices automatically resolve internal services without manual DNS entries. API key from Vault via ExternalSecret. + +## Domain Strategy + +| Domain | Gateway | Access | DNS Provider | TLS | +|--------|---------|--------|-------------|-----| +| `*.daviestechlabs.io` | envoy-external | Public (Cloudflare Tunnel) | Cloudflare | Let's Encrypt wildcard | +| `*.lab.daviestechlabs.io` | envoy-internal | LAN only | UniFi DNS | Self-signed + LE | + +## Links + +* Related to [ADR-0010](0010-use-envoy-gateway.md) (Envoy Gateway configuration) +* Related to [ADR-0043](0043-cilium-cni-network-fabric.md) (L2 VIP assignment) +* Related to [ADR-0045](0045-tls-certificate-strategy.md) (certificate issuance) +* Related to [ADR-0028](0028-authentik-sso-strategy.md) (OIDC on internal gateway) +* [Cloudflare Tunnel Docs](https://developers.cloudflare.com/cloudflare-one/connections/connect-networks/) +* [k8s-gateway](https://github.com/k8s-gateway/k8s-gateway) diff --git a/decisions/0045-tls-certificate-strategy.md b/decisions/0045-tls-certificate-strategy.md new file mode 100644 index 0000000..a1b0540 --- /dev/null +++ b/decisions/0045-tls-certificate-strategy.md @@ -0,0 +1,77 @@ +# TLS Certificate Strategy + +* Status: accepted +* Date: 2026-02-09 +* Deciders: Billy +* Technical Story: Automate TLS certificate provisioning for both public and internal services + +## Context and Problem Statement + +Every HTTPS service in the cluster needs a valid TLS certificate. Public services need trusted certificates (Let's Encrypt), while internal services can use self-signed certificates. Manual certificate management doesn't scale across 30+ services. + +How do we automate certificate issuance and renewal for both public and internal domains? + +## Decision Drivers + +* Fully automated certificate lifecycle (issuance, renewal, rotation) +* Wildcard certificates to avoid per-service certificate sprawl +* DNS-01 challenge for wildcard support (HTTP-01 can't do wildcards) +* Internal services need certificates too (browser warnings are unacceptable) +* Zero downtime during renewal + +## Decision Outcome + +Deploy **cert-manager** with two ClusterIssuers: Let's Encrypt (DNS-01 via Cloudflare) for public domains, and a self-signed issuer for internal domains. + +## Deployment Configuration + +| | | +|---|---| +| **Chart** | `cert-manager` from `oci://quay.io/jetstack/charts/cert-manager` | +| **Version** | v1.19.3 | +| **Namespace** | `cert-manager` | +| **Replicas** | 1 | + +## Certificate Issuers + +### letsencrypt-production (Public) + +| | | +|---|---| +| **Type** | ACME (Let's Encrypt) | +| **Challenge** | DNS-01 via Cloudflare API | +| **Nameservers** | `1.1.1.1:443`, `1.0.0.1:443` (DNS-over-HTTPS) | +| **Zone** | `daviestechlabs.io` | + +Uses a Cloudflare API token (SOPS-encrypted) to create DNS-01 challenge TXT records. Recursive nameservers configured to use Cloudflare DoH for faster propagation checks. + +### selfsigned-internal (Private) + +| | | +|---|---| +| **Type** | Self-Signed | +| **Use** | `*.lab.daviestechlabs.io` internal services | + +Used for internal services where browser trust isn't critical (admin UIs accessed by the operator). + +## Certificates + +| Domain | Issuer | Type | Duration | Renewal | +|--------|--------|------|----------|---------| +| `daviestechlabs.io` + `*.daviestechlabs.io` | letsencrypt-production | Wildcard | 90 days (LE default) | Auto | +| `lab.daviestechlabs.io` + `*.lab.daviestechlabs.io` | selfsigned-internal | Wildcard | 1 year | 30 days before expiry | + +Wildcard certificates are used to avoid creating individual certificates per service. Both certificates are referenced by the Envoy Gateway listeners. + +## Integration Points + +- **Cloudflare:** API token for DNS-01 challenges (stored as SOPS-encrypted Secret) +- **Envoy Gateway:** References certificates in Gateway listener TLS configuration +- **Flux:** Health check validates ClusterIssuer readiness before dependent resources +- **Prometheus:** ServiceMonitor enabled for cert-manager metrics + +## Links + +* Related to [ADR-0044](0044-dns-and-external-access.md) (DNS architecture) +* Related to [ADR-0010](0010-use-envoy-gateway.md) (Gateway TLS listeners) +* [cert-manager Documentation](https://cert-manager.io/docs/) diff --git a/decisions/0046-companions-frontend-architecture.md b/decisions/0046-companions-frontend-architecture.md new file mode 100644 index 0000000..a4530e9 --- /dev/null +++ b/decisions/0046-companions-frontend-architecture.md @@ -0,0 +1,137 @@ +# Companions Frontend Architecture + +* Status: accepted +* Date: 2026-02-09 +* Deciders: Billy +* Technical Story: Design the primary user interface for the AI/ML platform, supporting real-time chat, voice, and 3D avatar interactions + +## Context and Problem Statement + +The homelab AI platform needs a web interface for users to interact with chat (RAG + LLM), voice (STT โ†’ LLM โ†’ TTS), and embedding services. The interface must support real-time streaming responses, WebSocket connections for NATS message bus integration, and an engaging visual experience. + +How do we build a performant, maintainable frontend that integrates with the NATS-based backend without a heavy JavaScript framework build step? + +## Decision Drivers + +* Real-time streaming for chat and voice (WebSocket required) +* Direct integration with NATS JetStream (binary MessagePack protocol) +* Minimal client-side JavaScript (~20KB gzipped target) +* No frontend build step (no webpack/vite/node required) +* 3D avatar rendering for immersive experience +* OAuth integration with multiple providers +* Single binary deployment (Go) + +## Considered Options + +1. **Go + HTMX + Alpine.js + Three.js** โ€” Server-rendered with minimal JS +2. **Next.js / React SPA** โ€” Full JavaScript framework +3. **SvelteKit** โ€” Compiled JS framework +4. **Go + Templ + raw WebSocket** โ€” Pure Go templates, no JS framework + +## Decision Outcome + +Chosen option: **Option 1 - Go + HTMX + Alpine.js + Three.js**, because it provides a zero-build-step frontend with server-rendered HTML, minimal JavaScript, and rich 3D avatar support, all served from a single Go binary. + +### Positive Consequences + +* Single binary deployment โ€” Go server serves everything +* ~20KB gzipped total JS payload (CDN-served HTMX + Alpine + Three.js) +* No npm, no webpack, no build step โ€” assets served directly +* Server-side rendering via Go templates +* WebSocket handled natively in Go (gorilla/websocket) +* NATS integration with MessagePack in the same binary +* Distroless container image for minimal attack surface + +### Negative Consequences + +* Three.js adds complexity for 3D avatar rendering +* HTMX pattern less familiar to developers expecting React/Vue +* Limited client-side state management (by design) + +## Technology Stack + +| Layer | Technology | Purpose | +|-------|-----------|---------| +| Server | Go 1.25 | HTTP server, WebSocket, NATS client, OAuth | +| Templates | Go `html/template` | Server-side HTML rendering | +| Interactivity | HTMX 2.0 | AJAX, WebSocket, server-sent events | +| Client state | Alpine.js 3 | Lightweight reactive UI for local state | +| 3D Avatars | Three.js + VRM | 3D character rendering with lip-sync | +| Styling | Tailwind CSS 4 + DaisyUI | Utility-first CSS with component library | +| Messaging | NATS JetStream | Real-time pub/sub with MessagePack encoding | +| Auth | golang-jwt/jwt/v5 | JWT token handling for OAuth flows | +| Database | PostgreSQL (lib/pq) + SQLite | Persistent + local session storage | +| Observability | OpenTelemetry SDK | Traces, metrics via OTLP gRPC | + +## Architecture + +``` +โ”Œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ” +โ”‚ Browser โ”‚ +โ”‚ โ”‚ +โ”‚ HTMX (server-rendered HTML) โ†โ†’ Go Server (WebSocket) โ”‚ +โ”‚ Alpine.js (local UI state) โ”‚ +โ”‚ Three.js (VRM 3D avatars with lip-sync) โ”‚ +โ””โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ฌโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”˜ + โ”‚ HTTP/WebSocket + โ–ผ +โ”Œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ” +โ”‚ Go Server (single binary) โ”‚ +โ”‚ โ”‚ +โ”‚ โ”Œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ” โ”Œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ” โ”Œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ” โ”Œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ” โ”‚ +โ”‚ โ”‚ Routes โ”‚ โ”‚ OAuth โ”‚ โ”‚WebSocket โ”‚ โ”‚ OTEL โ”‚ โ”‚ +โ”‚ โ”‚ (HTTP) โ”‚ โ”‚ Handlers โ”‚ โ”‚ Hub โ”‚ โ”‚ Tracing โ”‚ โ”‚ +โ”‚ โ””โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”˜ โ””โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”˜ โ””โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”˜ โ””โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”˜ โ”‚ +โ”‚ โ”‚ โ”‚ โ”‚ โ”‚ +โ”‚ โ””โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ดโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”˜ โ”‚ +โ”‚ โ”‚ โ”‚ +โ”‚ โ”Œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ดโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ” โ”‚ +โ”‚ โ”‚ NATS Client โ”‚ โ”‚ +โ”‚ โ”‚ (JetStream + โ”‚ โ”‚ +โ”‚ โ”‚ MessagePack) โ”‚ โ”‚ +โ”‚ โ””โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ฌโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”˜ โ”‚ +โ””โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ผโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”˜ + โ”‚ + โ”Œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ดโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ” + โ–ผ โ–ผ +โ”Œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ” โ”Œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ” +โ”‚ NATS JetStream โ”‚ โ”‚ Ray Serve โ”‚ +โ”‚ ai.chat.* โ”‚ โ”‚ (STT, TTS, LLM, โ”‚ +โ”‚ ai.voice.* โ”‚ โ”‚ Embeddings) โ”‚ +โ””โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”˜ โ””โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”˜ +``` + +## Key Features + +| Feature | Implementation | +|---------|---------------| +| Real-time chat | WebSocket โ†’ NATS pub/sub per-user channels | +| Voice assistant | Streaming STT โ†’ LLM โ†’ TTS via Ray Serve endpoints | +| 3D avatars | VRM models rendered in Three.js with audio-driven lip-sync | +| OAuth login | Google, Discord, GitHub, Twitch + Authentik OIDC | +| RAG search | Milvus vector search for premium users | +| Session state | PostgreSQL (CNPG) for persistent data, SQLite for local cache | + +## Kubernetes Deployment + +| | | +|---|---| +| **Namespace** | `ai-ml` | +| **Replicas** | 1 | +| **Image** | `ghcr.io/billy-davies-2/companions-frontend` (distroless) | +| **Resources** | 50m/128Mi request โ†’ 500m/512Mi limit | + +**OTEL sidecar:** `otel/opentelemetry-collector-contrib:0.145.0` exports traces to ClickStack. + +**Backend routing:** All AI inference requests (STT, TTS, LLM, embeddings, reranking) route to Ray Serve at `ai-inference-serve-svc.ai-ml.svc.cluster.local:8000`. Auxiliary HTTPRoutes in the `auxiliary` kustomization provide direct model endpoint access at `embeddings.lab`, `whisper.lab`, `tts.lab`, `llm.lab`, `reranker.lab`. + +**Access:** `companions-chat.lab.daviestechlabs.io` via envoy-internal with Authentik OIDC proxy auth. + +## Links + +* Related to [ADR-0003](0003-use-nats-for-messaging.md) (NATS messaging) +* Related to [ADR-0004](0004-use-messagepack-for-nats.md) (MessagePack encoding) +* Related to [ADR-0011](0011-kuberay-unified-gpu-backend.md) (Ray Serve backend) +* Related to [ADR-0028](0028-authentik-sso-strategy.md) (OAuth/OIDC) +* [HTMX Documentation](https://htmx.org/docs/) +* [VRM Specification](https://vrm.dev/en/) diff --git a/decisions/0047-mlflow-experiment-tracking.md b/decisions/0047-mlflow-experiment-tracking.md new file mode 100644 index 0000000..466330f --- /dev/null +++ b/decisions/0047-mlflow-experiment-tracking.md @@ -0,0 +1,114 @@ +# MLflow Experiment Tracking and Model Registry + +* Status: accepted +* Date: 2026-02-09 +* Deciders: Billy +* Technical Story: Provide centralized experiment tracking, model versioning, and artifact storage for ML workflows + +## Context and Problem Statement + +ML training pipelines (Kubeflow, Argo) produce metrics, parameters, and model artifacts that must be tracked across experiments. Without centralized tracking, comparing model performance, reproducing results, and managing model versions becomes ad hoc and error-prone. + +How do we provide experiment tracking and model registry that integrates with both Kubeflow Pipelines and Argo Workflows? + +## Decision Drivers + +* Track metrics, parameters, and artifacts across all training runs +* Compare experiments to select best models +* Version models with metadata for deployment decisions +* Integrate with both Kubeflow and Argo workflow engines +* Python-native API (all ML code is Python) +* Self-hosted with no external dependencies + +## Considered Options + +1. **MLflow** โ€” Open-source experiment tracking and model registry +2. **Weights & Biases (W&B)** โ€” SaaS experiment tracking +3. **Neptune.ai** โ€” SaaS ML metadata store +4. **Kubeflow Metadata** โ€” Built-in Kubeflow tracking + +## Decision Outcome + +Chosen option: **MLflow**, because it's open-source, self-hostable, has a mature Python SDK, and provides both experiment tracking and model registry in a single tool. + +### Positive Consequences + +* Self-hosted โ€” no SaaS costs or external dependencies +* Python SDK integrates naturally with training code +* Model registry provides versioning with stage transitions +* REST API enables integration from any workflow engine +* Artifact storage on NFS provides shared access across pods + +### Negative Consequences + +* Another service to maintain (server + database + artifact storage) +* Concurrent access to SQLite/file artifacts can be tricky (mitigated by PostgreSQL backend) +* UI is functional but not as polished as commercial alternatives + +## Deployment Configuration + +| | | +|---|---| +| **Chart** | `mlflow` from `https://community-charts.github.io/helm-charts` | +| **Namespace** | `mlflow` | +| **Server** | uvicorn (gunicorn disabled) | +| **Resources** | 200m/512Mi request โ†’ 1 CPU/2Gi limit | +| **Strategy** | Recreate | + +### Backend Store + +PostgreSQL via **CloudNativePG**: +- 1 instance, `amd64` node affinity +- 10Gi Longhorn storage, `max_connections: 200` +- Credentials from Vault via ExternalSecret + +### Artifact Store + +- 50Gi NFS PVC (`nfs-slow` StorageClass, ReadWriteMany) +- Mounted at `/mlflow/artifacts` +- Proxied artifact storage (clients access via MLflow server, not directly) + +NFS provides ReadWriteMany access so multiple training pods can write artifacts concurrently. + +## MLflow Utils Library + +The `mlflow/` repository contains `mlflow_utils`, a Python package that wraps the MLflow API for homelab-specific patterns: + +| Module | Purpose | +|--------|---------| +| `client.py` | MLflow client wrapper with homelab defaults | +| `tracker.py` | Experiment tracking with auto-logging | +| `inference_tracker.py` | Async, batched inference metrics logging | +| `model_registry.py` | Model versioning with KServe metadata | +| `kfp_components.py` | Kubeflow Pipeline components for MLflow | +| `experiment_comparison.py` | Compare runs across experiments | +| `cli.py` | CLI for common operations | + +This library is used by `handler-base`, Kubeflow pipelines, and Argo training workflows to provide consistent MLflow integration across the platform. + +## Integration Points + +``` +Kubeflow Pipelines โ”€โ”€โ†’ mlflow_utils.kfp_components โ”€โ”€โ†’ MLflow Server + โ”‚ +Argo Workflows โ”€โ”€โ†’ mlflow_utils.tracker โ”€โ”€โ†’โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ค + โ”‚ +handler-base โ”€โ”€โ†’ mlflow_utils.inference_tracker โ”€โ”€โ†’โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ค + โ–ผ + โ”Œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ” + โ”‚ PostgreSQL โ”‚ + โ”‚ (metadata) โ”‚ + โ”œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ค + โ”‚ NFS โ”‚ + โ”‚ (artifacts) โ”‚ + โ””โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”˜ +``` + +**Access:** `mlflow.lab.daviestechlabs.io` via envoy-internal gateway. + +## Links + +* Related to [ADR-0009](0009-dual-workflow-engines.md) (Argo + Kubeflow workflows) +* Related to [ADR-0027](0027-database-strategy.md) (CNPG PostgreSQL) +* Related to [ADR-0026](0026-storage-strategy.md) (NFS artifact storage) +* [MLflow Documentation](https://mlflow.org/docs/latest/) diff --git a/decisions/0048-entertainment-media-stack.md b/decisions/0048-entertainment-media-stack.md new file mode 100644 index 0000000..c9c0e14 --- /dev/null +++ b/decisions/0048-entertainment-media-stack.md @@ -0,0 +1,99 @@ +# Entertainment and Media Stack + +* Status: accepted +* Date: 2026-02-09 +* Deciders: Billy +* Technical Story: Deploy a self-hosted media automation and streaming platform for the homelab + +## Context and Problem Statement + +Self-hosting media services provides control over content libraries, avoids subscription costs for multiple streaming services, and integrates with the homelab's storage and networking infrastructure. + +How do we deploy a complete media management pipeline โ€” from content acquisition to streaming โ€” with minimal manual intervention? + +## Decision Drivers + +* Automated content discovery, download, and organization +* Single media library shared across all services +* Internal-only access (no public exposure) +* Minimal resource footprint per service +* Integration with NFS storage for large media libraries + +## Decision Outcome + +Deploy the *arr stack (Sonarr, Radarr, Prowlarr, Bazarr) for automated media management plus Jellyfin for streaming, all sharing a single NFS-backed media volume. + +## Architecture + +``` +โ”Œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ” +โ”‚ DISCOVERY & INDEXING โ”‚ +โ”‚ โ”‚ +โ”‚ Prowlarr (indexer manager) โ”‚ +โ”‚ โ†“ syncs indexers to โ”‚ +โ”‚ Sonarr (TV) + Radarr (Movies) โ”‚ +โ”‚ โ†“ sends requests to โ”‚ +โ”‚ qbittorrent (download client) [suspended โ€” pending VPN] โ”‚ +โ”‚ โ†“ downloads to โ”‚ +โ”‚ /media (shared NFS volume) โ”‚ +โ”‚ โ†“ organizes into โ”‚ +โ”‚ Sonarr/Radarr import + rename โ”‚ +โ”‚ โ†“ subtitle lookup via โ”‚ +โ”‚ Bazarr (subtitle manager) โ”‚ +โ””โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”˜ + +โ”Œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ” +โ”‚ STREAMING โ”‚ +โ”‚ โ”‚ +โ”‚ Jellyfin (media server) โ”‚ +โ”‚ โ† reads from /media (shared NFS volume) โ”‚ +โ”‚ โ† transcodes to /config/transcodes (emptyDir) โ”‚ +โ”‚ โ†’ streams to LAN clients (web, apps, DLNA) โ”‚ +โ””โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”˜ +``` + +## Components + +| Component | Image | Purpose | Resources (req โ†’ limit) | +|-----------|-------|---------|------------------------| +| Jellyfin | `jellyfin/jellyfin:10.11.6` | Media server & streaming | 100m/1Gi โ†’ โ€”/4Gi | +| Sonarr | `ghcr.io/onedr0p/sonarr:4.0.14.2938` | TV series management | 50m/256Mi โ†’ โ€”/1Gi | +| Radarr | `ghcr.io/onedr0p/radarr:5.20.1.9773` | Movie management | 50m/256Mi โ†’ โ€”/1Gi | +| Prowlarr | `ghcr.io/onedr0p/prowlarr:1.32.2.4987` | Indexer aggregation | 50m/128Mi โ†’ โ€”/512Mi | +| Bazarr | `ghcr.io/onedr0p/bazarr:v1.5.1` | Subtitle management | 50m/256Mi โ†’ โ€”/1Gi | +| qbittorrent | `ghcr.io/onedr0p/qbittorrent:5.0.4` | Download client | *Suspended* | + +All deployed via bjw-s `app-template` v4.6.2 in the `entertainment` namespace. + +## Storage Architecture + +| Volume | StorageClass | Size | Access | Mounted By | +|--------|-------------|------|--------|------------| +| `jellyfin-media` | `nfs-slow` | 500Gi | RWX | Jellyfin, Sonarr, Radarr, Bazarr | +| Config PVCs | `longhorn` | 2-10Gi | RWO | One per app | +| Transcodes | `emptyDir` | โ€” | โ€” | Jellyfin only | + +The shared `jellyfin-media` NFS volume is the key integration point โ€” *arr apps write organized media files, Jellyfin reads and streams them. + +## Network Access + +All services are internal-only via `envoy-internal` gateway at `*.lab.daviestechlabs.io`: + +| Service | URL | +|---------|-----| +| Jellyfin | `jellyfin.lab.daviestechlabs.io` | +| Sonarr | `sonarr.lab.daviestechlabs.io` | +| Radarr | `radarr.lab.daviestechlabs.io` | +| Prowlarr | `prowlarr.lab.daviestechlabs.io` | +| Bazarr | `bazarr.lab.daviestechlabs.io` | + +qbittorrent has a dedicated Cilium LoadBalancer VIP (`10.0.0.210`) for BitTorrent traffic, separate from the HTTP gateway. + +## Security + +All *arr apps run as non-root (UID 568), with read-only root filesystem and all Linux capabilities dropped. qbittorrent is suspended pending VPN/Multus integration to ensure download traffic is routed through a VPN tunnel. + +## Links + +* Related to [ADR-0026](0026-storage-strategy.md) (NFS storage for media) +* Related to [ADR-0043](0043-cilium-cni-network-fabric.md) (Multus for qbittorrent VPN) diff --git a/decisions/0049-self-hosted-productivity-suite.md b/decisions/0049-self-hosted-productivity-suite.md new file mode 100644 index 0000000..612f7c5 --- /dev/null +++ b/decisions/0049-self-hosted-productivity-suite.md @@ -0,0 +1,122 @@ +# Self-Hosted Productivity Suite + +* Status: accepted +* Date: 2026-02-09 +* Deciders: Billy +* Technical Story: Select and deploy self-hosted alternatives to commercial cloud productivity services + +## Context and Problem Statement + +Commercial cloud services (Google Workspace, iCloud, Notion) centralize personal data with third parties and incur ongoing subscription costs. A homelab with sufficient compute and storage can host equivalent services with full data ownership. + +Which self-hosted applications best replace commercial productivity services, and how should they share infrastructure? + +## Decision Drivers + +* Data sovereignty โ€” all personal data stays on-premises +* Feature parity with commercial alternatives where possible +* SSO integration via Authentik for unified login +* Shared infrastructure (database, cache, storage) to reduce overhead +* Public access via Cloudflare Tunnel for mobile/remote use + +## Decision Outcome + +Deploy five productivity applications sharing a common infrastructure layer (CNPG PostgreSQL, Valkey cache, NFS storage), exposed publicly via Cloudflare Tunnel with Authentik SSO where supported. + +## Components + +| Application | Replaces | Image/Chart | Database | Cache | Storage | +|-------------|----------|-------------|----------|-------|---------| +| **AFFiNE** | Notion | `ghcr.io/toeverything/affine:stable` | CNPG (VectorChord) | Valkey DB 2 | 10Gi Longhorn | +| **Immich** | Google Photos | `immich` chart v0.10.3 | CNPG (VectorChord) | Valkey DB 3 | 10Gi NFS | +| **Nextcloud** | Google Drive | `nextcloud` chart v8.8.1 | CNPG | Valkey DB 1 | 200Gi NFS | +| **Kasm** | โ€” (unique) | `kasm` chart v1.18.1 | CNPG | Valkey | 50Gi Longhorn | +| **Kavita** | Kindle/Calibre | `ghcr.io/kareadita/kavita:latest` | Embedded | โ€” | 30Gi NFS (3 libraries) | + +All deployed in the `productivity` namespace, exposed via `envoy-external` at `*.daviestechlabs.io`. + +## Shared Infrastructure + +### Valkey Cache (Shared Instance) + +A single Valkey instance (`valkey/valkey:9.0.2`) with per-application ACL users and database isolation: + +| User | DB Index | Application | +|------|----------|-------------| +| `nextcloud` | 1 | Nextcloud | +| `affine` | 2 | AFFiNE | +| `immich` | 3 | Immich | +| `kasm` | โ€” | Kasm | + +Default user disabled. Per-user passwords from Vault. 20Gi Longhorn storage. + +### CloudNativePG Databases + +Each application with a relational database gets its own CNPG cluster (single instance, 10Gi Longhorn, amd64 affinity). AFFiNE and Immich use PostgreSQL 18 with the **VectorChord** extension for vector search capabilities. + +## Application Details + +### AFFiNE (Notion Alternative) + +Knowledge base and project management with real-time collaboration. + +- OIDC SSO via **Authentik** (`openid`, `profile`, `email` scopes) +- VectorChord extension enables AI-powered semantic search +- OTEL tracing to OpenTelemetry collector +- Init container runs database migration (`self-host-predeploy.js`) + +### Immich (Google Photos Alternative) + +Photo and video management with ML-powered search and face recognition. + +- Built-in ML sidecar for facial recognition and smart search +- VectorChord PostgreSQL extension for similarity search +- OTEL tracing enabled +- Library stored on NFS for large photo collections + +### Nextcloud (Google Drive Alternative) + +File sync, calendar, contacts, and collaboration. + +- Imaginary sidecar for image processing +- Custom reverse-proxy config for trusted proxies (RFC1918 ranges) +- CalDAV/CardDAV `.well-known` URL redirects via HTTPRoute +- PHP cron job for background tasks +- Chart pinned to v8.8.1 (v8.9.0 has timeout issues) + +### Kasm Workspaces (Browser Isolation) + +Remote browser isolation and desktop streaming. + +- Small deployment (10-15 concurrent sessions) +- WebSocket support via custom `BackendTrafficPolicy` (no request timeout, 1h idle, TCP keepalive) +- `applySecurity: false` for Talos compatibility +- Dedicated Let's Encrypt certificate for `*.kasm.lab.daviestechlabs.io` + +### Kavita (Digital Library) + +Ebook, manga, and comic reader. + +- Simplest deployment โ€” no external database, no cache, no SSO +- Three NFS-backed content libraries: manga (10Gi), comics (10Gi), books (10Gi) +- Embedded database in config PVC + +## Network Access + +All productivity apps are publicly accessible via Cloudflare Tunnel: + +| Service | URL | +|---------|-----| +| AFFiNE | `affine.daviestechlabs.io` | +| Immich | `immich.daviestechlabs.io` | +| Nextcloud | `nextcloud.daviestechlabs.io` | +| Kasm | `kasm.daviestechlabs.io` | +| Kavita | `kavita.daviestechlabs.io` | + +## Links + +* Related to [ADR-0027](0027-database-strategy.md) (CNPG databases) +* Related to [ADR-0023](0023-valkey-ml-caching.md) (Valkey caching) +* Related to [ADR-0026](0026-storage-strategy.md) (NFS + Longhorn storage) +* Related to [ADR-0028](0028-authentik-sso-strategy.md) (SSO integration) +* Related to [ADR-0044](0044-dns-and-external-access.md) (Cloudflare Tunnel access) diff --git a/decisions/0050-argo-rollouts-progressive-delivery.md b/decisions/0050-argo-rollouts-progressive-delivery.md new file mode 100644 index 0000000..6d55821 --- /dev/null +++ b/decisions/0050-argo-rollouts-progressive-delivery.md @@ -0,0 +1,71 @@ +# Argo Rollouts Progressive Delivery + +* Status: accepted +* Date: 2026-02-09 +* Deciders: Billy +* Technical Story: Enable progressive delivery (canary, blue-green) for safer deployments alongside existing Argo Workflows + +## Context and Problem Statement + +Standard Kubernetes Deployments use a rolling update strategy that replaces all pods at once. For critical services, this creates risk โ€” a bad deployment affects all traffic immediately. Progressive delivery allows gradual traffic shifting with automated rollback on failure. + +How do we add progressive delivery capabilities without duplicating the existing Argo Workflows infrastructure? + +## Decision Drivers + +* Reduce blast radius of bad deployments +* Automated rollback on failure metrics +* Complement (not replace) existing GitOps deployment via Flux +* Reuse Argo ecosystem already deployed for workflows +* Dashboard for deployment visibility + +## Considered Options + +1. **Argo Rollouts** โ€” Progressive delivery controller from Argo project +2. **Flagger** โ€” Flux-native progressive delivery +3. **Istio traffic management** โ€” Service mesh canary routing +4. **Manual canary via Flux** โ€” Separate canary Deployments managed by Flux + +## Decision Outcome + +Chosen option: **Argo Rollouts**, because it complements the existing Argo Workflows deployment, provides native canary and blue-green strategies, and includes a dashboard for deployment visibility. + +### Positive Consequences + +* Canary and blue-green deployment strategies with automated analysis +* Integrates with Envoy Gateway for traffic splitting +* Dashboard for real-time deployment progress +* Same Argo ecosystem as existing Workflows (shared expertise) +* CRD-based โ€” works with GitOps (Flux manages Rollout resources) + +### Negative Consequences + +* Another CRD set to manage alongside standard Deployments +* Not all workloads need progressive delivery (overhead for simple services) +* Dashboard currently available only via port-forward (no ingress) + +## Deployment Configuration + +| | | +|---|---| +| **Chart** | `argo-rollouts` from Argo HelmRepository | +| **Namespace** | `ci-cd` | +| **Replicas** | 1 | +| **Dashboard** | Enabled | +| **CRDs** | `CreateReplace` on install and upgrade | + +Managed by Flux Kustomization with `wait: true` to ensure the controller is ready before dependent Rollout resources are applied. + +## Use Cases + +| Strategy | When to Use | Example | +|----------|-------------|---------| +| Canary | Gradual traffic shift with metric analysis | AI inference endpoint updates | +| Blue-Green | Zero-downtime full cutover with instant rollback | Companions frontend releases | +| Rolling (standard) | Low-risk config changes | Most infrastructure services | + +## Links + +* Related to [ADR-0009](0009-dual-workflow-engines.md) (Argo ecosystem) +* Related to [ADR-0031](0031-gitea-cicd-strategy.md) (CI/CD pipeline) +* [Argo Rollouts Documentation](https://argoproj.github.io/rollouts/) diff --git a/decisions/0051-keda-event-driven-autoscaling.md b/decisions/0051-keda-event-driven-autoscaling.md new file mode 100644 index 0000000..6c53762 --- /dev/null +++ b/decisions/0051-keda-event-driven-autoscaling.md @@ -0,0 +1,68 @@ +# KEDA Event-Driven Autoscaling + +* Status: accepted +* Date: 2026-02-09 +* Deciders: Billy +* Technical Story: Scale workloads based on external event sources rather than only CPU/memory metrics + +## Context and Problem Statement + +Kubernetes Horizontal Pod Autoscaler (HPA) scales on CPU and memory, but many homelab workloads have scaling signals from external systems โ€” Envoy Gateway request queues, NATS queue depth, or GPU utilization. Scaling on the right signal reduces latency and avoids over-provisioning. + +How do we autoscale workloads based on external metrics like message queues, HTTP request rates, and custom Prometheus queries? + +## Decision Drivers + +* Scale on NATS queue depth for inference pipelines +* Scale on Envoy Gateway metrics for HTTP workloads +* Prometheus integration for arbitrary custom metrics +* CRD-based scalers compatible with Flux GitOps +* Low resource overhead for the scaler controller itself + +## Considered Options + +1. **KEDA** โ€” Kubernetes Event-Driven Autoscaling +2. **Custom HPA with Prometheus Adapter** โ€” HPA + external-metrics API +3. **Knative Serving** โ€” Serverless autoscaler with scale-to-zero + +## Decision Outcome + +Chosen option: **KEDA**, because it provides a large catalog of built-in scalers (Prometheus, NATS, HTTP), supports scale-to-zero, and integrates cleanly with existing HelmRelease/Kustomization GitOps. + +### Positive Consequences + +* 60+ built-in scalers covering all homelab event sources +* ScaledObject CRDs fit naturally in GitOps workflow +* Scale-to-zero for bursty workloads (saves GPU resources) +* ServiceMonitors for self-monitoring via Prometheus +* Grafana dashboard included for visibility + +### Negative Consequences + +* Additional CRDs and controller pods +* ScaledObject/TriggerAuthentication learning curve +* Potential conflict with manually-defined HPAs + +## Deployment Configuration + +| | | +|---|---| +| **Chart** | `keda` OCI chart v2.19.0 | +| **Namespace** | `keda` | +| **Monitoring** | ServiceMonitor enabled, Grafana dashboard provisioned | +| **Webhooks** | Enabled | + +## Scaling Use Cases + +| Workload | Scaler | Signal | Target | +|----------|--------|--------|--------| +| Ray Serve inference | Prometheus | Pending request queue depth | 1-4 replicas | +| Envoy Gateway | Prometheus | Active connections per gateway | KEDA manages envoy proxy fleet | +| Voice pipeline | NATS | Message queue length | 0-2 replicas | +| Batch inference | Prometheus | Job queue size | 0-N GPU pods | + +## Links + +* Related to [ADR-0010](0010-scalable-inference-platform.md) (inference scaling) +* Related to [ADR-0038](0038-infrastructure-metrics-collection.md) (Prometheus metrics) +* [KEDA Documentation](https://keda.sh/docs/) diff --git a/decisions/0052-cluster-utilities-optimization.md b/decisions/0052-cluster-utilities-optimization.md new file mode 100644 index 0000000..65927c7 --- /dev/null +++ b/decisions/0052-cluster-utilities-optimization.md @@ -0,0 +1,104 @@ +# Cluster Utilities and Optimization + +* Status: accepted +* Date: 2026-02-09 +* Deciders: Billy +* Technical Story: Deploy supporting utilities that improve cluster efficiency, reliability, and operational overhead + +## Context and Problem Statement + +A Kubernetes cluster running diverse workloads benefits from several operational utilities โ€” image caching to reduce pull times, workload rebalancing for efficiency, automatic secret/configmap reloading, and shared storage provisioning. Each is small individually but collectively they significantly improve cluster operations. + +How do we manage these cross-cutting cluster utilities consistently? + +## Decision Drivers + +* Reduce container image pull latency across nodes +* Automatically rebalance workloads for even resource utilization +* Eliminate manual pod restarts when secrets/configmaps change +* Provide shared NFS storage class for ReadWriteMany workloads +* Minimal resource overhead per utility + +## Decision Outcome + +Deploy four cluster utilities โ€” Spegel (image cache), Descheduler (pod rebalancing), Reloader (config reload), and CSI-NFS (NFS StorageClass) โ€” each solving a distinct operational concern with minimal footprint. + +## Components + +### Spegel โ€” Peer-to-Peer Image Registry Mirror + +Spegel distributes container images between nodes, so pulling an image already present on _any_ node avoids hitting the external registry. + +| | | +|---|---| +| **Chart** | `spegel` OCI chart v0.3.0 | +| **Namespace** | `spegel` | +| **Port** | 29999 | +| **Mode** | P2P mirror (DaemonSet, one pod per node) | + +**Mirrored Registries:** +- `docker.io`, `ghcr.io`, `quay.io`, `gcr.io` +- `registry.k8s.io`, `mcr.microsoft.com` +- `git.daviestechlabs.io` (Gitea), `public.ecr.aws` + +Spegel registers as a containerd mirror, intercepting pulls before they reach the internet. Especially valuable for large ML model images (5-20GB) that would otherwise be pulled repeatedly. + +### Descheduler โ€” Workload Rebalancing + +The descheduler evicts pods to allow the scheduler to redistribute them more optimally. + +| | | +|---|---| +| **Chart** | `descheduler` v0.33.0 | +| **Namespace** | `descheduler` | +| **Mode** | Deployment (continuous) | +| **Strategy** | `LowNodeUtilization` | + +**Excluded Namespaces:** `ai-ml`, `kuberay`, `gitea` + +AI/ML and Gitea namespaces are excluded because GPU workloads and git repositories should not be disrupted by rebalancing. + +### Reloader โ€” Automatic Config Reload + +Reloader watches for Secret and ConfigMap changes and triggers rolling restarts on Deployments/StatefulSets that reference them. + +| | | +|---|---| +| **Chart** | `reloader` v2.2.7 | +| **Namespace** | `reloader` | +| **Monitoring** | PodMonitor enabled | +| **Security** | Read-only root filesystem | + +Eliminates manual `kubectl rollout restart` after Vault secret rotations or config changes. + +### CSI-NFS โ€” NFS StorageClass + +Provides a Kubernetes StorageClass backed by the NAS (candlekeep) NFS export. + +| | | +|---|---| +| **Chart** | `csi-driver-nfs` v4.13.0 | +| **Namespace** | `csi-nfs` | +| **StorageClass** | `nfs-slow` | +| **NFS Server** | `candlekeep` โ†’ `/kubernetes` | +| **NFS Version** | 4.1, `nconnect=16` | + +`nfs-slow` provides ReadWriteMany access for workloads that need shared storage (media library, ML artifacts, photo libraries). Named "slow" relative to Longhorn SSDs, not in absolute terms. The `nconnect=16` option enables 16 parallel NFS connections per mount for improved throughput. + +## Resource Overhead + +| Utility | Pods | CPU Request | Memory Request | +|---------|------|-------------|----------------| +| Spegel | 1 per node (DaemonSet) | โ€” | โ€” | +| Descheduler | 1 | โ€” | โ€” | +| Reloader | 1 | โ€” | โ€” | +| CSI-NFS | 1 controller + DaemonSet | โ€” | โ€” | +| **Total** | ~8-12 pods | Minimal | Minimal | + +All four utilities are lightweight and designed to run alongside workloads with negligible resource impact. + +## Links + +* Related to [ADR-0026](0026-storage-strategy.md) (Longhorn + NFS storage strategy) +* Related to [ADR-0003](0003-bare-metal-kubernetes.md) (Talos container runtime / containerd) +* [Spegel](https://github.com/spegel-org/spegel) ยท [Descheduler](https://sigs.k8s.io/descheduler) ยท [Reloader](https://github.com/stakater/Reloader) ยท [CSI-NFS](https://github.com/kubernetes-csi/csi-driver-nfs) diff --git a/decisions/0053-vaultwarden-password-management.md b/decisions/0053-vaultwarden-password-management.md new file mode 100644 index 0000000..995f1e1 --- /dev/null +++ b/decisions/0053-vaultwarden-password-management.md @@ -0,0 +1,90 @@ +# Vaultwarden Password Management + +* Status: accepted +* Date: 2026-02-09 +* Deciders: Billy +* Technical Story: Self-host a Bitwarden-compatible password manager for personal and family credential management + +## Context and Problem Statement + +Password management is essential for security, and commercial Bitwarden plans charge per-user fees for family/team features. Vaultwarden provides a lightweight, Bitwarden-compatible server that runs all premium features without licensing costs. + +How do we self-host password management with the reliability and accessibility requirements of a critical personal service? + +## Decision Drivers + +* Bitwarden client compatibility (browser extensions, mobile apps, CLI) +* All premium features (TOTP, file attachments, organizations) without licensing +* High availability relative to importance (password manager is critical infrastructure) +* Public access for mobile/remote use +* Minimal attack surface + +## Considered Options + +1. **Vaultwarden** โ€” Rust reimplementation of Bitwarden server API +2. **Bitwarden (official)** โ€” Official self-hosted Bitwarden +3. **KeePass/KeePassXC** โ€” File-based password manager with sync +4. **1Password** โ€” Commercial SaaS + +## Decision Outcome + +Chosen option: **Vaultwarden**, because it provides full Bitwarden client compatibility in a single lightweight container, supports all premium features, and uses PostgreSQL for reliable storage. + +### Positive Consequences + +* All Bitwarden clients work natively (browser, mobile, desktop, CLI) +* All premium features unlocked (TOTP, attachments, emergency access, organizations) +* Single container (~50MB RAM) instead of Bitwarden's 6+ containers +* PostgreSQL backend via CNPG for reliable, backed-up storage +* Existing Bitwarden vaults can be migrated via import + +### Negative Consequences + +* Third-party reimplementation โ€” may lag behind official Bitwarden features +* Self-hosted means self-responsible for backups and availability +* Public-facing service increases attack surface + +## Deployment Configuration + +| | | +|---|---| +| **Image** | `vaultwarden/server:1.35.2` | +| **Namespace** | `productivity` | +| **Chart** | bjw-s `app-template` | +| **Signups** | Disabled (`SIGNUPS_ALLOWED=false`) | +| **Admin panel** | Disabled | +| **Storage** | 10Gi Longhorn PVC (attachments/icons) | + +### Database + +PostgreSQL via **CloudNativePG**: +- 1 instance, `amd64` node affinity +- 10Gi Longhorn storage +- Credentials from Vault via ExternalSecret + +### Network Access + +| | | +|---|---| +| **Gateway** | `envoy-external` | +| **URL** | `vaultwarden.daviestechlabs.io` | +| **TLS** | Let's Encrypt wildcard (DNS-01 via Cloudflare) | + +Publicly accessible via Cloudflare Tunnel so mobile apps and browser extensions work from anywhere. + +## Security Hardening + +* New user signups disabled โ€” accounts provisioned manually +* Admin panel disabled โ€” reduces attack surface +* Vault credentials from HashiCorp Vault (not inline) +* WebSocket support for real-time sync between clients +* All Bitwarden data encrypted client-side (server never sees plaintext) + +Vaultwarden serves only encrypted blobs. The encryption key never leaves the client, so even a full server compromise does not expose plaintext passwords. + +## Links + +* Related to [ADR-0027](0027-database-strategy.md) (CNPG PostgreSQL) +* Related to [ADR-0044](0044-dns-and-external-access.md) (Cloudflare Tunnel access) +* Related to [ADR-0045](0045-tls-certificate-strategy.md) (Let's Encrypt TLS) +* [Vaultwarden](https://github.com/dani-garcia/vaultwarden)