docs: add ADRs 0043-0053 covering remaining architecture gaps
All checks were successful
Update README with ADR Index / update-readme (push) Successful in 6s

New ADRs:
- 0043: Cilium CNI and Network Fabric
- 0044: DNS and External Access Architecture
- 0045: TLS Certificate Strategy (cert-manager)
- 0046: Companions Frontend Architecture
- 0047: MLflow Experiment Tracking and Model Registry
- 0048: Entertainment and Media Stack
- 0049: Self-Hosted Productivity Suite
- 0050: Argo Rollouts Progressive Delivery
- 0051: KEDA Event-Driven Autoscaling
- 0052: Cluster Utilities (Spegel, Descheduler, Reloader, CSI-NFS)
- 0053: Vaultwarden Password Management

README updated with table entries and badge count (53 total).
This commit is contained in:
2026-02-09 18:36:39 -05:00
parent 49ce970780
commit 5846d0dc16
12 changed files with 1141 additions and 1 deletions

View File

@@ -0,0 +1,77 @@
# TLS Certificate Strategy
* Status: accepted
* Date: 2026-02-09
* Deciders: Billy
* Technical Story: Automate TLS certificate provisioning for both public and internal services
## Context and Problem Statement
Every HTTPS service in the cluster needs a valid TLS certificate. Public services need trusted certificates (Let's Encrypt), while internal services can use self-signed certificates. Manual certificate management doesn't scale across 30+ services.
How do we automate certificate issuance and renewal for both public and internal domains?
## Decision Drivers
* Fully automated certificate lifecycle (issuance, renewal, rotation)
* Wildcard certificates to avoid per-service certificate sprawl
* DNS-01 challenge for wildcard support (HTTP-01 can't do wildcards)
* Internal services need certificates too (browser warnings are unacceptable)
* Zero downtime during renewal
## Decision Outcome
Deploy **cert-manager** with two ClusterIssuers: Let's Encrypt (DNS-01 via Cloudflare) for public domains, and a self-signed issuer for internal domains.
## Deployment Configuration
| | |
|---|---|
| **Chart** | `cert-manager` from `oci://quay.io/jetstack/charts/cert-manager` |
| **Version** | v1.19.3 |
| **Namespace** | `cert-manager` |
| **Replicas** | 1 |
## Certificate Issuers
### letsencrypt-production (Public)
| | |
|---|---|
| **Type** | ACME (Let's Encrypt) |
| **Challenge** | DNS-01 via Cloudflare API |
| **Nameservers** | `1.1.1.1:443`, `1.0.0.1:443` (DNS-over-HTTPS) |
| **Zone** | `daviestechlabs.io` |
Uses a Cloudflare API token (SOPS-encrypted) to create DNS-01 challenge TXT records. Recursive nameservers configured to use Cloudflare DoH for faster propagation checks.
### selfsigned-internal (Private)
| | |
|---|---|
| **Type** | Self-Signed |
| **Use** | `*.lab.daviestechlabs.io` internal services |
Used for internal services where browser trust isn't critical (admin UIs accessed by the operator).
## Certificates
| Domain | Issuer | Type | Duration | Renewal |
|--------|--------|------|----------|---------|
| `daviestechlabs.io` + `*.daviestechlabs.io` | letsencrypt-production | Wildcard | 90 days (LE default) | Auto |
| `lab.daviestechlabs.io` + `*.lab.daviestechlabs.io` | selfsigned-internal | Wildcard | 1 year | 30 days before expiry |
Wildcard certificates are used to avoid creating individual certificates per service. Both certificates are referenced by the Envoy Gateway listeners.
## Integration Points
- **Cloudflare:** API token for DNS-01 challenges (stored as SOPS-encrypted Secret)
- **Envoy Gateway:** References certificates in Gateway listener TLS configuration
- **Flux:** Health check validates ClusterIssuer readiness before dependent resources
- **Prometheus:** ServiceMonitor enabled for cert-manager metrics
## Links
* Related to [ADR-0044](0044-dns-and-external-access.md) (DNS architecture)
* Related to [ADR-0010](0010-use-envoy-gateway.md) (Gateway TLS listeners)
* [cert-manager Documentation](https://cert-manager.io/docs/)