diff --git a/decisions/0060-internal-pki-vault.md b/decisions/0060-internal-pki-vault.md new file mode 100644 index 0000000..180cb92 --- /dev/null +++ b/decisions/0060-internal-pki-vault.md @@ -0,0 +1,192 @@ +# Internal PKI with Vault and cert-manager + +* Status: accepted +* Date: 2026-02-16 +* Deciders: Billy +* Technical Story: Replace self-signed internal certificates with a proper CA chain using Vault PKI + +## Context and Problem Statement + +Internal services on `*.lab.daviestechlabs.io` use a `selfsigned-internal` ClusterIssuer. Each cert-manager Certificate gets its own unique self-signed root — there is no shared CA. This causes problems: + +- Off-cluster devices (gravenhollow, candlekeep, waterdeep) have no way to obtain trusted certs +- Clients cannot verify server certs because there's no CA to trust +- RustFS on gravenhollow has a TLS cert only valid for `localhost`, breaking S3 clients +- No certificate chain means no ability to distribute a single CA bundle across the fleet + +The homelab already runs HashiCorp Vault in HA mode (3 replicas, Raft storage) and cert-manager with Let's Encrypt for public certs. How do we issue trusted internal certificates for both in-cluster and off-cluster services? + +## Decision Drivers + +* Vault is already deployed and used for secrets management +* cert-manager is already deployed with ClusterIssuer support +* Off-cluster devices (NAS, Mac Mini) need valid TLS certs +* Single CA root to trust across all machines +* Automated renewal for in-cluster certs via cert-manager +* Must not disrupt existing Let's Encrypt public certs + +## Considered Options + +1. **Vault PKI secrets engine + cert-manager Vault ClusterIssuer** +2. **step-ca (Smallstep) as standalone internal CA** +3. **Keep self-signed, distribute individual certs manually** + +## Decision Outcome + +Chosen option: **Option 1 — Vault PKI + cert-manager Vault ClusterIssuer**, because it builds on existing infrastructure (Vault and cert-manager), provides a proper two-tier CA chain, and supports both in-cluster automated renewal and off-cluster cert issuance via the Vault API. + +### Positive Consequences + +* Single root CA — one trust anchor for the entire homelab +* cert-manager automatically renews in-cluster certs via Vault +* Off-cluster devices request certs via `vault write` CLI +* Two-tier CA (root → intermediate) follows PKI best practices +* Root CA key never leaves Vault +* Existing Let's Encrypt public certs are unaffected + +### Negative Consequences + +* Vault becomes a dependency for internal TLS issuance +* Off-cluster cert renewal requires manual or scripted `vault write` (no ACME) +* CA root cert must be distributed to trust stores on all machines +* Vault PKI engine adds operational complexity + +## Architecture + +``` +┌──────────────────────────────────────────────────────────────────────────┐ +│ Vault PKI (security namespace) │ +│ │ +│ ┌──────────────────────┐ ┌──────────────────────────┐ │ +│ │ pki/ (Root CA) │ │ pki_int/ (Intermediate) │ │ +│ │ │ signs │ │ │ +│ │ Homelab Root CA │──────▶│ Homelab Intermediate CA │ │ +│ │ TTL: 10 years │ │ TTL: 5 years │ │ +│ │ │ │ │ │ +│ │ (only signs │ │ Role: lab-internal │ │ +│ │ intermediates) │ │ *.lab.daviestechlabs.io │ │ +│ └──────────────────────┘ │ TTL: 90 days (default) │ │ +│ │ Key: EC P-256 │ │ +│ └─────────┬────────────────┘ │ +│ │ │ +│ ┌──────────────┼──────────────┐ │ +│ │ │ │ │ +│ ▼ ▼ ▼ │ +│ ┌─────────────┐ ┌─────────────┐ ┌──────────┐ │ +│ │cert-manager │ │ vault write │ │ Future │ │ +│ │ClusterIssuer│ │ (CLI) │ │ ACME │ │ +│ │vault-internal│ │ │ │ │ │ +│ └──────┬──────┘ └──────┬──────┘ └──────────┘ │ +│ │ │ │ +└────────────────────────────┼───────────────┼─────────────────────────────┘ + │ │ + ┌────────────▼───┐ ┌──────▼───────────────────┐ + │ In-Cluster │ │ Off-Cluster │ + │ │ │ │ + │ *.lab.dav... │ │ gravenhollow (RustFS) │ + │ envoy-internal │ │ candlekeep (QNAP) │ + │ auto-renewed │ │ waterdeep (Mac Mini) │ + │ by cert-manager │ │ manual/scripted renewal │ + └─────────────────┘ └───────────────────────────┘ +``` + +## Implementation + +### Vault PKI Configuration (Phases 1–4, completed) + +```bash +# Phase 1: Root CA +vault secrets enable -path=pki pki +vault secrets tune -max-lease-ttl=87600h pki +vault write pki/root/generate/internal \ + common_name="Homelab Root CA" issuer_name="homelab-root" ttl=87600h +vault write pki/config/urls \ + issuing_certificates="http://vault.security.svc:8200/v1/pki/ca" \ + crl_distribution_points="http://vault.security.svc:8200/v1/pki/crl" + +# Phase 2: Intermediate CA +vault secrets enable -path=pki_int pki +vault secrets tune -max-lease-ttl=43800h pki_int +vault write -field=csr pki_int/intermediate/generate/internal \ + common_name="Homelab Intermediate CA" issuer_name="homelab-intermediate" \ + > /tmp/intermediate.csr +vault write -field=certificate pki/root/sign-intermediate \ + issuer_ref="homelab-root" csr=@/tmp/intermediate.csr \ + format=pem_bundle ttl=43800h > /tmp/intermediate.crt +vault write pki_int/intermediate/set-signed certificate=@/tmp/intermediate.crt + +# Phase 3: PKI Role +vault write pki_int/roles/lab-internal \ + allowed_domains="lab.daviestechlabs.io" \ + allow_subdomains=true allow_bare_domains=true \ + max_ttl=8760h ttl=2160h key_type=ec key_bits=256 + +# Phase 4: Policy and Kubernetes Auth Role +vault policy write cert-manager-pki - <