Some checks failed
Update README with ADR Index / update-readme (push) Failing after 12m52s
193 lines
9.9 KiB
Markdown
193 lines
9.9 KiB
Markdown
# Internal PKI with Vault and cert-manager
|
||
|
||
* Status: accepted
|
||
* Date: 2026-02-16
|
||
* Deciders: Billy
|
||
* Technical Story: Replace self-signed internal certificates with a proper CA chain using Vault PKI
|
||
|
||
## Context and Problem Statement
|
||
|
||
Internal services on `*.lab.daviestechlabs.io` use a `selfsigned-internal` ClusterIssuer. Each cert-manager Certificate gets its own unique self-signed root — there is no shared CA. This causes problems:
|
||
|
||
- Off-cluster devices (gravenhollow, candlekeep, waterdeep) have no way to obtain trusted certs
|
||
- Clients cannot verify server certs because there's no CA to trust
|
||
- RustFS on gravenhollow has a TLS cert only valid for `localhost`, breaking S3 clients
|
||
- No certificate chain means no ability to distribute a single CA bundle across the fleet
|
||
|
||
The homelab already runs HashiCorp Vault in HA mode (3 replicas, Raft storage) and cert-manager with Let's Encrypt for public certs. How do we issue trusted internal certificates for both in-cluster and off-cluster services?
|
||
|
||
## Decision Drivers
|
||
|
||
* Vault is already deployed and used for secrets management
|
||
* cert-manager is already deployed with ClusterIssuer support
|
||
* Off-cluster devices (NAS, Mac Mini) need valid TLS certs
|
||
* Single CA root to trust across all machines
|
||
* Automated renewal for in-cluster certs via cert-manager
|
||
* Must not disrupt existing Let's Encrypt public certs
|
||
|
||
## Considered Options
|
||
|
||
1. **Vault PKI secrets engine + cert-manager Vault ClusterIssuer**
|
||
2. **step-ca (Smallstep) as standalone internal CA**
|
||
3. **Keep self-signed, distribute individual certs manually**
|
||
|
||
## Decision Outcome
|
||
|
||
Chosen option: **Option 1 — Vault PKI + cert-manager Vault ClusterIssuer**, because it builds on existing infrastructure (Vault and cert-manager), provides a proper two-tier CA chain, and supports both in-cluster automated renewal and off-cluster cert issuance via the Vault API.
|
||
|
||
### Positive Consequences
|
||
|
||
* Single root CA — one trust anchor for the entire homelab
|
||
* cert-manager automatically renews in-cluster certs via Vault
|
||
* Off-cluster devices request certs via `vault write` CLI
|
||
* Two-tier CA (root → intermediate) follows PKI best practices
|
||
* Root CA key never leaves Vault
|
||
* Existing Let's Encrypt public certs are unaffected
|
||
|
||
### Negative Consequences
|
||
|
||
* Vault becomes a dependency for internal TLS issuance
|
||
* Off-cluster cert renewal requires manual or scripted `vault write` (no ACME)
|
||
* CA root cert must be distributed to trust stores on all machines
|
||
* Vault PKI engine adds operational complexity
|
||
|
||
## Architecture
|
||
|
||
```
|
||
┌──────────────────────────────────────────────────────────────────────────┐
|
||
│ Vault PKI (security namespace) │
|
||
│ │
|
||
│ ┌──────────────────────┐ ┌──────────────────────────┐ │
|
||
│ │ pki/ (Root CA) │ │ pki_int/ (Intermediate) │ │
|
||
│ │ │ signs │ │ │
|
||
│ │ Homelab Root CA │──────▶│ Homelab Intermediate CA │ │
|
||
│ │ TTL: 10 years │ │ TTL: 5 years │ │
|
||
│ │ │ │ │ │
|
||
│ │ (only signs │ │ Role: lab-internal │ │
|
||
│ │ intermediates) │ │ *.lab.daviestechlabs.io │ │
|
||
│ └──────────────────────┘ │ TTL: 90 days (default) │ │
|
||
│ │ Key: EC P-256 │ │
|
||
│ └─────────┬────────────────┘ │
|
||
│ │ │
|
||
│ ┌──────────────┼──────────────┐ │
|
||
│ │ │ │ │
|
||
│ ▼ ▼ ▼ │
|
||
│ ┌─────────────┐ ┌─────────────┐ ┌──────────┐ │
|
||
│ │cert-manager │ │ vault write │ │ Future │ │
|
||
│ │ClusterIssuer│ │ (CLI) │ │ ACME │ │
|
||
│ │vault-internal│ │ │ │ │ │
|
||
│ └──────┬──────┘ └──────┬──────┘ └──────────┘ │
|
||
│ │ │ │
|
||
└────────────────────────────┼───────────────┼─────────────────────────────┘
|
||
│ │
|
||
┌────────────▼───┐ ┌──────▼───────────────────┐
|
||
│ In-Cluster │ │ Off-Cluster │
|
||
│ │ │ │
|
||
│ *.lab.dav... │ │ gravenhollow (RustFS) │
|
||
│ envoy-internal │ │ candlekeep (QNAP) │
|
||
│ auto-renewed │ │ waterdeep (Mac Mini) │
|
||
│ by cert-manager │ │ manual/scripted renewal │
|
||
└─────────────────┘ └───────────────────────────┘
|
||
```
|
||
|
||
## Implementation
|
||
|
||
### Vault PKI Configuration (Phases 1–4, completed)
|
||
|
||
```bash
|
||
# Phase 1: Root CA
|
||
vault secrets enable -path=pki pki
|
||
vault secrets tune -max-lease-ttl=87600h pki
|
||
vault write pki/root/generate/internal \
|
||
common_name="Homelab Root CA" issuer_name="homelab-root" ttl=87600h
|
||
vault write pki/config/urls \
|
||
issuing_certificates="http://vault.security.svc:8200/v1/pki/ca" \
|
||
crl_distribution_points="http://vault.security.svc:8200/v1/pki/crl"
|
||
|
||
# Phase 2: Intermediate CA
|
||
vault secrets enable -path=pki_int pki
|
||
vault secrets tune -max-lease-ttl=43800h pki_int
|
||
vault write -field=csr pki_int/intermediate/generate/internal \
|
||
common_name="Homelab Intermediate CA" issuer_name="homelab-intermediate" \
|
||
> /tmp/intermediate.csr
|
||
vault write -field=certificate pki/root/sign-intermediate \
|
||
issuer_ref="homelab-root" csr=@/tmp/intermediate.csr \
|
||
format=pem_bundle ttl=43800h > /tmp/intermediate.crt
|
||
vault write pki_int/intermediate/set-signed certificate=@/tmp/intermediate.crt
|
||
|
||
# Phase 3: PKI Role
|
||
vault write pki_int/roles/lab-internal \
|
||
allowed_domains="lab.daviestechlabs.io" \
|
||
allow_subdomains=true allow_bare_domains=true \
|
||
max_ttl=8760h ttl=2160h key_type=ec key_bits=256
|
||
|
||
# Phase 4: Policy and Kubernetes Auth Role
|
||
vault policy write cert-manager-pki - <<EOF
|
||
path "pki_int/sign/lab-internal" { capabilities = ["create", "update"] }
|
||
path "pki_int/issue/lab-internal" { capabilities = ["create"] }
|
||
EOF
|
||
vault write auth/kubernetes/role/cert-manager \
|
||
bound_service_account_names=cert-manager \
|
||
bound_service_account_namespaces=cert-manager \
|
||
audience="https://192.168.100.20:6443" \
|
||
policies=cert-manager-pki ttl=1h
|
||
```
|
||
|
||
### Phase 5: Kubernetes Manifests (GitOps)
|
||
|
||
New `vault-internal` ClusterIssuer replaces `selfsigned-internal`:
|
||
|
||
```yaml
|
||
apiVersion: cert-manager.io/v1
|
||
kind: ClusterIssuer
|
||
metadata:
|
||
name: vault-internal
|
||
spec:
|
||
vault:
|
||
server: http://vault.security.svc:8200
|
||
path: pki_int/sign/lab-internal
|
||
auth:
|
||
kubernetes:
|
||
role: cert-manager
|
||
mountPath: /v1/auth/kubernetes
|
||
serviceAccountRef:
|
||
name: cert-manager
|
||
```
|
||
|
||
Envoy internal wildcard certificate updated to use `vault-internal`.
|
||
|
||
### Phase 6: Off-Cluster Cert Issuance
|
||
|
||
```bash
|
||
vault write -format=json pki_int/issue/lab-internal \
|
||
common_name="gravenhollow.lab.daviestechlabs.io" \
|
||
alt_names="gravenhollow.lab.daviestechlabs.io" \
|
||
ttl=8760h
|
||
# Extract cert, key, ca_chain from JSON output
|
||
```
|
||
|
||
### Phase 7: CA Distribution
|
||
|
||
The root CA cert is distributed to:
|
||
- **Kubernetes pods**: via ConfigMap `homelab-ca-bundle` or cert-manager `ca-injector`
|
||
- **waterdeep (macOS)**: `sudo security add-trusted-cert -d -r trustRoot -k /Library/Keychains/System.keychain homelab-root-ca.crt`
|
||
- **gravenhollow / candlekeep**: installed into OS trust store
|
||
|
||
## Certificate Inventory
|
||
|
||
| Service | Issuer | Renewal | Notes |
|
||
|---------|--------|---------|-------|
|
||
| `*.daviestechlabs.io` | `letsencrypt-production` | Auto (cert-manager) | Public, unchanged |
|
||
| `*.lab.daviestechlabs.io` | `vault-internal` | Auto (cert-manager) | Envoy internal gateway |
|
||
| `gravenhollow.lab.daviestechlabs.io` | `vault-internal` (via CLI) | Manual/cron | RustFS S3, NFS |
|
||
| `candlekeep.lab.daviestechlabs.io` | `vault-internal` (via CLI) | Manual/cron | QNAP NAS |
|
||
| `waterdeep.lab.daviestechlabs.io` | `vault-internal` (via CLI) | Manual/cron | Mac Mini dev workstation |
|
||
|
||
## Links
|
||
|
||
* [Vault PKI Secrets Engine](https://developer.hashicorp.com/vault/docs/secrets/pki)
|
||
* [cert-manager Vault Issuer](https://cert-manager.io/docs/configuration/vault/)
|
||
* Related: [ADR-0026](0026-storage-strategy.md) — Storage strategy (gravenhollow S3)
|
||
* Related: [ADR-0037](0037-node-naming-conventions.md) — Node naming conventions
|
||
* Related: [ADR-0059](0059-mac-mini-ray-worker.md) — Mac Mini Ray worker
|