Files
homelab-design/decisions/0060-internal-pki-vault.md
Billy D. 4537de7825
Some checks failed
Update README with ADR Index / update-readme (push) Failing after 12m52s
adding vualt for self signed pki stuff.
2026-02-16 19:24:55 -05:00

193 lines
9.9 KiB
Markdown
Raw Blame History

This file contains ambiguous Unicode characters
This file contains Unicode characters that might be confused with other characters. If you think that this is intentional, you can safely ignore this warning. Use the Escape button to reveal them.
# Internal PKI with Vault and cert-manager
* Status: accepted
* Date: 2026-02-16
* Deciders: Billy
* Technical Story: Replace self-signed internal certificates with a proper CA chain using Vault PKI
## Context and Problem Statement
Internal services on `*.lab.daviestechlabs.io` use a `selfsigned-internal` ClusterIssuer. Each cert-manager Certificate gets its own unique self-signed root — there is no shared CA. This causes problems:
- Off-cluster devices (gravenhollow, candlekeep, waterdeep) have no way to obtain trusted certs
- Clients cannot verify server certs because there's no CA to trust
- RustFS on gravenhollow has a TLS cert only valid for `localhost`, breaking S3 clients
- No certificate chain means no ability to distribute a single CA bundle across the fleet
The homelab already runs HashiCorp Vault in HA mode (3 replicas, Raft storage) and cert-manager with Let's Encrypt for public certs. How do we issue trusted internal certificates for both in-cluster and off-cluster services?
## Decision Drivers
* Vault is already deployed and used for secrets management
* cert-manager is already deployed with ClusterIssuer support
* Off-cluster devices (NAS, Mac Mini) need valid TLS certs
* Single CA root to trust across all machines
* Automated renewal for in-cluster certs via cert-manager
* Must not disrupt existing Let's Encrypt public certs
## Considered Options
1. **Vault PKI secrets engine + cert-manager Vault ClusterIssuer**
2. **step-ca (Smallstep) as standalone internal CA**
3. **Keep self-signed, distribute individual certs manually**
## Decision Outcome
Chosen option: **Option 1 — Vault PKI + cert-manager Vault ClusterIssuer**, because it builds on existing infrastructure (Vault and cert-manager), provides a proper two-tier CA chain, and supports both in-cluster automated renewal and off-cluster cert issuance via the Vault API.
### Positive Consequences
* Single root CA — one trust anchor for the entire homelab
* cert-manager automatically renews in-cluster certs via Vault
* Off-cluster devices request certs via `vault write` CLI
* Two-tier CA (root → intermediate) follows PKI best practices
* Root CA key never leaves Vault
* Existing Let's Encrypt public certs are unaffected
### Negative Consequences
* Vault becomes a dependency for internal TLS issuance
* Off-cluster cert renewal requires manual or scripted `vault write` (no ACME)
* CA root cert must be distributed to trust stores on all machines
* Vault PKI engine adds operational complexity
## Architecture
```
┌──────────────────────────────────────────────────────────────────────────┐
│ Vault PKI (security namespace) │
│ │
│ ┌──────────────────────┐ ┌──────────────────────────┐ │
│ │ pki/ (Root CA) │ │ pki_int/ (Intermediate) │ │
│ │ │ signs │ │ │
│ │ Homelab Root CA │──────▶│ Homelab Intermediate CA │ │
│ │ TTL: 10 years │ │ TTL: 5 years │ │
│ │ │ │ │ │
│ │ (only signs │ │ Role: lab-internal │ │
│ │ intermediates) │ │ *.lab.daviestechlabs.io │ │
│ └──────────────────────┘ │ TTL: 90 days (default) │ │
│ │ Key: EC P-256 │ │
│ └─────────┬────────────────┘ │
│ │ │
│ ┌──────────────┼──────────────┐ │
│ │ │ │ │
│ ▼ ▼ ▼ │
│ ┌─────────────┐ ┌─────────────┐ ┌──────────┐ │
│ │cert-manager │ │ vault write │ │ Future │ │
│ │ClusterIssuer│ │ (CLI) │ │ ACME │ │
│ │vault-internal│ │ │ │ │ │
│ └──────┬──────┘ └──────┬──────┘ └──────────┘ │
│ │ │ │
└────────────────────────────┼───────────────┼─────────────────────────────┘
│ │
┌────────────▼───┐ ┌──────▼───────────────────┐
│ In-Cluster │ │ Off-Cluster │
│ │ │ │
│ *.lab.dav... │ │ gravenhollow (RustFS) │
│ envoy-internal │ │ candlekeep (QNAP) │
│ auto-renewed │ │ waterdeep (Mac Mini) │
│ by cert-manager │ │ manual/scripted renewal │
└─────────────────┘ └───────────────────────────┘
```
## Implementation
### Vault PKI Configuration (Phases 14, completed)
```bash
# Phase 1: Root CA
vault secrets enable -path=pki pki
vault secrets tune -max-lease-ttl=87600h pki
vault write pki/root/generate/internal \
common_name="Homelab Root CA" issuer_name="homelab-root" ttl=87600h
vault write pki/config/urls \
issuing_certificates="http://vault.security.svc:8200/v1/pki/ca" \
crl_distribution_points="http://vault.security.svc:8200/v1/pki/crl"
# Phase 2: Intermediate CA
vault secrets enable -path=pki_int pki
vault secrets tune -max-lease-ttl=43800h pki_int
vault write -field=csr pki_int/intermediate/generate/internal \
common_name="Homelab Intermediate CA" issuer_name="homelab-intermediate" \
> /tmp/intermediate.csr
vault write -field=certificate pki/root/sign-intermediate \
issuer_ref="homelab-root" csr=@/tmp/intermediate.csr \
format=pem_bundle ttl=43800h > /tmp/intermediate.crt
vault write pki_int/intermediate/set-signed certificate=@/tmp/intermediate.crt
# Phase 3: PKI Role
vault write pki_int/roles/lab-internal \
allowed_domains="lab.daviestechlabs.io" \
allow_subdomains=true allow_bare_domains=true \
max_ttl=8760h ttl=2160h key_type=ec key_bits=256
# Phase 4: Policy and Kubernetes Auth Role
vault policy write cert-manager-pki - <<EOF
path "pki_int/sign/lab-internal" { capabilities = ["create", "update"] }
path "pki_int/issue/lab-internal" { capabilities = ["create"] }
EOF
vault write auth/kubernetes/role/cert-manager \
bound_service_account_names=cert-manager \
bound_service_account_namespaces=cert-manager \
audience="https://192.168.100.20:6443" \
policies=cert-manager-pki ttl=1h
```
### Phase 5: Kubernetes Manifests (GitOps)
New `vault-internal` ClusterIssuer replaces `selfsigned-internal`:
```yaml
apiVersion: cert-manager.io/v1
kind: ClusterIssuer
metadata:
name: vault-internal
spec:
vault:
server: http://vault.security.svc:8200
path: pki_int/sign/lab-internal
auth:
kubernetes:
role: cert-manager
mountPath: /v1/auth/kubernetes
serviceAccountRef:
name: cert-manager
```
Envoy internal wildcard certificate updated to use `vault-internal`.
### Phase 6: Off-Cluster Cert Issuance
```bash
vault write -format=json pki_int/issue/lab-internal \
common_name="gravenhollow.lab.daviestechlabs.io" \
alt_names="gravenhollow.lab.daviestechlabs.io" \
ttl=8760h
# Extract cert, key, ca_chain from JSON output
```
### Phase 7: CA Distribution
The root CA cert is distributed to:
- **Kubernetes pods**: via ConfigMap `homelab-ca-bundle` or cert-manager `ca-injector`
- **waterdeep (macOS)**: `sudo security add-trusted-cert -d -r trustRoot -k /Library/Keychains/System.keychain homelab-root-ca.crt`
- **gravenhollow / candlekeep**: installed into OS trust store
## Certificate Inventory
| Service | Issuer | Renewal | Notes |
|---------|--------|---------|-------|
| `*.daviestechlabs.io` | `letsencrypt-production` | Auto (cert-manager) | Public, unchanged |
| `*.lab.daviestechlabs.io` | `vault-internal` | Auto (cert-manager) | Envoy internal gateway |
| `gravenhollow.lab.daviestechlabs.io` | `vault-internal` (via CLI) | Manual/cron | RustFS S3, NFS |
| `candlekeep.lab.daviestechlabs.io` | `vault-internal` (via CLI) | Manual/cron | QNAP NAS |
| `waterdeep.lab.daviestechlabs.io` | `vault-internal` (via CLI) | Manual/cron | Mac Mini dev workstation |
## Links
* [Vault PKI Secrets Engine](https://developer.hashicorp.com/vault/docs/secrets/pki)
* [cert-manager Vault Issuer](https://cert-manager.io/docs/configuration/vault/)
* Related: [ADR-0026](0026-storage-strategy.md) — Storage strategy (gravenhollow S3)
* Related: [ADR-0037](0037-node-naming-conventions.md) — Node naming conventions
* Related: [ADR-0059](0059-mac-mini-ray-worker.md) — Mac Mini Ray worker