From a128c265e483510875951eb05475204054b2018b Mon Sep 17 00:00:00 2001 From: "Billy D." Date: Wed, 4 Feb 2026 08:45:47 -0500 Subject: [PATCH] docs: Add ADRs for secrets management and security policy - 0017: Secrets Management Strategy (SOPS + Vault + External Secrets) - 0018: Security Policy Enforcement (Gatekeeper + Trivy) --- decisions/0017-secrets-management-strategy.md | 197 +++++++++++++++ decisions/0018-security-policy-enforcement.md | 239 ++++++++++++++++++ 2 files changed, 436 insertions(+) create mode 100644 decisions/0017-secrets-management-strategy.md create mode 100644 decisions/0018-security-policy-enforcement.md diff --git a/decisions/0017-secrets-management-strategy.md b/decisions/0017-secrets-management-strategy.md new file mode 100644 index 0000000..3822600 --- /dev/null +++ b/decisions/0017-secrets-management-strategy.md @@ -0,0 +1,197 @@ +# Secrets Management Strategy + +* Status: accepted +* Date: 2026-02-04 +* Deciders: Billy +* Technical Story: Establish a secure, GitOps-compatible secrets management approach for the homelab + +## Context and Problem Statement + +Managing secrets in a Kubernetes environment presents challenges: secrets must be available to applications, versionable in Git for GitOps, yet never exposed in plain text in repositories. The homelab needs a solution that balances security with operational simplicity. + +How do we manage secrets securely while maintaining GitOps principles and enabling applications to access credentials at runtime? + +## Decision Drivers + +* GitOps compatibility - secrets must be manageable through Git workflows +* Security - no plain text secrets in repositories or logs +* Operational simplicity - minimize manual secret rotation burden +* Application integration - secrets must be consumable by workloads +* Disaster recovery - ability to restore secrets from backups + +## Considered Options + +1. **SOPS + Age for bootstrap, Vault + External Secrets for runtime** +2. **Sealed Secrets only** +3. **Vault only (with Vault Agent Injector)** +4. **SOPS only for everything** + +## Decision Outcome + +Chosen option: **Option 1 - SOPS + Age for bootstrap, Vault + External Secrets for runtime** + +This hybrid approach uses SOPS with Age encryption for bootstrap secrets that must exist before the cluster is fully operational, and HashiCorp Vault with External Secrets Operator for runtime secrets that applications consume. + +### Positive Consequences + +* Bootstrap secrets can be committed to Git safely (encrypted with Age) +* Vault provides centralized secret management with audit logging +* External Secrets Operator enables declarative secret sync from Vault +* Clear separation between infrastructure secrets (SOPS) and application secrets (Vault) +* Secrets are automatically synced and refreshed + +### Negative Consequences + +* Two systems to understand and maintain +* Initial Vault setup requires manual unsealing (or auto-unseal configuration) +* Age key must be securely backed up outside the cluster + +## Pros and Cons of the Options + +### Option 1: SOPS + Age for Bootstrap, Vault + External Secrets for Runtime (Chosen) + +**Architecture:** +``` +Bootstrap Secrets (Git-encrypted): + .sops.yaml ──► age encryption ──► *.sops.yaml files + │ + ▼ + Flux SOPS decryption + │ + ▼ + Kubernetes Secrets + +Runtime Secrets (Vault-managed): + Vault KV Store ◄── Manual/API ──► ExternalSecret CR + │ + ▼ + External Secrets Operator + │ + ▼ + Kubernetes Secrets +``` + +* Good, because bootstrap secrets (Flux, cert-manager, Cloudflare) are encrypted in Git +* Good, because Vault provides audit trail and dynamic secret generation +* Good, because External Secrets syncs secrets declaratively (GitOps-friendly) +* Good, because secrets can be rotated in Vault without Git commits +* Bad, because two systems add operational complexity +* Bad, because Vault requires storage (Raft) and HA consideration + +### Option 2: Sealed Secrets Only + +* Good, because single tool to manage +* Good, because native Kubernetes integration +* Bad, because secrets are cluster-specific (can't reuse across clusters) +* Bad, because no central secret management or audit logging +* Bad, because no support for dynamic secrets + +### Option 3: Vault Only with Agent Injector + +* Good, because single source of truth +* Good, because supports dynamic secrets and leases +* Bad, because requires sidecar injection (resource overhead) +* Bad, because bootstrap problem - how does Vault authenticate before secrets exist? +* Bad, because more complex application integration + +### Option 4: SOPS Only + +* Good, because simple - everything encrypted in Git +* Good, because no external dependencies at runtime +* Bad, because all secrets in Git (even encrypted) is risky for large secrets +* Bad, because secret rotation requires Git commits +* Bad, because no audit logging + +## Implementation Details + +### SOPS Configuration + +`.sops.yaml` at repository root: +```yaml +creation_rules: + - path_regex: talos/.*\.sops\.ya?ml + age: age1... # Talos-specific key + - path_regex: (bootstrap|kubernetes)/.*\.sops\.ya?ml + age: age1... # Cluster key +``` + +**Bootstrap secrets encrypted with SOPS:** +- `bootstrap/sops-age.sops.yaml` - Age private key for Flux +- `bootstrap/github-deploy-key.sops.yaml` - Git repository access +- `talos/talsecret.sops.yaml` - Talos machine secrets + +### Vault Configuration + +**Deployment:** HA mode with 3 replicas, Raft storage on Longhorn + +```yaml +# HelmRelease values +server: + ha: + enabled: true + replicas: 3 + raft: + enabled: true + dataStorage: + storageClass: longhorn + size: 2Gi +``` + +**Kubernetes Auth:** External Secrets authenticates via ServiceAccount + +```yaml +# ClusterSecretStore +spec: + provider: + vault: + server: "http://vault.security.svc:8200" + path: "kv" + version: "v2" + auth: + kubernetes: + mountPath: "kubernetes" + role: "external-secrets" +``` + +### External Secrets Usage Pattern + +```yaml +apiVersion: external-secrets.io/v1 +kind: ExternalSecret +metadata: + name: app-credentials +spec: + refreshInterval: 1h + secretStoreRef: + kind: ClusterSecretStore + name: vault + target: + name: app-credentials + data: + - secretKey: password + remoteRef: + key: kv/data/myapp + property: password +``` + +### Secret Categories + +| Category | Storage | Examples | +|----------|---------|----------| +| Bootstrap | SOPS + Age | Age keys, deploy keys, Talos secrets | +| Infrastructure | Vault | Database credentials, API tokens | +| Application | Vault | Service accounts, OAuth secrets | +| Certificates | cert-manager | TLS certs (auto-generated) | + +## Disaster Recovery + +1. **Age private key** - Stored securely outside cluster (password manager, hardware key) +2. **Vault data** - Backed up via Longhorn snapshots +3. **Unseal keys** - Stored securely outside cluster (Shamir shares distributed) + +## References + +* [SOPS Documentation](https://github.com/getsops/sops) +* [Age Encryption](https://github.com/FiloSottile/age) +* [External Secrets Operator](https://external-secrets.io/) +* [HashiCorp Vault](https://www.vaultproject.io/) diff --git a/decisions/0018-security-policy-enforcement.md b/decisions/0018-security-policy-enforcement.md new file mode 100644 index 0000000..7bfecdd --- /dev/null +++ b/decisions/0018-security-policy-enforcement.md @@ -0,0 +1,239 @@ +# Security Policy Enforcement + +* Status: accepted +* Date: 2026-02-04 +* Deciders: Billy +* Technical Story: Implement security guardrails and vulnerability scanning for the homelab cluster + +## Context and Problem Statement + +A Kubernetes cluster without security policies is vulnerable to misconfigurations, privilege escalation, and unpatched vulnerabilities. Even in a homelab environment, security best practices protect against accidental misconfigurations and provide learning opportunities for production-grade security. + +How do we enforce security policies and maintain visibility into vulnerabilities without creating excessive operational friction? + +## Decision Drivers + +* Defense in depth - multiple layers of security controls +* Visibility - understand security posture across all workloads +* Progressive enforcement - warn before blocking to avoid disruption +* Automation - minimize manual security auditing +* Talos compatibility - policies must work with immutable OS constraints + +## Considered Options + +1. **Gatekeeper (OPA) for policy + Trivy Operator for scanning** +2. **Kyverno for policy + Trivy for scanning** +3. **Pod Security Standards (PSS) only** +4. **No enforcement, manual auditing** + +## Decision Outcome + +Chosen option: **Option 1 - Gatekeeper (OPA) for policy enforcement + Trivy Operator for vulnerability scanning** + +Gatekeeper provides flexible policy-as-code using Rego, while Trivy Operator continuously scans for vulnerabilities, misconfigurations, and exposed secrets. Both integrate with Prometheus for alerting. + +### Positive Consequences + +* Policies are defined as code and version-controlled +* Violations are visible in Grafana dashboards +* Trivy provides continuous vulnerability scanning without CI/CD integration +* Gatekeeper's warn mode allows gradual policy rollout +* Both tools provide Prometheus metrics for alerting + +### Negative Consequences + +* Rego learning curve for custom policies +* Must maintain exclusion lists for system namespaces +* Trivy node-collector disabled on Talos (lacks systemd paths) + +## Pros and Cons of the Options + +### Option 1: Gatekeeper + Trivy (Chosen) + +**Architecture:** +``` + ┌─────────────────┐ + │ Gatekeeper │ + │ (Admission) │ + └────────┬────────┘ + │ Validates + ▼ +┌─────────────┐ ┌─────────────────┐ ┌─────────────┐ +│ kubectl │───►│ API Server │───►│ Workloads │ +│ Flux │ └─────────────────┘ └──────┬──────┘ +└─────────────┘ │ + │ Scans + ┌─────────────────┐ │ + │ Trivy Operator │◄─────────┘ + └────────┬────────┘ + │ + ▼ + ┌─────────────────┐ + │ Vulnerability │ + │ Reports (CRDs) │ + └─────────────────┘ +``` + +* Good, because Gatekeeper is CNCF graduated and widely adopted +* Good, because Rego allows complex policy logic +* Good, because Trivy scans images, configs, RBAC, and secrets +* Good, because both provide Prometheus metrics and Grafana dashboards +* Bad, because Rego has a learning curve +* Bad, because Trivy node-collector incompatible with Talos + +### Option 2: Kyverno + Trivy + +* Good, because Kyverno policies are YAML-based (easier to write) +* Good, because Kyverno can mutate resources (auto-fix) +* Bad, because Kyverno is less mature than Gatekeeper +* Bad, because mutation can cause unexpected behavior + +### Option 3: Pod Security Standards Only + +* Good, because built into Kubernetes (no additional components) +* Good, because simple namespace-level enforcement +* Bad, because limited to pod security only +* Bad, because no vulnerability scanning +* Bad, because no custom policy support + +### Option 4: No Enforcement + +* Good, because no operational overhead +* Bad, because no protection against misconfigurations +* Bad, because no visibility into security posture +* Bad, because bad practice even for homelabs + +## Implementation Details + +### Gatekeeper Policies + +**Constraint Templates (Rego-based):** + +| Template | Purpose | +|----------|---------| +| `K8sPSPPrivilegedContainer` | Block privileged containers | +| `K8sRequiredLabels` | Require app.kubernetes.io labels | +| `K8sContainerLimits` | Require resource limits | + +**Constraints (Policy Instances):** + +```yaml +# Deny privileged containers (warn mode) +apiVersion: constraints.gatekeeper.sh/v1beta1 +kind: K8sPSPPrivilegedContainer +metadata: + name: deny-privileged-containers +spec: + enforcementAction: warn # Start with warn, move to deny + match: + kinds: + - apiGroups: [""] + kinds: ["Pod"] + excludedNamespaces: + - kube-system + - gatekeeper-system + - cilium-secrets + - longhorn-system + - observability + - gpu-operator + parameters: + exemptImages: + - "quay.io/cilium/*" + - "ghcr.io/longhorn/*" + - "nvcr.io/nvidia/*" +``` + +**Enforcement Progression:** +1. `warn` - Log violations, don't block (initial rollout) +2. `dryrun` - Audit mode, visible in reports +3. `deny` - Block non-compliant resources + +### Trivy Operator Configuration + +```yaml +operator: + vulnerabilityScannerEnabled: true + configAuditScannerEnabled: true + rbacAssessmentScannerEnabled: true + exposedSecretScannerEnabled: true + clusterComplianceEnabled: true + + # Disabled for Talos (no systemd, no /var/lib/kubelet) + infraAssessmentScannerEnabled: false + +# Metrics for Prometheus +metricsFindingsEnabled: true +metricsConfigAuditInfo: true +``` + +**Scan Reports (CRDs):** +- `VulnerabilityReport` - CVEs in container images +- `ConfigAuditReport` - Kubernetes misconfigurations +- `RbacAssessmentReport` - RBAC privilege issues +- `ExposedSecretReport` - Secrets in environment variables + +### Grafana Dashboards + +| Dashboard | Source | +|-----------|--------| +| Gatekeeper Overview | Grafana ID 15763 | +| Gatekeeper Violations | Grafana ID 14828 | +| Trivy Vulnerabilities | Grafana ID 17813 | +| Trivy Image Scan | Custom | + +### Namespace Exclusions + +System namespaces excluded from strict policies: + +| Namespace | Reason | +|-----------|--------| +| `kube-system` | Core Kubernetes components | +| `gatekeeper-system` | Gatekeeper itself | +| `longhorn-system` | Storage requires privileges | +| `gpu-operator` | GPU drivers require privileges | +| `cilium-secrets` | CNI requires host networking | +| `observability` | Some collectors need host access | + +### Talos-Specific Considerations + +Trivy's `node-collector` is disabled because Talos: +- Has no `/etc/systemd` (uses custom init) +- Has no standard `/var/lib/kubelet` path +- Is immutable (read-only root filesystem) + +This is acceptable because Talos itself is security-hardened by design. + +## Alerting Strategy + +**Prometheus Alerts:** +```yaml +- alert: HighSeverityVulnerability + expr: trivy_vulnerability_id{severity="CRITICAL"} > 0 + for: 1h + labels: + severity: warning + annotations: + summary: "Critical vulnerability detected" + +- alert: GatekeeperViolation + expr: increase(gatekeeper_violations[1h]) > 0 + for: 5m + labels: + severity: info + annotations: + summary: "Policy violation detected" +``` + +## Future Enhancements + +1. **Move to `deny` enforcement** once baseline violations are resolved +2. **Add network policies** via Cilium for workload isolation +3. **Integrate Falco** for runtime threat detection +4. **Add SBOM generation** with Trivy for supply chain visibility + +## References + +* [OPA Gatekeeper](https://open-policy-agent.github.io/gatekeeper/) +* [Trivy Operator](https://aquasecurity.github.io/trivy-operator/) +* [Pod Security Standards](https://kubernetes.io/docs/concepts/security/pod-security-standards/) +* [Talos Security](https://www.talos.dev/v1.6/introduction/what-is-talos/#security)