docs: Add ADRs for secrets management and security policy

- 0017: Secrets Management Strategy (SOPS + Vault + External Secrets) - 0018: Security Policy Enforcement (Gatekeeper + Trivy)
2026-02-04 08:45:47 -05:00
parent 8f4df84657
commit a128c265e4
2 changed files with 436 additions and 0 deletions
--- a/decisions/0018-security-policy-enforcement.md
+++ b/decisions/0018-security-policy-enforcement.md
@@ -0,0 +1,239 @@
+# Security Policy Enforcement
+
+* Status: accepted
+* Date: 2026-02-04
+* Deciders: Billy
+* Technical Story: Implement security guardrails and vulnerability scanning for the homelab cluster
+
+## Context and Problem Statement
+
+A Kubernetes cluster without security policies is vulnerable to misconfigurations, privilege escalation, and unpatched vulnerabilities. Even in a homelab environment, security best practices protect against accidental misconfigurations and provide learning opportunities for production-grade security.
+
+How do we enforce security policies and maintain visibility into vulnerabilities without creating excessive operational friction?
+
+## Decision Drivers
+
+* Defense in depth - multiple layers of security controls
+* Visibility - understand security posture across all workloads
+* Progressive enforcement - warn before blocking to avoid disruption
+* Automation - minimize manual security auditing
+* Talos compatibility - policies must work with immutable OS constraints
+
+## Considered Options
+
+1. **Gatekeeper (OPA) for policy + Trivy Operator for scanning**
+2. **Kyverno for policy + Trivy for scanning**
+3. **Pod Security Standards (PSS) only**
+4. **No enforcement, manual auditing**
+
+## Decision Outcome
+
+Chosen option: **Option 1 - Gatekeeper (OPA) for policy enforcement + Trivy Operator for vulnerability scanning**
+
+Gatekeeper provides flexible policy-as-code using Rego, while Trivy Operator continuously scans for vulnerabilities, misconfigurations, and exposed secrets. Both integrate with Prometheus for alerting.
+
+### Positive Consequences
+
+* Policies are defined as code and version-controlled
+* Violations are visible in Grafana dashboards
+* Trivy provides continuous vulnerability scanning without CI/CD integration
+* Gatekeeper's warn mode allows gradual policy rollout
+* Both tools provide Prometheus metrics for alerting
+
+### Negative Consequences
+
+* Rego learning curve for custom policies
+* Must maintain exclusion lists for system namespaces
+* Trivy node-collector disabled on Talos (lacks systemd paths)
+
+## Pros and Cons of the Options
+
+### Option 1: Gatekeeper + Trivy (Chosen)
+
+**Architecture:**
+```
+                    ┌─────────────────┐
+                    │  Gatekeeper     │
+                    │  (Admission)    │
+                    └────────┬────────┘
+                             │ Validates
+                             ▼
+┌─────────────┐    ┌─────────────────┐    ┌─────────────┐
+│ kubectl     │───►│  API Server     │───►│ Workloads   │
+│ Flux        │    └─────────────────┘    └──────┬──────┘
+└─────────────┘                                  │
+                                                 │ Scans
+                    ┌─────────────────┐          │
+                    │ Trivy Operator  │◄─────────┘
+                    └────────┬────────┘
+                             │
+                             ▼
+                    ┌─────────────────┐
+                    │ Vulnerability   │
+                    │ Reports (CRDs)  │
+                    └─────────────────┘
+```
+
+* Good, because Gatekeeper is CNCF graduated and widely adopted
+* Good, because Rego allows complex policy logic
+* Good, because Trivy scans images, configs, RBAC, and secrets
+* Good, because both provide Prometheus metrics and Grafana dashboards
+* Bad, because Rego has a learning curve
+* Bad, because Trivy node-collector incompatible with Talos
+
+### Option 2: Kyverno + Trivy
+
+* Good, because Kyverno policies are YAML-based (easier to write)
+* Good, because Kyverno can mutate resources (auto-fix)
+* Bad, because Kyverno is less mature than Gatekeeper
+* Bad, because mutation can cause unexpected behavior
+
+### Option 3: Pod Security Standards Only
+
+* Good, because built into Kubernetes (no additional components)
+* Good, because simple namespace-level enforcement
+* Bad, because limited to pod security only
+* Bad, because no vulnerability scanning
+* Bad, because no custom policy support
+
+### Option 4: No Enforcement
+
+* Good, because no operational overhead
+* Bad, because no protection against misconfigurations
+* Bad, because no visibility into security posture
+* Bad, because bad practice even for homelabs
+
+## Implementation Details
+
+### Gatekeeper Policies
+
+**Constraint Templates (Rego-based):**
+
+| Template | Purpose |
+|----------|---------|
+| `K8sPSPPrivilegedContainer` | Block privileged containers |
+| `K8sRequiredLabels` | Require app.kubernetes.io labels |
+| `K8sContainerLimits` | Require resource limits |
+
+**Constraints (Policy Instances):**
+
+```yaml
+# Deny privileged containers (warn mode)
+apiVersion: constraints.gatekeeper.sh/v1beta1
+kind: K8sPSPPrivilegedContainer
+metadata:
+  name: deny-privileged-containers
+spec:
+  enforcementAction: warn  # Start with warn, move to deny
+  match:
+    kinds:
+      - apiGroups: [""]
+        kinds: ["Pod"]
+    excludedNamespaces:
+      - kube-system
+      - gatekeeper-system
+      - cilium-secrets
+      - longhorn-system
+      - observability
+      - gpu-operator
+  parameters:
+    exemptImages:
+      - "quay.io/cilium/*"
+      - "ghcr.io/longhorn/*"
+      - "nvcr.io/nvidia/*"
+```
+
+**Enforcement Progression:**
+1. `warn` - Log violations, don't block (initial rollout)
+2. `dryrun` - Audit mode, visible in reports
+3. `deny` - Block non-compliant resources
+
+### Trivy Operator Configuration
+
+```yaml
+operator:
+  vulnerabilityScannerEnabled: true
+  configAuditScannerEnabled: true
+  rbacAssessmentScannerEnabled: true
+  exposedSecretScannerEnabled: true
+  clusterComplianceEnabled: true
+  
+  # Disabled for Talos (no systemd, no /var/lib/kubelet)
+  infraAssessmentScannerEnabled: false
+
+# Metrics for Prometheus
+metricsFindingsEnabled: true
+metricsConfigAuditInfo: true
+```
+
+**Scan Reports (CRDs):**
+- `VulnerabilityReport` - CVEs in container images
+- `ConfigAuditReport` - Kubernetes misconfigurations
+- `RbacAssessmentReport` - RBAC privilege issues
+- `ExposedSecretReport` - Secrets in environment variables
+
+### Grafana Dashboards
+
+| Dashboard | Source |
+|-----------|--------|
+| Gatekeeper Overview | Grafana ID 15763 |
+| Gatekeeper Violations | Grafana ID 14828 |
+| Trivy Vulnerabilities | Grafana ID 17813 |
+| Trivy Image Scan | Custom |
+
+### Namespace Exclusions
+
+System namespaces excluded from strict policies:
+
+| Namespace | Reason |
+|-----------|--------|
+| `kube-system` | Core Kubernetes components |
+| `gatekeeper-system` | Gatekeeper itself |
+| `longhorn-system` | Storage requires privileges |
+| `gpu-operator` | GPU drivers require privileges |
+| `cilium-secrets` | CNI requires host networking |
+| `observability` | Some collectors need host access |
+
+### Talos-Specific Considerations
+
+Trivy's `node-collector` is disabled because Talos:
+- Has no `/etc/systemd` (uses custom init)
+- Has no standard `/var/lib/kubelet` path
+- Is immutable (read-only root filesystem)
+
+This is acceptable because Talos itself is security-hardened by design.
+
+## Alerting Strategy
+
+**Prometheus Alerts:**
+```yaml
+- alert: HighSeverityVulnerability
+  expr: trivy_vulnerability_id{severity="CRITICAL"} > 0
+  for: 1h
+  labels:
+    severity: warning
+  annotations:
+    summary: "Critical vulnerability detected"
+
+- alert: GatekeeperViolation
+  expr: increase(gatekeeper_violations[1h]) > 0
+  for: 5m
+  labels:
+    severity: info
+  annotations:
+    summary: "Policy violation detected"
+```
+
+## Future Enhancements
+
+1. **Move to `deny` enforcement** once baseline violations are resolved
+2. **Add network policies** via Cilium for workload isolation
+3. **Integrate Falco** for runtime threat detection
+4. **Add SBOM generation** with Trivy for supply chain visibility
+
+## References
+
+* [OPA Gatekeeper](https://open-policy-agent.github.io/gatekeeper/)
+* [Trivy Operator](https://aquasecurity.github.io/trivy-operator/)
+* [Pod Security Standards](https://kubernetes.io/docs/concepts/security/pod-security-standards/)
+* [Talos Security](https://www.talos.dev/v1.6/introduction/what-is-talos/#security)