# Security Policy Enforcement * Status: accepted * Date: 2026-02-04 * Deciders: Billy * Technical Story: Implement security guardrails and vulnerability scanning for the homelab cluster ## Context and Problem Statement A Kubernetes cluster without security policies is vulnerable to misconfigurations, privilege escalation, and unpatched vulnerabilities. Even in a homelab environment, security best practices protect against accidental misconfigurations and provide learning opportunities for production-grade security. How do we enforce security policies and maintain visibility into vulnerabilities without creating excessive operational friction? ## Decision Drivers * Defense in depth - multiple layers of security controls * Visibility - understand security posture across all workloads * Progressive enforcement - warn before blocking to avoid disruption * Automation - minimize manual security auditing * Talos compatibility - policies must work with immutable OS constraints ## Considered Options 1. **Gatekeeper (OPA) for policy + Trivy Operator for scanning** 2. **Kyverno for policy + Trivy for scanning** 3. **Pod Security Standards (PSS) only** 4. **No enforcement, manual auditing** ## Decision Outcome Chosen option: **Option 1 - Gatekeeper (OPA) for policy enforcement + Trivy Operator for vulnerability scanning** Gatekeeper provides flexible policy-as-code using Rego, while Trivy Operator continuously scans for vulnerabilities, misconfigurations, and exposed secrets. Both integrate with Prometheus for alerting. ### Positive Consequences * Policies are defined as code and version-controlled * Violations are visible in Grafana dashboards * Trivy provides continuous vulnerability scanning without CI/CD integration * Gatekeeper's warn mode allows gradual policy rollout * Both tools provide Prometheus metrics for alerting ### Negative Consequences * Rego learning curve for custom policies * Must maintain exclusion lists for system namespaces * Trivy node-collector disabled on Talos (lacks systemd paths) ## Pros and Cons of the Options ### Option 1: Gatekeeper + Trivy (Chosen) **Architecture:** ``` ┌─────────────────┐ │ Gatekeeper │ │ (Admission) │ └────────┬────────┘ │ Validates ▼ ┌─────────────┐ ┌─────────────────┐ ┌─────────────┐ │ kubectl │───►│ API Server │───►│ Workloads │ │ Flux │ └─────────────────┘ └──────┬──────┘ └─────────────┘ │ │ Scans ┌─────────────────┐ │ │ Trivy Operator │◄─────────┘ └────────┬────────┘ │ ▼ ┌─────────────────┐ │ Vulnerability │ │ Reports (CRDs) │ └─────────────────┘ ``` * Good, because Gatekeeper is CNCF graduated and widely adopted * Good, because Rego allows complex policy logic * Good, because Trivy scans images, configs, RBAC, and secrets * Good, because both provide Prometheus metrics and Grafana dashboards * Bad, because Rego has a learning curve * Bad, because Trivy node-collector incompatible with Talos ### Option 2: Kyverno + Trivy * Good, because Kyverno policies are YAML-based (easier to write) * Good, because Kyverno can mutate resources (auto-fix) * Bad, because Kyverno is less mature than Gatekeeper * Bad, because mutation can cause unexpected behavior ### Option 3: Pod Security Standards Only * Good, because built into Kubernetes (no additional components) * Good, because simple namespace-level enforcement * Bad, because limited to pod security only * Bad, because no vulnerability scanning * Bad, because no custom policy support ### Option 4: No Enforcement * Good, because no operational overhead * Bad, because no protection against misconfigurations * Bad, because no visibility into security posture * Bad, because bad practice even for homelabs ## Implementation Details ### Gatekeeper Policies **Constraint Templates (Rego-based):** | Template | Purpose | |----------|---------| | `K8sPSPPrivilegedContainer` | Block privileged containers | | `K8sRequiredLabels` | Require app.kubernetes.io labels | | `K8sContainerLimits` | Require resource limits | **Constraints (Policy Instances):** ```yaml # Deny privileged containers (warn mode) apiVersion: constraints.gatekeeper.sh/v1beta1 kind: K8sPSPPrivilegedContainer metadata: name: deny-privileged-containers spec: enforcementAction: warn # Start with warn, move to deny match: kinds: - apiGroups: [""] kinds: ["Pod"] excludedNamespaces: - kube-system - gatekeeper-system - cilium-secrets - longhorn-system - observability - gpu-operator parameters: exemptImages: - "quay.io/cilium/*" - "ghcr.io/longhorn/*" - "nvcr.io/nvidia/*" ``` **Enforcement Progression:** 1. `warn` - Log violations, don't block (initial rollout) 2. `dryrun` - Audit mode, visible in reports 3. `deny` - Block non-compliant resources ### Trivy Operator Configuration ```yaml operator: vulnerabilityScannerEnabled: true configAuditScannerEnabled: true rbacAssessmentScannerEnabled: true exposedSecretScannerEnabled: true clusterComplianceEnabled: true # Disabled for Talos (no systemd, no /var/lib/kubelet) infraAssessmentScannerEnabled: false # Metrics for Prometheus metricsFindingsEnabled: true metricsConfigAuditInfo: true ``` **Scan Reports (CRDs):** - `VulnerabilityReport` - CVEs in container images - `ConfigAuditReport` - Kubernetes misconfigurations - `RbacAssessmentReport` - RBAC privilege issues - `ExposedSecretReport` - Secrets in environment variables ### Grafana Dashboards | Dashboard | Source | |-----------|--------| | Gatekeeper Overview | Grafana ID 15763 | | Gatekeeper Violations | Grafana ID 14828 | | Trivy Vulnerabilities | Grafana ID 17813 | | Trivy Image Scan | Custom | ### Namespace Exclusions System namespaces excluded from strict policies: | Namespace | Reason | |-----------|--------| | `kube-system` | Core Kubernetes components | | `gatekeeper-system` | Gatekeeper itself | | `longhorn-system` | Storage requires privileges | | `gpu-operator` | GPU drivers require privileges | | `cilium-secrets` | CNI requires host networking | | `observability` | Some collectors need host access | ### Talos-Specific Considerations Trivy's `node-collector` is disabled because Talos: - Has no `/etc/systemd` (uses custom init) - Has no standard `/var/lib/kubelet` path - Is immutable (read-only root filesystem) This is acceptable because Talos itself is security-hardened by design. ## Alerting Strategy **Prometheus Alerts:** ```yaml - alert: HighSeverityVulnerability expr: trivy_vulnerability_id{severity="CRITICAL"} > 0 for: 1h labels: severity: warning annotations: summary: "Critical vulnerability detected" - alert: GatekeeperViolation expr: increase(gatekeeper_violations[1h]) > 0 for: 5m labels: severity: info annotations: summary: "Policy violation detected" ``` ## Future Enhancements 1. **Move to `deny` enforcement** once baseline violations are resolved 2. **Add network policies** via Cilium for workload isolation 3. **Integrate Falco** for runtime threat detection 4. **Add SBOM generation** with Trivy for supply chain visibility ## References * [OPA Gatekeeper](https://open-policy-agent.github.io/gatekeeper/) * [Trivy Operator](https://aquasecurity.github.io/trivy-operator/) * [Pod Security Standards](https://kubernetes.io/docs/concepts/security/pod-security-standards/) * [Talos Security](https://www.talos.dev/v1.6/introduction/what-is-talos/#security)