Files
homelab-design/decisions/0018-security-policy-enforcement.md
Billy D. 1bc602b726
All checks were successful
Update README with ADR Index / update-readme (push) Successful in 6s
docs(adr): add security ADRs for Gatekeeper, Falco, and Trivy
- ADR-0040: OPA Gatekeeper policy framework (constraint templates,
  progressive enforcement, warn-first strategy)
- ADR-0041: Falco runtime threat detection (modern eBPF on Talos,
  Falcosidekick → Alertmanager integration)
- ADR-0042: Trivy Operator vulnerability scanning (5 scanners enabled,
  ARM64 scan job scheduling, Talos adaptations)
- Update ADR-0018: mark Falco as implemented, link to detailed ADRs
- Update README: add 0040-0042 to ADR table, update badge counts
2026-02-09 18:20:13 -05:00

8.7 KiB

Security Policy Enforcement

  • Status: accepted
  • Date: 2026-02-04
  • Deciders: Billy
  • Technical Story: Implement security guardrails and vulnerability scanning for the homelab cluster

Context and Problem Statement

A Kubernetes cluster without security policies is vulnerable to misconfigurations, privilege escalation, and unpatched vulnerabilities. Even in a homelab environment, security best practices protect against accidental misconfigurations and provide learning opportunities for production-grade security.

How do we enforce security policies and maintain visibility into vulnerabilities without creating excessive operational friction?

Decision Drivers

  • Defense in depth - multiple layers of security controls
  • Visibility - understand security posture across all workloads
  • Progressive enforcement - warn before blocking to avoid disruption
  • Automation - minimize manual security auditing
  • Talos compatibility - policies must work with immutable OS constraints

Considered Options

  1. Gatekeeper (OPA) for policy + Trivy Operator for scanning
  2. Kyverno for policy + Trivy for scanning
  3. Pod Security Standards (PSS) only
  4. No enforcement, manual auditing

Decision Outcome

Chosen option: Option 1 - Gatekeeper (OPA) for policy enforcement + Trivy Operator for vulnerability scanning

Gatekeeper provides flexible policy-as-code using Rego, while Trivy Operator continuously scans for vulnerabilities, misconfigurations, and exposed secrets. Both integrate with Prometheus for alerting.

Positive Consequences

  • Policies are defined as code and version-controlled
  • Violations are visible in Grafana dashboards
  • Trivy provides continuous vulnerability scanning without CI/CD integration
  • Gatekeeper's warn mode allows gradual policy rollout
  • Both tools provide Prometheus metrics for alerting

Negative Consequences

  • Rego learning curve for custom policies
  • Must maintain exclusion lists for system namespaces
  • Trivy node-collector disabled on Talos (lacks systemd paths)

Pros and Cons of the Options

Option 1: Gatekeeper + Trivy (Chosen)

Architecture:

                    ┌─────────────────┐
                    │  Gatekeeper     │
                    │  (Admission)    │
                    └────────┬────────┘
                             │ Validates
                             ▼
┌─────────────┐    ┌─────────────────┐    ┌─────────────┐
│ kubectl     │───►│  API Server     │───►│ Workloads   │
│ Flux        │    └─────────────────┘    └──────┬──────┘
└─────────────┘                                  │
                                                 │ Scans
                    ┌─────────────────┐          │
                    │ Trivy Operator  │◄─────────┘
                    └────────┬────────┘
                             │
                             ▼
                    ┌─────────────────┐
                    │ Vulnerability   │
                    │ Reports (CRDs)  │
                    └─────────────────┘
  • Good, because Gatekeeper is CNCF graduated and widely adopted
  • Good, because Rego allows complex policy logic
  • Good, because Trivy scans images, configs, RBAC, and secrets
  • Good, because both provide Prometheus metrics and Grafana dashboards
  • Bad, because Rego has a learning curve
  • Bad, because Trivy node-collector incompatible with Talos

Option 2: Kyverno + Trivy

  • Good, because Kyverno policies are YAML-based (easier to write)
  • Good, because Kyverno can mutate resources (auto-fix)
  • Bad, because Kyverno is less mature than Gatekeeper
  • Bad, because mutation can cause unexpected behavior

Option 3: Pod Security Standards Only

  • Good, because built into Kubernetes (no additional components)
  • Good, because simple namespace-level enforcement
  • Bad, because limited to pod security only
  • Bad, because no vulnerability scanning
  • Bad, because no custom policy support

Option 4: No Enforcement

  • Good, because no operational overhead
  • Bad, because no protection against misconfigurations
  • Bad, because no visibility into security posture
  • Bad, because bad practice even for homelabs

Implementation Details

Gatekeeper Policies

Constraint Templates (Rego-based):

Template Purpose
K8sPSPPrivilegedContainer Block privileged containers
K8sRequiredLabels Require app.kubernetes.io labels
K8sContainerLimits Require resource limits

Constraints (Policy Instances):

# Deny privileged containers (warn mode)
apiVersion: constraints.gatekeeper.sh/v1beta1
kind: K8sPSPPrivilegedContainer
metadata:
  name: deny-privileged-containers
spec:
  enforcementAction: warn  # Start with warn, move to deny
  match:
    kinds:
      - apiGroups: [""]
        kinds: ["Pod"]
    excludedNamespaces:
      - kube-system
      - gatekeeper-system
      - cilium-secrets
      - longhorn-system
      - observability
      - gpu-operator
  parameters:
    exemptImages:
      - "quay.io/cilium/*"
      - "ghcr.io/longhorn/*"
      - "nvcr.io/nvidia/*"

Enforcement Progression:

  1. warn - Log violations, don't block (initial rollout)
  2. dryrun - Audit mode, visible in reports
  3. deny - Block non-compliant resources

Trivy Operator Configuration

operator:
  vulnerabilityScannerEnabled: true
  configAuditScannerEnabled: true
  rbacAssessmentScannerEnabled: true
  exposedSecretScannerEnabled: true
  clusterComplianceEnabled: true
  
  # Disabled for Talos (no systemd, no /var/lib/kubelet)
  infraAssessmentScannerEnabled: false

# Metrics for Prometheus
metricsFindingsEnabled: true
metricsConfigAuditInfo: true

Scan Reports (CRDs):

  • VulnerabilityReport - CVEs in container images
  • ConfigAuditReport - Kubernetes misconfigurations
  • RbacAssessmentReport - RBAC privilege issues
  • ExposedSecretReport - Secrets in environment variables

Grafana Dashboards

Dashboard Source
Gatekeeper Overview Grafana ID 15763
Gatekeeper Violations Grafana ID 14828
Trivy Vulnerabilities Grafana ID 17813
Trivy Image Scan Custom

Namespace Exclusions

System namespaces excluded from strict policies:

Namespace Reason
kube-system Core Kubernetes components
gatekeeper-system Gatekeeper itself
longhorn-system Storage requires privileges
gpu-operator GPU drivers require privileges
cilium-secrets CNI requires host networking
observability Some collectors need host access

Talos-Specific Considerations

Trivy's node-collector is disabled because Talos:

  • Has no /etc/systemd (uses custom init)
  • Has no standard /var/lib/kubelet path
  • Is immutable (read-only root filesystem)

This is acceptable because Talos itself is security-hardened by design.

Alerting Strategy

Prometheus Alerts:

- alert: HighSeverityVulnerability
  expr: trivy_vulnerability_id{severity="CRITICAL"} > 0
  for: 1h
  labels:
    severity: warning
  annotations:
    summary: "Critical vulnerability detected"

- alert: GatekeeperViolation
  expr: increase(gatekeeper_violations[1h]) > 0
  for: 5m
  labels:
    severity: info
  annotations:
    summary: "Policy violation detected"

Future Enhancements

  1. Move to deny enforcement once baseline violations are resolved
  2. Add network policies via Cilium for workload isolation
  3. Falco integrated — see ADR-0041 for runtime threat detection
  4. Add SBOM generation with Trivy for supply chain visibility

Detailed Component ADRs

Component ADR Purpose
Gatekeeper ADR-0040 Policy templates, constraints, enforcement progression
Falco ADR-0041 Runtime threat detection, eBPF driver, Falcosidekick
Trivy Operator ADR-0042 Vulnerability scanning, compliance reports, Talos adaptations

References