- ADR-0040: OPA Gatekeeper policy framework (constraint templates, progressive enforcement, warn-first strategy) - ADR-0041: Falco runtime threat detection (modern eBPF on Talos, Falcosidekick → Alertmanager integration) - ADR-0042: Trivy Operator vulnerability scanning (5 scanners enabled, ARM64 scan job scheduling, Talos adaptations) - Update ADR-0018: mark Falco as implemented, link to detailed ADRs - Update README: add 0040-0042 to ADR table, update badge counts
8.7 KiB
Security Policy Enforcement
- Status: accepted
- Date: 2026-02-04
- Deciders: Billy
- Technical Story: Implement security guardrails and vulnerability scanning for the homelab cluster
Context and Problem Statement
A Kubernetes cluster without security policies is vulnerable to misconfigurations, privilege escalation, and unpatched vulnerabilities. Even in a homelab environment, security best practices protect against accidental misconfigurations and provide learning opportunities for production-grade security.
How do we enforce security policies and maintain visibility into vulnerabilities without creating excessive operational friction?
Decision Drivers
- Defense in depth - multiple layers of security controls
- Visibility - understand security posture across all workloads
- Progressive enforcement - warn before blocking to avoid disruption
- Automation - minimize manual security auditing
- Talos compatibility - policies must work with immutable OS constraints
Considered Options
- Gatekeeper (OPA) for policy + Trivy Operator for scanning
- Kyverno for policy + Trivy for scanning
- Pod Security Standards (PSS) only
- No enforcement, manual auditing
Decision Outcome
Chosen option: Option 1 - Gatekeeper (OPA) for policy enforcement + Trivy Operator for vulnerability scanning
Gatekeeper provides flexible policy-as-code using Rego, while Trivy Operator continuously scans for vulnerabilities, misconfigurations, and exposed secrets. Both integrate with Prometheus for alerting.
Positive Consequences
- Policies are defined as code and version-controlled
- Violations are visible in Grafana dashboards
- Trivy provides continuous vulnerability scanning without CI/CD integration
- Gatekeeper's warn mode allows gradual policy rollout
- Both tools provide Prometheus metrics for alerting
Negative Consequences
- Rego learning curve for custom policies
- Must maintain exclusion lists for system namespaces
- Trivy node-collector disabled on Talos (lacks systemd paths)
Pros and Cons of the Options
Option 1: Gatekeeper + Trivy (Chosen)
Architecture:
┌─────────────────┐
│ Gatekeeper │
│ (Admission) │
└────────┬────────┘
│ Validates
▼
┌─────────────┐ ┌─────────────────┐ ┌─────────────┐
│ kubectl │───►│ API Server │───►│ Workloads │
│ Flux │ └─────────────────┘ └──────┬──────┘
└─────────────┘ │
│ Scans
┌─────────────────┐ │
│ Trivy Operator │◄─────────┘
└────────┬────────┘
│
▼
┌─────────────────┐
│ Vulnerability │
│ Reports (CRDs) │
└─────────────────┘
- Good, because Gatekeeper is CNCF graduated and widely adopted
- Good, because Rego allows complex policy logic
- Good, because Trivy scans images, configs, RBAC, and secrets
- Good, because both provide Prometheus metrics and Grafana dashboards
- Bad, because Rego has a learning curve
- Bad, because Trivy node-collector incompatible with Talos
Option 2: Kyverno + Trivy
- Good, because Kyverno policies are YAML-based (easier to write)
- Good, because Kyverno can mutate resources (auto-fix)
- Bad, because Kyverno is less mature than Gatekeeper
- Bad, because mutation can cause unexpected behavior
Option 3: Pod Security Standards Only
- Good, because built into Kubernetes (no additional components)
- Good, because simple namespace-level enforcement
- Bad, because limited to pod security only
- Bad, because no vulnerability scanning
- Bad, because no custom policy support
Option 4: No Enforcement
- Good, because no operational overhead
- Bad, because no protection against misconfigurations
- Bad, because no visibility into security posture
- Bad, because bad practice even for homelabs
Implementation Details
Gatekeeper Policies
Constraint Templates (Rego-based):
| Template | Purpose |
|---|---|
K8sPSPPrivilegedContainer |
Block privileged containers |
K8sRequiredLabels |
Require app.kubernetes.io labels |
K8sContainerLimits |
Require resource limits |
Constraints (Policy Instances):
# Deny privileged containers (warn mode)
apiVersion: constraints.gatekeeper.sh/v1beta1
kind: K8sPSPPrivilegedContainer
metadata:
name: deny-privileged-containers
spec:
enforcementAction: warn # Start with warn, move to deny
match:
kinds:
- apiGroups: [""]
kinds: ["Pod"]
excludedNamespaces:
- kube-system
- gatekeeper-system
- cilium-secrets
- longhorn-system
- observability
- gpu-operator
parameters:
exemptImages:
- "quay.io/cilium/*"
- "ghcr.io/longhorn/*"
- "nvcr.io/nvidia/*"
Enforcement Progression:
warn- Log violations, don't block (initial rollout)dryrun- Audit mode, visible in reportsdeny- Block non-compliant resources
Trivy Operator Configuration
operator:
vulnerabilityScannerEnabled: true
configAuditScannerEnabled: true
rbacAssessmentScannerEnabled: true
exposedSecretScannerEnabled: true
clusterComplianceEnabled: true
# Disabled for Talos (no systemd, no /var/lib/kubelet)
infraAssessmentScannerEnabled: false
# Metrics for Prometheus
metricsFindingsEnabled: true
metricsConfigAuditInfo: true
Scan Reports (CRDs):
VulnerabilityReport- CVEs in container imagesConfigAuditReport- Kubernetes misconfigurationsRbacAssessmentReport- RBAC privilege issuesExposedSecretReport- Secrets in environment variables
Grafana Dashboards
| Dashboard | Source |
|---|---|
| Gatekeeper Overview | Grafana ID 15763 |
| Gatekeeper Violations | Grafana ID 14828 |
| Trivy Vulnerabilities | Grafana ID 17813 |
| Trivy Image Scan | Custom |
Namespace Exclusions
System namespaces excluded from strict policies:
| Namespace | Reason |
|---|---|
kube-system |
Core Kubernetes components |
gatekeeper-system |
Gatekeeper itself |
longhorn-system |
Storage requires privileges |
gpu-operator |
GPU drivers require privileges |
cilium-secrets |
CNI requires host networking |
observability |
Some collectors need host access |
Talos-Specific Considerations
Trivy's node-collector is disabled because Talos:
- Has no
/etc/systemd(uses custom init) - Has no standard
/var/lib/kubeletpath - Is immutable (read-only root filesystem)
This is acceptable because Talos itself is security-hardened by design.
Alerting Strategy
Prometheus Alerts:
- alert: HighSeverityVulnerability
expr: trivy_vulnerability_id{severity="CRITICAL"} > 0
for: 1h
labels:
severity: warning
annotations:
summary: "Critical vulnerability detected"
- alert: GatekeeperViolation
expr: increase(gatekeeper_violations[1h]) > 0
for: 5m
labels:
severity: info
annotations:
summary: "Policy violation detected"
Future Enhancements
- Move to
denyenforcement once baseline violations are resolved - Add network policies via Cilium for workload isolation
- ✅ Falco integrated — see ADR-0041 for runtime threat detection
- Add SBOM generation with Trivy for supply chain visibility
Detailed Component ADRs
| Component | ADR | Purpose |
|---|---|---|
| Gatekeeper | ADR-0040 | Policy templates, constraints, enforcement progression |
| Falco | ADR-0041 | Runtime threat detection, eBPF driver, Falcosidekick |
| Trivy Operator | ADR-0042 | Vulnerability scanning, compliance reports, Talos adaptations |