- ADR-0040: OPA Gatekeeper policy framework (constraint templates, progressive enforcement, warn-first strategy) - ADR-0041: Falco runtime threat detection (modern eBPF on Talos, Falcosidekick → Alertmanager integration) - ADR-0042: Trivy Operator vulnerability scanning (5 scanners enabled, ARM64 scan job scheduling, Talos adaptations) - Update ADR-0018: mark Falco as implemented, link to detailed ADRs - Update README: add 0040-0042 to ADR table, update badge counts
7.4 KiB
OPA Gatekeeper Policy Framework
- Status: accepted
- Date: 2026-02-09
- Deciders: Billy
- Technical Story: Document the Gatekeeper policy framework, constraint templates, and progressive enforcement strategy
Context and Problem Statement
Kubernetes has no built-in mechanism to enforce organizational policies beyond basic Pod Security Standards. Without admission control, workloads can be deployed with excessive privileges, missing labels, or no resource limits — creating operational and security risks.
How do we enforce cluster-wide policies while avoiding disruption to existing workloads during rollout?
Decision Drivers
- Prevent privilege escalation from misconfigured pods
- Enforce consistent labelling for observability and ownership
- Require resource limits to prevent noisy-neighbor issues
- Progressive rollout — observe violations before blocking
- System namespaces and infrastructure components must be exempted
Decision Outcome
Deploy OPA Gatekeeper with all constraints initially in warn mode, using a three-stage Flux dependency chain to ensure correct resource ordering.
Architecture
┌───────────────────────────────────────────────────────────┐
│ Flux Dependency Chain │
│ │
│ Stage 1: gatekeeper (controller) │
│ ↓ depends-on + healthChecks on CRDs │
│ Stage 2: constraint-templates (Rego policies) │
│ ↓ depends-on │
│ Stage 3: constraints (policy instances) │
└───────────────────────────────────────────────────────────┘
┌───────────────────────────────────────────────────────────┐
│ Admission Flow │
│ │
│ kubectl/Flux → API Server → Gatekeeper Webhook │
│ │ │
│ ┌───────┴───────┐ │
│ │ Evaluate │ │
│ │ Constraints │ │
│ └───────┬───────┘ │
│ │ │
│ ┌─────────────┼──────────────┐ │
│ ▼ ▼ ▼ │
│ warn dryrun deny │
│ (log only) (audit only) (reject) │
└───────────────────────────────────────────────────────────┘
Deployment Configuration
| Chart | gatekeeper from https://open-policy-agent.github.io/gatekeeper/charts |
| Namespace | gatekeeper-system |
| Replicas | 2 |
| Audit interval | 60 seconds |
| Webhook failure policy | Ignore (fail-open) |
| Log denies | true |
| Metrics backend | Prometheus |
The webhook uses Ignore failure policy to avoid breaking workloads if Gatekeeper itself is unavailable — availability takes priority over enforcement in a homelab.
Resources
| Component | CPU Request/Limit | Memory Request/Limit |
|---|---|---|
| Controller | 100m / 1000m | 256Mi / 512Mi |
| Audit Controller | 100m / 1000m | 1Gi / 4Gi |
The audit controller requires significantly more memory because it caches cluster state for background evaluation of all existing resources.
Exempt Namespaces (Webhook)
kube-system, gatekeeper-system, flux-system
Constraint Templates
Three Rego-based constraint templates define the policy vocabulary:
K8sPSPPrivilegedContainer
Blocks containers with securityContext.privileged: true. Checks all container types (containers, initContainers, ephemeralContainers). Supports exemptImages with wildcard prefix matching.
K8sRequiredLabels
Requires specified labels on resources, with optional regex validation on values. Used to enforce the app.kubernetes.io/name convention.
K8sContainerLimits
Requires containers to define resource limits. Parameterised for CPU and memory independently, with image exemptions.
Constraints
All three constraints use enforcementAction: warn — violations are logged and surfaced in metrics but nothing is blocked.
deny-privileged-containers
| Template | K8sPSPPrivilegedContainer |
| Targets | Pods |
| Action | warn |
Excluded namespaces: kube-system, kube-public, kube-node-lease, gatekeeper-system, cilium-secrets, longhorn-system, observability, trivy-system, security, gpu-operator
Exempt images:
quay.io/cilium/*— CNI requires privileged accessghcr.io/longhorn/*— Storage driver needs host accessdocker.io/falcosecurity/*— eBPF probe requires elevated privilegesregistry.k8s.io/*— Core Kubernetes componentsnvcr.io/nvidia/*— GPU operator/drivers
require-app-labels
| Template | K8sRequiredLabels |
| Targets | Deployments, StatefulSets, DaemonSets |
| Action | warn |
Requires app.kubernetes.io/name label. Excluded from system and infrastructure namespaces (kube-system, kube-public, kube-node-lease, gatekeeper-system, flux-system, cilium-secrets, cnpg-system).
require-container-limits
| Template | K8sContainerLimits |
| Targets | Pods |
| Action | warn |
Requires memory limits (requireMemory: true) but not CPU limits (requireCPU: false). CPU limits are intentionally not required because they can cause CPU throttling, while memory limits protect against OOM.
Exempt images: registry.k8s.io/*, quay.io/cilium/*, docker.io/library/*
Enforcement Progression
| Phase | Action | Purpose |
|---|---|---|
| Current | warn |
Establish baseline — understand existing violations |
| Next | dryrun |
Audit-only mode visible in compliance reports |
| Target | deny |
Block non-compliant resources at admission |
The move to deny is gated on resolving the baseline violations surfaced in the warn phase.
Observability
ServiceMonitor: Scrapes Gatekeeper pods (label gatekeeper.sh/system: "yes"), port metrics, 30s interval.
Grafana dashboards:
| Dashboard | Grafana ID | Purpose |
|---|---|---|
| Gatekeeper Overview | #15763 | Policy status, constraint health |
| Gatekeeper Violations | #14828 | Violation trends and details |
Links
- Implements ADR-0018 (Gatekeeper component)
- OPA Gatekeeper Documentation
- Gatekeeper Policy Library