Files
homelab-design/decisions/0040-opa-gatekeeper-policy-framework.md
Billy D. 1bc602b726
All checks were successful
Update README with ADR Index / update-readme (push) Successful in 6s
docs(adr): add security ADRs for Gatekeeper, Falco, and Trivy
- ADR-0040: OPA Gatekeeper policy framework (constraint templates,
  progressive enforcement, warn-first strategy)
- ADR-0041: Falco runtime threat detection (modern eBPF on Talos,
  Falcosidekick → Alertmanager integration)
- ADR-0042: Trivy Operator vulnerability scanning (5 scanners enabled,
  ARM64 scan job scheduling, Talos adaptations)
- Update ADR-0018: mark Falco as implemented, link to detailed ADRs
- Update README: add 0040-0042 to ADR table, update badge counts
2026-02-09 18:20:13 -05:00

7.4 KiB

OPA Gatekeeper Policy Framework

  • Status: accepted
  • Date: 2026-02-09
  • Deciders: Billy
  • Technical Story: Document the Gatekeeper policy framework, constraint templates, and progressive enforcement strategy

Context and Problem Statement

Kubernetes has no built-in mechanism to enforce organizational policies beyond basic Pod Security Standards. Without admission control, workloads can be deployed with excessive privileges, missing labels, or no resource limits — creating operational and security risks.

How do we enforce cluster-wide policies while avoiding disruption to existing workloads during rollout?

Decision Drivers

  • Prevent privilege escalation from misconfigured pods
  • Enforce consistent labelling for observability and ownership
  • Require resource limits to prevent noisy-neighbor issues
  • Progressive rollout — observe violations before blocking
  • System namespaces and infrastructure components must be exempted

Decision Outcome

Deploy OPA Gatekeeper with all constraints initially in warn mode, using a three-stage Flux dependency chain to ensure correct resource ordering.

Architecture

┌───────────────────────────────────────────────────────────┐
│                   Flux Dependency Chain                     │
│                                                           │
│  Stage 1: gatekeeper (controller)                         │
│      ↓ depends-on + healthChecks on CRDs                  │
│  Stage 2: constraint-templates (Rego policies)            │
│      ↓ depends-on                                         │
│  Stage 3: constraints (policy instances)                  │
└───────────────────────────────────────────────────────────┘

┌───────────────────────────────────────────────────────────┐
│                   Admission Flow                           │
│                                                           │
│  kubectl/Flux → API Server → Gatekeeper Webhook           │
│                                  │                        │
│                          ┌───────┴───────┐                │
│                          │  Evaluate     │                │
│                          │  Constraints  │                │
│                          └───────┬───────┘                │
│                                  │                        │
│                    ┌─────────────┼──────────────┐         │
│                    ▼             ▼              ▼         │
│                 warn          dryrun          deny        │
│              (log only)    (audit only)    (reject)       │
└───────────────────────────────────────────────────────────┘

Deployment Configuration

Chart gatekeeper from https://open-policy-agent.github.io/gatekeeper/charts
Namespace gatekeeper-system
Replicas 2
Audit interval 60 seconds
Webhook failure policy Ignore (fail-open)
Log denies true
Metrics backend Prometheus

The webhook uses Ignore failure policy to avoid breaking workloads if Gatekeeper itself is unavailable — availability takes priority over enforcement in a homelab.

Resources

Component CPU Request/Limit Memory Request/Limit
Controller 100m / 1000m 256Mi / 512Mi
Audit Controller 100m / 1000m 1Gi / 4Gi

The audit controller requires significantly more memory because it caches cluster state for background evaluation of all existing resources.

Exempt Namespaces (Webhook)

kube-system, gatekeeper-system, flux-system

Constraint Templates

Three Rego-based constraint templates define the policy vocabulary:

K8sPSPPrivilegedContainer

Blocks containers with securityContext.privileged: true. Checks all container types (containers, initContainers, ephemeralContainers). Supports exemptImages with wildcard prefix matching.

K8sRequiredLabels

Requires specified labels on resources, with optional regex validation on values. Used to enforce the app.kubernetes.io/name convention.

K8sContainerLimits

Requires containers to define resource limits. Parameterised for CPU and memory independently, with image exemptions.

Constraints

All three constraints use enforcementAction: warn — violations are logged and surfaced in metrics but nothing is blocked.

deny-privileged-containers

Template K8sPSPPrivilegedContainer
Targets Pods
Action warn

Excluded namespaces: kube-system, kube-public, kube-node-lease, gatekeeper-system, cilium-secrets, longhorn-system, observability, trivy-system, security, gpu-operator

Exempt images:

  • quay.io/cilium/* — CNI requires privileged access
  • ghcr.io/longhorn/* — Storage driver needs host access
  • docker.io/falcosecurity/* — eBPF probe requires elevated privileges
  • registry.k8s.io/* — Core Kubernetes components
  • nvcr.io/nvidia/* — GPU operator/drivers

require-app-labels

Template K8sRequiredLabels
Targets Deployments, StatefulSets, DaemonSets
Action warn

Requires app.kubernetes.io/name label. Excluded from system and infrastructure namespaces (kube-system, kube-public, kube-node-lease, gatekeeper-system, flux-system, cilium-secrets, cnpg-system).

require-container-limits

Template K8sContainerLimits
Targets Pods
Action warn

Requires memory limits (requireMemory: true) but not CPU limits (requireCPU: false). CPU limits are intentionally not required because they can cause CPU throttling, while memory limits protect against OOM.

Exempt images: registry.k8s.io/*, quay.io/cilium/*, docker.io/library/*

Enforcement Progression

Phase Action Purpose
Current warn Establish baseline — understand existing violations
Next dryrun Audit-only mode visible in compliance reports
Target deny Block non-compliant resources at admission

The move to deny is gated on resolving the baseline violations surfaced in the warn phase.

Observability

ServiceMonitor: Scrapes Gatekeeper pods (label gatekeeper.sh/system: "yes"), port metrics, 30s interval.

Grafana dashboards:

Dashboard Grafana ID Purpose
Gatekeeper Overview #15763 Policy status, constraint health
Gatekeeper Violations #14828 Violation trends and details