Files

Update README with ADR Index / update-readme (push) Successful in 6s

Details

docs(adr): add security ADRs for Gatekeeper, Falco, and Trivy

- ADR-0040: OPA Gatekeeper policy framework (constraint templates,
  progressive enforcement, warn-first strategy)
- ADR-0041: Falco runtime threat detection (modern eBPF on Talos,
  Falcosidekick → Alertmanager integration)
- ADR-0042: Trivy Operator vulnerability scanning (5 scanners enabled,
  ARM64 scan job scheduling, Talos adaptations)
- Update ADR-0018: mark Falco as implemented, link to detailed ADRs
- Update README: add 0040-0042 to ADR table, update badge counts

2026-02-09 18:20:13 -05:00

7.4 KiB

Raw Blame History

OPA Gatekeeper Policy Framework

Status: accepted
Date: 2026-02-09
Deciders: Billy
Technical Story: Document the Gatekeeper policy framework, constraint templates, and progressive enforcement strategy

Context and Problem Statement

Kubernetes has no built-in mechanism to enforce organizational policies beyond basic Pod Security Standards. Without admission control, workloads can be deployed with excessive privileges, missing labels, or no resource limits — creating operational and security risks.

How do we enforce cluster-wide policies while avoiding disruption to existing workloads during rollout?

Decision Drivers

Prevent privilege escalation from misconfigured pods
Enforce consistent labelling for observability and ownership
Require resource limits to prevent noisy-neighbor issues
Progressive rollout — observe violations before blocking
System namespaces and infrastructure components must be exempted

Decision Outcome

Deploy OPA Gatekeeper with all constraints initially in warn mode, using a three-stage Flux dependency chain to ensure correct resource ordering.

Architecture

┌───────────────────────────────────────────────────────────┐
│                   Flux Dependency Chain                     │
│                                                           │
│  Stage 1: gatekeeper (controller)                         │
│      ↓ depends-on + healthChecks on CRDs                  │
│  Stage 2: constraint-templates (Rego policies)            │
│      ↓ depends-on                                         │
│  Stage 3: constraints (policy instances)                  │
└───────────────────────────────────────────────────────────┘

┌───────────────────────────────────────────────────────────┐
│                   Admission Flow                           │
│                                                           │
│  kubectl/Flux → API Server → Gatekeeper Webhook           │
│                                  │                        │
│                          ┌───────┴───────┐                │
│                          │  Evaluate     │                │
│                          │  Constraints  │                │
│                          └───────┬───────┘                │
│                                  │                        │
│                    ┌─────────────┼──────────────┐         │
│                    ▼             ▼              ▼         │
│                 warn          dryrun          deny        │
│              (log only)    (audit only)    (reject)       │
└───────────────────────────────────────────────────────────┘

Deployment Configuration


Chart	`gatekeeper` from `https://open-policy-agent.github.io/gatekeeper/charts`
Namespace	`gatekeeper-system`
Replicas	2
Audit interval	60 seconds
Webhook failure policy	`Ignore` (fail-open)
Log denies	`true`
Metrics backend	Prometheus

The webhook uses Ignore failure policy to avoid breaking workloads if Gatekeeper itself is unavailable — availability takes priority over enforcement in a homelab.

Resources

Component	CPU Request/Limit	Memory Request/Limit
Controller	100m / 1000m	256Mi / 512Mi
Audit Controller	100m / 1000m	1Gi / 4Gi

The audit controller requires significantly more memory because it caches cluster state for background evaluation of all existing resources.

Exempt Namespaces (Webhook)

kube-system, gatekeeper-system, flux-system

Constraint Templates

Three Rego-based constraint templates define the policy vocabulary:

K8sPSPPrivilegedContainer

Blocks containers with securityContext.privileged: true. Checks all container types (containers, initContainers, ephemeralContainers). Supports exemptImages with wildcard prefix matching.

K8sRequiredLabels

Requires specified labels on resources, with optional regex validation on values. Used to enforce the app.kubernetes.io/name convention.

K8sContainerLimits

Requires containers to define resource limits. Parameterised for CPU and memory independently, with image exemptions.

Constraints

All three constraints use enforcementAction: warn — violations are logged and surfaced in metrics but nothing is blocked.

deny-privileged-containers


Template	`K8sPSPPrivilegedContainer`
Targets	Pods
Action	warn

Excluded namespaces: kube-system, kube-public, kube-node-lease, gatekeeper-system, cilium-secrets, longhorn-system, observability, trivy-system, security, gpu-operator

Exempt images:

quay.io/cilium/* — CNI requires privileged access
ghcr.io/longhorn/* — Storage driver needs host access
docker.io/falcosecurity/* — eBPF probe requires elevated privileges
registry.k8s.io/* — Core Kubernetes components
nvcr.io/nvidia/* — GPU operator/drivers

require-app-labels


Template	`K8sRequiredLabels`
Targets	Deployments, StatefulSets, DaemonSets
Action	warn

Requires app.kubernetes.io/name label. Excluded from system and infrastructure namespaces (kube-system, kube-public, kube-node-lease, gatekeeper-system, flux-system, cilium-secrets, cnpg-system).

require-container-limits


Template	`K8sContainerLimits`
Targets	Pods
Action	warn

Requires memory limits (requireMemory: true) but not CPU limits (requireCPU: false). CPU limits are intentionally not required because they can cause CPU throttling, while memory limits protect against OOM.

Exempt images: registry.k8s.io/*, quay.io/cilium/*, docker.io/library/*

Enforcement Progression

Phase	Action	Purpose
Current	`warn`	Establish baseline — understand existing violations
Next	`dryrun`	Audit-only mode visible in compliance reports
Target	`deny`	Block non-compliant resources at admission

The move to deny is gated on resolving the baseline violations surfaced in the warn phase.

Observability

ServiceMonitor: Scrapes Gatekeeper pods (label gatekeeper.sh/system: "yes"), port metrics, 30s interval.

Grafana dashboards:

Dashboard	Grafana ID	Purpose
Gatekeeper Overview	#15763	Policy status, constraint health
Gatekeeper Violations	#14828	Violation trends and details

7.4 KiB Raw Blame History