# OPA Gatekeeper Policy Framework * Status: accepted * Date: 2026-02-09 * Deciders: Billy * Technical Story: Document the Gatekeeper policy framework, constraint templates, and progressive enforcement strategy ## Context and Problem Statement Kubernetes has no built-in mechanism to enforce organizational policies beyond basic Pod Security Standards. Without admission control, workloads can be deployed with excessive privileges, missing labels, or no resource limits — creating operational and security risks. How do we enforce cluster-wide policies while avoiding disruption to existing workloads during rollout? ## Decision Drivers * Prevent privilege escalation from misconfigured pods * Enforce consistent labelling for observability and ownership * Require resource limits to prevent noisy-neighbor issues * Progressive rollout — observe violations before blocking * System namespaces and infrastructure components must be exempted ## Decision Outcome Deploy **OPA Gatekeeper** with all constraints initially in **warn** mode, using a three-stage Flux dependency chain to ensure correct resource ordering. ## Architecture ``` ┌───────────────────────────────────────────────────────────┐ │ Flux Dependency Chain │ │ │ │ Stage 1: gatekeeper (controller) │ │ ↓ depends-on + healthChecks on CRDs │ │ Stage 2: constraint-templates (Rego policies) │ │ ↓ depends-on │ │ Stage 3: constraints (policy instances) │ └───────────────────────────────────────────────────────────┘ ┌───────────────────────────────────────────────────────────┐ │ Admission Flow │ │ │ │ kubectl/Flux → API Server → Gatekeeper Webhook │ │ │ │ │ ┌───────┴───────┐ │ │ │ Evaluate │ │ │ │ Constraints │ │ │ └───────┬───────┘ │ │ │ │ │ ┌─────────────┼──────────────┐ │ │ ▼ ▼ ▼ │ │ warn dryrun deny │ │ (log only) (audit only) (reject) │ └───────────────────────────────────────────────────────────┘ ``` ## Deployment Configuration | | | |---|---| | **Chart** | `gatekeeper` from `https://open-policy-agent.github.io/gatekeeper/charts` | | **Namespace** | `gatekeeper-system` | | **Replicas** | 2 | | **Audit interval** | 60 seconds | | **Webhook failure policy** | `Ignore` (fail-open) | | **Log denies** | `true` | | **Metrics backend** | Prometheus | The webhook uses `Ignore` failure policy to avoid breaking workloads if Gatekeeper itself is unavailable — availability takes priority over enforcement in a homelab. ### Resources | Component | CPU Request/Limit | Memory Request/Limit | |-----------|-------------------|----------------------| | Controller | 100m / 1000m | 256Mi / 512Mi | | Audit Controller | 100m / 1000m | 1Gi / 4Gi | The audit controller requires significantly more memory because it caches cluster state for background evaluation of all existing resources. ### Exempt Namespaces (Webhook) `kube-system`, `gatekeeper-system`, `flux-system` ## Constraint Templates Three Rego-based constraint templates define the policy vocabulary: ### K8sPSPPrivilegedContainer Blocks containers with `securityContext.privileged: true`. Checks all container types (containers, initContainers, ephemeralContainers). Supports `exemptImages` with wildcard prefix matching. ### K8sRequiredLabels Requires specified labels on resources, with optional regex validation on values. Used to enforce the `app.kubernetes.io/name` convention. ### K8sContainerLimits Requires containers to define resource limits. Parameterised for CPU and memory independently, with image exemptions. ## Constraints All three constraints use **`enforcementAction: warn`** — violations are logged and surfaced in metrics but nothing is blocked. ### deny-privileged-containers | | | |---|---| | **Template** | `K8sPSPPrivilegedContainer` | | **Targets** | Pods | | **Action** | warn | **Excluded namespaces:** kube-system, kube-public, kube-node-lease, gatekeeper-system, cilium-secrets, longhorn-system, observability, trivy-system, security, gpu-operator **Exempt images:** - `quay.io/cilium/*` — CNI requires privileged access - `ghcr.io/longhorn/*` — Storage driver needs host access - `docker.io/falcosecurity/*` — eBPF probe requires elevated privileges - `registry.k8s.io/*` — Core Kubernetes components - `nvcr.io/nvidia/*` — GPU operator/drivers ### require-app-labels | | | |---|---| | **Template** | `K8sRequiredLabels` | | **Targets** | Deployments, StatefulSets, DaemonSets | | **Action** | warn | Requires `app.kubernetes.io/name` label. Excluded from system and infrastructure namespaces (kube-system, kube-public, kube-node-lease, gatekeeper-system, flux-system, cilium-secrets, cnpg-system). ### require-container-limits | | | |---|---| | **Template** | `K8sContainerLimits` | | **Targets** | Pods | | **Action** | warn | Requires memory limits (`requireMemory: true`) but not CPU limits (`requireCPU: false`). CPU limits are intentionally not required because they can cause CPU throttling, while memory limits protect against OOM. **Exempt images:** `registry.k8s.io/*`, `quay.io/cilium/*`, `docker.io/library/*` ## Enforcement Progression | Phase | Action | Purpose | |-------|--------|---------| | Current | `warn` | Establish baseline — understand existing violations | | Next | `dryrun` | Audit-only mode visible in compliance reports | | Target | `deny` | Block non-compliant resources at admission | The move to `deny` is gated on resolving the baseline violations surfaced in the warn phase. ## Observability **ServiceMonitor:** Scrapes Gatekeeper pods (label `gatekeeper.sh/system: "yes"`), port `metrics`, 30s interval. **Grafana dashboards:** | Dashboard | Grafana ID | Purpose | |-----------|------------|---------| | Gatekeeper Overview | #15763 | Policy status, constraint health | | Gatekeeper Violations | #14828 | Violation trends and details | ## Links * Implements [ADR-0018](0018-security-policy-enforcement.md) (Gatekeeper component) * [OPA Gatekeeper Documentation](https://open-policy-agent.github.io/gatekeeper/) * [Gatekeeper Policy Library](https://open-policy-agent.github.io/gatekeeper-library/)