homelab-design/decisions/0040-opa-gatekeeper-policy-framework.md

# OPA Gatekeeper Policy Framework

* Status: accepted
* Date: 2026-02-09
* Deciders: Billy
* Technical Story: Document the Gatekeeper policy framework, constraint templates, and progressive enforcement strategy

## Context and Problem Statement

Kubernetes has no built-in mechanism to enforce organizational policies beyond basic Pod Security Standards. Without admission control, workloads can be deployed with excessive privileges, missing labels, or no resource limits — creating operational and security risks.

How do we enforce cluster-wide policies while avoiding disruption to existing workloads during rollout?

## Decision Drivers

* Prevent privilege escalation from misconfigured pods
* Enforce consistent labelling for observability and ownership
* Require resource limits to prevent noisy-neighbor issues
* Progressive rollout — observe violations before blocking
* System namespaces and infrastructure components must be exempted

## Decision Outcome

Deploy **OPA Gatekeeper** with all constraints initially in **warn** mode, using a three-stage Flux dependency chain to ensure correct resource ordering.

## Architecture

```
┌───────────────────────────────────────────────────────────┐
│                   Flux Dependency Chain                     │
│                                                           │
│  Stage 1: gatekeeper (controller)                         │
│      ↓ depends-on + healthChecks on CRDs                  │
│  Stage 2: constraint-templates (Rego policies)            │
│      ↓ depends-on                                         │
│  Stage 3: constraints (policy instances)                  │
└───────────────────────────────────────────────────────────┘

┌───────────────────────────────────────────────────────────┐
│                   Admission Flow                           │
│                                                           │
│  kubectl/Flux → API Server → Gatekeeper Webhook           │
│                                  │                        │
│                          ┌───────┴───────┐                │
│                          │  Evaluate     │                │
│                          │  Constraints  │                │
│                          └───────┬───────┘                │
│                                  │                        │
│                    ┌─────────────┼──────────────┐         │
│                    ▼             ▼              ▼         │
│                 warn          dryrun          deny        │
│              (log only)    (audit only)    (reject)       │
└───────────────────────────────────────────────────────────┘
```

## Deployment Configuration

| | |
|---|---|
| **Chart** | `gatekeeper` from `https://open-policy-agent.github.io/gatekeeper/charts` |
| **Namespace** | `gatekeeper-system` |
| **Replicas** | 2 |
| **Audit interval** | 60 seconds |
| **Webhook failure policy** | `Ignore` (fail-open) |
| **Log denies** | `true` |
| **Metrics backend** | Prometheus |

The webhook uses `Ignore` failure policy to avoid breaking workloads if Gatekeeper itself is unavailable — availability takes priority over enforcement in a homelab.

### Resources

| Component | CPU Request/Limit | Memory Request/Limit |
|-----------|-------------------|----------------------|
| Controller | 100m / 1000m | 256Mi / 512Mi |
| Audit Controller | 100m / 1000m | 1Gi / 4Gi |

The audit controller requires significantly more memory because it caches cluster state for background evaluation of all existing resources.

### Exempt Namespaces (Webhook)

`kube-system`, `gatekeeper-system`, `flux-system`

## Constraint Templates

Three Rego-based constraint templates define the policy vocabulary:

### K8sPSPPrivilegedContainer

Blocks containers with `securityContext.privileged: true`. Checks all container types (containers, initContainers, ephemeralContainers). Supports `exemptImages` with wildcard prefix matching.

### K8sRequiredLabels

Requires specified labels on resources, with optional regex validation on values. Used to enforce the `app.kubernetes.io/name` convention.

### K8sContainerLimits

Requires containers to define resource limits. Parameterised for CPU and memory independently, with image exemptions.

## Constraints

All three constraints use **`enforcementAction: warn`** — violations are logged and surfaced in metrics but nothing is blocked.

### deny-privileged-containers

| | |
|---|---|
| **Template** | `K8sPSPPrivilegedContainer` |
| **Targets** | Pods |
| **Action** | warn |

**Excluded namespaces:** kube-system, kube-public, kube-node-lease, gatekeeper-system, cilium-secrets, longhorn-system, observability, trivy-system, security, gpu-operator

**Exempt images:**
- `quay.io/cilium/*` — CNI requires privileged access
- `ghcr.io/longhorn/*` — Storage driver needs host access
- `docker.io/falcosecurity/*` — eBPF probe requires elevated privileges
- `registry.k8s.io/*` — Core Kubernetes components
- `nvcr.io/nvidia/*` — GPU operator/drivers

### require-app-labels

| | |
|---|---|
| **Template** | `K8sRequiredLabels` |
| **Targets** | Deployments, StatefulSets, DaemonSets |
| **Action** | warn |

Requires `app.kubernetes.io/name` label. Excluded from system and infrastructure namespaces (kube-system, kube-public, kube-node-lease, gatekeeper-system, flux-system, cilium-secrets, cnpg-system).

### require-container-limits

| | |
|---|---|
| **Template** | `K8sContainerLimits` |
| **Targets** | Pods |
| **Action** | warn |

Requires memory limits (`requireMemory: true`) but not CPU limits (`requireCPU: false`). CPU limits are intentionally not required because they can cause CPU throttling, while memory limits protect against OOM.

**Exempt images:** `registry.k8s.io/*`, `quay.io/cilium/*`, `docker.io/library/*`

## Enforcement Progression

| Phase | Action | Purpose |
|-------|--------|---------|
| Current | `warn` | Establish baseline — understand existing violations |
| Next | `dryrun` | Audit-only mode visible in compliance reports |
| Target | `deny` | Block non-compliant resources at admission |

The move to `deny` is gated on resolving the baseline violations surfaced in the warn phase.

## Observability

**ServiceMonitor:** Scrapes Gatekeeper pods (label `gatekeeper.sh/system: "yes"`), port `metrics`, 30s interval.

**Grafana dashboards:**
| Dashboard | Grafana ID | Purpose |
|-----------|------------|---------|
| Gatekeeper Overview | #15763 | Policy status, constraint health |
| Gatekeeper Violations | #14828 | Violation trends and details |

## Links

* Implements [ADR-0018](0018-security-policy-enforcement.md) (Gatekeeper component)
* [OPA Gatekeeper Documentation](https://open-policy-agent.github.io/gatekeeper/)
* [Gatekeeper Policy Library](https://open-policy-agent.github.io/gatekeeper-library/)