docs(adr): add security ADRs for Gatekeeper, Falco, and Trivy

- ADR-0040: OPA Gatekeeper policy framework (constraint templates, progressive enforcement, warn-first strategy) - ADR-0041: Falco runtime threat detection (modern eBPF on Talos, Falcosidekick → Alertmanager integration) - ADR-0042: Trivy Operator vulnerability scanning (5 scanners enabled, ARM64 scan job scheduling, Talos adaptations) - Update ADR-0018: mark Falco as implemented, link to detailed ADRs - Update README: add 0040-0042 to ADR table, update badge counts
2026-02-09 18:20:13 -05:00
parent fbd5e0bb70
commit 1bc602b726
5 changed files with 474 additions and 2 deletions
--- a/decisions/0040-opa-gatekeeper-policy-framework.md
+++ b/decisions/0040-opa-gatekeeper-policy-framework.md
@@ -0,0 +1,166 @@
+# OPA Gatekeeper Policy Framework
+
+* Status: accepted
+* Date: 2026-02-09
+* Deciders: Billy
+* Technical Story: Document the Gatekeeper policy framework, constraint templates, and progressive enforcement strategy
+
+## Context and Problem Statement
+
+Kubernetes has no built-in mechanism to enforce organizational policies beyond basic Pod Security Standards. Without admission control, workloads can be deployed with excessive privileges, missing labels, or no resource limits — creating operational and security risks.
+
+How do we enforce cluster-wide policies while avoiding disruption to existing workloads during rollout?
+
+## Decision Drivers
+
+* Prevent privilege escalation from misconfigured pods
+* Enforce consistent labelling for observability and ownership
+* Require resource limits to prevent noisy-neighbor issues
+* Progressive rollout — observe violations before blocking
+* System namespaces and infrastructure components must be exempted
+
+## Decision Outcome
+
+Deploy **OPA Gatekeeper** with all constraints initially in **warn** mode, using a three-stage Flux dependency chain to ensure correct resource ordering.
+
+## Architecture
+
+```
+┌───────────────────────────────────────────────────────────┐
+│                   Flux Dependency Chain                     │
+│                                                           │
+│  Stage 1: gatekeeper (controller)                         │
+│      ↓ depends-on + healthChecks on CRDs                  │
+│  Stage 2: constraint-templates (Rego policies)            │
+│      ↓ depends-on                                         │
+│  Stage 3: constraints (policy instances)                  │
+└───────────────────────────────────────────────────────────┘
+
+┌───────────────────────────────────────────────────────────┐
+│                   Admission Flow                           │
+│                                                           │
+│  kubectl/Flux → API Server → Gatekeeper Webhook           │
+│                                  │                        │
+│                          ┌───────┴───────┐                │
+│                          │  Evaluate     │                │
+│                          │  Constraints  │                │
+│                          └───────┬───────┘                │
+│                                  │                        │
+│                    ┌─────────────┼──────────────┐         │
+│                    ▼             ▼              ▼         │
+│                 warn          dryrun          deny        │
+│              (log only)    (audit only)    (reject)       │
+└───────────────────────────────────────────────────────────┘
+```
+
+## Deployment Configuration
+
+| | |
+|---|---|
+| **Chart** | `gatekeeper` from `https://open-policy-agent.github.io/gatekeeper/charts` |
+| **Namespace** | `gatekeeper-system` |
+| **Replicas** | 2 |
+| **Audit interval** | 60 seconds |
+| **Webhook failure policy** | `Ignore` (fail-open) |
+| **Log denies** | `true` |
+| **Metrics backend** | Prometheus |
+
+The webhook uses `Ignore` failure policy to avoid breaking workloads if Gatekeeper itself is unavailable — availability takes priority over enforcement in a homelab.
+
+### Resources
+
+| Component | CPU Request/Limit | Memory Request/Limit |
+|-----------|-------------------|----------------------|
+| Controller | 100m / 1000m | 256Mi / 512Mi |
+| Audit Controller | 100m / 1000m | 1Gi / 4Gi |
+
+The audit controller requires significantly more memory because it caches cluster state for background evaluation of all existing resources.
+
+### Exempt Namespaces (Webhook)
+
+`kube-system`, `gatekeeper-system`, `flux-system`
+
+## Constraint Templates
+
+Three Rego-based constraint templates define the policy vocabulary:
+
+### K8sPSPPrivilegedContainer
+
+Blocks containers with `securityContext.privileged: true`. Checks all container types (containers, initContainers, ephemeralContainers). Supports `exemptImages` with wildcard prefix matching.
+
+### K8sRequiredLabels
+
+Requires specified labels on resources, with optional regex validation on values. Used to enforce the `app.kubernetes.io/name` convention.
+
+### K8sContainerLimits
+
+Requires containers to define resource limits. Parameterised for CPU and memory independently, with image exemptions.
+
+## Constraints
+
+All three constraints use **`enforcementAction: warn`** — violations are logged and surfaced in metrics but nothing is blocked.
+
+### deny-privileged-containers
+
+| | |
+|---|---|
+| **Template** | `K8sPSPPrivilegedContainer` |
+| **Targets** | Pods |
+| **Action** | warn |
+
+**Excluded namespaces:** kube-system, kube-public, kube-node-lease, gatekeeper-system, cilium-secrets, longhorn-system, observability, trivy-system, security, gpu-operator
+
+**Exempt images:**
+- `quay.io/cilium/*` — CNI requires privileged access
+- `ghcr.io/longhorn/*` — Storage driver needs host access
+- `docker.io/falcosecurity/*` — eBPF probe requires elevated privileges
+- `registry.k8s.io/*` — Core Kubernetes components
+- `nvcr.io/nvidia/*` — GPU operator/drivers
+
+### require-app-labels
+
+| | |
+|---|---|
+| **Template** | `K8sRequiredLabels` |
+| **Targets** | Deployments, StatefulSets, DaemonSets |
+| **Action** | warn |
+
+Requires `app.kubernetes.io/name` label. Excluded from system and infrastructure namespaces (kube-system, kube-public, kube-node-lease, gatekeeper-system, flux-system, cilium-secrets, cnpg-system).
+
+### require-container-limits
+
+| | |
+|---|---|
+| **Template** | `K8sContainerLimits` |
+| **Targets** | Pods |
+| **Action** | warn |
+
+Requires memory limits (`requireMemory: true`) but not CPU limits (`requireCPU: false`). CPU limits are intentionally not required because they can cause CPU throttling, while memory limits protect against OOM.
+
+**Exempt images:** `registry.k8s.io/*`, `quay.io/cilium/*`, `docker.io/library/*`
+
+## Enforcement Progression
+
+| Phase | Action | Purpose |
+|-------|--------|---------|
+| Current | `warn` | Establish baseline — understand existing violations |
+| Next | `dryrun` | Audit-only mode visible in compliance reports |
+| Target | `deny` | Block non-compliant resources at admission |
+
+The move to `deny` is gated on resolving the baseline violations surfaced in the warn phase.
+
+## Observability
+
+**ServiceMonitor:** Scrapes Gatekeeper pods (label `gatekeeper.sh/system: "yes"`), port `metrics`, 30s interval.
+
+**Grafana dashboards:**
+| Dashboard | Grafana ID | Purpose |
+|-----------|------------|---------|
+| Gatekeeper Overview | #15763 | Policy status, constraint health |
+| Gatekeeper Violations | #14828 | Violation trends and details |
+
+## Links
+
+* Implements [ADR-0018](0018-security-policy-enforcement.md) (Gatekeeper component)
+* [OPA Gatekeeper Documentation](https://open-policy-agent.github.io/gatekeeper/)
+* [Gatekeeper Policy Library](https://open-policy-agent.github.io/gatekeeper-library/)