docs(adr): add security ADRs for Gatekeeper, Falco, and Trivy
All checks were successful
Update README with ADR Index / update-readme (push) Successful in 6s

- ADR-0040: OPA Gatekeeper policy framework (constraint templates,
  progressive enforcement, warn-first strategy)
- ADR-0041: Falco runtime threat detection (modern eBPF on Talos,
  Falcosidekick → Alertmanager integration)
- ADR-0042: Trivy Operator vulnerability scanning (5 scanners enabled,
  ARM64 scan job scheduling, Talos adaptations)
- Update ADR-0018: mark Falco as implemented, link to detailed ADRs
- Update README: add 0040-0042 to ADR table, update badge counts
This commit is contained in:
2026-02-09 18:20:13 -05:00
parent fbd5e0bb70
commit 1bc602b726
5 changed files with 474 additions and 2 deletions

View File

@@ -0,0 +1,166 @@
# OPA Gatekeeper Policy Framework
* Status: accepted
* Date: 2026-02-09
* Deciders: Billy
* Technical Story: Document the Gatekeeper policy framework, constraint templates, and progressive enforcement strategy
## Context and Problem Statement
Kubernetes has no built-in mechanism to enforce organizational policies beyond basic Pod Security Standards. Without admission control, workloads can be deployed with excessive privileges, missing labels, or no resource limits — creating operational and security risks.
How do we enforce cluster-wide policies while avoiding disruption to existing workloads during rollout?
## Decision Drivers
* Prevent privilege escalation from misconfigured pods
* Enforce consistent labelling for observability and ownership
* Require resource limits to prevent noisy-neighbor issues
* Progressive rollout — observe violations before blocking
* System namespaces and infrastructure components must be exempted
## Decision Outcome
Deploy **OPA Gatekeeper** with all constraints initially in **warn** mode, using a three-stage Flux dependency chain to ensure correct resource ordering.
## Architecture
```
┌───────────────────────────────────────────────────────────┐
│ Flux Dependency Chain │
│ │
│ Stage 1: gatekeeper (controller) │
│ ↓ depends-on + healthChecks on CRDs │
│ Stage 2: constraint-templates (Rego policies) │
│ ↓ depends-on │
│ Stage 3: constraints (policy instances) │
└───────────────────────────────────────────────────────────┘
┌───────────────────────────────────────────────────────────┐
│ Admission Flow │
│ │
│ kubectl/Flux → API Server → Gatekeeper Webhook │
│ │ │
│ ┌───────┴───────┐ │
│ │ Evaluate │ │
│ │ Constraints │ │
│ └───────┬───────┘ │
│ │ │
│ ┌─────────────┼──────────────┐ │
│ ▼ ▼ ▼ │
│ warn dryrun deny │
│ (log only) (audit only) (reject) │
└───────────────────────────────────────────────────────────┘
```
## Deployment Configuration
| | |
|---|---|
| **Chart** | `gatekeeper` from `https://open-policy-agent.github.io/gatekeeper/charts` |
| **Namespace** | `gatekeeper-system` |
| **Replicas** | 2 |
| **Audit interval** | 60 seconds |
| **Webhook failure policy** | `Ignore` (fail-open) |
| **Log denies** | `true` |
| **Metrics backend** | Prometheus |
The webhook uses `Ignore` failure policy to avoid breaking workloads if Gatekeeper itself is unavailable — availability takes priority over enforcement in a homelab.
### Resources
| Component | CPU Request/Limit | Memory Request/Limit |
|-----------|-------------------|----------------------|
| Controller | 100m / 1000m | 256Mi / 512Mi |
| Audit Controller | 100m / 1000m | 1Gi / 4Gi |
The audit controller requires significantly more memory because it caches cluster state for background evaluation of all existing resources.
### Exempt Namespaces (Webhook)
`kube-system`, `gatekeeper-system`, `flux-system`
## Constraint Templates
Three Rego-based constraint templates define the policy vocabulary:
### K8sPSPPrivilegedContainer
Blocks containers with `securityContext.privileged: true`. Checks all container types (containers, initContainers, ephemeralContainers). Supports `exemptImages` with wildcard prefix matching.
### K8sRequiredLabels
Requires specified labels on resources, with optional regex validation on values. Used to enforce the `app.kubernetes.io/name` convention.
### K8sContainerLimits
Requires containers to define resource limits. Parameterised for CPU and memory independently, with image exemptions.
## Constraints
All three constraints use **`enforcementAction: warn`** — violations are logged and surfaced in metrics but nothing is blocked.
### deny-privileged-containers
| | |
|---|---|
| **Template** | `K8sPSPPrivilegedContainer` |
| **Targets** | Pods |
| **Action** | warn |
**Excluded namespaces:** kube-system, kube-public, kube-node-lease, gatekeeper-system, cilium-secrets, longhorn-system, observability, trivy-system, security, gpu-operator
**Exempt images:**
- `quay.io/cilium/*` — CNI requires privileged access
- `ghcr.io/longhorn/*` — Storage driver needs host access
- `docker.io/falcosecurity/*` — eBPF probe requires elevated privileges
- `registry.k8s.io/*` — Core Kubernetes components
- `nvcr.io/nvidia/*` — GPU operator/drivers
### require-app-labels
| | |
|---|---|
| **Template** | `K8sRequiredLabels` |
| **Targets** | Deployments, StatefulSets, DaemonSets |
| **Action** | warn |
Requires `app.kubernetes.io/name` label. Excluded from system and infrastructure namespaces (kube-system, kube-public, kube-node-lease, gatekeeper-system, flux-system, cilium-secrets, cnpg-system).
### require-container-limits
| | |
|---|---|
| **Template** | `K8sContainerLimits` |
| **Targets** | Pods |
| **Action** | warn |
Requires memory limits (`requireMemory: true`) but not CPU limits (`requireCPU: false`). CPU limits are intentionally not required because they can cause CPU throttling, while memory limits protect against OOM.
**Exempt images:** `registry.k8s.io/*`, `quay.io/cilium/*`, `docker.io/library/*`
## Enforcement Progression
| Phase | Action | Purpose |
|-------|--------|---------|
| Current | `warn` | Establish baseline — understand existing violations |
| Next | `dryrun` | Audit-only mode visible in compliance reports |
| Target | `deny` | Block non-compliant resources at admission |
The move to `deny` is gated on resolving the baseline violations surfaced in the warn phase.
## Observability
**ServiceMonitor:** Scrapes Gatekeeper pods (label `gatekeeper.sh/system: "yes"`), port `metrics`, 30s interval.
**Grafana dashboards:**
| Dashboard | Grafana ID | Purpose |
|-----------|------------|---------|
| Gatekeeper Overview | #15763 | Policy status, constraint health |
| Gatekeeper Violations | #14828 | Violation trends and details |
## Links
* Implements [ADR-0018](0018-security-policy-enforcement.md) (Gatekeeper component)
* [OPA Gatekeeper Documentation](https://open-policy-agent.github.io/gatekeeper/)
* [Gatekeeper Policy Library](https://open-policy-agent.github.io/gatekeeper-library/)