docs(adr): add security ADRs for Gatekeeper, Falco, and Trivy

- ADR-0040: OPA Gatekeeper policy framework (constraint templates, progressive enforcement, warn-first strategy) - ADR-0041: Falco runtime threat detection (modern eBPF on Talos, Falcosidekick → Alertmanager integration) - ADR-0042: Trivy Operator vulnerability scanning (5 scanners enabled, ARM64 scan job scheduling, Talos adaptations) - Update ADR-0018: mark Falco as implemented, link to detailed ADRs - Update README: add 0040-0042 to ADR table, update badge counts
2026-02-09 18:20:13 -05:00
parent fbd5e0bb70
commit 1bc602b726
5 changed files with 474 additions and 2 deletions
--- a/decisions/0018-security-policy-enforcement.md
+++ b/decisions/0018-security-policy-enforcement.md
@@ -228,9 +228,17 @@ This is acceptable because Talos itself is security-hardened by design.

 1. **Move to `deny` enforcement** once baseline violations are resolved
 2. **Add network policies** via Cilium for workload isolation
-3. **Integrate Falco** for runtime threat detection
+3. ✅ **Falco integrated** — see [ADR-0041](0041-falco-runtime-threat-detection.md) for runtime threat detection
 4. **Add SBOM generation** with Trivy for supply chain visibility

+## Detailed Component ADRs
+
+| Component | ADR | Purpose |
+|-----------|-----|--------|
+| Gatekeeper | [ADR-0040](0040-opa-gatekeeper-policy-framework.md) | Policy templates, constraints, enforcement progression |
+| Falco | [ADR-0041](0041-falco-runtime-threat-detection.md) | Runtime threat detection, eBPF driver, Falcosidekick |
+| Trivy Operator | [ADR-0042](0042-trivy-operator-vulnerability-scanning.md) | Vulnerability scanning, compliance reports, Talos adaptations |
+
 ## References

 * [OPA Gatekeeper](https://open-policy-agent.github.io/gatekeeper/)
--- a/decisions/0040-opa-gatekeeper-policy-framework.md
+++ b/decisions/0040-opa-gatekeeper-policy-framework.md
@@ -0,0 +1,166 @@
+# OPA Gatekeeper Policy Framework
+
+* Status: accepted
+* Date: 2026-02-09
+* Deciders: Billy
+* Technical Story: Document the Gatekeeper policy framework, constraint templates, and progressive enforcement strategy
+
+## Context and Problem Statement
+
+Kubernetes has no built-in mechanism to enforce organizational policies beyond basic Pod Security Standards. Without admission control, workloads can be deployed with excessive privileges, missing labels, or no resource limits — creating operational and security risks.
+
+How do we enforce cluster-wide policies while avoiding disruption to existing workloads during rollout?
+
+## Decision Drivers
+
+* Prevent privilege escalation from misconfigured pods
+* Enforce consistent labelling for observability and ownership
+* Require resource limits to prevent noisy-neighbor issues
+* Progressive rollout — observe violations before blocking
+* System namespaces and infrastructure components must be exempted
+
+## Decision Outcome
+
+Deploy **OPA Gatekeeper** with all constraints initially in **warn** mode, using a three-stage Flux dependency chain to ensure correct resource ordering.
+
+## Architecture
+
+```
+┌───────────────────────────────────────────────────────────┐
+│                   Flux Dependency Chain                     │
+│                                                           │
+│  Stage 1: gatekeeper (controller)                         │
+│      ↓ depends-on + healthChecks on CRDs                  │
+│  Stage 2: constraint-templates (Rego policies)            │
+│      ↓ depends-on                                         │
+│  Stage 3: constraints (policy instances)                  │
+└───────────────────────────────────────────────────────────┘
+
+┌───────────────────────────────────────────────────────────┐
+│                   Admission Flow                           │
+│                                                           │
+│  kubectl/Flux → API Server → Gatekeeper Webhook           │
+│                                  │                        │
+│                          ┌───────┴───────┐                │
+│                          │  Evaluate     │                │
+│                          │  Constraints  │                │
+│                          └───────┬───────┘                │
+│                                  │                        │
+│                    ┌─────────────┼──────────────┐         │
+│                    ▼             ▼              ▼         │
+│                 warn          dryrun          deny        │
+│              (log only)    (audit only)    (reject)       │
+└───────────────────────────────────────────────────────────┘
+```
+
+## Deployment Configuration
+
+| | |
+|---|---|
+| **Chart** | `gatekeeper` from `https://open-policy-agent.github.io/gatekeeper/charts` |
+| **Namespace** | `gatekeeper-system` |
+| **Replicas** | 2 |
+| **Audit interval** | 60 seconds |
+| **Webhook failure policy** | `Ignore` (fail-open) |
+| **Log denies** | `true` |
+| **Metrics backend** | Prometheus |
+
+The webhook uses `Ignore` failure policy to avoid breaking workloads if Gatekeeper itself is unavailable — availability takes priority over enforcement in a homelab.
+
+### Resources
+
+| Component | CPU Request/Limit | Memory Request/Limit |
+|-----------|-------------------|----------------------|
+| Controller | 100m / 1000m | 256Mi / 512Mi |
+| Audit Controller | 100m / 1000m | 1Gi / 4Gi |
+
+The audit controller requires significantly more memory because it caches cluster state for background evaluation of all existing resources.
+
+### Exempt Namespaces (Webhook)
+
+`kube-system`, `gatekeeper-system`, `flux-system`
+
+## Constraint Templates
+
+Three Rego-based constraint templates define the policy vocabulary:
+
+### K8sPSPPrivilegedContainer
+
+Blocks containers with `securityContext.privileged: true`. Checks all container types (containers, initContainers, ephemeralContainers). Supports `exemptImages` with wildcard prefix matching.
+
+### K8sRequiredLabels
+
+Requires specified labels on resources, with optional regex validation on values. Used to enforce the `app.kubernetes.io/name` convention.
+
+### K8sContainerLimits
+
+Requires containers to define resource limits. Parameterised for CPU and memory independently, with image exemptions.
+
+## Constraints
+
+All three constraints use **`enforcementAction: warn`** — violations are logged and surfaced in metrics but nothing is blocked.
+
+### deny-privileged-containers
+
+| | |
+|---|---|
+| **Template** | `K8sPSPPrivilegedContainer` |
+| **Targets** | Pods |
+| **Action** | warn |
+
+**Excluded namespaces:** kube-system, kube-public, kube-node-lease, gatekeeper-system, cilium-secrets, longhorn-system, observability, trivy-system, security, gpu-operator
+
+**Exempt images:**
+- `quay.io/cilium/*` — CNI requires privileged access
+- `ghcr.io/longhorn/*` — Storage driver needs host access
+- `docker.io/falcosecurity/*` — eBPF probe requires elevated privileges
+- `registry.k8s.io/*` — Core Kubernetes components
+- `nvcr.io/nvidia/*` — GPU operator/drivers
+
+### require-app-labels
+
+| | |
+|---|---|
+| **Template** | `K8sRequiredLabels` |
+| **Targets** | Deployments, StatefulSets, DaemonSets |
+| **Action** | warn |
+
+Requires `app.kubernetes.io/name` label. Excluded from system and infrastructure namespaces (kube-system, kube-public, kube-node-lease, gatekeeper-system, flux-system, cilium-secrets, cnpg-system).
+
+### require-container-limits
+
+| | |
+|---|---|
+| **Template** | `K8sContainerLimits` |
+| **Targets** | Pods |
+| **Action** | warn |
+
+Requires memory limits (`requireMemory: true`) but not CPU limits (`requireCPU: false`). CPU limits are intentionally not required because they can cause CPU throttling, while memory limits protect against OOM.
+
+**Exempt images:** `registry.k8s.io/*`, `quay.io/cilium/*`, `docker.io/library/*`
+
+## Enforcement Progression
+
+| Phase | Action | Purpose |
+|-------|--------|---------|
+| Current | `warn` | Establish baseline — understand existing violations |
+| Next | `dryrun` | Audit-only mode visible in compliance reports |
+| Target | `deny` | Block non-compliant resources at admission |
+
+The move to `deny` is gated on resolving the baseline violations surfaced in the warn phase.
+
+## Observability
+
+**ServiceMonitor:** Scrapes Gatekeeper pods (label `gatekeeper.sh/system: "yes"`), port `metrics`, 30s interval.
+
+**Grafana dashboards:**
+| Dashboard | Grafana ID | Purpose |
+|-----------|------------|---------|
+| Gatekeeper Overview | #15763 | Policy status, constraint health |
+| Gatekeeper Violations | #14828 | Violation trends and details |
+
+## Links
+
+* Implements [ADR-0018](0018-security-policy-enforcement.md) (Gatekeeper component)
+* [OPA Gatekeeper Documentation](https://open-policy-agent.github.io/gatekeeper/)
+* [Gatekeeper Policy Library](https://open-policy-agent.github.io/gatekeeper-library/)
--- a/decisions/0041-falco-runtime-threat-detection.md
+++ b/decisions/0041-falco-runtime-threat-detection.md
@@ -0,0 +1,156 @@
+# Falco Runtime Threat Detection
+
+* Status: accepted
+* Date: 2026-02-09
+* Deciders: Billy
+* Technical Story: Deploy runtime security monitoring to detect anomalous behavior and threats inside running containers
+
+## Context and Problem Statement
+
+Admission policies (Gatekeeper) and vulnerability scanning (Trivy) operate at deploy-time and scan-time respectively, but neither detects runtime threats — a container executing unexpected commands, opening unusual network connections, or reading sensitive files after it has been admitted.
+
+How do we detect runtime security threats in a Talos Linux environment where kernel module compilation is impossible?
+
+## Decision Drivers
+
+* Detect runtime threats that admission policies can't prevent
+* Must work on Talos Linux (immutable root filesystem, no kernel headers)
+* Alert on suspicious activity without blocking legitimate workloads
+* Stream alerts into the existing notification pipeline (Alertmanager → ntfy → Discord)
+* Minimal performance impact on AI/ML GPU workloads
+
+## Considered Options
+
+1. **Falco with modern eBPF driver** — CNCF runtime security
+2. **Tetragon** — Cilium-based eBPF security observability
+3. **Sysdig Secure** — Commercial runtime security
+4. **No runtime detection** — Rely on admission policies only
+
+## Decision Outcome
+
+Chosen option: **Option 1 - Falco with modern eBPF**, because it's CNCF graduated, supports the modern eBPF driver required for Talos, and integrates with Alertmanager via Falcosidekick.
+
+### Positive Consequences
+
+* Detects container escapes, unexpected shells, sensitive file reads at runtime
+* modern_ebpf driver works on Talos without kernel module compilation
+* Falcosidekick routes alerts to Alertmanager, integrating with existing pipeline
+* JSON output enables structured log processing
+* Runs on every node including control plane via tolerations
+
+### Negative Consequences
+
+* eBPF instrumentation adds minor CPU/memory overhead per node
+* Tuning rules to reduce false positives requires ongoing attention
+* Falcosidekick adds a Redis dependency for event deduplication
+
+## Architecture
+
+```
+┌─────────────────────────────────────────────────────────────┐
+│                    Every Cluster Node                         │
+│                                                             │
+│  ┌─────────────────────────────────────────────────────┐    │
+│  │  Falco (DaemonSet)                                   │    │
+│  │  Driver: modern_ebpf (least-privileged)              │    │
+│  │  Runtime: containerd socket                          │    │
+│  │                                                     │    │
+│  │  Kernel syscalls → eBPF probes → Rule evaluation    │    │
+│  │                                       │             │    │
+│  │                                       ▼             │    │
+│  │                              JSON alert output     │    │
+│  └─────────────────────────────────┬───────────────────┘    │
+│                                    │ gRPC                    │
+└────────────────────────────────────┼────────────────────────┘
+                                     │
+                                     ▼
+                        ┌────────────────────────┐
+                        │     Falcosidekick       │
+                        │                        │
+                        │  → Alertmanager        │
+                        │  → Prometheus metrics  │
+                        │  → Web UI              │
+                        └────────────┬───────────┘
+                                     │
+                    ┌────────────────┼─────────────────┐
+                    ▼                ▼                  ▼
+            ┌──────────────┐ ┌────────────┐   ┌──────────────┐
+            │ Alertmanager │ │ Prometheus │   │  Web UI      │
+            │ → ntfy       │ │ (metrics)  │   │ (inspect)    │
+            │ → Discord    │ └────────────┘   └──────────────┘
+            └──────────────┘
+```
+
+## Deployment Configuration
+
+| | |
+|---|---|
+| **Chart** | `falco` from `https://falcosecurity.github.io/charts` |
+| **Namespace** | `security` |
+| **Driver** | `modern_ebpf` with `leastPrivileged: true` |
+| **Container runtime** | Containerd only (`/run/containerd/containerd.sock`) |
+| **Output format** | JSON (`json_output: true`) |
+| **Minimum priority** | `warning` |
+| **Log destination** | stderr (syslog disabled) |
+| **Buffered outputs** | `false` (immediate delivery) |
+
+### Resources
+
+| CPU Request/Limit | Memory Request/Limit |
+|-------------------|----------------------|
+| 100m / 1000m | 512Mi / 1024Mi |
+
+### Node Coverage
+
+Falco tolerates **all** taints (`NoSchedule` + `NoExecute` with `Exists` operator), ensuring it runs on every node including:
+- Control plane nodes
+- GPU worker nodes with dedicated taints
+- ARM64 edge nodes
+
+## Talos Linux Adaptations
+
+| Challenge | Solution |
+|-----------|----------|
+| No kernel headers/module compilation | `modern_ebpf` driver (compiles at build-time, loads at runtime) |
+| Immutable root filesystem | `leastPrivileged: true` — minimal host mounts |
+| No syslog daemon | stderr-only logging, no syslog output |
+| Containerd (not Docker/CRI-O) | Explicit containerd socket mount at `/run/containerd/containerd.sock` |
+
+## Falcosidekick (Alert Routing)
+
+Falcosidekick receives Falco events via gRPC and fans them out to multiple targets:
+
+| Target | Configuration | Minimum Priority |
+|--------|---------------|------------------|
+| Alertmanager | `http://alertmanager-operated.observability.svc.cluster.local:9093` | `warning` |
+| Prometheus | Metrics exporter enabled | — |
+| Web UI | Enabled (ClusterIP service) | — |
+
+**Redis persistence:** 1Gi PVC on `nfs-slow` StorageClass (NFS chosen for ARM node compatibility).
+
+## Detection Categories
+
+Falco uses the default ruleset plus local overrides. Key detection categories include:
+
+| Category | Example Rules |
+|----------|---------------|
+| Container escape | ptrace attach, mount namespace changes |
+| Unexpected shells | Shell spawned in non-shell container |
+| Sensitive file access | Reading `/etc/shadow`, `/etc/passwd` |
+| Network anomalies | Unexpected outbound connections |
+| Privilege escalation | setuid/setgid calls, capability changes |
+| Cryptomining | Known mining pool connections, CPU abuse patterns |
+
+## Observability
+
+**ServiceMonitor:** Enabled with label `release: prometheus`, scraping Falcosidekick metrics.
+
+Alert flow: Falco → Falcosidekick → Alertmanager → ntfy → Discord (same pipeline as all other cluster alerts, documented in [ADR-0039](0039-alerting-notification-pipeline.md)).
+
+## Links
+
+* Implements [ADR-0018](0018-security-policy-enforcement.md) (runtime detection component)
+* Related to [ADR-0039](0039-alerting-notification-pipeline.md) (alerting pipeline)
+* [Falco Documentation](https://falco.org/docs/)
+* [Falcosidekick](https://github.com/falcosecurity/falcosidekick)
+* [modern_ebpf driver](https://falco.org/docs/event-sources/kernel/modern-ebpf/)
--- a/decisions/0042-trivy-operator-vulnerability-scanning.md
+++ b/decisions/0042-trivy-operator-vulnerability-scanning.md
@@ -0,0 +1,139 @@
+# Trivy Operator Vulnerability Scanning
+
+* Status: accepted
+* Date: 2026-02-09
+* Deciders: Billy
+* Technical Story: Continuously scan cluster workloads for vulnerabilities, misconfigurations, RBAC issues, and exposed secrets
+
+## Context and Problem Statement
+
+Container images accumulate vulnerabilities over time as new CVEs are disclosed. Without continuous scanning, the cluster's security posture degrades silently between deployments. Additionally, Kubernetes resource misconfigurations and overly permissive RBAC can create attack surfaces.
+
+How do we maintain continuous visibility into the security posture of all cluster workloads, including images that haven't been rebuilt recently?
+
+## Decision Drivers
+
+* Continuous scanning — not just at build-time
+* Cover multiple security dimensions (CVEs, misconfig, RBAC, secrets)
+* Results stored as Kubernetes CRDs for GitOps-friendly querying
+* Prometheus metrics for alerting and Grafana dashboards
+* Must work on Talos Linux and heterogeneous (amd64 + arm64) clusters
+
+## Decision Outcome
+
+Deploy **Trivy Operator** in standalone mode with all applicable scanners enabled, explicitly disabling infra assessment for Talos compatibility, and scheduling scan jobs preferentially on ARM64 nodes to offload work from GPU nodes.
+
+## Deployment Configuration
+
+| | |
+|---|---|
+| **Chart** | `trivy-operator` from `https://aquasecurity.github.io/helm-charts` |
+| **Namespace** | `security` |
+| **Mode** | Standalone (embedded database, no external Trivy server) |
+| **Severity filter** | All levels: UNKNOWN, LOW, MEDIUM, HIGH, CRITICAL |
+| **Ignore unfixed** | `false` (reports all vulnerabilities, even without patches) |
+| **Scan timeout** | 10 minutes |
+| **Concurrent scan jobs** | 10 |
+| **Slow mode** | `true` (reduces resource usage at cost of scan speed) |
+| **Compliance cron** | `0 */6 * * *` (every 6 hours) |
+
+### Scan Job Resources
+
+| CPU Request/Limit | Memory Request/Limit |
+|-------------------|----------------------|
+| 100m / 500m | 100M / 500M |
+
+## Scanners
+
+| Scanner | Status | Purpose |
+|---------|--------|---------|
+| Vulnerability | **Enabled** | CVE scanning of container images |
+| Config Audit | **Enabled** | Kubernetes resource misconfiguration checks |
+| RBAC Assessment | **Enabled** | Overly permissive RBAC analysis |
+| Exposed Secrets | **Enabled** | Detect secrets leaked in image layers/env vars |
+| Cluster Compliance | **Enabled** | CIS benchmark compliance reports |
+| Infra Assessment | **Disabled** | Requires `/etc/systemd` paths — incompatible with Talos |
+
+### Report CRDs
+
+Trivy stores all scan results as Kubernetes custom resources:
+
+| CRD | Content |
+|-----|---------|
+| `VulnerabilityReport` | CVEs per container image with severity, fix version |
+| `ConfigAuditReport` | Kubernetes misconfiguration findings |
+| `RbacAssessmentReport` | RBAC privilege escalation risks |
+| `ExposedSecretReport` | Secrets found in environment variables or image layers |
+| `ClusterComplianceReport` | CIS benchmark compliance status |
+
+## Talos Linux Adaptations
+
+| Challenge | Solution |
+|-----------|----------|
+| No `/etc/systemd` paths | Infra assessment scanner disabled |
+| No standard `/var/lib/kubelet` | `nodeCollector` volumes and volumeMounts set to empty `[]` |
+| Immutable root filesystem | Standalone mode — database cached in operator pod |
+
+Talos is inherently security-hardened (immutable OS, no SSH, API-only management), making the infra assessment scanner redundant for the OS layer.
+
+## Scan Job Scheduling
+
+Scan jobs use affinity to prefer **ARM64 nodes** (weight 100) and tolerate the `arm64` taint:
+
+```yaml
+tolerations:
+  - key: node.kubernetes.io/arch
+    value: arm64
+    effect: NoSchedule
+
+affinity:
+  nodeAffinity:
+    preferredDuringSchedulingIgnoredDuringExecution:
+      - weight: 100
+        preference:
+          matchExpressions:
+            - key: kubernetes.io/arch
+              operator: In
+              values: [arm64]
+```
+
+This offloads scan work from GPU nodes (amd64) to ARM64 edge workers, avoiding interference with AI/ML inference workloads.
+
+## Observability
+
+### Prometheus Metrics
+
+All four metric categories enabled:
+
+| Metric | Purpose |
+|--------|---------|
+| `metricsFindingsEnabled` | Vulnerability counts by severity |
+| `metricsConfigAuditInfo` | Misconfiguration findings |
+| `metricsRbacAssessmentInfo` | RBAC assessment results |
+| `metricsClusterComplianceInfo` | Compliance benchmark status |
+
+**ServiceMonitor:** Enabled with label `release: prometheus`.
+
+### Grafana Dashboards
+
+| Dashboard | Grafana ID | Purpose |
+|-----------|------------|---------|
+| Trivy Operator Vulnerabilities | #17813 | CVE overview by severity, namespace, image |
+| Trivy Image Scan | #16337 | Detailed per-image scan results |
+
+### Alerting
+
+Vulnerability metrics feed into Prometheus alerting rules defined in kube-prometheus-stack (see [ADR-0039](0039-alerting-notification-pipeline.md)):
+
+```
+Trivy scans → VulnerabilityReport CRDs → Prometheus metrics
+  → AlertRules (critical CVEs) → Alertmanager → ntfy → Discord
+```
+
+## Links
+
+* Implements [ADR-0018](0018-security-policy-enforcement.md) (scanning component)
+* Related to [ADR-0039](0039-alerting-notification-pipeline.md) (alerting pipeline)
+* Related to [ADR-0035](0035-arm64-worker-strategy.md) (ARM64 scan job scheduling)
+* [Trivy Operator Documentation](https://aquasecurity.github.io/trivy-operator/)
+* [Trivy Vulnerability Database](https://github.com/aquasecurity/trivy-db)