docs(adr): add security ADRs for Gatekeeper, Falco, and Trivy
All checks were successful
Update README with ADR Index / update-readme (push) Successful in 6s

- ADR-0040: OPA Gatekeeper policy framework (constraint templates,
  progressive enforcement, warn-first strategy)
- ADR-0041: Falco runtime threat detection (modern eBPF on Talos,
  Falcosidekick → Alertmanager integration)
- ADR-0042: Trivy Operator vulnerability scanning (5 scanners enabled,
  ARM64 scan job scheduling, Talos adaptations)
- Update ADR-0018: mark Falco as implemented, link to detailed ADRs
- Update README: add 0040-0042 to ADR table, update badge counts
This commit is contained in:
2026-02-09 18:20:13 -05:00
parent fbd5e0bb70
commit 1bc602b726
5 changed files with 474 additions and 2 deletions

View File

@@ -0,0 +1,156 @@
# Falco Runtime Threat Detection
* Status: accepted
* Date: 2026-02-09
* Deciders: Billy
* Technical Story: Deploy runtime security monitoring to detect anomalous behavior and threats inside running containers
## Context and Problem Statement
Admission policies (Gatekeeper) and vulnerability scanning (Trivy) operate at deploy-time and scan-time respectively, but neither detects runtime threats — a container executing unexpected commands, opening unusual network connections, or reading sensitive files after it has been admitted.
How do we detect runtime security threats in a Talos Linux environment where kernel module compilation is impossible?
## Decision Drivers
* Detect runtime threats that admission policies can't prevent
* Must work on Talos Linux (immutable root filesystem, no kernel headers)
* Alert on suspicious activity without blocking legitimate workloads
* Stream alerts into the existing notification pipeline (Alertmanager → ntfy → Discord)
* Minimal performance impact on AI/ML GPU workloads
## Considered Options
1. **Falco with modern eBPF driver** — CNCF runtime security
2. **Tetragon** — Cilium-based eBPF security observability
3. **Sysdig Secure** — Commercial runtime security
4. **No runtime detection** — Rely on admission policies only
## Decision Outcome
Chosen option: **Option 1 - Falco with modern eBPF**, because it's CNCF graduated, supports the modern eBPF driver required for Talos, and integrates with Alertmanager via Falcosidekick.
### Positive Consequences
* Detects container escapes, unexpected shells, sensitive file reads at runtime
* modern_ebpf driver works on Talos without kernel module compilation
* Falcosidekick routes alerts to Alertmanager, integrating with existing pipeline
* JSON output enables structured log processing
* Runs on every node including control plane via tolerations
### Negative Consequences
* eBPF instrumentation adds minor CPU/memory overhead per node
* Tuning rules to reduce false positives requires ongoing attention
* Falcosidekick adds a Redis dependency for event deduplication
## Architecture
```
┌─────────────────────────────────────────────────────────────┐
│ Every Cluster Node │
│ │
│ ┌─────────────────────────────────────────────────────┐ │
│ │ Falco (DaemonSet) │ │
│ │ Driver: modern_ebpf (least-privileged) │ │
│ │ Runtime: containerd socket │ │
│ │ │ │
│ │ Kernel syscalls → eBPF probes → Rule evaluation │ │
│ │ │ │ │
│ │ ▼ │ │
│ │ JSON alert output │ │
│ └─────────────────────────────────┬───────────────────┘ │
│ │ gRPC │
└────────────────────────────────────┼────────────────────────┘
┌────────────────────────┐
│ Falcosidekick │
│ │
│ → Alertmanager │
│ → Prometheus metrics │
│ → Web UI │
└────────────┬───────────┘
┌────────────────┼─────────────────┐
▼ ▼ ▼
┌──────────────┐ ┌────────────┐ ┌──────────────┐
│ Alertmanager │ │ Prometheus │ │ Web UI │
│ → ntfy │ │ (metrics) │ │ (inspect) │
│ → Discord │ └────────────┘ └──────────────┘
└──────────────┘
```
## Deployment Configuration
| | |
|---|---|
| **Chart** | `falco` from `https://falcosecurity.github.io/charts` |
| **Namespace** | `security` |
| **Driver** | `modern_ebpf` with `leastPrivileged: true` |
| **Container runtime** | Containerd only (`/run/containerd/containerd.sock`) |
| **Output format** | JSON (`json_output: true`) |
| **Minimum priority** | `warning` |
| **Log destination** | stderr (syslog disabled) |
| **Buffered outputs** | `false` (immediate delivery) |
### Resources
| CPU Request/Limit | Memory Request/Limit |
|-------------------|----------------------|
| 100m / 1000m | 512Mi / 1024Mi |
### Node Coverage
Falco tolerates **all** taints (`NoSchedule` + `NoExecute` with `Exists` operator), ensuring it runs on every node including:
- Control plane nodes
- GPU worker nodes with dedicated taints
- ARM64 edge nodes
## Talos Linux Adaptations
| Challenge | Solution |
|-----------|----------|
| No kernel headers/module compilation | `modern_ebpf` driver (compiles at build-time, loads at runtime) |
| Immutable root filesystem | `leastPrivileged: true` — minimal host mounts |
| No syslog daemon | stderr-only logging, no syslog output |
| Containerd (not Docker/CRI-O) | Explicit containerd socket mount at `/run/containerd/containerd.sock` |
## Falcosidekick (Alert Routing)
Falcosidekick receives Falco events via gRPC and fans them out to multiple targets:
| Target | Configuration | Minimum Priority |
|--------|---------------|------------------|
| Alertmanager | `http://alertmanager-operated.observability.svc.cluster.local:9093` | `warning` |
| Prometheus | Metrics exporter enabled | — |
| Web UI | Enabled (ClusterIP service) | — |
**Redis persistence:** 1Gi PVC on `nfs-slow` StorageClass (NFS chosen for ARM node compatibility).
## Detection Categories
Falco uses the default ruleset plus local overrides. Key detection categories include:
| Category | Example Rules |
|----------|---------------|
| Container escape | ptrace attach, mount namespace changes |
| Unexpected shells | Shell spawned in non-shell container |
| Sensitive file access | Reading `/etc/shadow`, `/etc/passwd` |
| Network anomalies | Unexpected outbound connections |
| Privilege escalation | setuid/setgid calls, capability changes |
| Cryptomining | Known mining pool connections, CPU abuse patterns |
## Observability
**ServiceMonitor:** Enabled with label `release: prometheus`, scraping Falcosidekick metrics.
Alert flow: Falco → Falcosidekick → Alertmanager → ntfy → Discord (same pipeline as all other cluster alerts, documented in [ADR-0039](0039-alerting-notification-pipeline.md)).
## Links
* Implements [ADR-0018](0018-security-policy-enforcement.md) (runtime detection component)
* Related to [ADR-0039](0039-alerting-notification-pipeline.md) (alerting pipeline)
* [Falco Documentation](https://falco.org/docs/)
* [Falcosidekick](https://github.com/falcosecurity/falcosidekick)
* [modern_ebpf driver](https://falco.org/docs/event-sources/kernel/modern-ebpf/)