docs(adr): add security ADRs for Gatekeeper, Falco, and Trivy
All checks were successful
Update README with ADR Index / update-readme (push) Successful in 6s
All checks were successful
Update README with ADR Index / update-readme (push) Successful in 6s
- ADR-0040: OPA Gatekeeper policy framework (constraint templates, progressive enforcement, warn-first strategy) - ADR-0041: Falco runtime threat detection (modern eBPF on Talos, Falcosidekick → Alertmanager integration) - ADR-0042: Trivy Operator vulnerability scanning (5 scanners enabled, ARM64 scan job scheduling, Talos adaptations) - Update ADR-0018: mark Falco as implemented, link to detailed ADRs - Update README: add 0040-0042 to ADR table, update badge counts
This commit is contained in:
156
decisions/0041-falco-runtime-threat-detection.md
Normal file
156
decisions/0041-falco-runtime-threat-detection.md
Normal file
@@ -0,0 +1,156 @@
|
||||
# Falco Runtime Threat Detection
|
||||
|
||||
* Status: accepted
|
||||
* Date: 2026-02-09
|
||||
* Deciders: Billy
|
||||
* Technical Story: Deploy runtime security monitoring to detect anomalous behavior and threats inside running containers
|
||||
|
||||
## Context and Problem Statement
|
||||
|
||||
Admission policies (Gatekeeper) and vulnerability scanning (Trivy) operate at deploy-time and scan-time respectively, but neither detects runtime threats — a container executing unexpected commands, opening unusual network connections, or reading sensitive files after it has been admitted.
|
||||
|
||||
How do we detect runtime security threats in a Talos Linux environment where kernel module compilation is impossible?
|
||||
|
||||
## Decision Drivers
|
||||
|
||||
* Detect runtime threats that admission policies can't prevent
|
||||
* Must work on Talos Linux (immutable root filesystem, no kernel headers)
|
||||
* Alert on suspicious activity without blocking legitimate workloads
|
||||
* Stream alerts into the existing notification pipeline (Alertmanager → ntfy → Discord)
|
||||
* Minimal performance impact on AI/ML GPU workloads
|
||||
|
||||
## Considered Options
|
||||
|
||||
1. **Falco with modern eBPF driver** — CNCF runtime security
|
||||
2. **Tetragon** — Cilium-based eBPF security observability
|
||||
3. **Sysdig Secure** — Commercial runtime security
|
||||
4. **No runtime detection** — Rely on admission policies only
|
||||
|
||||
## Decision Outcome
|
||||
|
||||
Chosen option: **Option 1 - Falco with modern eBPF**, because it's CNCF graduated, supports the modern eBPF driver required for Talos, and integrates with Alertmanager via Falcosidekick.
|
||||
|
||||
### Positive Consequences
|
||||
|
||||
* Detects container escapes, unexpected shells, sensitive file reads at runtime
|
||||
* modern_ebpf driver works on Talos without kernel module compilation
|
||||
* Falcosidekick routes alerts to Alertmanager, integrating with existing pipeline
|
||||
* JSON output enables structured log processing
|
||||
* Runs on every node including control plane via tolerations
|
||||
|
||||
### Negative Consequences
|
||||
|
||||
* eBPF instrumentation adds minor CPU/memory overhead per node
|
||||
* Tuning rules to reduce false positives requires ongoing attention
|
||||
* Falcosidekick adds a Redis dependency for event deduplication
|
||||
|
||||
## Architecture
|
||||
|
||||
```
|
||||
┌─────────────────────────────────────────────────────────────┐
|
||||
│ Every Cluster Node │
|
||||
│ │
|
||||
│ ┌─────────────────────────────────────────────────────┐ │
|
||||
│ │ Falco (DaemonSet) │ │
|
||||
│ │ Driver: modern_ebpf (least-privileged) │ │
|
||||
│ │ Runtime: containerd socket │ │
|
||||
│ │ │ │
|
||||
│ │ Kernel syscalls → eBPF probes → Rule evaluation │ │
|
||||
│ │ │ │ │
|
||||
│ │ ▼ │ │
|
||||
│ │ JSON alert output │ │
|
||||
│ └─────────────────────────────────┬───────────────────┘ │
|
||||
│ │ gRPC │
|
||||
└────────────────────────────────────┼────────────────────────┘
|
||||
│
|
||||
▼
|
||||
┌────────────────────────┐
|
||||
│ Falcosidekick │
|
||||
│ │
|
||||
│ → Alertmanager │
|
||||
│ → Prometheus metrics │
|
||||
│ → Web UI │
|
||||
└────────────┬───────────┘
|
||||
│
|
||||
┌────────────────┼─────────────────┐
|
||||
▼ ▼ ▼
|
||||
┌──────────────┐ ┌────────────┐ ┌──────────────┐
|
||||
│ Alertmanager │ │ Prometheus │ │ Web UI │
|
||||
│ → ntfy │ │ (metrics) │ │ (inspect) │
|
||||
│ → Discord │ └────────────┘ └──────────────┘
|
||||
└──────────────┘
|
||||
```
|
||||
|
||||
## Deployment Configuration
|
||||
|
||||
| | |
|
||||
|---|---|
|
||||
| **Chart** | `falco` from `https://falcosecurity.github.io/charts` |
|
||||
| **Namespace** | `security` |
|
||||
| **Driver** | `modern_ebpf` with `leastPrivileged: true` |
|
||||
| **Container runtime** | Containerd only (`/run/containerd/containerd.sock`) |
|
||||
| **Output format** | JSON (`json_output: true`) |
|
||||
| **Minimum priority** | `warning` |
|
||||
| **Log destination** | stderr (syslog disabled) |
|
||||
| **Buffered outputs** | `false` (immediate delivery) |
|
||||
|
||||
### Resources
|
||||
|
||||
| CPU Request/Limit | Memory Request/Limit |
|
||||
|-------------------|----------------------|
|
||||
| 100m / 1000m | 512Mi / 1024Mi |
|
||||
|
||||
### Node Coverage
|
||||
|
||||
Falco tolerates **all** taints (`NoSchedule` + `NoExecute` with `Exists` operator), ensuring it runs on every node including:
|
||||
- Control plane nodes
|
||||
- GPU worker nodes with dedicated taints
|
||||
- ARM64 edge nodes
|
||||
|
||||
## Talos Linux Adaptations
|
||||
|
||||
| Challenge | Solution |
|
||||
|-----------|----------|
|
||||
| No kernel headers/module compilation | `modern_ebpf` driver (compiles at build-time, loads at runtime) |
|
||||
| Immutable root filesystem | `leastPrivileged: true` — minimal host mounts |
|
||||
| No syslog daemon | stderr-only logging, no syslog output |
|
||||
| Containerd (not Docker/CRI-O) | Explicit containerd socket mount at `/run/containerd/containerd.sock` |
|
||||
|
||||
## Falcosidekick (Alert Routing)
|
||||
|
||||
Falcosidekick receives Falco events via gRPC and fans them out to multiple targets:
|
||||
|
||||
| Target | Configuration | Minimum Priority |
|
||||
|--------|---------------|------------------|
|
||||
| Alertmanager | `http://alertmanager-operated.observability.svc.cluster.local:9093` | `warning` |
|
||||
| Prometheus | Metrics exporter enabled | — |
|
||||
| Web UI | Enabled (ClusterIP service) | — |
|
||||
|
||||
**Redis persistence:** 1Gi PVC on `nfs-slow` StorageClass (NFS chosen for ARM node compatibility).
|
||||
|
||||
## Detection Categories
|
||||
|
||||
Falco uses the default ruleset plus local overrides. Key detection categories include:
|
||||
|
||||
| Category | Example Rules |
|
||||
|----------|---------------|
|
||||
| Container escape | ptrace attach, mount namespace changes |
|
||||
| Unexpected shells | Shell spawned in non-shell container |
|
||||
| Sensitive file access | Reading `/etc/shadow`, `/etc/passwd` |
|
||||
| Network anomalies | Unexpected outbound connections |
|
||||
| Privilege escalation | setuid/setgid calls, capability changes |
|
||||
| Cryptomining | Known mining pool connections, CPU abuse patterns |
|
||||
|
||||
## Observability
|
||||
|
||||
**ServiceMonitor:** Enabled with label `release: prometheus`, scraping Falcosidekick metrics.
|
||||
|
||||
Alert flow: Falco → Falcosidekick → Alertmanager → ntfy → Discord (same pipeline as all other cluster alerts, documented in [ADR-0039](0039-alerting-notification-pipeline.md)).
|
||||
|
||||
## Links
|
||||
|
||||
* Implements [ADR-0018](0018-security-policy-enforcement.md) (runtime detection component)
|
||||
* Related to [ADR-0039](0039-alerting-notification-pipeline.md) (alerting pipeline)
|
||||
* [Falco Documentation](https://falco.org/docs/)
|
||||
* [Falcosidekick](https://github.com/falcosecurity/falcosidekick)
|
||||
* [modern_ebpf driver](https://falco.org/docs/event-sources/kernel/modern-ebpf/)
|
||||
Reference in New Issue
Block a user