Files
homelab-design/decisions/0041-falco-runtime-threat-detection.md
Billy D. 1bc602b726
All checks were successful
Update README with ADR Index / update-readme (push) Successful in 6s
docs(adr): add security ADRs for Gatekeeper, Falco, and Trivy
- ADR-0040: OPA Gatekeeper policy framework (constraint templates,
  progressive enforcement, warn-first strategy)
- ADR-0041: Falco runtime threat detection (modern eBPF on Talos,
  Falcosidekick → Alertmanager integration)
- ADR-0042: Trivy Operator vulnerability scanning (5 scanners enabled,
  ARM64 scan job scheduling, Talos adaptations)
- Update ADR-0018: mark Falco as implemented, link to detailed ADRs
- Update README: add 0040-0042 to ADR table, update badge counts
2026-02-09 18:20:13 -05:00

7.8 KiB

Falco Runtime Threat Detection

  • Status: accepted
  • Date: 2026-02-09
  • Deciders: Billy
  • Technical Story: Deploy runtime security monitoring to detect anomalous behavior and threats inside running containers

Context and Problem Statement

Admission policies (Gatekeeper) and vulnerability scanning (Trivy) operate at deploy-time and scan-time respectively, but neither detects runtime threats — a container executing unexpected commands, opening unusual network connections, or reading sensitive files after it has been admitted.

How do we detect runtime security threats in a Talos Linux environment where kernel module compilation is impossible?

Decision Drivers

  • Detect runtime threats that admission policies can't prevent
  • Must work on Talos Linux (immutable root filesystem, no kernel headers)
  • Alert on suspicious activity without blocking legitimate workloads
  • Stream alerts into the existing notification pipeline (Alertmanager → ntfy → Discord)
  • Minimal performance impact on AI/ML GPU workloads

Considered Options

  1. Falco with modern eBPF driver — CNCF runtime security
  2. Tetragon — Cilium-based eBPF security observability
  3. Sysdig Secure — Commercial runtime security
  4. No runtime detection — Rely on admission policies only

Decision Outcome

Chosen option: Option 1 - Falco with modern eBPF, because it's CNCF graduated, supports the modern eBPF driver required for Talos, and integrates with Alertmanager via Falcosidekick.

Positive Consequences

  • Detects container escapes, unexpected shells, sensitive file reads at runtime
  • modern_ebpf driver works on Talos without kernel module compilation
  • Falcosidekick routes alerts to Alertmanager, integrating with existing pipeline
  • JSON output enables structured log processing
  • Runs on every node including control plane via tolerations

Negative Consequences

  • eBPF instrumentation adds minor CPU/memory overhead per node
  • Tuning rules to reduce false positives requires ongoing attention
  • Falcosidekick adds a Redis dependency for event deduplication

Architecture

┌─────────────────────────────────────────────────────────────┐
│                    Every Cluster Node                         │
│                                                             │
│  ┌─────────────────────────────────────────────────────┐    │
│  │  Falco (DaemonSet)                                   │    │
│  │  Driver: modern_ebpf (least-privileged)              │    │
│  │  Runtime: containerd socket                          │    │
│  │                                                     │    │
│  │  Kernel syscalls → eBPF probes → Rule evaluation    │    │
│  │                                       │             │    │
│  │                                       ▼             │    │
│  │                              JSON alert output     │    │
│  └─────────────────────────────────┬───────────────────┘    │
│                                    │ gRPC                    │
└────────────────────────────────────┼────────────────────────┘
                                     │
                                     ▼
                        ┌────────────────────────┐
                        │     Falcosidekick       │
                        │                        │
                        │  → Alertmanager        │
                        │  → Prometheus metrics  │
                        │  → Web UI              │
                        └────────────┬───────────┘
                                     │
                    ┌────────────────┼─────────────────┐
                    ▼                ▼                  ▼
            ┌──────────────┐ ┌────────────┐   ┌──────────────┐
            │ Alertmanager │ │ Prometheus │   │  Web UI      │
            │ → ntfy       │ │ (metrics)  │   │ (inspect)    │
            │ → Discord    │ └────────────┘   └──────────────┘
            └──────────────┘

Deployment Configuration

Chart falco from https://falcosecurity.github.io/charts
Namespace security
Driver modern_ebpf with leastPrivileged: true
Container runtime Containerd only (/run/containerd/containerd.sock)
Output format JSON (json_output: true)
Minimum priority warning
Log destination stderr (syslog disabled)
Buffered outputs false (immediate delivery)

Resources

CPU Request/Limit Memory Request/Limit
100m / 1000m 512Mi / 1024Mi

Node Coverage

Falco tolerates all taints (NoSchedule + NoExecute with Exists operator), ensuring it runs on every node including:

  • Control plane nodes
  • GPU worker nodes with dedicated taints
  • ARM64 edge nodes

Talos Linux Adaptations

Challenge Solution
No kernel headers/module compilation modern_ebpf driver (compiles at build-time, loads at runtime)
Immutable root filesystem leastPrivileged: true — minimal host mounts
No syslog daemon stderr-only logging, no syslog output
Containerd (not Docker/CRI-O) Explicit containerd socket mount at /run/containerd/containerd.sock

Falcosidekick (Alert Routing)

Falcosidekick receives Falco events via gRPC and fans them out to multiple targets:

Target Configuration Minimum Priority
Alertmanager http://alertmanager-operated.observability.svc.cluster.local:9093 warning
Prometheus Metrics exporter enabled
Web UI Enabled (ClusterIP service)

Redis persistence: 1Gi PVC on nfs-slow StorageClass (NFS chosen for ARM node compatibility).

Detection Categories

Falco uses the default ruleset plus local overrides. Key detection categories include:

Category Example Rules
Container escape ptrace attach, mount namespace changes
Unexpected shells Shell spawned in non-shell container
Sensitive file access Reading /etc/shadow, /etc/passwd
Network anomalies Unexpected outbound connections
Privilege escalation setuid/setgid calls, capability changes
Cryptomining Known mining pool connections, CPU abuse patterns

Observability

ServiceMonitor: Enabled with label release: prometheus, scraping Falcosidekick metrics.

Alert flow: Falco → Falcosidekick → Alertmanager → ntfy → Discord (same pipeline as all other cluster alerts, documented in ADR-0039).