docs(adr): add security ADRs for Gatekeeper, Falco, and Trivy
All checks were successful
Update README with ADR Index / update-readme (push) Successful in 6s

- ADR-0040: OPA Gatekeeper policy framework (constraint templates,
  progressive enforcement, warn-first strategy)
- ADR-0041: Falco runtime threat detection (modern eBPF on Talos,
  Falcosidekick → Alertmanager integration)
- ADR-0042: Trivy Operator vulnerability scanning (5 scanners enabled,
  ARM64 scan job scheduling, Talos adaptations)
- Update ADR-0018: mark Falco as implemented, link to detailed ADRs
- Update README: add 0040-0042 to ADR table, update badge counts
This commit is contained in:
2026-02-09 18:20:13 -05:00
parent fbd5e0bb70
commit 1bc602b726
5 changed files with 474 additions and 2 deletions

View File

@@ -0,0 +1,139 @@
# Trivy Operator Vulnerability Scanning
* Status: accepted
* Date: 2026-02-09
* Deciders: Billy
* Technical Story: Continuously scan cluster workloads for vulnerabilities, misconfigurations, RBAC issues, and exposed secrets
## Context and Problem Statement
Container images accumulate vulnerabilities over time as new CVEs are disclosed. Without continuous scanning, the cluster's security posture degrades silently between deployments. Additionally, Kubernetes resource misconfigurations and overly permissive RBAC can create attack surfaces.
How do we maintain continuous visibility into the security posture of all cluster workloads, including images that haven't been rebuilt recently?
## Decision Drivers
* Continuous scanning — not just at build-time
* Cover multiple security dimensions (CVEs, misconfig, RBAC, secrets)
* Results stored as Kubernetes CRDs for GitOps-friendly querying
* Prometheus metrics for alerting and Grafana dashboards
* Must work on Talos Linux and heterogeneous (amd64 + arm64) clusters
## Decision Outcome
Deploy **Trivy Operator** in standalone mode with all applicable scanners enabled, explicitly disabling infra assessment for Talos compatibility, and scheduling scan jobs preferentially on ARM64 nodes to offload work from GPU nodes.
## Deployment Configuration
| | |
|---|---|
| **Chart** | `trivy-operator` from `https://aquasecurity.github.io/helm-charts` |
| **Namespace** | `security` |
| **Mode** | Standalone (embedded database, no external Trivy server) |
| **Severity filter** | All levels: UNKNOWN, LOW, MEDIUM, HIGH, CRITICAL |
| **Ignore unfixed** | `false` (reports all vulnerabilities, even without patches) |
| **Scan timeout** | 10 minutes |
| **Concurrent scan jobs** | 10 |
| **Slow mode** | `true` (reduces resource usage at cost of scan speed) |
| **Compliance cron** | `0 */6 * * *` (every 6 hours) |
### Scan Job Resources
| CPU Request/Limit | Memory Request/Limit |
|-------------------|----------------------|
| 100m / 500m | 100M / 500M |
## Scanners
| Scanner | Status | Purpose |
|---------|--------|---------|
| Vulnerability | **Enabled** | CVE scanning of container images |
| Config Audit | **Enabled** | Kubernetes resource misconfiguration checks |
| RBAC Assessment | **Enabled** | Overly permissive RBAC analysis |
| Exposed Secrets | **Enabled** | Detect secrets leaked in image layers/env vars |
| Cluster Compliance | **Enabled** | CIS benchmark compliance reports |
| Infra Assessment | **Disabled** | Requires `/etc/systemd` paths — incompatible with Talos |
### Report CRDs
Trivy stores all scan results as Kubernetes custom resources:
| CRD | Content |
|-----|---------|
| `VulnerabilityReport` | CVEs per container image with severity, fix version |
| `ConfigAuditReport` | Kubernetes misconfiguration findings |
| `RbacAssessmentReport` | RBAC privilege escalation risks |
| `ExposedSecretReport` | Secrets found in environment variables or image layers |
| `ClusterComplianceReport` | CIS benchmark compliance status |
## Talos Linux Adaptations
| Challenge | Solution |
|-----------|----------|
| No `/etc/systemd` paths | Infra assessment scanner disabled |
| No standard `/var/lib/kubelet` | `nodeCollector` volumes and volumeMounts set to empty `[]` |
| Immutable root filesystem | Standalone mode — database cached in operator pod |
Talos is inherently security-hardened (immutable OS, no SSH, API-only management), making the infra assessment scanner redundant for the OS layer.
## Scan Job Scheduling
Scan jobs use affinity to prefer **ARM64 nodes** (weight 100) and tolerate the `arm64` taint:
```yaml
tolerations:
- key: node.kubernetes.io/arch
value: arm64
effect: NoSchedule
affinity:
nodeAffinity:
preferredDuringSchedulingIgnoredDuringExecution:
- weight: 100
preference:
matchExpressions:
- key: kubernetes.io/arch
operator: In
values: [arm64]
```
This offloads scan work from GPU nodes (amd64) to ARM64 edge workers, avoiding interference with AI/ML inference workloads.
## Observability
### Prometheus Metrics
All four metric categories enabled:
| Metric | Purpose |
|--------|---------|
| `metricsFindingsEnabled` | Vulnerability counts by severity |
| `metricsConfigAuditInfo` | Misconfiguration findings |
| `metricsRbacAssessmentInfo` | RBAC assessment results |
| `metricsClusterComplianceInfo` | Compliance benchmark status |
**ServiceMonitor:** Enabled with label `release: prometheus`.
### Grafana Dashboards
| Dashboard | Grafana ID | Purpose |
|-----------|------------|---------|
| Trivy Operator Vulnerabilities | #17813 | CVE overview by severity, namespace, image |
| Trivy Image Scan | #16337 | Detailed per-image scan results |
### Alerting
Vulnerability metrics feed into Prometheus alerting rules defined in kube-prometheus-stack (see [ADR-0039](0039-alerting-notification-pipeline.md)):
```
Trivy scans → VulnerabilityReport CRDs → Prometheus metrics
→ AlertRules (critical CVEs) → Alertmanager → ntfy → Discord
```
## Links
* Implements [ADR-0018](0018-security-policy-enforcement.md) (scanning component)
* Related to [ADR-0039](0039-alerting-notification-pipeline.md) (alerting pipeline)
* Related to [ADR-0035](0035-arm64-worker-strategy.md) (ARM64 scan job scheduling)
* [Trivy Operator Documentation](https://aquasecurity.github.io/trivy-operator/)
* [Trivy Vulnerability Database](https://github.com/aquasecurity/trivy-db)