- ADR-0040: OPA Gatekeeper policy framework (constraint templates, progressive enforcement, warn-first strategy) - ADR-0041: Falco runtime threat detection (modern eBPF on Talos, Falcosidekick → Alertmanager integration) - ADR-0042: Trivy Operator vulnerability scanning (5 scanners enabled, ARM64 scan job scheduling, Talos adaptations) - Update ADR-0018: mark Falco as implemented, link to detailed ADRs - Update README: add 0040-0042 to ADR table, update badge counts
5.4 KiB
Trivy Operator Vulnerability Scanning
- Status: accepted
- Date: 2026-02-09
- Deciders: Billy
- Technical Story: Continuously scan cluster workloads for vulnerabilities, misconfigurations, RBAC issues, and exposed secrets
Context and Problem Statement
Container images accumulate vulnerabilities over time as new CVEs are disclosed. Without continuous scanning, the cluster's security posture degrades silently between deployments. Additionally, Kubernetes resource misconfigurations and overly permissive RBAC can create attack surfaces.
How do we maintain continuous visibility into the security posture of all cluster workloads, including images that haven't been rebuilt recently?
Decision Drivers
- Continuous scanning — not just at build-time
- Cover multiple security dimensions (CVEs, misconfig, RBAC, secrets)
- Results stored as Kubernetes CRDs for GitOps-friendly querying
- Prometheus metrics for alerting and Grafana dashboards
- Must work on Talos Linux and heterogeneous (amd64 + arm64) clusters
Decision Outcome
Deploy Trivy Operator in standalone mode with all applicable scanners enabled, explicitly disabling infra assessment for Talos compatibility, and scheduling scan jobs preferentially on ARM64 nodes to offload work from GPU nodes.
Deployment Configuration
| Chart | trivy-operator from https://aquasecurity.github.io/helm-charts |
| Namespace | security |
| Mode | Standalone (embedded database, no external Trivy server) |
| Severity filter | All levels: UNKNOWN, LOW, MEDIUM, HIGH, CRITICAL |
| Ignore unfixed | false (reports all vulnerabilities, even without patches) |
| Scan timeout | 10 minutes |
| Concurrent scan jobs | 10 |
| Slow mode | true (reduces resource usage at cost of scan speed) |
| Compliance cron | 0 */6 * * * (every 6 hours) |
Scan Job Resources
| CPU Request/Limit | Memory Request/Limit |
|---|---|
| 100m / 500m | 100M / 500M |
Scanners
| Scanner | Status | Purpose |
|---|---|---|
| Vulnerability | Enabled | CVE scanning of container images |
| Config Audit | Enabled | Kubernetes resource misconfiguration checks |
| RBAC Assessment | Enabled | Overly permissive RBAC analysis |
| Exposed Secrets | Enabled | Detect secrets leaked in image layers/env vars |
| Cluster Compliance | Enabled | CIS benchmark compliance reports |
| Infra Assessment | Disabled | Requires /etc/systemd paths — incompatible with Talos |
Report CRDs
Trivy stores all scan results as Kubernetes custom resources:
| CRD | Content |
|---|---|
VulnerabilityReport |
CVEs per container image with severity, fix version |
ConfigAuditReport |
Kubernetes misconfiguration findings |
RbacAssessmentReport |
RBAC privilege escalation risks |
ExposedSecretReport |
Secrets found in environment variables or image layers |
ClusterComplianceReport |
CIS benchmark compliance status |
Talos Linux Adaptations
| Challenge | Solution |
|---|---|
No /etc/systemd paths |
Infra assessment scanner disabled |
No standard /var/lib/kubelet |
nodeCollector volumes and volumeMounts set to empty [] |
| Immutable root filesystem | Standalone mode — database cached in operator pod |
Talos is inherently security-hardened (immutable OS, no SSH, API-only management), making the infra assessment scanner redundant for the OS layer.
Scan Job Scheduling
Scan jobs use affinity to prefer ARM64 nodes (weight 100) and tolerate the arm64 taint:
tolerations:
- key: node.kubernetes.io/arch
value: arm64
effect: NoSchedule
affinity:
nodeAffinity:
preferredDuringSchedulingIgnoredDuringExecution:
- weight: 100
preference:
matchExpressions:
- key: kubernetes.io/arch
operator: In
values: [arm64]
This offloads scan work from GPU nodes (amd64) to ARM64 edge workers, avoiding interference with AI/ML inference workloads.
Observability
Prometheus Metrics
All four metric categories enabled:
| Metric | Purpose |
|---|---|
metricsFindingsEnabled |
Vulnerability counts by severity |
metricsConfigAuditInfo |
Misconfiguration findings |
metricsRbacAssessmentInfo |
RBAC assessment results |
metricsClusterComplianceInfo |
Compliance benchmark status |
ServiceMonitor: Enabled with label release: prometheus.
Grafana Dashboards
| Dashboard | Grafana ID | Purpose |
|---|---|---|
| Trivy Operator Vulnerabilities | #17813 | CVE overview by severity, namespace, image |
| Trivy Image Scan | #16337 | Detailed per-image scan results |
Alerting
Vulnerability metrics feed into Prometheus alerting rules defined in kube-prometheus-stack (see ADR-0039):
Trivy scans → VulnerabilityReport CRDs → Prometheus metrics
→ AlertRules (critical CVEs) → Alertmanager → ntfy → Discord
Links
- Implements ADR-0018 (scanning component)
- Related to ADR-0039 (alerting pipeline)
- Related to ADR-0035 (ARM64 scan job scheduling)
- Trivy Operator Documentation
- Trivy Vulnerability Database