Files
homelab-design/decisions/0042-trivy-operator-vulnerability-scanning.md
Billy D. 1bc602b726
All checks were successful
Update README with ADR Index / update-readme (push) Successful in 6s
docs(adr): add security ADRs for Gatekeeper, Falco, and Trivy
- ADR-0040: OPA Gatekeeper policy framework (constraint templates,
  progressive enforcement, warn-first strategy)
- ADR-0041: Falco runtime threat detection (modern eBPF on Talos,
  Falcosidekick → Alertmanager integration)
- ADR-0042: Trivy Operator vulnerability scanning (5 scanners enabled,
  ARM64 scan job scheduling, Talos adaptations)
- Update ADR-0018: mark Falco as implemented, link to detailed ADRs
- Update README: add 0040-0042 to ADR table, update badge counts
2026-02-09 18:20:13 -05:00

5.4 KiB

Trivy Operator Vulnerability Scanning

  • Status: accepted
  • Date: 2026-02-09
  • Deciders: Billy
  • Technical Story: Continuously scan cluster workloads for vulnerabilities, misconfigurations, RBAC issues, and exposed secrets

Context and Problem Statement

Container images accumulate vulnerabilities over time as new CVEs are disclosed. Without continuous scanning, the cluster's security posture degrades silently between deployments. Additionally, Kubernetes resource misconfigurations and overly permissive RBAC can create attack surfaces.

How do we maintain continuous visibility into the security posture of all cluster workloads, including images that haven't been rebuilt recently?

Decision Drivers

  • Continuous scanning — not just at build-time
  • Cover multiple security dimensions (CVEs, misconfig, RBAC, secrets)
  • Results stored as Kubernetes CRDs for GitOps-friendly querying
  • Prometheus metrics for alerting and Grafana dashboards
  • Must work on Talos Linux and heterogeneous (amd64 + arm64) clusters

Decision Outcome

Deploy Trivy Operator in standalone mode with all applicable scanners enabled, explicitly disabling infra assessment for Talos compatibility, and scheduling scan jobs preferentially on ARM64 nodes to offload work from GPU nodes.

Deployment Configuration

Chart trivy-operator from https://aquasecurity.github.io/helm-charts
Namespace security
Mode Standalone (embedded database, no external Trivy server)
Severity filter All levels: UNKNOWN, LOW, MEDIUM, HIGH, CRITICAL
Ignore unfixed false (reports all vulnerabilities, even without patches)
Scan timeout 10 minutes
Concurrent scan jobs 10
Slow mode true (reduces resource usage at cost of scan speed)
Compliance cron 0 */6 * * * (every 6 hours)

Scan Job Resources

CPU Request/Limit Memory Request/Limit
100m / 500m 100M / 500M

Scanners

Scanner Status Purpose
Vulnerability Enabled CVE scanning of container images
Config Audit Enabled Kubernetes resource misconfiguration checks
RBAC Assessment Enabled Overly permissive RBAC analysis
Exposed Secrets Enabled Detect secrets leaked in image layers/env vars
Cluster Compliance Enabled CIS benchmark compliance reports
Infra Assessment Disabled Requires /etc/systemd paths — incompatible with Talos

Report CRDs

Trivy stores all scan results as Kubernetes custom resources:

CRD Content
VulnerabilityReport CVEs per container image with severity, fix version
ConfigAuditReport Kubernetes misconfiguration findings
RbacAssessmentReport RBAC privilege escalation risks
ExposedSecretReport Secrets found in environment variables or image layers
ClusterComplianceReport CIS benchmark compliance status

Talos Linux Adaptations

Challenge Solution
No /etc/systemd paths Infra assessment scanner disabled
No standard /var/lib/kubelet nodeCollector volumes and volumeMounts set to empty []
Immutable root filesystem Standalone mode — database cached in operator pod

Talos is inherently security-hardened (immutable OS, no SSH, API-only management), making the infra assessment scanner redundant for the OS layer.

Scan Job Scheduling

Scan jobs use affinity to prefer ARM64 nodes (weight 100) and tolerate the arm64 taint:

tolerations:
  - key: node.kubernetes.io/arch
    value: arm64
    effect: NoSchedule

affinity:
  nodeAffinity:
    preferredDuringSchedulingIgnoredDuringExecution:
      - weight: 100
        preference:
          matchExpressions:
            - key: kubernetes.io/arch
              operator: In
              values: [arm64]

This offloads scan work from GPU nodes (amd64) to ARM64 edge workers, avoiding interference with AI/ML inference workloads.

Observability

Prometheus Metrics

All four metric categories enabled:

Metric Purpose
metricsFindingsEnabled Vulnerability counts by severity
metricsConfigAuditInfo Misconfiguration findings
metricsRbacAssessmentInfo RBAC assessment results
metricsClusterComplianceInfo Compliance benchmark status

ServiceMonitor: Enabled with label release: prometheus.

Grafana Dashboards

Dashboard Grafana ID Purpose
Trivy Operator Vulnerabilities #17813 CVE overview by severity, namespace, image
Trivy Image Scan #16337 Detailed per-image scan results

Alerting

Vulnerability metrics feed into Prometheus alerting rules defined in kube-prometheus-stack (see ADR-0039):

Trivy scans → VulnerabilityReport CRDs → Prometheus metrics
  → AlertRules (critical CVEs) → Alertmanager → ntfy → Discord