Files

Update README with ADR Index / update-readme (push) Successful in 6s

Details

docs(adr): add security ADRs for Gatekeeper, Falco, and Trivy

- ADR-0040: OPA Gatekeeper policy framework (constraint templates,
  progressive enforcement, warn-first strategy)
- ADR-0041: Falco runtime threat detection (modern eBPF on Talos,
  Falcosidekick → Alertmanager integration)
- ADR-0042: Trivy Operator vulnerability scanning (5 scanners enabled,
  ARM64 scan job scheduling, Talos adaptations)
- Update ADR-0018: mark Falco as implemented, link to detailed ADRs
- Update README: add 0040-0042 to ADR table, update badge counts

2026-02-09 18:20:13 -05:00

5.4 KiB

Raw Permalink Blame History

Trivy Operator Vulnerability Scanning

Status: accepted
Date: 2026-02-09
Deciders: Billy
Technical Story: Continuously scan cluster workloads for vulnerabilities, misconfigurations, RBAC issues, and exposed secrets

Context and Problem Statement

Container images accumulate vulnerabilities over time as new CVEs are disclosed. Without continuous scanning, the cluster's security posture degrades silently between deployments. Additionally, Kubernetes resource misconfigurations and overly permissive RBAC can create attack surfaces.

How do we maintain continuous visibility into the security posture of all cluster workloads, including images that haven't been rebuilt recently?

Decision Drivers

Continuous scanning — not just at build-time
Cover multiple security dimensions (CVEs, misconfig, RBAC, secrets)
Results stored as Kubernetes CRDs for GitOps-friendly querying
Prometheus metrics for alerting and Grafana dashboards
Must work on Talos Linux and heterogeneous (amd64 + arm64) clusters

Decision Outcome

Deploy Trivy Operator in standalone mode with all applicable scanners enabled, explicitly disabling infra assessment for Talos compatibility, and scheduling scan jobs preferentially on ARM64 nodes to offload work from GPU nodes.

Deployment Configuration


Chart	`trivy-operator` from `https://aquasecurity.github.io/helm-charts`
Namespace	`security`
Mode	Standalone (embedded database, no external Trivy server)
Severity filter	All levels: UNKNOWN, LOW, MEDIUM, HIGH, CRITICAL
Ignore unfixed	`false` (reports all vulnerabilities, even without patches)
Scan timeout	10 minutes
Concurrent scan jobs	10
Slow mode	`true` (reduces resource usage at cost of scan speed)
Compliance cron	`0 /6 * *` (every 6 hours)

Scan Job Resources

CPU Request/Limit	Memory Request/Limit
100m / 500m	100M / 500M

Scanners

Scanner	Status	Purpose
Vulnerability	Enabled	CVE scanning of container images
Config Audit	Enabled	Kubernetes resource misconfiguration checks
RBAC Assessment	Enabled	Overly permissive RBAC analysis
Exposed Secrets	Enabled	Detect secrets leaked in image layers/env vars
Cluster Compliance	Enabled	CIS benchmark compliance reports
Infra Assessment	Disabled	Requires `/etc/systemd` paths — incompatible with Talos

Report CRDs

Trivy stores all scan results as Kubernetes custom resources:

CRD	Content
`VulnerabilityReport`	CVEs per container image with severity, fix version
`ConfigAuditReport`	Kubernetes misconfiguration findings
`RbacAssessmentReport`	RBAC privilege escalation risks
`ExposedSecretReport`	Secrets found in environment variables or image layers
`ClusterComplianceReport`	CIS benchmark compliance status

Talos Linux Adaptations

Challenge	Solution
No `/etc/systemd` paths	Infra assessment scanner disabled
No standard `/var/lib/kubelet`	`nodeCollector` volumes and volumeMounts set to empty `[]`
Immutable root filesystem	Standalone mode — database cached in operator pod

Talos is inherently security-hardened (immutable OS, no SSH, API-only management), making the infra assessment scanner redundant for the OS layer.

Scan Job Scheduling

Scan jobs use affinity to prefer ARM64 nodes (weight 100) and tolerate the arm64 taint:

tolerations:
  - key: node.kubernetes.io/arch
    value: arm64
    effect: NoSchedule

affinity:
  nodeAffinity:
    preferredDuringSchedulingIgnoredDuringExecution:
      - weight: 100
        preference:
          matchExpressions:
            - key: kubernetes.io/arch
              operator: In
              values: [arm64]

This offloads scan work from GPU nodes (amd64) to ARM64 edge workers, avoiding interference with AI/ML inference workloads.

Observability

Prometheus Metrics

All four metric categories enabled:

Metric	Purpose
`metricsFindingsEnabled`	Vulnerability counts by severity
`metricsConfigAuditInfo`	Misconfiguration findings
`metricsRbacAssessmentInfo`	RBAC assessment results
`metricsClusterComplianceInfo`	Compliance benchmark status

ServiceMonitor: Enabled with label release: prometheus.

Grafana Dashboards

Dashboard	Grafana ID	Purpose
Trivy Operator Vulnerabilities	#17813	CVE overview by severity, namespace, image
Trivy Image Scan	#16337	Detailed per-image scan results

Alerting

Vulnerability metrics feed into Prometheus alerting rules defined in kube-prometheus-stack (see ADR-0039):

Trivy scans → VulnerabilityReport CRDs → Prometheus metrics
  → AlertRules (critical CVEs) → Alertmanager → ntfy → Discord

5.4 KiB Raw Permalink Blame History