All checks were successful
Update README with ADR Index / update-readme (push) Successful in 6s
- ADR-0040: OPA Gatekeeper policy framework (constraint templates, progressive enforcement, warn-first strategy) - ADR-0041: Falco runtime threat detection (modern eBPF on Talos, Falcosidekick → Alertmanager integration) - ADR-0042: Trivy Operator vulnerability scanning (5 scanners enabled, ARM64 scan job scheduling, Talos adaptations) - Update ADR-0018: mark Falco as implemented, link to detailed ADRs - Update README: add 0040-0042 to ADR table, update badge counts
248 lines
8.7 KiB
Markdown
248 lines
8.7 KiB
Markdown
# Security Policy Enforcement
|
|
|
|
* Status: accepted
|
|
* Date: 2026-02-04
|
|
* Deciders: Billy
|
|
* Technical Story: Implement security guardrails and vulnerability scanning for the homelab cluster
|
|
|
|
## Context and Problem Statement
|
|
|
|
A Kubernetes cluster without security policies is vulnerable to misconfigurations, privilege escalation, and unpatched vulnerabilities. Even in a homelab environment, security best practices protect against accidental misconfigurations and provide learning opportunities for production-grade security.
|
|
|
|
How do we enforce security policies and maintain visibility into vulnerabilities without creating excessive operational friction?
|
|
|
|
## Decision Drivers
|
|
|
|
* Defense in depth - multiple layers of security controls
|
|
* Visibility - understand security posture across all workloads
|
|
* Progressive enforcement - warn before blocking to avoid disruption
|
|
* Automation - minimize manual security auditing
|
|
* Talos compatibility - policies must work with immutable OS constraints
|
|
|
|
## Considered Options
|
|
|
|
1. **Gatekeeper (OPA) for policy + Trivy Operator for scanning**
|
|
2. **Kyverno for policy + Trivy for scanning**
|
|
3. **Pod Security Standards (PSS) only**
|
|
4. **No enforcement, manual auditing**
|
|
|
|
## Decision Outcome
|
|
|
|
Chosen option: **Option 1 - Gatekeeper (OPA) for policy enforcement + Trivy Operator for vulnerability scanning**
|
|
|
|
Gatekeeper provides flexible policy-as-code using Rego, while Trivy Operator continuously scans for vulnerabilities, misconfigurations, and exposed secrets. Both integrate with Prometheus for alerting.
|
|
|
|
### Positive Consequences
|
|
|
|
* Policies are defined as code and version-controlled
|
|
* Violations are visible in Grafana dashboards
|
|
* Trivy provides continuous vulnerability scanning without CI/CD integration
|
|
* Gatekeeper's warn mode allows gradual policy rollout
|
|
* Both tools provide Prometheus metrics for alerting
|
|
|
|
### Negative Consequences
|
|
|
|
* Rego learning curve for custom policies
|
|
* Must maintain exclusion lists for system namespaces
|
|
* Trivy node-collector disabled on Talos (lacks systemd paths)
|
|
|
|
## Pros and Cons of the Options
|
|
|
|
### Option 1: Gatekeeper + Trivy (Chosen)
|
|
|
|
**Architecture:**
|
|
```
|
|
┌─────────────────┐
|
|
│ Gatekeeper │
|
|
│ (Admission) │
|
|
└────────┬────────┘
|
|
│ Validates
|
|
▼
|
|
┌─────────────┐ ┌─────────────────┐ ┌─────────────┐
|
|
│ kubectl │───►│ API Server │───►│ Workloads │
|
|
│ Flux │ └─────────────────┘ └──────┬──────┘
|
|
└─────────────┘ │
|
|
│ Scans
|
|
┌─────────────────┐ │
|
|
│ Trivy Operator │◄─────────┘
|
|
└────────┬────────┘
|
|
│
|
|
▼
|
|
┌─────────────────┐
|
|
│ Vulnerability │
|
|
│ Reports (CRDs) │
|
|
└─────────────────┘
|
|
```
|
|
|
|
* Good, because Gatekeeper is CNCF graduated and widely adopted
|
|
* Good, because Rego allows complex policy logic
|
|
* Good, because Trivy scans images, configs, RBAC, and secrets
|
|
* Good, because both provide Prometheus metrics and Grafana dashboards
|
|
* Bad, because Rego has a learning curve
|
|
* Bad, because Trivy node-collector incompatible with Talos
|
|
|
|
### Option 2: Kyverno + Trivy
|
|
|
|
* Good, because Kyverno policies are YAML-based (easier to write)
|
|
* Good, because Kyverno can mutate resources (auto-fix)
|
|
* Bad, because Kyverno is less mature than Gatekeeper
|
|
* Bad, because mutation can cause unexpected behavior
|
|
|
|
### Option 3: Pod Security Standards Only
|
|
|
|
* Good, because built into Kubernetes (no additional components)
|
|
* Good, because simple namespace-level enforcement
|
|
* Bad, because limited to pod security only
|
|
* Bad, because no vulnerability scanning
|
|
* Bad, because no custom policy support
|
|
|
|
### Option 4: No Enforcement
|
|
|
|
* Good, because no operational overhead
|
|
* Bad, because no protection against misconfigurations
|
|
* Bad, because no visibility into security posture
|
|
* Bad, because bad practice even for homelabs
|
|
|
|
## Implementation Details
|
|
|
|
### Gatekeeper Policies
|
|
|
|
**Constraint Templates (Rego-based):**
|
|
|
|
| Template | Purpose |
|
|
|----------|---------|
|
|
| `K8sPSPPrivilegedContainer` | Block privileged containers |
|
|
| `K8sRequiredLabels` | Require app.kubernetes.io labels |
|
|
| `K8sContainerLimits` | Require resource limits |
|
|
|
|
**Constraints (Policy Instances):**
|
|
|
|
```yaml
|
|
# Deny privileged containers (warn mode)
|
|
apiVersion: constraints.gatekeeper.sh/v1beta1
|
|
kind: K8sPSPPrivilegedContainer
|
|
metadata:
|
|
name: deny-privileged-containers
|
|
spec:
|
|
enforcementAction: warn # Start with warn, move to deny
|
|
match:
|
|
kinds:
|
|
- apiGroups: [""]
|
|
kinds: ["Pod"]
|
|
excludedNamespaces:
|
|
- kube-system
|
|
- gatekeeper-system
|
|
- cilium-secrets
|
|
- longhorn-system
|
|
- observability
|
|
- gpu-operator
|
|
parameters:
|
|
exemptImages:
|
|
- "quay.io/cilium/*"
|
|
- "ghcr.io/longhorn/*"
|
|
- "nvcr.io/nvidia/*"
|
|
```
|
|
|
|
**Enforcement Progression:**
|
|
1. `warn` - Log violations, don't block (initial rollout)
|
|
2. `dryrun` - Audit mode, visible in reports
|
|
3. `deny` - Block non-compliant resources
|
|
|
|
### Trivy Operator Configuration
|
|
|
|
```yaml
|
|
operator:
|
|
vulnerabilityScannerEnabled: true
|
|
configAuditScannerEnabled: true
|
|
rbacAssessmentScannerEnabled: true
|
|
exposedSecretScannerEnabled: true
|
|
clusterComplianceEnabled: true
|
|
|
|
# Disabled for Talos (no systemd, no /var/lib/kubelet)
|
|
infraAssessmentScannerEnabled: false
|
|
|
|
# Metrics for Prometheus
|
|
metricsFindingsEnabled: true
|
|
metricsConfigAuditInfo: true
|
|
```
|
|
|
|
**Scan Reports (CRDs):**
|
|
- `VulnerabilityReport` - CVEs in container images
|
|
- `ConfigAuditReport` - Kubernetes misconfigurations
|
|
- `RbacAssessmentReport` - RBAC privilege issues
|
|
- `ExposedSecretReport` - Secrets in environment variables
|
|
|
|
### Grafana Dashboards
|
|
|
|
| Dashboard | Source |
|
|
|-----------|--------|
|
|
| Gatekeeper Overview | Grafana ID 15763 |
|
|
| Gatekeeper Violations | Grafana ID 14828 |
|
|
| Trivy Vulnerabilities | Grafana ID 17813 |
|
|
| Trivy Image Scan | Custom |
|
|
|
|
### Namespace Exclusions
|
|
|
|
System namespaces excluded from strict policies:
|
|
|
|
| Namespace | Reason |
|
|
|-----------|--------|
|
|
| `kube-system` | Core Kubernetes components |
|
|
| `gatekeeper-system` | Gatekeeper itself |
|
|
| `longhorn-system` | Storage requires privileges |
|
|
| `gpu-operator` | GPU drivers require privileges |
|
|
| `cilium-secrets` | CNI requires host networking |
|
|
| `observability` | Some collectors need host access |
|
|
|
|
### Talos-Specific Considerations
|
|
|
|
Trivy's `node-collector` is disabled because Talos:
|
|
- Has no `/etc/systemd` (uses custom init)
|
|
- Has no standard `/var/lib/kubelet` path
|
|
- Is immutable (read-only root filesystem)
|
|
|
|
This is acceptable because Talos itself is security-hardened by design.
|
|
|
|
## Alerting Strategy
|
|
|
|
**Prometheus Alerts:**
|
|
```yaml
|
|
- alert: HighSeverityVulnerability
|
|
expr: trivy_vulnerability_id{severity="CRITICAL"} > 0
|
|
for: 1h
|
|
labels:
|
|
severity: warning
|
|
annotations:
|
|
summary: "Critical vulnerability detected"
|
|
|
|
- alert: GatekeeperViolation
|
|
expr: increase(gatekeeper_violations[1h]) > 0
|
|
for: 5m
|
|
labels:
|
|
severity: info
|
|
annotations:
|
|
summary: "Policy violation detected"
|
|
```
|
|
|
|
## Future Enhancements
|
|
|
|
1. **Move to `deny` enforcement** once baseline violations are resolved
|
|
2. **Add network policies** via Cilium for workload isolation
|
|
3. ✅ **Falco integrated** — see [ADR-0041](0041-falco-runtime-threat-detection.md) for runtime threat detection
|
|
4. **Add SBOM generation** with Trivy for supply chain visibility
|
|
|
|
## Detailed Component ADRs
|
|
|
|
| Component | ADR | Purpose |
|
|
|-----------|-----|--------|
|
|
| Gatekeeper | [ADR-0040](0040-opa-gatekeeper-policy-framework.md) | Policy templates, constraints, enforcement progression |
|
|
| Falco | [ADR-0041](0041-falco-runtime-threat-detection.md) | Runtime threat detection, eBPF driver, Falcosidekick |
|
|
| Trivy Operator | [ADR-0042](0042-trivy-operator-vulnerability-scanning.md) | Vulnerability scanning, compliance reports, Talos adaptations |
|
|
|
|
## References
|
|
|
|
* [OPA Gatekeeper](https://open-policy-agent.github.io/gatekeeper/)
|
|
* [Trivy Operator](https://aquasecurity.github.io/trivy-operator/)
|
|
* [Pod Security Standards](https://kubernetes.io/docs/concepts/security/pod-security-standards/)
|
|
* [Talos Security](https://www.talos.dev/v1.6/introduction/what-is-talos/#security)
|