Files
homelab-design/decisions/0018-security-policy-enforcement.md
Billy D. a128c265e4 docs: Add ADRs for secrets management and security policy
- 0017: Secrets Management Strategy (SOPS + Vault + External Secrets)
- 0018: Security Policy Enforcement (Gatekeeper + Trivy)
2026-02-04 08:45:47 -05:00

240 lines
8.2 KiB
Markdown

# Security Policy Enforcement
* Status: accepted
* Date: 2026-02-04
* Deciders: Billy
* Technical Story: Implement security guardrails and vulnerability scanning for the homelab cluster
## Context and Problem Statement
A Kubernetes cluster without security policies is vulnerable to misconfigurations, privilege escalation, and unpatched vulnerabilities. Even in a homelab environment, security best practices protect against accidental misconfigurations and provide learning opportunities for production-grade security.
How do we enforce security policies and maintain visibility into vulnerabilities without creating excessive operational friction?
## Decision Drivers
* Defense in depth - multiple layers of security controls
* Visibility - understand security posture across all workloads
* Progressive enforcement - warn before blocking to avoid disruption
* Automation - minimize manual security auditing
* Talos compatibility - policies must work with immutable OS constraints
## Considered Options
1. **Gatekeeper (OPA) for policy + Trivy Operator for scanning**
2. **Kyverno for policy + Trivy for scanning**
3. **Pod Security Standards (PSS) only**
4. **No enforcement, manual auditing**
## Decision Outcome
Chosen option: **Option 1 - Gatekeeper (OPA) for policy enforcement + Trivy Operator for vulnerability scanning**
Gatekeeper provides flexible policy-as-code using Rego, while Trivy Operator continuously scans for vulnerabilities, misconfigurations, and exposed secrets. Both integrate with Prometheus for alerting.
### Positive Consequences
* Policies are defined as code and version-controlled
* Violations are visible in Grafana dashboards
* Trivy provides continuous vulnerability scanning without CI/CD integration
* Gatekeeper's warn mode allows gradual policy rollout
* Both tools provide Prometheus metrics for alerting
### Negative Consequences
* Rego learning curve for custom policies
* Must maintain exclusion lists for system namespaces
* Trivy node-collector disabled on Talos (lacks systemd paths)
## Pros and Cons of the Options
### Option 1: Gatekeeper + Trivy (Chosen)
**Architecture:**
```
┌─────────────────┐
│ Gatekeeper │
│ (Admission) │
└────────┬────────┘
│ Validates
┌─────────────┐ ┌─────────────────┐ ┌─────────────┐
│ kubectl │───►│ API Server │───►│ Workloads │
│ Flux │ └─────────────────┘ └──────┬──────┘
└─────────────┘ │
│ Scans
┌─────────────────┐ │
│ Trivy Operator │◄─────────┘
└────────┬────────┘
┌─────────────────┐
│ Vulnerability │
│ Reports (CRDs) │
└─────────────────┘
```
* Good, because Gatekeeper is CNCF graduated and widely adopted
* Good, because Rego allows complex policy logic
* Good, because Trivy scans images, configs, RBAC, and secrets
* Good, because both provide Prometheus metrics and Grafana dashboards
* Bad, because Rego has a learning curve
* Bad, because Trivy node-collector incompatible with Talos
### Option 2: Kyverno + Trivy
* Good, because Kyverno policies are YAML-based (easier to write)
* Good, because Kyverno can mutate resources (auto-fix)
* Bad, because Kyverno is less mature than Gatekeeper
* Bad, because mutation can cause unexpected behavior
### Option 3: Pod Security Standards Only
* Good, because built into Kubernetes (no additional components)
* Good, because simple namespace-level enforcement
* Bad, because limited to pod security only
* Bad, because no vulnerability scanning
* Bad, because no custom policy support
### Option 4: No Enforcement
* Good, because no operational overhead
* Bad, because no protection against misconfigurations
* Bad, because no visibility into security posture
* Bad, because bad practice even for homelabs
## Implementation Details
### Gatekeeper Policies
**Constraint Templates (Rego-based):**
| Template | Purpose |
|----------|---------|
| `K8sPSPPrivilegedContainer` | Block privileged containers |
| `K8sRequiredLabels` | Require app.kubernetes.io labels |
| `K8sContainerLimits` | Require resource limits |
**Constraints (Policy Instances):**
```yaml
# Deny privileged containers (warn mode)
apiVersion: constraints.gatekeeper.sh/v1beta1
kind: K8sPSPPrivilegedContainer
metadata:
name: deny-privileged-containers
spec:
enforcementAction: warn # Start with warn, move to deny
match:
kinds:
- apiGroups: [""]
kinds: ["Pod"]
excludedNamespaces:
- kube-system
- gatekeeper-system
- cilium-secrets
- longhorn-system
- observability
- gpu-operator
parameters:
exemptImages:
- "quay.io/cilium/*"
- "ghcr.io/longhorn/*"
- "nvcr.io/nvidia/*"
```
**Enforcement Progression:**
1. `warn` - Log violations, don't block (initial rollout)
2. `dryrun` - Audit mode, visible in reports
3. `deny` - Block non-compliant resources
### Trivy Operator Configuration
```yaml
operator:
vulnerabilityScannerEnabled: true
configAuditScannerEnabled: true
rbacAssessmentScannerEnabled: true
exposedSecretScannerEnabled: true
clusterComplianceEnabled: true
# Disabled for Talos (no systemd, no /var/lib/kubelet)
infraAssessmentScannerEnabled: false
# Metrics for Prometheus
metricsFindingsEnabled: true
metricsConfigAuditInfo: true
```
**Scan Reports (CRDs):**
- `VulnerabilityReport` - CVEs in container images
- `ConfigAuditReport` - Kubernetes misconfigurations
- `RbacAssessmentReport` - RBAC privilege issues
- `ExposedSecretReport` - Secrets in environment variables
### Grafana Dashboards
| Dashboard | Source |
|-----------|--------|
| Gatekeeper Overview | Grafana ID 15763 |
| Gatekeeper Violations | Grafana ID 14828 |
| Trivy Vulnerabilities | Grafana ID 17813 |
| Trivy Image Scan | Custom |
### Namespace Exclusions
System namespaces excluded from strict policies:
| Namespace | Reason |
|-----------|--------|
| `kube-system` | Core Kubernetes components |
| `gatekeeper-system` | Gatekeeper itself |
| `longhorn-system` | Storage requires privileges |
| `gpu-operator` | GPU drivers require privileges |
| `cilium-secrets` | CNI requires host networking |
| `observability` | Some collectors need host access |
### Talos-Specific Considerations
Trivy's `node-collector` is disabled because Talos:
- Has no `/etc/systemd` (uses custom init)
- Has no standard `/var/lib/kubelet` path
- Is immutable (read-only root filesystem)
This is acceptable because Talos itself is security-hardened by design.
## Alerting Strategy
**Prometheus Alerts:**
```yaml
- alert: HighSeverityVulnerability
expr: trivy_vulnerability_id{severity="CRITICAL"} > 0
for: 1h
labels:
severity: warning
annotations:
summary: "Critical vulnerability detected"
- alert: GatekeeperViolation
expr: increase(gatekeeper_violations[1h]) > 0
for: 5m
labels:
severity: info
annotations:
summary: "Policy violation detected"
```
## Future Enhancements
1. **Move to `deny` enforcement** once baseline violations are resolved
2. **Add network policies** via Cilium for workload isolation
3. **Integrate Falco** for runtime threat detection
4. **Add SBOM generation** with Trivy for supply chain visibility
## References
* [OPA Gatekeeper](https://open-policy-agent.github.io/gatekeeper/)
* [Trivy Operator](https://aquasecurity.github.io/trivy-operator/)
* [Pod Security Standards](https://kubernetes.io/docs/concepts/security/pod-security-standards/)
* [Talos Security](https://www.talos.dev/v1.6/introduction/what-is-talos/#security)