homelab-design/decisions/0018-security-policy-enforcement.md

# Security Policy Enforcement

* Status: accepted
* Date: 2026-02-04
* Deciders: Billy
* Technical Story: Implement security guardrails and vulnerability scanning for the homelab cluster

## Context and Problem Statement

A Kubernetes cluster without security policies is vulnerable to misconfigurations, privilege escalation, and unpatched vulnerabilities. Even in a homelab environment, security best practices protect against accidental misconfigurations and provide learning opportunities for production-grade security.

How do we enforce security policies and maintain visibility into vulnerabilities without creating excessive operational friction?

## Decision Drivers

* Defense in depth - multiple layers of security controls
* Visibility - understand security posture across all workloads
* Progressive enforcement - warn before blocking to avoid disruption
* Automation - minimize manual security auditing
* Talos compatibility - policies must work with immutable OS constraints

## Considered Options

1. **Gatekeeper (OPA) for policy + Trivy Operator for scanning**
2. **Kyverno for policy + Trivy for scanning**
3. **Pod Security Standards (PSS) only**
4. **No enforcement, manual auditing**

## Decision Outcome

Chosen option: **Option 1 - Gatekeeper (OPA) for policy enforcement + Trivy Operator for vulnerability scanning**

Gatekeeper provides flexible policy-as-code using Rego, while Trivy Operator continuously scans for vulnerabilities, misconfigurations, and exposed secrets. Both integrate with Prometheus for alerting.

### Positive Consequences

* Policies are defined as code and version-controlled
* Violations are visible in Grafana dashboards
* Trivy provides continuous vulnerability scanning without CI/CD integration
* Gatekeeper's warn mode allows gradual policy rollout
* Both tools provide Prometheus metrics for alerting

### Negative Consequences

* Rego learning curve for custom policies
* Must maintain exclusion lists for system namespaces
* Trivy node-collector disabled on Talos (lacks systemd paths)

## Pros and Cons of the Options

### Option 1: Gatekeeper + Trivy (Chosen)

**Architecture:**
```
                    ┌─────────────────┐
                    │  Gatekeeper     │
                    │  (Admission)    │
                    └────────┬────────┘
                             │ Validates
                             ▼
┌─────────────┐    ┌─────────────────┐    ┌─────────────┐
│ kubectl     │───►│  API Server     │───►│ Workloads   │
│ Flux        │    └─────────────────┘    └──────┬──────┘
└─────────────┘                                  │
                                                 │ Scans
                    ┌─────────────────┐          │
                    │ Trivy Operator  │◄─────────┘
                    └────────┬────────┘
                             │
                             ▼
                    ┌─────────────────┐
                    │ Vulnerability   │
                    │ Reports (CRDs)  │
                    └─────────────────┘
```

* Good, because Gatekeeper is CNCF graduated and widely adopted
* Good, because Rego allows complex policy logic
* Good, because Trivy scans images, configs, RBAC, and secrets
* Good, because both provide Prometheus metrics and Grafana dashboards
* Bad, because Rego has a learning curve
* Bad, because Trivy node-collector incompatible with Talos

### Option 2: Kyverno + Trivy

* Good, because Kyverno policies are YAML-based (easier to write)
* Good, because Kyverno can mutate resources (auto-fix)
* Bad, because Kyverno is less mature than Gatekeeper
* Bad, because mutation can cause unexpected behavior

### Option 3: Pod Security Standards Only

* Good, because built into Kubernetes (no additional components)
* Good, because simple namespace-level enforcement
* Bad, because limited to pod security only
* Bad, because no vulnerability scanning
* Bad, because no custom policy support

### Option 4: No Enforcement

* Good, because no operational overhead
* Bad, because no protection against misconfigurations
* Bad, because no visibility into security posture
* Bad, because bad practice even for homelabs

## Implementation Details

### Gatekeeper Policies

**Constraint Templates (Rego-based):**

| Template | Purpose |
|----------|---------|
| `K8sPSPPrivilegedContainer` | Block privileged containers |
| `K8sRequiredLabels` | Require app.kubernetes.io labels |
| `K8sContainerLimits` | Require resource limits |

**Constraints (Policy Instances):**

```yaml
# Deny privileged containers (warn mode)
apiVersion: constraints.gatekeeper.sh/v1beta1
kind: K8sPSPPrivilegedContainer
metadata:
  name: deny-privileged-containers
spec:
  enforcementAction: warn  # Start with warn, move to deny
  match:
    kinds:
      - apiGroups: [""]
        kinds: ["Pod"]
    excludedNamespaces:
      - kube-system
      - gatekeeper-system
      - cilium-secrets
      - longhorn-system
      - observability
      - gpu-operator
  parameters:
    exemptImages:
      - "quay.io/cilium/*"
      - "ghcr.io/longhorn/*"
      - "nvcr.io/nvidia/*"
```

**Enforcement Progression:**
1. `warn` - Log violations, don't block (initial rollout)
2. `dryrun` - Audit mode, visible in reports
3. `deny` - Block non-compliant resources

### Trivy Operator Configuration

```yaml
operator:
  vulnerabilityScannerEnabled: true
  configAuditScannerEnabled: true
  rbacAssessmentScannerEnabled: true
  exposedSecretScannerEnabled: true
  clusterComplianceEnabled: true

  # Disabled for Talos (no systemd, no /var/lib/kubelet)
  infraAssessmentScannerEnabled: false

# Metrics for Prometheus
metricsFindingsEnabled: true
metricsConfigAuditInfo: true
```

**Scan Reports (CRDs):**
- `VulnerabilityReport` - CVEs in container images
- `ConfigAuditReport` - Kubernetes misconfigurations
- `RbacAssessmentReport` - RBAC privilege issues
- `ExposedSecretReport` - Secrets in environment variables

### Grafana Dashboards

| Dashboard | Source |
|-----------|--------|
| Gatekeeper Overview | Grafana ID 15763 |
| Gatekeeper Violations | Grafana ID 14828 |
| Trivy Vulnerabilities | Grafana ID 17813 |
| Trivy Image Scan | Custom |

### Namespace Exclusions

System namespaces excluded from strict policies:

| Namespace | Reason |
|-----------|--------|
| `kube-system` | Core Kubernetes components |
| `gatekeeper-system` | Gatekeeper itself |
| `longhorn-system` | Storage requires privileges |
| `gpu-operator` | GPU drivers require privileges |
| `cilium-secrets` | CNI requires host networking |
| `observability` | Some collectors need host access |

### Talos-Specific Considerations

Trivy's `node-collector` is disabled because Talos:
- Has no `/etc/systemd` (uses custom init)
- Has no standard `/var/lib/kubelet` path
- Is immutable (read-only root filesystem)

This is acceptable because Talos itself is security-hardened by design.

## Alerting Strategy

**Prometheus Alerts:**
```yaml
- alert: HighSeverityVulnerability
  expr: trivy_vulnerability_id{severity="CRITICAL"} > 0
  for: 1h
  labels:
    severity: warning
  annotations:
    summary: "Critical vulnerability detected"

- alert: GatekeeperViolation
  expr: increase(gatekeeper_violations[1h]) > 0
  for: 5m
  labels:
    severity: info
  annotations:
    summary: "Policy violation detected"
```

## Future Enhancements

1. **Move to `deny` enforcement** once baseline violations are resolved
2. **Add network policies** via Cilium for workload isolation
3. **Integrate Falco** for runtime threat detection
4. **Add SBOM generation** with Trivy for supply chain visibility

## References

* [OPA Gatekeeper](https://open-policy-agent.github.io/gatekeeper/)
* [Trivy Operator](https://aquasecurity.github.io/trivy-operator/)
* [Pod Security Standards](https://kubernetes.io/docs/concepts/security/pod-security-standards/)
* [Talos Security](https://www.talos.dev/v1.6/introduction/what-is-talos/#security)