docs: Add ADRs for secrets management and security policy
- 0017: Secrets Management Strategy (SOPS + Vault + External Secrets) - 0018: Security Policy Enforcement (Gatekeeper + Trivy)
This commit is contained in:
239
decisions/0018-security-policy-enforcement.md
Normal file
239
decisions/0018-security-policy-enforcement.md
Normal file
@@ -0,0 +1,239 @@
|
||||
# Security Policy Enforcement
|
||||
|
||||
* Status: accepted
|
||||
* Date: 2026-02-04
|
||||
* Deciders: Billy
|
||||
* Technical Story: Implement security guardrails and vulnerability scanning for the homelab cluster
|
||||
|
||||
## Context and Problem Statement
|
||||
|
||||
A Kubernetes cluster without security policies is vulnerable to misconfigurations, privilege escalation, and unpatched vulnerabilities. Even in a homelab environment, security best practices protect against accidental misconfigurations and provide learning opportunities for production-grade security.
|
||||
|
||||
How do we enforce security policies and maintain visibility into vulnerabilities without creating excessive operational friction?
|
||||
|
||||
## Decision Drivers
|
||||
|
||||
* Defense in depth - multiple layers of security controls
|
||||
* Visibility - understand security posture across all workloads
|
||||
* Progressive enforcement - warn before blocking to avoid disruption
|
||||
* Automation - minimize manual security auditing
|
||||
* Talos compatibility - policies must work with immutable OS constraints
|
||||
|
||||
## Considered Options
|
||||
|
||||
1. **Gatekeeper (OPA) for policy + Trivy Operator for scanning**
|
||||
2. **Kyverno for policy + Trivy for scanning**
|
||||
3. **Pod Security Standards (PSS) only**
|
||||
4. **No enforcement, manual auditing**
|
||||
|
||||
## Decision Outcome
|
||||
|
||||
Chosen option: **Option 1 - Gatekeeper (OPA) for policy enforcement + Trivy Operator for vulnerability scanning**
|
||||
|
||||
Gatekeeper provides flexible policy-as-code using Rego, while Trivy Operator continuously scans for vulnerabilities, misconfigurations, and exposed secrets. Both integrate with Prometheus for alerting.
|
||||
|
||||
### Positive Consequences
|
||||
|
||||
* Policies are defined as code and version-controlled
|
||||
* Violations are visible in Grafana dashboards
|
||||
* Trivy provides continuous vulnerability scanning without CI/CD integration
|
||||
* Gatekeeper's warn mode allows gradual policy rollout
|
||||
* Both tools provide Prometheus metrics for alerting
|
||||
|
||||
### Negative Consequences
|
||||
|
||||
* Rego learning curve for custom policies
|
||||
* Must maintain exclusion lists for system namespaces
|
||||
* Trivy node-collector disabled on Talos (lacks systemd paths)
|
||||
|
||||
## Pros and Cons of the Options
|
||||
|
||||
### Option 1: Gatekeeper + Trivy (Chosen)
|
||||
|
||||
**Architecture:**
|
||||
```
|
||||
┌─────────────────┐
|
||||
│ Gatekeeper │
|
||||
│ (Admission) │
|
||||
└────────┬────────┘
|
||||
│ Validates
|
||||
▼
|
||||
┌─────────────┐ ┌─────────────────┐ ┌─────────────┐
|
||||
│ kubectl │───►│ API Server │───►│ Workloads │
|
||||
│ Flux │ └─────────────────┘ └──────┬──────┘
|
||||
└─────────────┘ │
|
||||
│ Scans
|
||||
┌─────────────────┐ │
|
||||
│ Trivy Operator │◄─────────┘
|
||||
└────────┬────────┘
|
||||
│
|
||||
▼
|
||||
┌─────────────────┐
|
||||
│ Vulnerability │
|
||||
│ Reports (CRDs) │
|
||||
└─────────────────┘
|
||||
```
|
||||
|
||||
* Good, because Gatekeeper is CNCF graduated and widely adopted
|
||||
* Good, because Rego allows complex policy logic
|
||||
* Good, because Trivy scans images, configs, RBAC, and secrets
|
||||
* Good, because both provide Prometheus metrics and Grafana dashboards
|
||||
* Bad, because Rego has a learning curve
|
||||
* Bad, because Trivy node-collector incompatible with Talos
|
||||
|
||||
### Option 2: Kyverno + Trivy
|
||||
|
||||
* Good, because Kyverno policies are YAML-based (easier to write)
|
||||
* Good, because Kyverno can mutate resources (auto-fix)
|
||||
* Bad, because Kyverno is less mature than Gatekeeper
|
||||
* Bad, because mutation can cause unexpected behavior
|
||||
|
||||
### Option 3: Pod Security Standards Only
|
||||
|
||||
* Good, because built into Kubernetes (no additional components)
|
||||
* Good, because simple namespace-level enforcement
|
||||
* Bad, because limited to pod security only
|
||||
* Bad, because no vulnerability scanning
|
||||
* Bad, because no custom policy support
|
||||
|
||||
### Option 4: No Enforcement
|
||||
|
||||
* Good, because no operational overhead
|
||||
* Bad, because no protection against misconfigurations
|
||||
* Bad, because no visibility into security posture
|
||||
* Bad, because bad practice even for homelabs
|
||||
|
||||
## Implementation Details
|
||||
|
||||
### Gatekeeper Policies
|
||||
|
||||
**Constraint Templates (Rego-based):**
|
||||
|
||||
| Template | Purpose |
|
||||
|----------|---------|
|
||||
| `K8sPSPPrivilegedContainer` | Block privileged containers |
|
||||
| `K8sRequiredLabels` | Require app.kubernetes.io labels |
|
||||
| `K8sContainerLimits` | Require resource limits |
|
||||
|
||||
**Constraints (Policy Instances):**
|
||||
|
||||
```yaml
|
||||
# Deny privileged containers (warn mode)
|
||||
apiVersion: constraints.gatekeeper.sh/v1beta1
|
||||
kind: K8sPSPPrivilegedContainer
|
||||
metadata:
|
||||
name: deny-privileged-containers
|
||||
spec:
|
||||
enforcementAction: warn # Start with warn, move to deny
|
||||
match:
|
||||
kinds:
|
||||
- apiGroups: [""]
|
||||
kinds: ["Pod"]
|
||||
excludedNamespaces:
|
||||
- kube-system
|
||||
- gatekeeper-system
|
||||
- cilium-secrets
|
||||
- longhorn-system
|
||||
- observability
|
||||
- gpu-operator
|
||||
parameters:
|
||||
exemptImages:
|
||||
- "quay.io/cilium/*"
|
||||
- "ghcr.io/longhorn/*"
|
||||
- "nvcr.io/nvidia/*"
|
||||
```
|
||||
|
||||
**Enforcement Progression:**
|
||||
1. `warn` - Log violations, don't block (initial rollout)
|
||||
2. `dryrun` - Audit mode, visible in reports
|
||||
3. `deny` - Block non-compliant resources
|
||||
|
||||
### Trivy Operator Configuration
|
||||
|
||||
```yaml
|
||||
operator:
|
||||
vulnerabilityScannerEnabled: true
|
||||
configAuditScannerEnabled: true
|
||||
rbacAssessmentScannerEnabled: true
|
||||
exposedSecretScannerEnabled: true
|
||||
clusterComplianceEnabled: true
|
||||
|
||||
# Disabled for Talos (no systemd, no /var/lib/kubelet)
|
||||
infraAssessmentScannerEnabled: false
|
||||
|
||||
# Metrics for Prometheus
|
||||
metricsFindingsEnabled: true
|
||||
metricsConfigAuditInfo: true
|
||||
```
|
||||
|
||||
**Scan Reports (CRDs):**
|
||||
- `VulnerabilityReport` - CVEs in container images
|
||||
- `ConfigAuditReport` - Kubernetes misconfigurations
|
||||
- `RbacAssessmentReport` - RBAC privilege issues
|
||||
- `ExposedSecretReport` - Secrets in environment variables
|
||||
|
||||
### Grafana Dashboards
|
||||
|
||||
| Dashboard | Source |
|
||||
|-----------|--------|
|
||||
| Gatekeeper Overview | Grafana ID 15763 |
|
||||
| Gatekeeper Violations | Grafana ID 14828 |
|
||||
| Trivy Vulnerabilities | Grafana ID 17813 |
|
||||
| Trivy Image Scan | Custom |
|
||||
|
||||
### Namespace Exclusions
|
||||
|
||||
System namespaces excluded from strict policies:
|
||||
|
||||
| Namespace | Reason |
|
||||
|-----------|--------|
|
||||
| `kube-system` | Core Kubernetes components |
|
||||
| `gatekeeper-system` | Gatekeeper itself |
|
||||
| `longhorn-system` | Storage requires privileges |
|
||||
| `gpu-operator` | GPU drivers require privileges |
|
||||
| `cilium-secrets` | CNI requires host networking |
|
||||
| `observability` | Some collectors need host access |
|
||||
|
||||
### Talos-Specific Considerations
|
||||
|
||||
Trivy's `node-collector` is disabled because Talos:
|
||||
- Has no `/etc/systemd` (uses custom init)
|
||||
- Has no standard `/var/lib/kubelet` path
|
||||
- Is immutable (read-only root filesystem)
|
||||
|
||||
This is acceptable because Talos itself is security-hardened by design.
|
||||
|
||||
## Alerting Strategy
|
||||
|
||||
**Prometheus Alerts:**
|
||||
```yaml
|
||||
- alert: HighSeverityVulnerability
|
||||
expr: trivy_vulnerability_id{severity="CRITICAL"} > 0
|
||||
for: 1h
|
||||
labels:
|
||||
severity: warning
|
||||
annotations:
|
||||
summary: "Critical vulnerability detected"
|
||||
|
||||
- alert: GatekeeperViolation
|
||||
expr: increase(gatekeeper_violations[1h]) > 0
|
||||
for: 5m
|
||||
labels:
|
||||
severity: info
|
||||
annotations:
|
||||
summary: "Policy violation detected"
|
||||
```
|
||||
|
||||
## Future Enhancements
|
||||
|
||||
1. **Move to `deny` enforcement** once baseline violations are resolved
|
||||
2. **Add network policies** via Cilium for workload isolation
|
||||
3. **Integrate Falco** for runtime threat detection
|
||||
4. **Add SBOM generation** with Trivy for supply chain visibility
|
||||
|
||||
## References
|
||||
|
||||
* [OPA Gatekeeper](https://open-policy-agent.github.io/gatekeeper/)
|
||||
* [Trivy Operator](https://aquasecurity.github.io/trivy-operator/)
|
||||
* [Pod Security Standards](https://kubernetes.io/docs/concepts/security/pod-security-standards/)
|
||||
* [Talos Security](https://www.talos.dev/v1.6/introduction/what-is-talos/#security)
|
||||
Reference in New Issue
Block a user