Files
homelab-design/decisions/0032-velero-backup-strategy.md

181 lines
5.8 KiB
Markdown

# Velero Backup and Disaster Recovery Strategy
* Status: accepted
* Date: 2026-02-05
* Deciders: Billy
* Technical Story: Establish cluster backup and disaster recovery capabilities
## Context and Problem Statement
A homelab running critical workloads (AI/ML pipelines, databases, productivity apps) needs protection against data loss from hardware failures, misconfigurations, or disasters. Kubernetes resources and persistent data must be recoverable.
How do we implement backup and disaster recovery for the homelab cluster?
## Decision Drivers
* Full cluster state backup - resources, secrets, PVCs
* Application-consistent backups for databases
* S3-compatible storage for off-cluster backups
* Scheduled automated backups
* Selective restore capability
* GitOps compatibility
## Considered Options
1. **Velero with Node Agent (Kopia)**
2. **Kasten K10**
3. **Longhorn snapshots only**
4. **etcd snapshots + manual PVC backups**
## Decision Outcome
Chosen option: **Option 1 - Velero with Node Agent (Kopia)**
Velero provides comprehensive Kubernetes backup/restore with file-level PVC backups via the Node Agent (formerly Restic, now Kopia). Backups are stored on the external NAS via S3-compatible storage.
### Positive Consequences
* Full cluster state captured (deployments, secrets, configmaps)
* PVC data backed up via file-level snapshots
* S3 backend on NAS for off-cluster storage
* Scheduled daily backups with retention
* Selective namespace/label restore
* Active CNCF project with strong community
### Negative Consequences
* Node Agent runs as DaemonSet (14 pods on current cluster)
* File-level backup slower than volume snapshots
* Full cluster restore requires careful ordering
* Some CRDs may need special handling
## Architecture
```
┌─────────────────────────────────────────────────────────────────┐
│ Velero Server │
│ (velero namespace) │
└────────────────────────────┬────────────────────────────────────┘
┌──────────────┼──────────────┐
│ │ │
▼ ▼ ▼
┌───────────┐ ┌───────────┐ ┌───────────┐
│ Node │ │ Node │ │ Node │
│ Agent │ │ Agent │ │ Agent │
│ (per node)│ │ (per node)│ │ (per node)│
└─────┬─────┘ └─────┬─────┘ └─────┬─────┘
│ │ │
└──────────────┼──────────────┘
┌───────────────────────────┐
│ BackupStorageLocation │
│ (S3 on NAS - candlekeep) │
│ /backups/velero │
└───────────────────────────┘
```
## Configuration
### Schedule
```yaml
apiVersion: velero.io/v1
kind: Schedule
metadata:
name: nightly-cluster-backup
namespace: velero
spec:
schedule: "0 2 * * *" # 2 AM daily
template:
includedNamespaces:
- "*"
excludedNamespaces:
- kube-system
- kube-node-lease
- kube-public
includedResources:
- "*"
excludeNodeAgent: false
defaultVolumesToFsBackup: true
ttl: 720h # 30 days retention
```
### Backup Storage Location
```yaml
apiVersion: velero.io/v1
kind: BackupStorageLocation
metadata:
name: default
namespace: velero
spec:
provider: aws
objectStorage:
bucket: velero
config:
region: us-east-1
s3ForcePathStyle: "true"
s3Url: http://candlekeep.lab.daviestechlabs.io:9000
```
## Backup Scope
### Included
| Category | Examples | Backup Method |
|----------|----------|---------------|
| Kubernetes resources | Deployments, Services, ConfigMaps | Velero native |
| Secrets | Vault-synced, SOPS-decrypted | Velero native |
| Persistent Volumes | Database data, user files | Node Agent (Kopia) |
| CRDs | CNPG Clusters, RayServices, HelmReleases | Velero native |
### Excluded
| Category | Reason |
|----------|--------|
| kube-system | Rebuilt from Talos config |
| flux-system | Rebuilt from Git (GitOps) |
| Node-local data | Ephemeral, not critical |
## Recovery Procedures
### Full Cluster Recovery
1. Bootstrap new Talos cluster
2. Install Velero with same BSL configuration
3. `velero restore create --from-backup nightly-cluster-backup-YYYYMMDD`
4. Re-bootstrap Flux for GitOps reconciliation
### Selective Namespace Recovery
```bash
velero restore create \
--from-backup nightly-cluster-backup-20260205020000 \
--include-namespaces ai-ml \
--restore-pvs
```
### Database Recovery (CNPG)
For CNPG clusters, prefer CNPG's native PITR:
```bash
# CNPG handles its own WAL archiving to S3
# Velero provides secondary backup layer
```
## Monitoring
| Metric | Alert Threshold |
|--------|-----------------|
| `velero_backup_success_total` | No increase in 25h |
| `velero_backup_failure_total` | Any increase |
| Backup duration | > 4 hours |
## Links
* [Velero Documentation](https://velero.io/docs/)
* [Node Agent (Kopia) Integration](https://velero.io/docs/main/file-system-backup/)
* Related: [ADR-0026](0026-storage-strategy.md) - Storage Strategy