updating to match everything in my homelab.
This commit is contained in:
180
decisions/0032-velero-backup-strategy.md
Normal file
180
decisions/0032-velero-backup-strategy.md
Normal file
@@ -0,0 +1,180 @@
|
||||
# Velero Backup and Disaster Recovery Strategy
|
||||
|
||||
* Status: accepted
|
||||
* Date: 2026-02-05
|
||||
* Deciders: Billy
|
||||
* Technical Story: Establish cluster backup and disaster recovery capabilities
|
||||
|
||||
## Context and Problem Statement
|
||||
|
||||
A homelab running critical workloads (AI/ML pipelines, databases, productivity apps) needs protection against data loss from hardware failures, misconfigurations, or disasters. Kubernetes resources and persistent data must be recoverable.
|
||||
|
||||
How do we implement backup and disaster recovery for the homelab cluster?
|
||||
|
||||
## Decision Drivers
|
||||
|
||||
* Full cluster state backup - resources, secrets, PVCs
|
||||
* Application-consistent backups for databases
|
||||
* S3-compatible storage for off-cluster backups
|
||||
* Scheduled automated backups
|
||||
* Selective restore capability
|
||||
* GitOps compatibility
|
||||
|
||||
## Considered Options
|
||||
|
||||
1. **Velero with Node Agent (Kopia)**
|
||||
2. **Kasten K10**
|
||||
3. **Longhorn snapshots only**
|
||||
4. **etcd snapshots + manual PVC backups**
|
||||
|
||||
## Decision Outcome
|
||||
|
||||
Chosen option: **Option 1 - Velero with Node Agent (Kopia)**
|
||||
|
||||
Velero provides comprehensive Kubernetes backup/restore with file-level PVC backups via the Node Agent (formerly Restic, now Kopia). Backups are stored on the external NAS via S3-compatible storage.
|
||||
|
||||
### Positive Consequences
|
||||
|
||||
* Full cluster state captured (deployments, secrets, configmaps)
|
||||
* PVC data backed up via file-level snapshots
|
||||
* S3 backend on NAS for off-cluster storage
|
||||
* Scheduled daily backups with retention
|
||||
* Selective namespace/label restore
|
||||
* Active CNCF project with strong community
|
||||
|
||||
### Negative Consequences
|
||||
|
||||
* Node Agent runs as DaemonSet (14 pods on current cluster)
|
||||
* File-level backup slower than volume snapshots
|
||||
* Full cluster restore requires careful ordering
|
||||
* Some CRDs may need special handling
|
||||
|
||||
## Architecture
|
||||
|
||||
```
|
||||
┌─────────────────────────────────────────────────────────────────┐
|
||||
│ Velero Server │
|
||||
│ (velero namespace) │
|
||||
└────────────────────────────┬────────────────────────────────────┘
|
||||
│
|
||||
┌──────────────┼──────────────┐
|
||||
│ │ │
|
||||
▼ ▼ ▼
|
||||
┌───────────┐ ┌───────────┐ ┌───────────┐
|
||||
│ Node │ │ Node │ │ Node │
|
||||
│ Agent │ │ Agent │ │ Agent │
|
||||
│ (per node)│ │ (per node)│ │ (per node)│
|
||||
└─────┬─────┘ └─────┬─────┘ └─────┬─────┘
|
||||
│ │ │
|
||||
└──────────────┼──────────────┘
|
||||
│
|
||||
▼
|
||||
┌───────────────────────────┐
|
||||
│ BackupStorageLocation │
|
||||
│ (S3 on NAS - candlekeep) │
|
||||
│ /backups/velero │
|
||||
└───────────────────────────┘
|
||||
```
|
||||
|
||||
## Configuration
|
||||
|
||||
### Schedule
|
||||
|
||||
```yaml
|
||||
apiVersion: velero.io/v1
|
||||
kind: Schedule
|
||||
metadata:
|
||||
name: nightly-cluster-backup
|
||||
namespace: velero
|
||||
spec:
|
||||
schedule: "0 2 * * *" # 2 AM daily
|
||||
template:
|
||||
includedNamespaces:
|
||||
- "*"
|
||||
excludedNamespaces:
|
||||
- kube-system
|
||||
- kube-node-lease
|
||||
- kube-public
|
||||
includedResources:
|
||||
- "*"
|
||||
excludeNodeAgent: false
|
||||
defaultVolumesToFsBackup: true
|
||||
ttl: 720h # 30 days retention
|
||||
```
|
||||
|
||||
### Backup Storage Location
|
||||
|
||||
```yaml
|
||||
apiVersion: velero.io/v1
|
||||
kind: BackupStorageLocation
|
||||
metadata:
|
||||
name: default
|
||||
namespace: velero
|
||||
spec:
|
||||
provider: aws
|
||||
objectStorage:
|
||||
bucket: velero
|
||||
config:
|
||||
region: us-east-1
|
||||
s3ForcePathStyle: "true"
|
||||
s3Url: http://candlekeep.lab.daviestechlabs.io:9000
|
||||
```
|
||||
|
||||
## Backup Scope
|
||||
|
||||
### Included
|
||||
|
||||
| Category | Examples | Backup Method |
|
||||
|----------|----------|---------------|
|
||||
| Kubernetes resources | Deployments, Services, ConfigMaps | Velero native |
|
||||
| Secrets | Vault-synced, SOPS-decrypted | Velero native |
|
||||
| Persistent Volumes | Database data, user files | Node Agent (Kopia) |
|
||||
| CRDs | CNPG Clusters, RayServices, HelmReleases | Velero native |
|
||||
|
||||
### Excluded
|
||||
|
||||
| Category | Reason |
|
||||
|----------|--------|
|
||||
| kube-system | Rebuilt from Talos config |
|
||||
| flux-system | Rebuilt from Git (GitOps) |
|
||||
| Node-local data | Ephemeral, not critical |
|
||||
|
||||
## Recovery Procedures
|
||||
|
||||
### Full Cluster Recovery
|
||||
|
||||
1. Bootstrap new Talos cluster
|
||||
2. Install Velero with same BSL configuration
|
||||
3. `velero restore create --from-backup nightly-cluster-backup-YYYYMMDD`
|
||||
4. Re-bootstrap Flux for GitOps reconciliation
|
||||
|
||||
### Selective Namespace Recovery
|
||||
|
||||
```bash
|
||||
velero restore create \
|
||||
--from-backup nightly-cluster-backup-20260205020000 \
|
||||
--include-namespaces ai-ml \
|
||||
--restore-pvs
|
||||
```
|
||||
|
||||
### Database Recovery (CNPG)
|
||||
|
||||
For CNPG clusters, prefer CNPG's native PITR:
|
||||
```bash
|
||||
# CNPG handles its own WAL archiving to S3
|
||||
# Velero provides secondary backup layer
|
||||
```
|
||||
|
||||
## Monitoring
|
||||
|
||||
| Metric | Alert Threshold |
|
||||
|--------|-----------------|
|
||||
| `velero_backup_success_total` | No increase in 25h |
|
||||
| `velero_backup_failure_total` | Any increase |
|
||||
| Backup duration | > 4 hours |
|
||||
|
||||
## Links
|
||||
|
||||
* [Velero Documentation](https://velero.io/docs/)
|
||||
* [Node Agent (Kopia) Integration](https://velero.io/docs/main/file-system-backup/)
|
||||
* Related: [ADR-0026](0026-storage-strategy.md) - Storage Strategy
|
||||
Reference in New Issue
Block a user