181 lines
5.8 KiB
Markdown
181 lines
5.8 KiB
Markdown
# Velero Backup and Disaster Recovery Strategy
|
|
|
|
* Status: accepted
|
|
* Date: 2026-02-05
|
|
* Deciders: Billy
|
|
* Technical Story: Establish cluster backup and disaster recovery capabilities
|
|
|
|
## Context and Problem Statement
|
|
|
|
A homelab running critical workloads (AI/ML pipelines, databases, productivity apps) needs protection against data loss from hardware failures, misconfigurations, or disasters. Kubernetes resources and persistent data must be recoverable.
|
|
|
|
How do we implement backup and disaster recovery for the homelab cluster?
|
|
|
|
## Decision Drivers
|
|
|
|
* Full cluster state backup - resources, secrets, PVCs
|
|
* Application-consistent backups for databases
|
|
* S3-compatible storage for off-cluster backups
|
|
* Scheduled automated backups
|
|
* Selective restore capability
|
|
* GitOps compatibility
|
|
|
|
## Considered Options
|
|
|
|
1. **Velero with Node Agent (Kopia)**
|
|
2. **Kasten K10**
|
|
3. **Longhorn snapshots only**
|
|
4. **etcd snapshots + manual PVC backups**
|
|
|
|
## Decision Outcome
|
|
|
|
Chosen option: **Option 1 - Velero with Node Agent (Kopia)**
|
|
|
|
Velero provides comprehensive Kubernetes backup/restore with file-level PVC backups via the Node Agent (formerly Restic, now Kopia). Backups are stored on the external NAS via S3-compatible storage.
|
|
|
|
### Positive Consequences
|
|
|
|
* Full cluster state captured (deployments, secrets, configmaps)
|
|
* PVC data backed up via file-level snapshots
|
|
* S3 backend on NAS for off-cluster storage
|
|
* Scheduled daily backups with retention
|
|
* Selective namespace/label restore
|
|
* Active CNCF project with strong community
|
|
|
|
### Negative Consequences
|
|
|
|
* Node Agent runs as DaemonSet (14 pods on current cluster)
|
|
* File-level backup slower than volume snapshots
|
|
* Full cluster restore requires careful ordering
|
|
* Some CRDs may need special handling
|
|
|
|
## Architecture
|
|
|
|
```
|
|
┌─────────────────────────────────────────────────────────────────┐
|
|
│ Velero Server │
|
|
│ (velero namespace) │
|
|
└────────────────────────────┬────────────────────────────────────┘
|
|
│
|
|
┌──────────────┼──────────────┐
|
|
│ │ │
|
|
▼ ▼ ▼
|
|
┌───────────┐ ┌───────────┐ ┌───────────┐
|
|
│ Node │ │ Node │ │ Node │
|
|
│ Agent │ │ Agent │ │ Agent │
|
|
│ (per node)│ │ (per node)│ │ (per node)│
|
|
└─────┬─────┘ └─────┬─────┘ └─────┬─────┘
|
|
│ │ │
|
|
└──────────────┼──────────────┘
|
|
│
|
|
▼
|
|
┌───────────────────────────┐
|
|
│ BackupStorageLocation │
|
|
│ (S3 on NAS - candlekeep) │
|
|
│ /backups/velero │
|
|
└───────────────────────────┘
|
|
```
|
|
|
|
## Configuration
|
|
|
|
### Schedule
|
|
|
|
```yaml
|
|
apiVersion: velero.io/v1
|
|
kind: Schedule
|
|
metadata:
|
|
name: nightly-cluster-backup
|
|
namespace: velero
|
|
spec:
|
|
schedule: "0 2 * * *" # 2 AM daily
|
|
template:
|
|
includedNamespaces:
|
|
- "*"
|
|
excludedNamespaces:
|
|
- kube-system
|
|
- kube-node-lease
|
|
- kube-public
|
|
includedResources:
|
|
- "*"
|
|
excludeNodeAgent: false
|
|
defaultVolumesToFsBackup: true
|
|
ttl: 720h # 30 days retention
|
|
```
|
|
|
|
### Backup Storage Location
|
|
|
|
```yaml
|
|
apiVersion: velero.io/v1
|
|
kind: BackupStorageLocation
|
|
metadata:
|
|
name: default
|
|
namespace: velero
|
|
spec:
|
|
provider: aws
|
|
objectStorage:
|
|
bucket: velero
|
|
config:
|
|
region: us-east-1
|
|
s3ForcePathStyle: "true"
|
|
s3Url: http://candlekeep.lab.daviestechlabs.io:9000
|
|
```
|
|
|
|
## Backup Scope
|
|
|
|
### Included
|
|
|
|
| Category | Examples | Backup Method |
|
|
|----------|----------|---------------|
|
|
| Kubernetes resources | Deployments, Services, ConfigMaps | Velero native |
|
|
| Secrets | Vault-synced, SOPS-decrypted | Velero native |
|
|
| Persistent Volumes | Database data, user files | Node Agent (Kopia) |
|
|
| CRDs | CNPG Clusters, RayServices, HelmReleases | Velero native |
|
|
|
|
### Excluded
|
|
|
|
| Category | Reason |
|
|
|----------|--------|
|
|
| kube-system | Rebuilt from Talos config |
|
|
| flux-system | Rebuilt from Git (GitOps) |
|
|
| Node-local data | Ephemeral, not critical |
|
|
|
|
## Recovery Procedures
|
|
|
|
### Full Cluster Recovery
|
|
|
|
1. Bootstrap new Talos cluster
|
|
2. Install Velero with same BSL configuration
|
|
3. `velero restore create --from-backup nightly-cluster-backup-YYYYMMDD`
|
|
4. Re-bootstrap Flux for GitOps reconciliation
|
|
|
|
### Selective Namespace Recovery
|
|
|
|
```bash
|
|
velero restore create \
|
|
--from-backup nightly-cluster-backup-20260205020000 \
|
|
--include-namespaces ai-ml \
|
|
--restore-pvs
|
|
```
|
|
|
|
### Database Recovery (CNPG)
|
|
|
|
For CNPG clusters, prefer CNPG's native PITR:
|
|
```bash
|
|
# CNPG handles its own WAL archiving to S3
|
|
# Velero provides secondary backup layer
|
|
```
|
|
|
|
## Monitoring
|
|
|
|
| Metric | Alert Threshold |
|
|
|--------|-----------------|
|
|
| `velero_backup_success_total` | No increase in 25h |
|
|
| `velero_backup_failure_total` | Any increase |
|
|
| Backup duration | > 4 hours |
|
|
|
|
## Links
|
|
|
|
* [Velero Documentation](https://velero.io/docs/)
|
|
* [Node Agent (Kopia) Integration](https://velero.io/docs/main/file-system-backup/)
|
|
* Related: [ADR-0026](0026-storage-strategy.md) - Storage Strategy
|