homelab-design/decisions/0032-velero-backup-strategy.md

# Velero Backup and Disaster Recovery Strategy

* Status: accepted
* Date: 2026-02-05
* Deciders: Billy
* Technical Story: Establish cluster backup and disaster recovery capabilities

## Context and Problem Statement

A homelab running critical workloads (AI/ML pipelines, databases, productivity apps) needs protection against data loss from hardware failures, misconfigurations, or disasters. Kubernetes resources and persistent data must be recoverable.

How do we implement backup and disaster recovery for the homelab cluster?

## Decision Drivers

* Full cluster state backup - resources, secrets, PVCs
* Application-consistent backups for databases
* S3-compatible storage for off-cluster backups
* Scheduled automated backups
* Selective restore capability
* GitOps compatibility

## Considered Options

1. **Velero with Node Agent (Kopia)**
2. **Kasten K10**
3. **Longhorn snapshots only**
4. **etcd snapshots + manual PVC backups**

## Decision Outcome

Chosen option: **Option 1 - Velero with Node Agent (Kopia)**

Velero provides comprehensive Kubernetes backup/restore with file-level PVC backups via the Node Agent (formerly Restic, now Kopia). Backups are stored on the external NAS via S3-compatible storage.

### Positive Consequences

* Full cluster state captured (deployments, secrets, configmaps)
* PVC data backed up via file-level snapshots
* S3 backend on NAS for off-cluster storage
* Scheduled daily backups with retention
* Selective namespace/label restore
* Active CNCF project with strong community

### Negative Consequences

* Node Agent runs as DaemonSet (14 pods on current cluster)
* File-level backup slower than volume snapshots
* Full cluster restore requires careful ordering
* Some CRDs may need special handling

## Architecture

```
┌─────────────────────────────────────────────────────────────────┐
│                        Velero Server                             │
│                        (velero namespace)                        │
└────────────────────────────┬────────────────────────────────────┘
                             │
              ┌──────────────┼──────────────┐
              │              │              │
              ▼              ▼              ▼
       ┌───────────┐  ┌───────────┐  ┌───────────┐
       │   Node    │  │   Node    │  │   Node    │
       │   Agent   │  │   Agent   │  │   Agent   │
       │ (per node)│  │ (per node)│  │ (per node)│
       └─────┬─────┘  └─────┬─────┘  └─────┬─────┘
             │              │              │
             └──────────────┼──────────────┘
                            │
                            ▼
              ┌───────────────────────────┐
              │  BackupStorageLocation    │
              │  (S3 on NAS - candlekeep) │
              │  /backups/velero          │
              └───────────────────────────┘
```

## Configuration

### Schedule

```yaml
apiVersion: velero.io/v1
kind: Schedule
metadata:
  name: nightly-cluster-backup
  namespace: velero
spec:
  schedule: "0 2 * * *"  # 2 AM daily
  template:
    includedNamespaces:
      - "*"
    excludedNamespaces:
      - kube-system
      - kube-node-lease
      - kube-public
    includedResources:
      - "*"
    excludeNodeAgent: false
    defaultVolumesToFsBackup: true
    ttl: 720h  # 30 days retention
```

### Backup Storage Location

```yaml
apiVersion: velero.io/v1
kind: BackupStorageLocation
metadata:
  name: default
  namespace: velero
spec:
  provider: aws
  objectStorage:
    bucket: velero
  config:
    region: us-east-1
    s3ForcePathStyle: "true"
    s3Url: http://candlekeep.lab.daviestechlabs.io:9000
```

## Backup Scope

### Included

| Category | Examples | Backup Method |
|----------|----------|---------------|
| Kubernetes resources | Deployments, Services, ConfigMaps | Velero native |
| Secrets | Vault-synced, SOPS-decrypted | Velero native |
| Persistent Volumes | Database data, user files | Node Agent (Kopia) |
| CRDs | CNPG Clusters, RayServices, HelmReleases | Velero native |

### Excluded

| Category | Reason |
|----------|--------|
| kube-system | Rebuilt from Talos config |
| flux-system | Rebuilt from Git (GitOps) |
| Node-local data | Ephemeral, not critical |

## Recovery Procedures

### Full Cluster Recovery

1. Bootstrap new Talos cluster
2. Install Velero with same BSL configuration
3. `velero restore create --from-backup nightly-cluster-backup-YYYYMMDD`
4. Re-bootstrap Flux for GitOps reconciliation

### Selective Namespace Recovery

```bash
velero restore create \
  --from-backup nightly-cluster-backup-20260205020000 \
  --include-namespaces ai-ml \
  --restore-pvs
```

### Database Recovery (CNPG)

For CNPG clusters, prefer CNPG's native PITR:
```bash
# CNPG handles its own WAL archiving to S3
# Velero provides secondary backup layer
```

## Monitoring

| Metric | Alert Threshold |
|--------|-----------------|
| `velero_backup_success_total` | No increase in 25h |
| `velero_backup_failure_total` | Any increase |
| Backup duration | > 4 hours |

## Links

* [Velero Documentation](https://velero.io/docs/)
* [Node Agent (Kopia) Integration](https://velero.io/docs/main/file-system-backup/)
* Related: [ADR-0026](0026-storage-strategy.md) - Storage Strategy