Files
homelab-design/decisions/0032-velero-backup-strategy.md

5.8 KiB

Velero Backup and Disaster Recovery Strategy

  • Status: accepted
  • Date: 2026-02-05
  • Deciders: Billy
  • Technical Story: Establish cluster backup and disaster recovery capabilities

Context and Problem Statement

A homelab running critical workloads (AI/ML pipelines, databases, productivity apps) needs protection against data loss from hardware failures, misconfigurations, or disasters. Kubernetes resources and persistent data must be recoverable.

How do we implement backup and disaster recovery for the homelab cluster?

Decision Drivers

  • Full cluster state backup - resources, secrets, PVCs
  • Application-consistent backups for databases
  • S3-compatible storage for off-cluster backups
  • Scheduled automated backups
  • Selective restore capability
  • GitOps compatibility

Considered Options

  1. Velero with Node Agent (Kopia)
  2. Kasten K10
  3. Longhorn snapshots only
  4. etcd snapshots + manual PVC backups

Decision Outcome

Chosen option: Option 1 - Velero with Node Agent (Kopia)

Velero provides comprehensive Kubernetes backup/restore with file-level PVC backups via the Node Agent (formerly Restic, now Kopia). Backups are stored on the external NAS via S3-compatible storage.

Positive Consequences

  • Full cluster state captured (deployments, secrets, configmaps)
  • PVC data backed up via file-level snapshots
  • S3 backend on NAS for off-cluster storage
  • Scheduled daily backups with retention
  • Selective namespace/label restore
  • Active CNCF project with strong community

Negative Consequences

  • Node Agent runs as DaemonSet (14 pods on current cluster)
  • File-level backup slower than volume snapshots
  • Full cluster restore requires careful ordering
  • Some CRDs may need special handling

Architecture

┌─────────────────────────────────────────────────────────────────┐
│                        Velero Server                             │
│                        (velero namespace)                        │
└────────────────────────────┬────────────────────────────────────┘
                             │
              ┌──────────────┼──────────────┐
              │              │              │
              ▼              ▼              ▼
       ┌───────────┐  ┌───────────┐  ┌───────────┐
       │   Node    │  │   Node    │  │   Node    │
       │   Agent   │  │   Agent   │  │   Agent   │
       │ (per node)│  │ (per node)│  │ (per node)│
       └─────┬─────┘  └─────┬─────┘  └─────┬─────┘
             │              │              │
             └──────────────┼──────────────┘
                            │
                            ▼
              ┌───────────────────────────┐
              │  BackupStorageLocation    │
              │  (S3 on NAS - candlekeep) │
              │  /backups/velero          │
              └───────────────────────────┘

Configuration

Schedule

apiVersion: velero.io/v1
kind: Schedule
metadata:
  name: nightly-cluster-backup
  namespace: velero
spec:
  schedule: "0 2 * * *"  # 2 AM daily
  template:
    includedNamespaces:
      - "*"
    excludedNamespaces:
      - kube-system
      - kube-node-lease
      - kube-public
    includedResources:
      - "*"
    excludeNodeAgent: false
    defaultVolumesToFsBackup: true
    ttl: 720h  # 30 days retention

Backup Storage Location

apiVersion: velero.io/v1
kind: BackupStorageLocation
metadata:
  name: default
  namespace: velero
spec:
  provider: aws
  objectStorage:
    bucket: velero
  config:
    region: us-east-1
    s3ForcePathStyle: "true"
    s3Url: http://candlekeep.lab.daviestechlabs.io:9000

Backup Scope

Included

Category Examples Backup Method
Kubernetes resources Deployments, Services, ConfigMaps Velero native
Secrets Vault-synced, SOPS-decrypted Velero native
Persistent Volumes Database data, user files Node Agent (Kopia)
CRDs CNPG Clusters, RayServices, HelmReleases Velero native

Excluded

Category Reason
kube-system Rebuilt from Talos config
flux-system Rebuilt from Git (GitOps)
Node-local data Ephemeral, not critical

Recovery Procedures

Full Cluster Recovery

  1. Bootstrap new Talos cluster
  2. Install Velero with same BSL configuration
  3. velero restore create --from-backup nightly-cluster-backup-YYYYMMDD
  4. Re-bootstrap Flux for GitOps reconciliation

Selective Namespace Recovery

velero restore create \
  --from-backup nightly-cluster-backup-20260205020000 \
  --include-namespaces ai-ml \
  --restore-pvs

Database Recovery (CNPG)

For CNPG clusters, prefer CNPG's native PITR:

# CNPG handles its own WAL archiving to S3
# Velero provides secondary backup layer

Monitoring

Metric Alert Threshold
velero_backup_success_total No increase in 25h
velero_backup_failure_total Any increase
Backup duration > 4 hours