daviestechlabs/homelab-design

Fork 0

Files

Billy D. 80fb911e22 updating to match everything in my homelab.

2026-02-05 16:13:53 -05:00

5.8 KiB

Raw Blame History

Velero Backup and Disaster Recovery Strategy

Status: accepted
Date: 2026-02-05
Deciders: Billy
Technical Story: Establish cluster backup and disaster recovery capabilities

Context and Problem Statement

A homelab running critical workloads (AI/ML pipelines, databases, productivity apps) needs protection against data loss from hardware failures, misconfigurations, or disasters. Kubernetes resources and persistent data must be recoverable.

How do we implement backup and disaster recovery for the homelab cluster?

Decision Drivers

Full cluster state backup - resources, secrets, PVCs
Application-consistent backups for databases
S3-compatible storage for off-cluster backups
Scheduled automated backups
Selective restore capability
GitOps compatibility

Considered Options

Velero with Node Agent (Kopia)
Kasten K10
Longhorn snapshots only
etcd snapshots + manual PVC backups

Decision Outcome

Chosen option: Option 1 - Velero with Node Agent (Kopia)

Velero provides comprehensive Kubernetes backup/restore with file-level PVC backups via the Node Agent (formerly Restic, now Kopia). Backups are stored on the external NAS via S3-compatible storage.

Positive Consequences

Full cluster state captured (deployments, secrets, configmaps)
PVC data backed up via file-level snapshots
S3 backend on NAS for off-cluster storage
Scheduled daily backups with retention
Selective namespace/label restore
Active CNCF project with strong community

Negative Consequences

Node Agent runs as DaemonSet (14 pods on current cluster)
File-level backup slower than volume snapshots
Full cluster restore requires careful ordering
Some CRDs may need special handling

Architecture

┌─────────────────────────────────────────────────────────────────┐
│                        Velero Server                             │
│                        (velero namespace)                        │
└────────────────────────────┬────────────────────────────────────┘
                             │
              ┌──────────────┼──────────────┐
              │              │              │
              ▼              ▼              ▼
       ┌───────────┐  ┌───────────┐  ┌───────────┐
       │   Node    │  │   Node    │  │   Node    │
       │   Agent   │  │   Agent   │  │   Agent   │
       │ (per node)│  │ (per node)│  │ (per node)│
       └─────┬─────┘  └─────┬─────┘  └─────┬─────┘
             │              │              │
             └──────────────┼──────────────┘
                            │
                            ▼
              ┌───────────────────────────┐
              │  BackupStorageLocation    │
              │  (S3 on NAS - candlekeep) │
              │  /backups/velero          │
              └───────────────────────────┘

Configuration

Schedule

apiVersion: velero.io/v1
kind: Schedule
metadata:
  name: nightly-cluster-backup
  namespace: velero
spec:
  schedule: "0 2 * * *"  # 2 AM daily
  template:
    includedNamespaces:
      - "*"
    excludedNamespaces:
      - kube-system
      - kube-node-lease
      - kube-public
    includedResources:
      - "*"
    excludeNodeAgent: false
    defaultVolumesToFsBackup: true
    ttl: 720h  # 30 days retention

Backup Storage Location

apiVersion: velero.io/v1
kind: BackupStorageLocation
metadata:
  name: default
  namespace: velero
spec:
  provider: aws
  objectStorage:
    bucket: velero
  config:
    region: us-east-1
    s3ForcePathStyle: "true"
    s3Url: http://candlekeep.lab.daviestechlabs.io:9000

Backup Scope

Included

Category	Examples	Backup Method
Kubernetes resources	Deployments, Services, ConfigMaps	Velero native
Secrets	Vault-synced, SOPS-decrypted	Velero native
Persistent Volumes	Database data, user files	Node Agent (Kopia)
CRDs	CNPG Clusters, RayServices, HelmReleases	Velero native

Excluded

Category	Reason
kube-system	Rebuilt from Talos config
flux-system	Rebuilt from Git (GitOps)
Node-local data	Ephemeral, not critical

Recovery Procedures

Full Cluster Recovery

Bootstrap new Talos cluster
Install Velero with same BSL configuration
velero restore create --from-backup nightly-cluster-backup-YYYYMMDD
Re-bootstrap Flux for GitOps reconciliation

Selective Namespace Recovery

velero restore create \
  --from-backup nightly-cluster-backup-20260205020000 \
  --include-namespaces ai-ml \
  --restore-pvs

Database Recovery (CNPG)

For CNPG clusters, prefer CNPG's native PITR:

# CNPG handles its own WAL archiving to S3
# Velero provides secondary backup layer

Monitoring

Metric	Alert Threshold
`velero_backup_success_total`	No increase in 25h
`velero_backup_failure_total`	Any increase
Backup duration	> 4 hours

5.8 KiB Raw Blame History