Files
homelab-design/decisions/0027-database-strategy.md
Billy D. b43c80153c docs: add ADRs 0025-0028 for infrastructure patterns
- 0025: Observability stack (Prometheus + ClickStack + OTEL)
- 0026: Tiered storage strategy (Longhorn + NFS)
- 0027: Database strategy (CloudNativePG for PostgreSQL)
- 0028: Authentik SSO strategy (OIDC/SAML identity provider)
2026-02-04 08:55:15 -05:00

10 KiB

Database Strategy with CloudNativePG

  • Status: accepted
  • Date: 2026-02-04
  • Deciders: Billy
  • Technical Story: Standardize PostgreSQL deployment for stateful applications

Context and Problem Statement

Multiple applications in the homelab require relational databases: Gitea, Authentik, Companions, MLflow, and potentially more. Each could use different database solutions, creating operational complexity.

How do we standardize database deployment while providing production-grade reliability and minimal operational overhead?

Decision Drivers

  • Operational simplicity - single operator to learn and manage
  • High availability - automatic failover for critical databases
  • Backup integration - consistent backup strategy across all databases
  • GitOps compatibility - declarative database provisioning
  • Resource efficiency - don't over-provision for homelab scale

Considered Options

  1. CloudNativePG for PostgreSQL
  2. Helm charts per application (Bitnami PostgreSQL)
  3. External managed database (RDS-style)
  4. SQLite where possible + single shared PostgreSQL

Decision Outcome

Chosen option: Option 1 - CloudNativePG for PostgreSQL

CloudNativePG (CNPG) provides a Kubernetes-native PostgreSQL operator with HA, automatic failover, connection pooling (PgBouncer), and integrated backups.

Positive Consequences

  • Single operator manages all PostgreSQL instances
  • Declarative Cluster CRD for GitOps deployment
  • Automatic failover with minimal data loss
  • Built-in PgBouncer for connection pooling
  • Prometheus metrics and Grafana dashboards included
  • CNPG is CNCF-listed and actively maintained

Negative Consequences

  • PostgreSQL only (no MySQL/MariaDB support)
  • Operator adds resource overhead
  • Learning curve for CNPG-specific features

Architecture

┌─────────────────────────────────────────────────────────────────┐
│                        CNPG Operator                             │
│                     (cnpg-system namespace)                      │
└────────────────────────────┬────────────────────────────────────┘
                             │ Manages
                             ▼
┌──────────────────┬─────────────────┬─────────────────────────────┐
│                  │                 │                             │
▼                  ▼                 ▼                             ▼
┌──────────────┐  ┌──────────────┐  ┌──────────────┐  ┌──────────────┐
│  gitea-pg    │  │ authentik-db │  │companions-db │  │  mlflow-db   │
│  (3 replicas)│  │  (3 replicas)│  │ (3 replicas) │  │ (1 replica)  │
│              │  │              │  │              │  │              │
│ ┌──────────┐ │  │ ┌──────────┐ │  │ ┌──────────┐ │  │ ┌──────────┐ │
│ │ Primary  │ │  │ │ Primary  │ │  │ │ Primary  │ │  │ │ Primary  │ │
│ ├──────────┤ │  │ ├──────────┤ │  │ ├──────────┤ │  │ └──────────┘ │
│ │ Replica  │ │  │ │ Replica  │ │  │ │ Replica  │ │  │              │
│ ├──────────┤ │  │ ├──────────┤ │  │ ├──────────┤ │  │              │
│ │ Replica  │ │  │ │ Replica  │ │  │ │ Replica  │ │  │              │
│ └──────────┘ │  │ └──────────┘ │  │ └──────────┘ │  │              │
│ ┌──────────┐ │  │ ┌──────────┐ │  │ ┌──────────┐ │  │              │
│ │ PgBouncer│ │  │ │ PgBouncer│ │  │ │ PgBouncer│ │  │              │
│ └──────────┘ │  │ └──────────┘ │  │ └──────────┘ │  │              │
└──────────────┘  └──────────────┘  └──────────────┘  └──────────────┘
      │                  │                 │                │
      └──────────────────┼─────────────────┼────────────────┘
                         │                 │
                   ┌─────▼─────┐     ┌─────▼─────┐
                   │  Longhorn │     │ Longhorn  │
                   │   PVCs    │     │  Backups  │
                   └───────────┘     └───────────┘

Cluster Configuration Template

apiVersion: postgresql.cnpg.io/v1
kind: Cluster
metadata:
  name: app-db
spec:
  description: "Application PostgreSQL Cluster"
  imageName: ghcr.io/cloudnative-pg/postgresql:17.2
  instances: 3

  primaryUpdateStrategy: unsupervised

  postgresql:
    parameters:
      shared_buffers: "256MB"
      effective_cache_size: "768MB"
      work_mem: "16MB"
      max_connections: "200"
      
  # Enable PgBouncer for connection pooling
  enablePgBouncer: true
  pgbouncer:
    poolMode: transaction
    defaultPoolSize: "25"

  # Storage on Longhorn
  storage:
    size: 10Gi
    storageClass: longhorn

  # Monitoring
  monitoring:
    enabled: true
    customQueriesConfigMap:
      - name: cnpg-default-monitoring
        key: queries

  # Backup configuration
  backup:
    barmanObjectStore:
      destinationPath: "s3://backups/postgres/"
      s3Credentials:
        accessKeyId:
          name: postgres-backup-creds
          key: ACCESS_KEY_ID
        secretAccessKey:
          name: postgres-backup-creds
          key: SECRET_ACCESS_KEY
    retentionPolicy: "7d"

Database Instances

Cluster Instances Storage PgBouncer Purpose
gitea-pg 3 10Gi Yes Git repository metadata
authentik-db 3 5Gi Yes Identity/SSO data
companions-db 3 10Gi Yes Chat app data
mlflow-db 1 5Gi No Experiment tracking
kubeflow-db 1 10Gi No Pipeline metadata

Connection Patterns

Service Discovery

CNPG creates services for each cluster:

Service Purpose
<cluster>-rw Read-write (primary only)
<cluster>-ro Read-only (any replica)
<cluster>-r Read (any instance)
<cluster>-pooler-rw PgBouncer read-write
<cluster>-pooler-ro PgBouncer read-only

Application Configuration

# Application config using CNPG service
DATABASE_URL: "postgresql://user:password@gitea-pg-pooler-rw.gitea.svc:5432/giteadb"

Credentials via External Secrets

apiVersion: external-secrets.io/v1
kind: ExternalSecret
metadata:
  name: app-db-credentials
spec:
  secretStoreRef:
    kind: ClusterSecretStore
    name: vault
  target:
    name: app-db-credentials
  data:
    - secretKey: username
      remoteRef:
        key: kv/data/app-db
        property: username
    - secretKey: password
      remoteRef:
        key: kv/data/app-db
        property: password

High Availability

Automatic Failover

  • CNPG monitors primary health continuously
  • If primary fails, automatic promotion of replica
  • Application reconnection via service abstraction
  • Typical failover time: 10-30 seconds

Replica Synchronization

  • Streaming replication from primary to replicas
  • Synchronous replication available for zero data loss (trade-off: latency)
  • Default: asynchronous with acceptable RPO

Backup Strategy

Continuous WAL Archiving

  • Write-Ahead Log streamed to S3
  • Point-in-time recovery capability
  • RPO: seconds (last WAL segment)

Base Backups

  • Frequency: Daily
  • Retention: 7 days
  • Destination: S3-compatible (MinIO/Backblaze)

Recovery Testing

  • Periodic restore to test cluster
  • Validate backup integrity
  • Document recovery procedure

Monitoring

Prometheus Metrics

  • Connection count and pool utilization
  • Transaction rate and latency
  • Replication lag
  • Disk usage and WAL generation

Grafana Dashboard

CNPG provides official dashboard:

  • Cluster health overview
  • Per-instance metrics
  • Replication status
  • Backup job history

Alerts

- alert: PostgreSQLDown
  expr: cnpg_collector_up == 0
  for: 5m
  labels:
    severity: critical

- alert: PostgreSQLReplicationLag
  expr: cnpg_pg_replication_lag_seconds > 30
  for: 5m
  labels:
    severity: warning

- alert: PostgreSQLConnectionsHigh
  expr: cnpg_pg_stat_activity_count / cnpg_pg_settings_max_connections > 0.8
  for: 5m
  labels:
    severity: warning

When NOT to Use CloudNativePG

Scenario Alternative
Simple app, no HA needed Embedded SQLite
MySQL/MariaDB required Application-specific chart
Massive scale External managed database
Non-relational data Redis/Valkey, MongoDB

PostgreSQL Version Policy

  • Use latest stable major version (currently 17)
  • Minor version updates: automatic (primaryUpdateStrategy: unsupervised)
  • Major version upgrades: manual with testing

Future Enhancements

  1. Cross-cluster replication - DR site replica
  2. Logical replication - Selective table sync between clusters
  3. TimescaleDB extension - Time-series optimization for metrics
  4. PgVector extension - Vector storage alternative to Milvus

References