# Database Strategy with CloudNativePG * Status: accepted * Date: 2026-02-04 * Deciders: Billy * Technical Story: Standardize PostgreSQL deployment for stateful applications ## Context and Problem Statement Multiple applications in the homelab require relational databases: Gitea, Authentik, Companions, MLflow, and potentially more. Each could use different database solutions, creating operational complexity. How do we standardize database deployment while providing production-grade reliability and minimal operational overhead? ## Decision Drivers * Operational simplicity - single operator to learn and manage * High availability - automatic failover for critical databases * Backup integration - consistent backup strategy across all databases * GitOps compatibility - declarative database provisioning * Resource efficiency - don't over-provision for homelab scale ## Considered Options 1. **CloudNativePG for PostgreSQL** 2. **Helm charts per application (Bitnami PostgreSQL)** 3. **External managed database (RDS-style)** 4. **SQLite where possible + single shared PostgreSQL** ## Decision Outcome Chosen option: **Option 1 - CloudNativePG for PostgreSQL** CloudNativePG (CNPG) provides a Kubernetes-native PostgreSQL operator with HA, automatic failover, connection pooling (PgBouncer), and integrated backups. ### Positive Consequences * Single operator manages all PostgreSQL instances * Declarative Cluster CRD for GitOps deployment * Automatic failover with minimal data loss * Built-in PgBouncer for connection pooling * Prometheus metrics and Grafana dashboards included * CNPG is CNCF-listed and actively maintained ### Negative Consequences * PostgreSQL only (no MySQL/MariaDB support) * Operator adds resource overhead * Learning curve for CNPG-specific features ## Architecture ``` ┌─────────────────────────────────────────────────────────────────┐ │ CNPG Operator │ │ (cnpg-system namespace) │ └────────────────────────────┬────────────────────────────────────┘ │ Manages ▼ ┌──────────────────┬─────────────────┬─────────────────────────────┐ │ │ │ │ ▼ ▼ ▼ ▼ ┌──────────────┐ ┌──────────────┐ ┌──────────────┐ ┌──────────────┐ │ gitea-pg │ │ authentik-db │ │companions-db │ │ mlflow-db │ │ (3 replicas)│ │ (3 replicas)│ │ (3 replicas) │ │ (1 replica) │ │ │ │ │ │ │ │ │ │ ┌──────────┐ │ │ ┌──────────┐ │ │ ┌──────────┐ │ │ ┌──────────┐ │ │ │ Primary │ │ │ │ Primary │ │ │ │ Primary │ │ │ │ Primary │ │ │ ├──────────┤ │ │ ├──────────┤ │ │ ├──────────┤ │ │ └──────────┘ │ │ │ Replica │ │ │ │ Replica │ │ │ │ Replica │ │ │ │ │ ├──────────┤ │ │ ├──────────┤ │ │ ├──────────┤ │ │ │ │ │ Replica │ │ │ │ Replica │ │ │ │ Replica │ │ │ │ │ └──────────┘ │ │ └──────────┘ │ │ └──────────┘ │ │ │ │ ┌──────────┐ │ │ ┌──────────┐ │ │ ┌──────────┐ │ │ │ │ │ PgBouncer│ │ │ │ PgBouncer│ │ │ │ PgBouncer│ │ │ │ │ └──────────┘ │ │ └──────────┘ │ │ └──────────┘ │ │ │ └──────────────┘ └──────────────┘ └──────────────┘ └──────────────┘ │ │ │ │ └──────────────────┼─────────────────┼────────────────┘ │ │ ┌─────▼─────┐ ┌─────▼─────┐ │ Longhorn │ │ Longhorn │ │ PVCs │ │ Backups │ └───────────┘ └───────────┘ ``` ## Cluster Configuration Template ```yaml apiVersion: postgresql.cnpg.io/v1 kind: Cluster metadata: name: app-db spec: description: "Application PostgreSQL Cluster" imageName: ghcr.io/cloudnative-pg/postgresql:17.2 instances: 3 primaryUpdateStrategy: unsupervised postgresql: parameters: shared_buffers: "256MB" effective_cache_size: "768MB" work_mem: "16MB" max_connections: "200" # Enable PgBouncer for connection pooling enablePgBouncer: true pgbouncer: poolMode: transaction defaultPoolSize: "25" # Storage on Longhorn storage: size: 10Gi storageClass: longhorn # Monitoring monitoring: enabled: true customQueriesConfigMap: - name: cnpg-default-monitoring key: queries # Backup configuration backup: barmanObjectStore: destinationPath: "s3://backups/postgres/" s3Credentials: accessKeyId: name: postgres-backup-creds key: ACCESS_KEY_ID secretAccessKey: name: postgres-backup-creds key: SECRET_ACCESS_KEY retentionPolicy: "7d" ``` ## Database Instances | Cluster | Instances | Storage | PgBouncer | Purpose | |---------|-----------|---------|-----------|---------| | `gitea-pg` | 3 | 10Gi | Yes | Git repository metadata | | `authentik-db` | 3 | 5Gi | Yes | Identity/SSO data | | `companions-db` | 3 | 10Gi | Yes | Chat app data | | `mlflow-db` | 1 | 5Gi | No | Experiment tracking | | `kubeflow-db` | 1 | 10Gi | No | Pipeline metadata | ## Connection Patterns ### Service Discovery CNPG creates services for each cluster: | Service | Purpose | |---------|---------| | `-rw` | Read-write (primary only) | | `-ro` | Read-only (any replica) | | `-r` | Read (any instance) | | `-pooler-rw` | PgBouncer read-write | | `-pooler-ro` | PgBouncer read-only | ### Application Configuration ```yaml # Application config using CNPG service DATABASE_URL: "postgresql://user:password@gitea-pg-pooler-rw.gitea.svc:5432/giteadb" ``` ### Credentials via External Secrets ```yaml apiVersion: external-secrets.io/v1 kind: ExternalSecret metadata: name: app-db-credentials spec: secretStoreRef: kind: ClusterSecretStore name: vault target: name: app-db-credentials data: - secretKey: username remoteRef: key: kv/data/app-db property: username - secretKey: password remoteRef: key: kv/data/app-db property: password ``` ## High Availability ### Automatic Failover - CNPG monitors primary health continuously - If primary fails, automatic promotion of replica - Application reconnection via service abstraction - Typical failover time: 10-30 seconds ### Replica Synchronization - Streaming replication from primary to replicas - Synchronous replication available for zero data loss (trade-off: latency) - Default: asynchronous with acceptable RPO ## Backup Strategy ### Continuous WAL Archiving - Write-Ahead Log streamed to S3 - Point-in-time recovery capability - RPO: seconds (last WAL segment) ### Base Backups - **Frequency:** Daily - **Retention:** 7 days - **Destination:** S3-compatible (MinIO/Backblaze) ### Recovery Testing - Periodic restore to test cluster - Validate backup integrity - Document recovery procedure ## Monitoring ### Prometheus Metrics - Connection count and pool utilization - Transaction rate and latency - Replication lag - Disk usage and WAL generation ### Grafana Dashboard CNPG provides official dashboard: - Cluster health overview - Per-instance metrics - Replication status - Backup job history ### Alerts ```yaml - alert: PostgreSQLDown expr: cnpg_collector_up == 0 for: 5m labels: severity: critical - alert: PostgreSQLReplicationLag expr: cnpg_pg_replication_lag_seconds > 30 for: 5m labels: severity: warning - alert: PostgreSQLConnectionsHigh expr: cnpg_pg_stat_activity_count / cnpg_pg_settings_max_connections > 0.8 for: 5m labels: severity: warning ``` ## When NOT to Use CloudNativePG | Scenario | Alternative | |----------|-------------| | Simple app, no HA needed | Embedded SQLite | | MySQL/MariaDB required | Application-specific chart | | Massive scale | External managed database | | Non-relational data | Redis/Valkey, MongoDB | ## PostgreSQL Version Policy - Use latest stable major version (currently 17) - Minor version updates: automatic (`primaryUpdateStrategy: unsupervised`) - Major version upgrades: manual with testing ## Future Enhancements 1. **Cross-cluster replication** - DR site replica 2. **Logical replication** - Selective table sync between clusters 3. **TimescaleDB extension** - Time-series optimization for metrics 4. **PgVector extension** - Vector storage alternative to Milvus ## References * [CloudNativePG Documentation](https://cloudnative-pg.io/documentation/) * [CNPG GitHub](https://github.com/cloudnative-pg/cloudnative-pg) * [PostgreSQL High Availability](https://www.postgresql.org/docs/current/high-availability.html)