- 0025: Observability stack (Prometheus + ClickStack + OTEL) - 0026: Tiered storage strategy (Longhorn + NFS) - 0027: Database strategy (CloudNativePG for PostgreSQL) - 0028: Authentik SSO strategy (OIDC/SAML identity provider)
10 KiB
10 KiB
Database Strategy with CloudNativePG
- Status: accepted
- Date: 2026-02-04
- Deciders: Billy
- Technical Story: Standardize PostgreSQL deployment for stateful applications
Context and Problem Statement
Multiple applications in the homelab require relational databases: Gitea, Authentik, Companions, MLflow, and potentially more. Each could use different database solutions, creating operational complexity.
How do we standardize database deployment while providing production-grade reliability and minimal operational overhead?
Decision Drivers
- Operational simplicity - single operator to learn and manage
- High availability - automatic failover for critical databases
- Backup integration - consistent backup strategy across all databases
- GitOps compatibility - declarative database provisioning
- Resource efficiency - don't over-provision for homelab scale
Considered Options
- CloudNativePG for PostgreSQL
- Helm charts per application (Bitnami PostgreSQL)
- External managed database (RDS-style)
- SQLite where possible + single shared PostgreSQL
Decision Outcome
Chosen option: Option 1 - CloudNativePG for PostgreSQL
CloudNativePG (CNPG) provides a Kubernetes-native PostgreSQL operator with HA, automatic failover, connection pooling (PgBouncer), and integrated backups.
Positive Consequences
- Single operator manages all PostgreSQL instances
- Declarative Cluster CRD for GitOps deployment
- Automatic failover with minimal data loss
- Built-in PgBouncer for connection pooling
- Prometheus metrics and Grafana dashboards included
- CNPG is CNCF-listed and actively maintained
Negative Consequences
- PostgreSQL only (no MySQL/MariaDB support)
- Operator adds resource overhead
- Learning curve for CNPG-specific features
Architecture
┌─────────────────────────────────────────────────────────────────┐
│ CNPG Operator │
│ (cnpg-system namespace) │
└────────────────────────────┬────────────────────────────────────┘
│ Manages
▼
┌──────────────────┬─────────────────┬─────────────────────────────┐
│ │ │ │
▼ ▼ ▼ ▼
┌──────────────┐ ┌──────────────┐ ┌──────────────┐ ┌──────────────┐
│ gitea-pg │ │ authentik-db │ │companions-db │ │ mlflow-db │
│ (3 replicas)│ │ (3 replicas)│ │ (3 replicas) │ │ (1 replica) │
│ │ │ │ │ │ │ │
│ ┌──────────┐ │ │ ┌──────────┐ │ │ ┌──────────┐ │ │ ┌──────────┐ │
│ │ Primary │ │ │ │ Primary │ │ │ │ Primary │ │ │ │ Primary │ │
│ ├──────────┤ │ │ ├──────────┤ │ │ ├──────────┤ │ │ └──────────┘ │
│ │ Replica │ │ │ │ Replica │ │ │ │ Replica │ │ │ │
│ ├──────────┤ │ │ ├──────────┤ │ │ ├──────────┤ │ │ │
│ │ Replica │ │ │ │ Replica │ │ │ │ Replica │ │ │ │
│ └──────────┘ │ │ └──────────┘ │ │ └──────────┘ │ │ │
│ ┌──────────┐ │ │ ┌──────────┐ │ │ ┌──────────┐ │ │ │
│ │ PgBouncer│ │ │ │ PgBouncer│ │ │ │ PgBouncer│ │ │ │
│ └──────────┘ │ │ └──────────┘ │ │ └──────────┘ │ │ │
└──────────────┘ └──────────────┘ └──────────────┘ └──────────────┘
│ │ │ │
└──────────────────┼─────────────────┼────────────────┘
│ │
┌─────▼─────┐ ┌─────▼─────┐
│ Longhorn │ │ Longhorn │
│ PVCs │ │ Backups │
└───────────┘ └───────────┘
Cluster Configuration Template
apiVersion: postgresql.cnpg.io/v1
kind: Cluster
metadata:
name: app-db
spec:
description: "Application PostgreSQL Cluster"
imageName: ghcr.io/cloudnative-pg/postgresql:17.2
instances: 3
primaryUpdateStrategy: unsupervised
postgresql:
parameters:
shared_buffers: "256MB"
effective_cache_size: "768MB"
work_mem: "16MB"
max_connections: "200"
# Enable PgBouncer for connection pooling
enablePgBouncer: true
pgbouncer:
poolMode: transaction
defaultPoolSize: "25"
# Storage on Longhorn
storage:
size: 10Gi
storageClass: longhorn
# Monitoring
monitoring:
enabled: true
customQueriesConfigMap:
- name: cnpg-default-monitoring
key: queries
# Backup configuration
backup:
barmanObjectStore:
destinationPath: "s3://backups/postgres/"
s3Credentials:
accessKeyId:
name: postgres-backup-creds
key: ACCESS_KEY_ID
secretAccessKey:
name: postgres-backup-creds
key: SECRET_ACCESS_KEY
retentionPolicy: "7d"
Database Instances
| Cluster | Instances | Storage | PgBouncer | Purpose |
|---|---|---|---|---|
gitea-pg |
3 | 10Gi | Yes | Git repository metadata |
authentik-db |
3 | 5Gi | Yes | Identity/SSO data |
companions-db |
3 | 10Gi | Yes | Chat app data |
mlflow-db |
1 | 5Gi | No | Experiment tracking |
kubeflow-db |
1 | 10Gi | No | Pipeline metadata |
Connection Patterns
Service Discovery
CNPG creates services for each cluster:
| Service | Purpose |
|---|---|
<cluster>-rw |
Read-write (primary only) |
<cluster>-ro |
Read-only (any replica) |
<cluster>-r |
Read (any instance) |
<cluster>-pooler-rw |
PgBouncer read-write |
<cluster>-pooler-ro |
PgBouncer read-only |
Application Configuration
# Application config using CNPG service
DATABASE_URL: "postgresql://user:password@gitea-pg-pooler-rw.gitea.svc:5432/giteadb"
Credentials via External Secrets
apiVersion: external-secrets.io/v1
kind: ExternalSecret
metadata:
name: app-db-credentials
spec:
secretStoreRef:
kind: ClusterSecretStore
name: vault
target:
name: app-db-credentials
data:
- secretKey: username
remoteRef:
key: kv/data/app-db
property: username
- secretKey: password
remoteRef:
key: kv/data/app-db
property: password
High Availability
Automatic Failover
- CNPG monitors primary health continuously
- If primary fails, automatic promotion of replica
- Application reconnection via service abstraction
- Typical failover time: 10-30 seconds
Replica Synchronization
- Streaming replication from primary to replicas
- Synchronous replication available for zero data loss (trade-off: latency)
- Default: asynchronous with acceptable RPO
Backup Strategy
Continuous WAL Archiving
- Write-Ahead Log streamed to S3
- Point-in-time recovery capability
- RPO: seconds (last WAL segment)
Base Backups
- Frequency: Daily
- Retention: 7 days
- Destination: S3-compatible (MinIO/Backblaze)
Recovery Testing
- Periodic restore to test cluster
- Validate backup integrity
- Document recovery procedure
Monitoring
Prometheus Metrics
- Connection count and pool utilization
- Transaction rate and latency
- Replication lag
- Disk usage and WAL generation
Grafana Dashboard
CNPG provides official dashboard:
- Cluster health overview
- Per-instance metrics
- Replication status
- Backup job history
Alerts
- alert: PostgreSQLDown
expr: cnpg_collector_up == 0
for: 5m
labels:
severity: critical
- alert: PostgreSQLReplicationLag
expr: cnpg_pg_replication_lag_seconds > 30
for: 5m
labels:
severity: warning
- alert: PostgreSQLConnectionsHigh
expr: cnpg_pg_stat_activity_count / cnpg_pg_settings_max_connections > 0.8
for: 5m
labels:
severity: warning
When NOT to Use CloudNativePG
| Scenario | Alternative |
|---|---|
| Simple app, no HA needed | Embedded SQLite |
| MySQL/MariaDB required | Application-specific chart |
| Massive scale | External managed database |
| Non-relational data | Redis/Valkey, MongoDB |
PostgreSQL Version Policy
- Use latest stable major version (currently 17)
- Minor version updates: automatic (
primaryUpdateStrategy: unsupervised) - Major version upgrades: manual with testing
Future Enhancements
- Cross-cluster replication - DR site replica
- Logical replication - Selective table sync between clusters
- TimescaleDB extension - Time-series optimization for metrics
- PgVector extension - Vector storage alternative to Milvus