Files
homelab-design/decisions/0027-database-strategy.md
Billy D. b43c80153c docs: add ADRs 0025-0028 for infrastructure patterns
- 0025: Observability stack (Prometheus + ClickStack + OTEL)
- 0026: Tiered storage strategy (Longhorn + NFS)
- 0027: Database strategy (CloudNativePG for PostgreSQL)
- 0028: Authentik SSO strategy (OIDC/SAML identity provider)
2026-02-04 08:55:15 -05:00

295 lines
10 KiB
Markdown

# Database Strategy with CloudNativePG
* Status: accepted
* Date: 2026-02-04
* Deciders: Billy
* Technical Story: Standardize PostgreSQL deployment for stateful applications
## Context and Problem Statement
Multiple applications in the homelab require relational databases: Gitea, Authentik, Companions, MLflow, and potentially more. Each could use different database solutions, creating operational complexity.
How do we standardize database deployment while providing production-grade reliability and minimal operational overhead?
## Decision Drivers
* Operational simplicity - single operator to learn and manage
* High availability - automatic failover for critical databases
* Backup integration - consistent backup strategy across all databases
* GitOps compatibility - declarative database provisioning
* Resource efficiency - don't over-provision for homelab scale
## Considered Options
1. **CloudNativePG for PostgreSQL**
2. **Helm charts per application (Bitnami PostgreSQL)**
3. **External managed database (RDS-style)**
4. **SQLite where possible + single shared PostgreSQL**
## Decision Outcome
Chosen option: **Option 1 - CloudNativePG for PostgreSQL**
CloudNativePG (CNPG) provides a Kubernetes-native PostgreSQL operator with HA, automatic failover, connection pooling (PgBouncer), and integrated backups.
### Positive Consequences
* Single operator manages all PostgreSQL instances
* Declarative Cluster CRD for GitOps deployment
* Automatic failover with minimal data loss
* Built-in PgBouncer for connection pooling
* Prometheus metrics and Grafana dashboards included
* CNPG is CNCF-listed and actively maintained
### Negative Consequences
* PostgreSQL only (no MySQL/MariaDB support)
* Operator adds resource overhead
* Learning curve for CNPG-specific features
## Architecture
```
┌─────────────────────────────────────────────────────────────────┐
│ CNPG Operator │
│ (cnpg-system namespace) │
└────────────────────────────┬────────────────────────────────────┘
│ Manages
┌──────────────────┬─────────────────┬─────────────────────────────┐
│ │ │ │
▼ ▼ ▼ ▼
┌──────────────┐ ┌──────────────┐ ┌──────────────┐ ┌──────────────┐
│ gitea-pg │ │ authentik-db │ │companions-db │ │ mlflow-db │
│ (3 replicas)│ │ (3 replicas)│ │ (3 replicas) │ │ (1 replica) │
│ │ │ │ │ │ │ │
│ ┌──────────┐ │ │ ┌──────────┐ │ │ ┌──────────┐ │ │ ┌──────────┐ │
│ │ Primary │ │ │ │ Primary │ │ │ │ Primary │ │ │ │ Primary │ │
│ ├──────────┤ │ │ ├──────────┤ │ │ ├──────────┤ │ │ └──────────┘ │
│ │ Replica │ │ │ │ Replica │ │ │ │ Replica │ │ │ │
│ ├──────────┤ │ │ ├──────────┤ │ │ ├──────────┤ │ │ │
│ │ Replica │ │ │ │ Replica │ │ │ │ Replica │ │ │ │
│ └──────────┘ │ │ └──────────┘ │ │ └──────────┘ │ │ │
│ ┌──────────┐ │ │ ┌──────────┐ │ │ ┌──────────┐ │ │ │
│ │ PgBouncer│ │ │ │ PgBouncer│ │ │ │ PgBouncer│ │ │ │
│ └──────────┘ │ │ └──────────┘ │ │ └──────────┘ │ │ │
└──────────────┘ └──────────────┘ └──────────────┘ └──────────────┘
│ │ │ │
└──────────────────┼─────────────────┼────────────────┘
│ │
┌─────▼─────┐ ┌─────▼─────┐
│ Longhorn │ │ Longhorn │
│ PVCs │ │ Backups │
└───────────┘ └───────────┘
```
## Cluster Configuration Template
```yaml
apiVersion: postgresql.cnpg.io/v1
kind: Cluster
metadata:
name: app-db
spec:
description: "Application PostgreSQL Cluster"
imageName: ghcr.io/cloudnative-pg/postgresql:17.2
instances: 3
primaryUpdateStrategy: unsupervised
postgresql:
parameters:
shared_buffers: "256MB"
effective_cache_size: "768MB"
work_mem: "16MB"
max_connections: "200"
# Enable PgBouncer for connection pooling
enablePgBouncer: true
pgbouncer:
poolMode: transaction
defaultPoolSize: "25"
# Storage on Longhorn
storage:
size: 10Gi
storageClass: longhorn
# Monitoring
monitoring:
enabled: true
customQueriesConfigMap:
- name: cnpg-default-monitoring
key: queries
# Backup configuration
backup:
barmanObjectStore:
destinationPath: "s3://backups/postgres/"
s3Credentials:
accessKeyId:
name: postgres-backup-creds
key: ACCESS_KEY_ID
secretAccessKey:
name: postgres-backup-creds
key: SECRET_ACCESS_KEY
retentionPolicy: "7d"
```
## Database Instances
| Cluster | Instances | Storage | PgBouncer | Purpose |
|---------|-----------|---------|-----------|---------|
| `gitea-pg` | 3 | 10Gi | Yes | Git repository metadata |
| `authentik-db` | 3 | 5Gi | Yes | Identity/SSO data |
| `companions-db` | 3 | 10Gi | Yes | Chat app data |
| `mlflow-db` | 1 | 5Gi | No | Experiment tracking |
| `kubeflow-db` | 1 | 10Gi | No | Pipeline metadata |
## Connection Patterns
### Service Discovery
CNPG creates services for each cluster:
| Service | Purpose |
|---------|---------|
| `<cluster>-rw` | Read-write (primary only) |
| `<cluster>-ro` | Read-only (any replica) |
| `<cluster>-r` | Read (any instance) |
| `<cluster>-pooler-rw` | PgBouncer read-write |
| `<cluster>-pooler-ro` | PgBouncer read-only |
### Application Configuration
```yaml
# Application config using CNPG service
DATABASE_URL: "postgresql://user:password@gitea-pg-pooler-rw.gitea.svc:5432/giteadb"
```
### Credentials via External Secrets
```yaml
apiVersion: external-secrets.io/v1
kind: ExternalSecret
metadata:
name: app-db-credentials
spec:
secretStoreRef:
kind: ClusterSecretStore
name: vault
target:
name: app-db-credentials
data:
- secretKey: username
remoteRef:
key: kv/data/app-db
property: username
- secretKey: password
remoteRef:
key: kv/data/app-db
property: password
```
## High Availability
### Automatic Failover
- CNPG monitors primary health continuously
- If primary fails, automatic promotion of replica
- Application reconnection via service abstraction
- Typical failover time: 10-30 seconds
### Replica Synchronization
- Streaming replication from primary to replicas
- Synchronous replication available for zero data loss (trade-off: latency)
- Default: asynchronous with acceptable RPO
## Backup Strategy
### Continuous WAL Archiving
- Write-Ahead Log streamed to S3
- Point-in-time recovery capability
- RPO: seconds (last WAL segment)
### Base Backups
- **Frequency:** Daily
- **Retention:** 7 days
- **Destination:** S3-compatible (MinIO/Backblaze)
### Recovery Testing
- Periodic restore to test cluster
- Validate backup integrity
- Document recovery procedure
## Monitoring
### Prometheus Metrics
- Connection count and pool utilization
- Transaction rate and latency
- Replication lag
- Disk usage and WAL generation
### Grafana Dashboard
CNPG provides official dashboard:
- Cluster health overview
- Per-instance metrics
- Replication status
- Backup job history
### Alerts
```yaml
- alert: PostgreSQLDown
expr: cnpg_collector_up == 0
for: 5m
labels:
severity: critical
- alert: PostgreSQLReplicationLag
expr: cnpg_pg_replication_lag_seconds > 30
for: 5m
labels:
severity: warning
- alert: PostgreSQLConnectionsHigh
expr: cnpg_pg_stat_activity_count / cnpg_pg_settings_max_connections > 0.8
for: 5m
labels:
severity: warning
```
## When NOT to Use CloudNativePG
| Scenario | Alternative |
|----------|-------------|
| Simple app, no HA needed | Embedded SQLite |
| MySQL/MariaDB required | Application-specific chart |
| Massive scale | External managed database |
| Non-relational data | Redis/Valkey, MongoDB |
## PostgreSQL Version Policy
- Use latest stable major version (currently 17)
- Minor version updates: automatic (`primaryUpdateStrategy: unsupervised`)
- Major version upgrades: manual with testing
## Future Enhancements
1. **Cross-cluster replication** - DR site replica
2. **Logical replication** - Selective table sync between clusters
3. **TimescaleDB extension** - Time-series optimization for metrics
4. **PgVector extension** - Vector storage alternative to Milvus
## References
* [CloudNativePG Documentation](https://cloudnative-pg.io/documentation/)
* [CNPG GitHub](https://github.com/cloudnative-pg/cloudnative-pg)
* [PostgreSQL High Availability](https://www.postgresql.org/docs/current/high-availability.html)