- 0025: Observability stack (Prometheus + ClickStack + OTEL) - 0026: Tiered storage strategy (Longhorn + NFS) - 0027: Database strategy (CloudNativePG for PostgreSQL) - 0028: Authentik SSO strategy (OIDC/SAML identity provider)
295 lines
10 KiB
Markdown
295 lines
10 KiB
Markdown
# Database Strategy with CloudNativePG
|
|
|
|
* Status: accepted
|
|
* Date: 2026-02-04
|
|
* Deciders: Billy
|
|
* Technical Story: Standardize PostgreSQL deployment for stateful applications
|
|
|
|
## Context and Problem Statement
|
|
|
|
Multiple applications in the homelab require relational databases: Gitea, Authentik, Companions, MLflow, and potentially more. Each could use different database solutions, creating operational complexity.
|
|
|
|
How do we standardize database deployment while providing production-grade reliability and minimal operational overhead?
|
|
|
|
## Decision Drivers
|
|
|
|
* Operational simplicity - single operator to learn and manage
|
|
* High availability - automatic failover for critical databases
|
|
* Backup integration - consistent backup strategy across all databases
|
|
* GitOps compatibility - declarative database provisioning
|
|
* Resource efficiency - don't over-provision for homelab scale
|
|
|
|
## Considered Options
|
|
|
|
1. **CloudNativePG for PostgreSQL**
|
|
2. **Helm charts per application (Bitnami PostgreSQL)**
|
|
3. **External managed database (RDS-style)**
|
|
4. **SQLite where possible + single shared PostgreSQL**
|
|
|
|
## Decision Outcome
|
|
|
|
Chosen option: **Option 1 - CloudNativePG for PostgreSQL**
|
|
|
|
CloudNativePG (CNPG) provides a Kubernetes-native PostgreSQL operator with HA, automatic failover, connection pooling (PgBouncer), and integrated backups.
|
|
|
|
### Positive Consequences
|
|
|
|
* Single operator manages all PostgreSQL instances
|
|
* Declarative Cluster CRD for GitOps deployment
|
|
* Automatic failover with minimal data loss
|
|
* Built-in PgBouncer for connection pooling
|
|
* Prometheus metrics and Grafana dashboards included
|
|
* CNPG is CNCF-listed and actively maintained
|
|
|
|
### Negative Consequences
|
|
|
|
* PostgreSQL only (no MySQL/MariaDB support)
|
|
* Operator adds resource overhead
|
|
* Learning curve for CNPG-specific features
|
|
|
|
## Architecture
|
|
|
|
```
|
|
┌─────────────────────────────────────────────────────────────────┐
|
|
│ CNPG Operator │
|
|
│ (cnpg-system namespace) │
|
|
└────────────────────────────┬────────────────────────────────────┘
|
|
│ Manages
|
|
▼
|
|
┌──────────────────┬─────────────────┬─────────────────────────────┐
|
|
│ │ │ │
|
|
▼ ▼ ▼ ▼
|
|
┌──────────────┐ ┌──────────────┐ ┌──────────────┐ ┌──────────────┐
|
|
│ gitea-pg │ │ authentik-db │ │companions-db │ │ mlflow-db │
|
|
│ (3 replicas)│ │ (3 replicas)│ │ (3 replicas) │ │ (1 replica) │
|
|
│ │ │ │ │ │ │ │
|
|
│ ┌──────────┐ │ │ ┌──────────┐ │ │ ┌──────────┐ │ │ ┌──────────┐ │
|
|
│ │ Primary │ │ │ │ Primary │ │ │ │ Primary │ │ │ │ Primary │ │
|
|
│ ├──────────┤ │ │ ├──────────┤ │ │ ├──────────┤ │ │ └──────────┘ │
|
|
│ │ Replica │ │ │ │ Replica │ │ │ │ Replica │ │ │ │
|
|
│ ├──────────┤ │ │ ├──────────┤ │ │ ├──────────┤ │ │ │
|
|
│ │ Replica │ │ │ │ Replica │ │ │ │ Replica │ │ │ │
|
|
│ └──────────┘ │ │ └──────────┘ │ │ └──────────┘ │ │ │
|
|
│ ┌──────────┐ │ │ ┌──────────┐ │ │ ┌──────────┐ │ │ │
|
|
│ │ PgBouncer│ │ │ │ PgBouncer│ │ │ │ PgBouncer│ │ │ │
|
|
│ └──────────┘ │ │ └──────────┘ │ │ └──────────┘ │ │ │
|
|
└──────────────┘ └──────────────┘ └──────────────┘ └──────────────┘
|
|
│ │ │ │
|
|
└──────────────────┼─────────────────┼────────────────┘
|
|
│ │
|
|
┌─────▼─────┐ ┌─────▼─────┐
|
|
│ Longhorn │ │ Longhorn │
|
|
│ PVCs │ │ Backups │
|
|
└───────────┘ └───────────┘
|
|
```
|
|
|
|
## Cluster Configuration Template
|
|
|
|
```yaml
|
|
apiVersion: postgresql.cnpg.io/v1
|
|
kind: Cluster
|
|
metadata:
|
|
name: app-db
|
|
spec:
|
|
description: "Application PostgreSQL Cluster"
|
|
imageName: ghcr.io/cloudnative-pg/postgresql:17.2
|
|
instances: 3
|
|
|
|
primaryUpdateStrategy: unsupervised
|
|
|
|
postgresql:
|
|
parameters:
|
|
shared_buffers: "256MB"
|
|
effective_cache_size: "768MB"
|
|
work_mem: "16MB"
|
|
max_connections: "200"
|
|
|
|
# Enable PgBouncer for connection pooling
|
|
enablePgBouncer: true
|
|
pgbouncer:
|
|
poolMode: transaction
|
|
defaultPoolSize: "25"
|
|
|
|
# Storage on Longhorn
|
|
storage:
|
|
size: 10Gi
|
|
storageClass: longhorn
|
|
|
|
# Monitoring
|
|
monitoring:
|
|
enabled: true
|
|
customQueriesConfigMap:
|
|
- name: cnpg-default-monitoring
|
|
key: queries
|
|
|
|
# Backup configuration
|
|
backup:
|
|
barmanObjectStore:
|
|
destinationPath: "s3://backups/postgres/"
|
|
s3Credentials:
|
|
accessKeyId:
|
|
name: postgres-backup-creds
|
|
key: ACCESS_KEY_ID
|
|
secretAccessKey:
|
|
name: postgres-backup-creds
|
|
key: SECRET_ACCESS_KEY
|
|
retentionPolicy: "7d"
|
|
```
|
|
|
|
## Database Instances
|
|
|
|
| Cluster | Instances | Storage | PgBouncer | Purpose |
|
|
|---------|-----------|---------|-----------|---------|
|
|
| `gitea-pg` | 3 | 10Gi | Yes | Git repository metadata |
|
|
| `authentik-db` | 3 | 5Gi | Yes | Identity/SSO data |
|
|
| `companions-db` | 3 | 10Gi | Yes | Chat app data |
|
|
| `mlflow-db` | 1 | 5Gi | No | Experiment tracking |
|
|
| `kubeflow-db` | 1 | 10Gi | No | Pipeline metadata |
|
|
|
|
## Connection Patterns
|
|
|
|
### Service Discovery
|
|
|
|
CNPG creates services for each cluster:
|
|
|
|
| Service | Purpose |
|
|
|---------|---------|
|
|
| `<cluster>-rw` | Read-write (primary only) |
|
|
| `<cluster>-ro` | Read-only (any replica) |
|
|
| `<cluster>-r` | Read (any instance) |
|
|
| `<cluster>-pooler-rw` | PgBouncer read-write |
|
|
| `<cluster>-pooler-ro` | PgBouncer read-only |
|
|
|
|
### Application Configuration
|
|
|
|
```yaml
|
|
# Application config using CNPG service
|
|
DATABASE_URL: "postgresql://user:password@gitea-pg-pooler-rw.gitea.svc:5432/giteadb"
|
|
```
|
|
|
|
### Credentials via External Secrets
|
|
|
|
```yaml
|
|
apiVersion: external-secrets.io/v1
|
|
kind: ExternalSecret
|
|
metadata:
|
|
name: app-db-credentials
|
|
spec:
|
|
secretStoreRef:
|
|
kind: ClusterSecretStore
|
|
name: vault
|
|
target:
|
|
name: app-db-credentials
|
|
data:
|
|
- secretKey: username
|
|
remoteRef:
|
|
key: kv/data/app-db
|
|
property: username
|
|
- secretKey: password
|
|
remoteRef:
|
|
key: kv/data/app-db
|
|
property: password
|
|
```
|
|
|
|
## High Availability
|
|
|
|
### Automatic Failover
|
|
|
|
- CNPG monitors primary health continuously
|
|
- If primary fails, automatic promotion of replica
|
|
- Application reconnection via service abstraction
|
|
- Typical failover time: 10-30 seconds
|
|
|
|
### Replica Synchronization
|
|
|
|
- Streaming replication from primary to replicas
|
|
- Synchronous replication available for zero data loss (trade-off: latency)
|
|
- Default: asynchronous with acceptable RPO
|
|
|
|
## Backup Strategy
|
|
|
|
### Continuous WAL Archiving
|
|
|
|
- Write-Ahead Log streamed to S3
|
|
- Point-in-time recovery capability
|
|
- RPO: seconds (last WAL segment)
|
|
|
|
### Base Backups
|
|
|
|
- **Frequency:** Daily
|
|
- **Retention:** 7 days
|
|
- **Destination:** S3-compatible (MinIO/Backblaze)
|
|
|
|
### Recovery Testing
|
|
|
|
- Periodic restore to test cluster
|
|
- Validate backup integrity
|
|
- Document recovery procedure
|
|
|
|
## Monitoring
|
|
|
|
### Prometheus Metrics
|
|
|
|
- Connection count and pool utilization
|
|
- Transaction rate and latency
|
|
- Replication lag
|
|
- Disk usage and WAL generation
|
|
|
|
### Grafana Dashboard
|
|
|
|
CNPG provides official dashboard:
|
|
- Cluster health overview
|
|
- Per-instance metrics
|
|
- Replication status
|
|
- Backup job history
|
|
|
|
### Alerts
|
|
|
|
```yaml
|
|
- alert: PostgreSQLDown
|
|
expr: cnpg_collector_up == 0
|
|
for: 5m
|
|
labels:
|
|
severity: critical
|
|
|
|
- alert: PostgreSQLReplicationLag
|
|
expr: cnpg_pg_replication_lag_seconds > 30
|
|
for: 5m
|
|
labels:
|
|
severity: warning
|
|
|
|
- alert: PostgreSQLConnectionsHigh
|
|
expr: cnpg_pg_stat_activity_count / cnpg_pg_settings_max_connections > 0.8
|
|
for: 5m
|
|
labels:
|
|
severity: warning
|
|
```
|
|
|
|
## When NOT to Use CloudNativePG
|
|
|
|
| Scenario | Alternative |
|
|
|----------|-------------|
|
|
| Simple app, no HA needed | Embedded SQLite |
|
|
| MySQL/MariaDB required | Application-specific chart |
|
|
| Massive scale | External managed database |
|
|
| Non-relational data | Redis/Valkey, MongoDB |
|
|
|
|
## PostgreSQL Version Policy
|
|
|
|
- Use latest stable major version (currently 17)
|
|
- Minor version updates: automatic (`primaryUpdateStrategy: unsupervised`)
|
|
- Major version upgrades: manual with testing
|
|
|
|
## Future Enhancements
|
|
|
|
1. **Cross-cluster replication** - DR site replica
|
|
2. **Logical replication** - Selective table sync between clusters
|
|
3. **TimescaleDB extension** - Time-series optimization for metrics
|
|
4. **PgVector extension** - Vector storage alternative to Milvus
|
|
|
|
## References
|
|
|
|
* [CloudNativePG Documentation](https://cloudnative-pg.io/documentation/)
|
|
* [CNPG GitHub](https://github.com/cloudnative-pg/cloudnative-pg)
|
|
* [PostgreSQL High Availability](https://www.postgresql.org/docs/current/high-availability.html)
|