docs: add ADRs 0025-0028 for infrastructure patterns
- 0025: Observability stack (Prometheus + ClickStack + OTEL) - 0026: Tiered storage strategy (Longhorn + NFS) - 0027: Database strategy (CloudNativePG for PostgreSQL) - 0028: Authentik SSO strategy (OIDC/SAML identity provider)
This commit is contained in:
294
decisions/0027-database-strategy.md
Normal file
294
decisions/0027-database-strategy.md
Normal file
@@ -0,0 +1,294 @@
|
||||
# Database Strategy with CloudNativePG
|
||||
|
||||
* Status: accepted
|
||||
* Date: 2026-02-04
|
||||
* Deciders: Billy
|
||||
* Technical Story: Standardize PostgreSQL deployment for stateful applications
|
||||
|
||||
## Context and Problem Statement
|
||||
|
||||
Multiple applications in the homelab require relational databases: Gitea, Authentik, Companions, MLflow, and potentially more. Each could use different database solutions, creating operational complexity.
|
||||
|
||||
How do we standardize database deployment while providing production-grade reliability and minimal operational overhead?
|
||||
|
||||
## Decision Drivers
|
||||
|
||||
* Operational simplicity - single operator to learn and manage
|
||||
* High availability - automatic failover for critical databases
|
||||
* Backup integration - consistent backup strategy across all databases
|
||||
* GitOps compatibility - declarative database provisioning
|
||||
* Resource efficiency - don't over-provision for homelab scale
|
||||
|
||||
## Considered Options
|
||||
|
||||
1. **CloudNativePG for PostgreSQL**
|
||||
2. **Helm charts per application (Bitnami PostgreSQL)**
|
||||
3. **External managed database (RDS-style)**
|
||||
4. **SQLite where possible + single shared PostgreSQL**
|
||||
|
||||
## Decision Outcome
|
||||
|
||||
Chosen option: **Option 1 - CloudNativePG for PostgreSQL**
|
||||
|
||||
CloudNativePG (CNPG) provides a Kubernetes-native PostgreSQL operator with HA, automatic failover, connection pooling (PgBouncer), and integrated backups.
|
||||
|
||||
### Positive Consequences
|
||||
|
||||
* Single operator manages all PostgreSQL instances
|
||||
* Declarative Cluster CRD for GitOps deployment
|
||||
* Automatic failover with minimal data loss
|
||||
* Built-in PgBouncer for connection pooling
|
||||
* Prometheus metrics and Grafana dashboards included
|
||||
* CNPG is CNCF-listed and actively maintained
|
||||
|
||||
### Negative Consequences
|
||||
|
||||
* PostgreSQL only (no MySQL/MariaDB support)
|
||||
* Operator adds resource overhead
|
||||
* Learning curve for CNPG-specific features
|
||||
|
||||
## Architecture
|
||||
|
||||
```
|
||||
┌─────────────────────────────────────────────────────────────────┐
|
||||
│ CNPG Operator │
|
||||
│ (cnpg-system namespace) │
|
||||
└────────────────────────────┬────────────────────────────────────┘
|
||||
│ Manages
|
||||
▼
|
||||
┌──────────────────┬─────────────────┬─────────────────────────────┐
|
||||
│ │ │ │
|
||||
▼ ▼ ▼ ▼
|
||||
┌──────────────┐ ┌──────────────┐ ┌──────────────┐ ┌──────────────┐
|
||||
│ gitea-pg │ │ authentik-db │ │companions-db │ │ mlflow-db │
|
||||
│ (3 replicas)│ │ (3 replicas)│ │ (3 replicas) │ │ (1 replica) │
|
||||
│ │ │ │ │ │ │ │
|
||||
│ ┌──────────┐ │ │ ┌──────────┐ │ │ ┌──────────┐ │ │ ┌──────────┐ │
|
||||
│ │ Primary │ │ │ │ Primary │ │ │ │ Primary │ │ │ │ Primary │ │
|
||||
│ ├──────────┤ │ │ ├──────────┤ │ │ ├──────────┤ │ │ └──────────┘ │
|
||||
│ │ Replica │ │ │ │ Replica │ │ │ │ Replica │ │ │ │
|
||||
│ ├──────────┤ │ │ ├──────────┤ │ │ ├──────────┤ │ │ │
|
||||
│ │ Replica │ │ │ │ Replica │ │ │ │ Replica │ │ │ │
|
||||
│ └──────────┘ │ │ └──────────┘ │ │ └──────────┘ │ │ │
|
||||
│ ┌──────────┐ │ │ ┌──────────┐ │ │ ┌──────────┐ │ │ │
|
||||
│ │ PgBouncer│ │ │ │ PgBouncer│ │ │ │ PgBouncer│ │ │ │
|
||||
│ └──────────┘ │ │ └──────────┘ │ │ └──────────┘ │ │ │
|
||||
└──────────────┘ └──────────────┘ └──────────────┘ └──────────────┘
|
||||
│ │ │ │
|
||||
└──────────────────┼─────────────────┼────────────────┘
|
||||
│ │
|
||||
┌─────▼─────┐ ┌─────▼─────┐
|
||||
│ Longhorn │ │ Longhorn │
|
||||
│ PVCs │ │ Backups │
|
||||
└───────────┘ └───────────┘
|
||||
```
|
||||
|
||||
## Cluster Configuration Template
|
||||
|
||||
```yaml
|
||||
apiVersion: postgresql.cnpg.io/v1
|
||||
kind: Cluster
|
||||
metadata:
|
||||
name: app-db
|
||||
spec:
|
||||
description: "Application PostgreSQL Cluster"
|
||||
imageName: ghcr.io/cloudnative-pg/postgresql:17.2
|
||||
instances: 3
|
||||
|
||||
primaryUpdateStrategy: unsupervised
|
||||
|
||||
postgresql:
|
||||
parameters:
|
||||
shared_buffers: "256MB"
|
||||
effective_cache_size: "768MB"
|
||||
work_mem: "16MB"
|
||||
max_connections: "200"
|
||||
|
||||
# Enable PgBouncer for connection pooling
|
||||
enablePgBouncer: true
|
||||
pgbouncer:
|
||||
poolMode: transaction
|
||||
defaultPoolSize: "25"
|
||||
|
||||
# Storage on Longhorn
|
||||
storage:
|
||||
size: 10Gi
|
||||
storageClass: longhorn
|
||||
|
||||
# Monitoring
|
||||
monitoring:
|
||||
enabled: true
|
||||
customQueriesConfigMap:
|
||||
- name: cnpg-default-monitoring
|
||||
key: queries
|
||||
|
||||
# Backup configuration
|
||||
backup:
|
||||
barmanObjectStore:
|
||||
destinationPath: "s3://backups/postgres/"
|
||||
s3Credentials:
|
||||
accessKeyId:
|
||||
name: postgres-backup-creds
|
||||
key: ACCESS_KEY_ID
|
||||
secretAccessKey:
|
||||
name: postgres-backup-creds
|
||||
key: SECRET_ACCESS_KEY
|
||||
retentionPolicy: "7d"
|
||||
```
|
||||
|
||||
## Database Instances
|
||||
|
||||
| Cluster | Instances | Storage | PgBouncer | Purpose |
|
||||
|---------|-----------|---------|-----------|---------|
|
||||
| `gitea-pg` | 3 | 10Gi | Yes | Git repository metadata |
|
||||
| `authentik-db` | 3 | 5Gi | Yes | Identity/SSO data |
|
||||
| `companions-db` | 3 | 10Gi | Yes | Chat app data |
|
||||
| `mlflow-db` | 1 | 5Gi | No | Experiment tracking |
|
||||
| `kubeflow-db` | 1 | 10Gi | No | Pipeline metadata |
|
||||
|
||||
## Connection Patterns
|
||||
|
||||
### Service Discovery
|
||||
|
||||
CNPG creates services for each cluster:
|
||||
|
||||
| Service | Purpose |
|
||||
|---------|---------|
|
||||
| `<cluster>-rw` | Read-write (primary only) |
|
||||
| `<cluster>-ro` | Read-only (any replica) |
|
||||
| `<cluster>-r` | Read (any instance) |
|
||||
| `<cluster>-pooler-rw` | PgBouncer read-write |
|
||||
| `<cluster>-pooler-ro` | PgBouncer read-only |
|
||||
|
||||
### Application Configuration
|
||||
|
||||
```yaml
|
||||
# Application config using CNPG service
|
||||
DATABASE_URL: "postgresql://user:password@gitea-pg-pooler-rw.gitea.svc:5432/giteadb"
|
||||
```
|
||||
|
||||
### Credentials via External Secrets
|
||||
|
||||
```yaml
|
||||
apiVersion: external-secrets.io/v1
|
||||
kind: ExternalSecret
|
||||
metadata:
|
||||
name: app-db-credentials
|
||||
spec:
|
||||
secretStoreRef:
|
||||
kind: ClusterSecretStore
|
||||
name: vault
|
||||
target:
|
||||
name: app-db-credentials
|
||||
data:
|
||||
- secretKey: username
|
||||
remoteRef:
|
||||
key: kv/data/app-db
|
||||
property: username
|
||||
- secretKey: password
|
||||
remoteRef:
|
||||
key: kv/data/app-db
|
||||
property: password
|
||||
```
|
||||
|
||||
## High Availability
|
||||
|
||||
### Automatic Failover
|
||||
|
||||
- CNPG monitors primary health continuously
|
||||
- If primary fails, automatic promotion of replica
|
||||
- Application reconnection via service abstraction
|
||||
- Typical failover time: 10-30 seconds
|
||||
|
||||
### Replica Synchronization
|
||||
|
||||
- Streaming replication from primary to replicas
|
||||
- Synchronous replication available for zero data loss (trade-off: latency)
|
||||
- Default: asynchronous with acceptable RPO
|
||||
|
||||
## Backup Strategy
|
||||
|
||||
### Continuous WAL Archiving
|
||||
|
||||
- Write-Ahead Log streamed to S3
|
||||
- Point-in-time recovery capability
|
||||
- RPO: seconds (last WAL segment)
|
||||
|
||||
### Base Backups
|
||||
|
||||
- **Frequency:** Daily
|
||||
- **Retention:** 7 days
|
||||
- **Destination:** S3-compatible (MinIO/Backblaze)
|
||||
|
||||
### Recovery Testing
|
||||
|
||||
- Periodic restore to test cluster
|
||||
- Validate backup integrity
|
||||
- Document recovery procedure
|
||||
|
||||
## Monitoring
|
||||
|
||||
### Prometheus Metrics
|
||||
|
||||
- Connection count and pool utilization
|
||||
- Transaction rate and latency
|
||||
- Replication lag
|
||||
- Disk usage and WAL generation
|
||||
|
||||
### Grafana Dashboard
|
||||
|
||||
CNPG provides official dashboard:
|
||||
- Cluster health overview
|
||||
- Per-instance metrics
|
||||
- Replication status
|
||||
- Backup job history
|
||||
|
||||
### Alerts
|
||||
|
||||
```yaml
|
||||
- alert: PostgreSQLDown
|
||||
expr: cnpg_collector_up == 0
|
||||
for: 5m
|
||||
labels:
|
||||
severity: critical
|
||||
|
||||
- alert: PostgreSQLReplicationLag
|
||||
expr: cnpg_pg_replication_lag_seconds > 30
|
||||
for: 5m
|
||||
labels:
|
||||
severity: warning
|
||||
|
||||
- alert: PostgreSQLConnectionsHigh
|
||||
expr: cnpg_pg_stat_activity_count / cnpg_pg_settings_max_connections > 0.8
|
||||
for: 5m
|
||||
labels:
|
||||
severity: warning
|
||||
```
|
||||
|
||||
## When NOT to Use CloudNativePG
|
||||
|
||||
| Scenario | Alternative |
|
||||
|----------|-------------|
|
||||
| Simple app, no HA needed | Embedded SQLite |
|
||||
| MySQL/MariaDB required | Application-specific chart |
|
||||
| Massive scale | External managed database |
|
||||
| Non-relational data | Redis/Valkey, MongoDB |
|
||||
|
||||
## PostgreSQL Version Policy
|
||||
|
||||
- Use latest stable major version (currently 17)
|
||||
- Minor version updates: automatic (`primaryUpdateStrategy: unsupervised`)
|
||||
- Major version upgrades: manual with testing
|
||||
|
||||
## Future Enhancements
|
||||
|
||||
1. **Cross-cluster replication** - DR site replica
|
||||
2. **Logical replication** - Selective table sync between clusters
|
||||
3. **TimescaleDB extension** - Time-series optimization for metrics
|
||||
4. **PgVector extension** - Vector storage alternative to Milvus
|
||||
|
||||
## References
|
||||
|
||||
* [CloudNativePG Documentation](https://cloudnative-pg.io/documentation/)
|
||||
* [CNPG GitHub](https://github.com/cloudnative-pg/cloudnative-pg)
|
||||
* [PostgreSQL High Availability](https://www.postgresql.org/docs/current/high-availability.html)
|
||||
Reference in New Issue
Block a user