# Database Strategy with CloudNativePG

* Status: accepted
* Date: 2026-02-04
* Deciders: Billy
* Technical Story: Standardize PostgreSQL deployment for stateful applications

## Context and Problem Statement

Multiple applications in the homelab require relational databases: Gitea, Authentik, Companions, MLflow, and potentially more. Each could use different database solutions, creating operational complexity.

How do we standardize database deployment while providing production-grade reliability and minimal operational overhead?

## Decision Drivers

* Operational simplicity - single operator to learn and manage
* High availability - automatic failover for critical databases
* Backup integration - consistent backup strategy across all databases
* GitOps compatibility - declarative database provisioning
* Resource efficiency - don't over-provision for homelab scale

## Considered Options

1. **CloudNativePG for PostgreSQL**
2. **Helm charts per application (Bitnami PostgreSQL)**
3. **External managed database (RDS-style)**
4. **SQLite where possible + single shared PostgreSQL**

## Decision Outcome

Chosen option: **Option 1 - CloudNativePG for PostgreSQL**

CloudNativePG (CNPG) provides a Kubernetes-native PostgreSQL operator with HA, automatic failover, connection pooling (PgBouncer), and integrated backups.

### Positive Consequences

* Single operator manages all PostgreSQL instances
* Declarative Cluster CRD for GitOps deployment
* Automatic failover with minimal data loss
* Built-in PgBouncer for connection pooling
* Prometheus metrics and Grafana dashboards included
* CNPG is CNCF-listed and actively maintained

### Negative Consequences

* PostgreSQL only (no MySQL/MariaDB support)
* Operator adds resource overhead
* Learning curve for CNPG-specific features

## Architecture

```
┌─────────────────────────────────────────────────────────────────┐
│                        CNPG Operator                             │
│                     (cnpg-system namespace)                      │
└────────────────────────────┬────────────────────────────────────┘
                             │ Manages
                             ▼
┌──────────────────┬─────────────────┬─────────────────────────────┐
│                  │                 │                             │
▼                  ▼                 ▼                             ▼
┌──────────────┐  ┌──────────────┐  ┌──────────────┐  ┌──────────────┐
│  gitea-pg    │  │ authentik-db │  │companions-db │  │  mlflow-db   │
│  (3 replicas)│  │  (3 replicas)│  │ (3 replicas) │  │ (1 replica)  │
│              │  │              │  │              │  │              │
│ ┌──────────┐ │  │ ┌──────────┐ │  │ ┌──────────┐ │  │ ┌──────────┐ │
│ │ Primary  │ │  │ │ Primary  │ │  │ │ Primary  │ │  │ │ Primary  │ │
│ ├──────────┤ │  │ ├──────────┤ │  │ ├──────────┤ │  │ └──────────┘ │
│ │ Replica  │ │  │ │ Replica  │ │  │ │ Replica  │ │  │              │
│ ├──────────┤ │  │ ├──────────┤ │  │ ├──────────┤ │  │              │
│ │ Replica  │ │  │ │ Replica  │ │  │ │ Replica  │ │  │              │
│ └──────────┘ │  │ └──────────┘ │  │ └──────────┘ │  │              │
│ ┌──────────┐ │  │ ┌──────────┐ │  │ ┌──────────┐ │  │              │
│ │ PgBouncer│ │  │ │ PgBouncer│ │  │ │ PgBouncer│ │  │              │
│ └──────────┘ │  │ └──────────┘ │  │ └──────────┘ │  │              │
└──────────────┘  └──────────────┘  └──────────────┘  └──────────────┘
      │                  │                 │                │
      └──────────────────┼─────────────────┼────────────────┘
                         │                 │
                   ┌─────▼─────┐     ┌─────▼─────┐
                   │  Longhorn │     │ Longhorn  │
                   │   PVCs    │     │  Backups  │
                   └───────────┘     └───────────┘
```

## Cluster Configuration Template

```yaml
apiVersion: postgresql.cnpg.io/v1
kind: Cluster
metadata:
  name: app-db
spec:
  description: "Application PostgreSQL Cluster"
  imageName: ghcr.io/cloudnative-pg/postgresql:17.2
  instances: 3

  primaryUpdateStrategy: unsupervised

  postgresql:
    parameters:
      shared_buffers: "256MB"
      effective_cache_size: "768MB"
      work_mem: "16MB"
      max_connections: "200"
      
  # Enable PgBouncer for connection pooling
  enablePgBouncer: true
  pgbouncer:
    poolMode: transaction
    defaultPoolSize: "25"

  # Storage on Longhorn
  storage:
    size: 10Gi
    storageClass: longhorn

  # Monitoring
  monitoring:
    enabled: true
    customQueriesConfigMap:
      - name: cnpg-default-monitoring
        key: queries

  # Backup configuration
  backup:
    barmanObjectStore:
      destinationPath: "s3://backups/postgres/"
      s3Credentials:
        accessKeyId:
          name: postgres-backup-creds
          key: ACCESS_KEY_ID
        secretAccessKey:
          name: postgres-backup-creds
          key: SECRET_ACCESS_KEY
    retentionPolicy: "7d"
```

## Database Instances

| Cluster | Instances | Storage | PgBouncer | Purpose |
|---------|-----------|---------|-----------|---------|
| `gitea-pg` | 3 | 10Gi | Yes | Git repository metadata |
| `authentik-db` | 3 | 5Gi | Yes | Identity/SSO data |
| `companions-db` | 3 | 10Gi | Yes | Chat app data |
| `mlflow-db` | 1 | 5Gi | No | Experiment tracking |
| `kubeflow-db` | 1 | 10Gi | No | Pipeline metadata |

## Connection Patterns

### Service Discovery

CNPG creates services for each cluster:

| Service | Purpose |
|---------|---------|
| `<cluster>-rw` | Read-write (primary only) |
| `<cluster>-ro` | Read-only (any replica) |
| `<cluster>-r` | Read (any instance) |
| `<cluster>-pooler-rw` | PgBouncer read-write |
| `<cluster>-pooler-ro` | PgBouncer read-only |

### Application Configuration

```yaml
# Application config using CNPG service
DATABASE_URL: "postgresql://user:password@gitea-pg-pooler-rw.gitea.svc:5432/giteadb"
```

### Credentials via External Secrets

```yaml
apiVersion: external-secrets.io/v1
kind: ExternalSecret
metadata:
  name: app-db-credentials
spec:
  secretStoreRef:
    kind: ClusterSecretStore
    name: vault
  target:
    name: app-db-credentials
  data:
    - secretKey: username
      remoteRef:
        key: kv/data/app-db
        property: username
    - secretKey: password
      remoteRef:
        key: kv/data/app-db
        property: password
```

## High Availability

### Automatic Failover

- CNPG monitors primary health continuously
- If primary fails, automatic promotion of replica
- Application reconnection via service abstraction
- Typical failover time: 10-30 seconds

### Replica Synchronization

- Streaming replication from primary to replicas
- Synchronous replication available for zero data loss (trade-off: latency)
- Default: asynchronous with acceptable RPO

## Backup Strategy

### Continuous WAL Archiving

- Write-Ahead Log streamed to S3
- Point-in-time recovery capability
- RPO: seconds (last WAL segment)

### Base Backups

- **Frequency:** Daily
- **Retention:** 7 days
- **Destination:** S3-compatible (MinIO/Backblaze)

### Recovery Testing

- Periodic restore to test cluster
- Validate backup integrity
- Document recovery procedure

## Monitoring

### Prometheus Metrics

- Connection count and pool utilization
- Transaction rate and latency
- Replication lag
- Disk usage and WAL generation

### Grafana Dashboard

CNPG provides official dashboard:
- Cluster health overview
- Per-instance metrics
- Replication status
- Backup job history

### Alerts

```yaml
- alert: PostgreSQLDown
  expr: cnpg_collector_up == 0
  for: 5m
  labels:
    severity: critical

- alert: PostgreSQLReplicationLag
  expr: cnpg_pg_replication_lag_seconds > 30
  for: 5m
  labels:
    severity: warning

- alert: PostgreSQLConnectionsHigh
  expr: cnpg_pg_stat_activity_count / cnpg_pg_settings_max_connections > 0.8
  for: 5m
  labels:
    severity: warning
```

## When NOT to Use CloudNativePG

| Scenario | Alternative |
|----------|-------------|
| Simple app, no HA needed | Embedded SQLite |
| MySQL/MariaDB required | Application-specific chart |
| Massive scale | External managed database |
| Non-relational data | Redis/Valkey, MongoDB |

## PostgreSQL Version Policy

- Use latest stable major version (currently 17)
- Minor version updates: automatic (`primaryUpdateStrategy: unsupervised`)
- Major version upgrades: manual with testing

## Future Enhancements

1. **Cross-cluster replication** - DR site replica
2. **Logical replication** - Selective table sync between clusters
3. **TimescaleDB extension** - Time-series optimization for metrics
4. **PgVector extension** - Vector storage alternative to Milvus

## References

* [CloudNativePG Documentation](https://cloudnative-pg.io/documentation/)
* [CNPG GitHub](https://github.com/cloudnative-pg/cloudnative-pg)
* [PostgreSQL High Availability](https://www.postgresql.org/docs/current/high-availability.html)