Files
homelab-design/decisions/0033-data-analytics-platform.md

268 lines
9.2 KiB
Markdown

# Data Analytics Platform Architecture
* Status: accepted
* Date: 2026-02-05
* Deciders: Billy
* Technical Story: Build a modern lakehouse architecture for HTTP analytics and ML feature engineering
## Context and Problem Statement
The homelab generates significant telemetry data from HTTP traffic (via Envoy Gateway), application logs, and ML inference metrics. This data is valuable for:
- Traffic pattern analysis
- Security anomaly detection
- ML feature engineering
- Cost optimization insights
How do we build a scalable analytics platform that supports both batch and real-time processing?
## Decision Drivers
* Modern lakehouse architecture (SQL + streaming)
* Real-time and batch processing capabilities
* Cost-effective on homelab hardware
* Integration with existing observability stack
* Support for ML feature pipelines
* Open table formats for interoperability
## Considered Options
1. **Lakehouse: Nessie + Spark + Flink + Trino + RisingWave**
2. **Traditional DWH: ClickHouse only**
3. **Cloud-native: Databricks/Snowflake (SaaS)**
4. **Minimal: PostgreSQL with TimescaleDB**
## Decision Outcome
Chosen option: **Option 1 - Modern Lakehouse Architecture**
A full lakehouse stack with Apache Iceberg tables (via Nessie catalog), Spark for batch ETL, Flink for streaming, Trino for interactive queries, and RisingWave for streaming SQL.
### Positive Consequences
* Unified batch and streaming on same data
* Git-like versioning of tables via Nessie
* Standard SQL across all engines
* Decoupled compute and storage
* Open formats prevent vendor lock-in
* ML feature engineering support
### Negative Consequences
* Complex multi-component architecture
* Higher resource requirements
* Steeper learning curve
* Multiple operators to maintain
## Architecture
```
┌─────────────────────────────────────────────────────────────────────────────┐
│ DATA SOURCES │
│ ┌──────────────┐ ┌──────────────┐ ┌──────────────┐ ┌──────────────┐ │
│ │ Envoy Logs │ │ Application │ │ Inference │ │ Prometheus │ │
│ │ (HTTPRoute) │ │ Telemetry │ │ Metrics │ │ Metrics │ │
│ └──────┬───────┘ └──────┬───────┘ └──────┬───────┘ └──────┬───────┘ │
└─────────┼─────────────────┼─────────────────┼─────────────────┼─────────────┘
│ │ │ │
└─────────────────┼─────────────────┼─────────────────┘
▼ │
┌───────────────────────┐ │
│ NATS JetStream │◄──────┘
│ (Event Streaming) │
└───────────┬───────────┘
┌───────────────┼───────────────┐
│ │ │
▼ ▼ ▼
┌─────────────────┐ ┌───────────┐ ┌───────────────────┐
│ Apache Flink │ │ RisingWave│ │ Apache Spark │
│ (Streaming ETL)│ │ (Stream │ │ (Batch ETL) │
│ │ │ SQL) │ │ │
└────────┬────────┘ └─────┬─────┘ └─────────┬─────────┘
│ │ │
└────────────────┼─────────────────┘
│ Write Iceberg Tables
┌───────────────────────┐
│ Nessie │
│ (Iceberg Catalog) │
│ Git-like versioning │
└───────────┬───────────┘
┌───────────────────────┐
│ NFS Storage │
│ (candlekeep:/lakehouse)│
└───────────────────────┘
┌───────────────────────┐
│ Trino │
│ (Interactive Query) │
│ + Grafana Dashboards │
└───────────────────────┘
```
## Component Details
### Apache Nessie (Iceberg Catalog)
**Purpose:** Git-like version control for data tables
```yaml
# HelmRelease: nessie
# Version: 0.107.1
spec:
versionStoreType: ROCKSDB # Embedded storage
catalog:
iceberg:
configDefaults:
warehouse: s3://lakehouse/
```
**Features:**
- Branch/tag data versions
- Time travel queries
- Multi-table transactions
- Cross-engine compatibility
### Apache Spark (Batch Processing)
**Purpose:** Large-scale batch ETL and ML feature engineering
```yaml
# SparkApplication for HTTPRoute analytics
apiVersion: sparkoperator.k8s.io/v1beta2
kind: SparkApplication
spec:
type: Python
mode: cluster
sparkConf:
spark.sql.catalog.nessie: org.apache.iceberg.spark.SparkCatalog
spark.sql.catalog.nessie.catalog-impl: org.apache.iceberg.nessie.NessieCatalog
spark.sql.catalog.nessie.uri: http://nessie:19120/api/v1
```
**Use Cases:**
- Daily HTTPRoute log aggregation
- Feature engineering for ML
- Historical data compaction
### Apache Flink (Stream Processing)
**Purpose:** Real-time event processing
```yaml
# Flink Kubernetes Operator
# Version: 1.13.0
spec:
job:
jarURI: local:///opt/flink/jobs/httproute-analytics.jar
parallelism: 2
```
**Use Cases:**
- Real-time traffic anomaly detection
- Streaming ETL to Iceberg
- Session windowing for user analytics
### RisingWave (Streaming SQL)
**Purpose:** Simplified streaming SQL for real-time dashboards
```sql
-- Materialized view for real-time traffic
CREATE MATERIALIZED VIEW traffic_5min AS
SELECT
window_start,
route_name,
COUNT(*) as request_count,
AVG(response_time_ms) as avg_latency
FROM httproute_events
GROUP BY
TUMBLE(event_time, INTERVAL '5 MINUTES'),
route_name;
```
**Use Cases:**
- Real-time Grafana dashboards
- Streaming aggregations
- Alerting triggers
### Trino (Interactive Query)
**Purpose:** Fast SQL queries across Iceberg tables
```yaml
# Trino coordinator + 2 workers
catalogs:
iceberg: |
connector.name=iceberg
iceberg.catalog.type=nessie
iceberg.nessie.uri=http://nessie:19120/api/v1
```
**Use Cases:**
- Ad-hoc analytics queries
- Grafana data source for dashboards
- Cross-table JOINs
## Data Flow: HTTPRoute Analytics
```
Envoy Gateway
▼ (access logs via OTEL)
NATS JetStream
├─► Flink Job (streaming)
│ │
│ ▼
│ Iceberg Table: httproute_raw
└─► Spark Job (nightly batch)
Iceberg Table: httproute_daily_agg
Trino ─► Grafana Dashboard
```
## Storage Layout
```
candlekeep:/kubernetes/lakehouse/
├── warehouse/
│ └── analytics/
│ ├── httproute_raw/ # Raw events (partitioned by date)
│ ├── httproute_daily_agg/ # Daily aggregates
│ ├── inference_metrics/ # ML inference stats
│ └── feature_store/ # ML features
└── checkpoints/
├── flink/ # Flink savepoints
└── spark/ # Spark checkpoints
```
## Resource Allocation
| Component | Replicas | CPU | Memory |
|-----------|----------|-----|--------|
| Nessie | 1 | 0.5 | 512Mi |
| Spark Operator | 1 | 0.2 | 256Mi |
| Flink Operator | 1 | 0.2 | 256Mi |
| Flink JobManager | 1 | 1 | 2Gi |
| Flink TaskManager | 2 | 2 | 4Gi |
| RisingWave | 1 | 2 | 4Gi |
| Trino Coordinator | 1 | 1 | 2Gi |
| Trino Worker | 2 | 2 | 4Gi |
## Links
* [Apache Iceberg](https://iceberg.apache.org/)
* [Project Nessie](https://projectnessie.org/)
* [Apache Flink](https://flink.apache.org/)
* [RisingWave](https://risingwave.com/)
* [Trino](https://trino.io/)
* Related: [ADR-0025](0025-observability-stack.md) - Observability Stack