9.2 KiB
Data Analytics Platform Architecture
- Status: accepted
- Date: 2026-02-05
- Deciders: Billy
- Technical Story: Build a modern lakehouse architecture for HTTP analytics and ML feature engineering
Context and Problem Statement
The homelab generates significant telemetry data from HTTP traffic (via Envoy Gateway), application logs, and ML inference metrics. This data is valuable for:
- Traffic pattern analysis
- Security anomaly detection
- ML feature engineering
- Cost optimization insights
How do we build a scalable analytics platform that supports both batch and real-time processing?
Decision Drivers
- Modern lakehouse architecture (SQL + streaming)
- Real-time and batch processing capabilities
- Cost-effective on homelab hardware
- Integration with existing observability stack
- Support for ML feature pipelines
- Open table formats for interoperability
Considered Options
- Lakehouse: Nessie + Spark + Flink + Trino + RisingWave
- Traditional DWH: ClickHouse only
- Cloud-native: Databricks/Snowflake (SaaS)
- Minimal: PostgreSQL with TimescaleDB
Decision Outcome
Chosen option: Option 1 - Modern Lakehouse Architecture
A full lakehouse stack with Apache Iceberg tables (via Nessie catalog), Spark for batch ETL, Flink for streaming, Trino for interactive queries, and RisingWave for streaming SQL.
Positive Consequences
- Unified batch and streaming on same data
- Git-like versioning of tables via Nessie
- Standard SQL across all engines
- Decoupled compute and storage
- Open formats prevent vendor lock-in
- ML feature engineering support
Negative Consequences
- Complex multi-component architecture
- Higher resource requirements
- Steeper learning curve
- Multiple operators to maintain
Architecture
┌─────────────────────────────────────────────────────────────────────────────┐
│ DATA SOURCES │
│ ┌──────────────┐ ┌──────────────┐ ┌──────────────┐ ┌──────────────┐ │
│ │ Envoy Logs │ │ Application │ │ Inference │ │ Prometheus │ │
│ │ (HTTPRoute) │ │ Telemetry │ │ Metrics │ │ Metrics │ │
│ └──────┬───────┘ └──────┬───────┘ └──────┬───────┘ └──────┬───────┘ │
└─────────┼─────────────────┼─────────────────┼─────────────────┼─────────────┘
│ │ │ │
└─────────────────┼─────────────────┼─────────────────┘
▼ │
┌───────────────────────┐ │
│ NATS JetStream │◄──────┘
│ (Event Streaming) │
└───────────┬───────────┘
│
┌───────────────┼───────────────┐
│ │ │
▼ ▼ ▼
┌─────────────────┐ ┌───────────┐ ┌───────────────────┐
│ Apache Flink │ │ RisingWave│ │ Apache Spark │
│ (Streaming ETL)│ │ (Stream │ │ (Batch ETL) │
│ │ │ SQL) │ │ │
└────────┬────────┘ └─────┬─────┘ └─────────┬─────────┘
│ │ │
└────────────────┼─────────────────┘
│ Write Iceberg Tables
▼
┌───────────────────────┐
│ Nessie │
│ (Iceberg Catalog) │
│ Git-like versioning │
└───────────┬───────────┘
│
▼
┌───────────────────────┐
│ NFS Storage │
│ (candlekeep:/lakehouse)│
└───────────────────────┘
│
▼
┌───────────────────────┐
│ Trino │
│ (Interactive Query) │
│ + Grafana Dashboards │
└───────────────────────┘
Component Details
Apache Nessie (Iceberg Catalog)
Purpose: Git-like version control for data tables
# HelmRelease: nessie
# Version: 0.107.1
spec:
versionStoreType: ROCKSDB # Embedded storage
catalog:
iceberg:
configDefaults:
warehouse: s3://lakehouse/
Features:
- Branch/tag data versions
- Time travel queries
- Multi-table transactions
- Cross-engine compatibility
Apache Spark (Batch Processing)
Purpose: Large-scale batch ETL and ML feature engineering
# SparkApplication for HTTPRoute analytics
apiVersion: sparkoperator.k8s.io/v1beta2
kind: SparkApplication
spec:
type: Python
mode: cluster
sparkConf:
spark.sql.catalog.nessie: org.apache.iceberg.spark.SparkCatalog
spark.sql.catalog.nessie.catalog-impl: org.apache.iceberg.nessie.NessieCatalog
spark.sql.catalog.nessie.uri: http://nessie:19120/api/v1
Use Cases:
- Daily HTTPRoute log aggregation
- Feature engineering for ML
- Historical data compaction
Apache Flink (Stream Processing)
Purpose: Real-time event processing
# Flink Kubernetes Operator
# Version: 1.13.0
spec:
job:
jarURI: local:///opt/flink/jobs/httproute-analytics.jar
parallelism: 2
Use Cases:
- Real-time traffic anomaly detection
- Streaming ETL to Iceberg
- Session windowing for user analytics
RisingWave (Streaming SQL)
Purpose: Simplified streaming SQL for real-time dashboards
-- Materialized view for real-time traffic
CREATE MATERIALIZED VIEW traffic_5min AS
SELECT
window_start,
route_name,
COUNT(*) as request_count,
AVG(response_time_ms) as avg_latency
FROM httproute_events
GROUP BY
TUMBLE(event_time, INTERVAL '5 MINUTES'),
route_name;
Use Cases:
- Real-time Grafana dashboards
- Streaming aggregations
- Alerting triggers
Trino (Interactive Query)
Purpose: Fast SQL queries across Iceberg tables
# Trino coordinator + 2 workers
catalogs:
iceberg: |
connector.name=iceberg
iceberg.catalog.type=nessie
iceberg.nessie.uri=http://nessie:19120/api/v1
Use Cases:
- Ad-hoc analytics queries
- Grafana data source for dashboards
- Cross-table JOINs
Data Flow: HTTPRoute Analytics
Envoy Gateway
│
▼ (access logs via OTEL)
NATS JetStream
│
├─► Flink Job (streaming)
│ │
│ ▼
│ Iceberg Table: httproute_raw
│
└─► Spark Job (nightly batch)
│
▼
Iceberg Table: httproute_daily_agg
│
▼
Trino ─► Grafana Dashboard
Storage Layout
candlekeep:/kubernetes/lakehouse/
├── warehouse/
│ └── analytics/
│ ├── httproute_raw/ # Raw events (partitioned by date)
│ ├── httproute_daily_agg/ # Daily aggregates
│ ├── inference_metrics/ # ML inference stats
│ └── feature_store/ # ML features
└── checkpoints/
├── flink/ # Flink savepoints
└── spark/ # Spark checkpoints
Resource Allocation
| Component | Replicas | CPU | Memory |
|---|---|---|---|
| Nessie | 1 | 0.5 | 512Mi |
| Spark Operator | 1 | 0.2 | 256Mi |
| Flink Operator | 1 | 0.2 | 256Mi |
| Flink JobManager | 1 | 1 | 2Gi |
| Flink TaskManager | 2 | 2 | 4Gi |
| RisingWave | 1 | 2 | 4Gi |
| Trino Coordinator | 1 | 1 | 2Gi |
| Trino Worker | 2 | 2 | 4Gi |
Links
- Apache Iceberg
- Project Nessie
- Apache Flink
- RisingWave
- Trino
- Related: ADR-0025 - Observability Stack