# Data Analytics Platform Architecture * Status: accepted * Date: 2026-02-05 * Deciders: Billy * Technical Story: Build a modern lakehouse architecture for HTTP analytics and ML feature engineering ## Context and Problem Statement The homelab generates significant telemetry data from HTTP traffic (via Envoy Gateway), application logs, and ML inference metrics. This data is valuable for: - Traffic pattern analysis - Security anomaly detection - ML feature engineering - Cost optimization insights How do we build a scalable analytics platform that supports both batch and real-time processing? ## Decision Drivers * Modern lakehouse architecture (SQL + streaming) * Real-time and batch processing capabilities * Cost-effective on homelab hardware * Integration with existing observability stack * Support for ML feature pipelines * Open table formats for interoperability ## Considered Options 1. **Lakehouse: Nessie + Spark + Flink + Trino + RisingWave** 2. **Traditional DWH: ClickHouse only** 3. **Cloud-native: Databricks/Snowflake (SaaS)** 4. **Minimal: PostgreSQL with TimescaleDB** ## Decision Outcome Chosen option: **Option 1 - Modern Lakehouse Architecture** A full lakehouse stack with Apache Iceberg tables (via Nessie catalog), Spark for batch ETL, Flink for streaming, Trino for interactive queries, and RisingWave for streaming SQL. ### Positive Consequences * Unified batch and streaming on same data * Git-like versioning of tables via Nessie * Standard SQL across all engines * Decoupled compute and storage * Open formats prevent vendor lock-in * ML feature engineering support ### Negative Consequences * Complex multi-component architecture * Higher resource requirements * Steeper learning curve * Multiple operators to maintain ## Architecture ``` ┌─────────────────────────────────────────────────────────────────────────────┐ │ DATA SOURCES │ │ ┌──────────────┐ ┌──────────────┐ ┌──────────────┐ ┌──────────────┐ │ │ │ Envoy Logs │ │ Application │ │ Inference │ │ Prometheus │ │ │ │ (HTTPRoute) │ │ Telemetry │ │ Metrics │ │ Metrics │ │ │ └──────┬───────┘ └──────┬───────┘ └──────┬───────┘ └──────┬───────┘ │ └─────────┼─────────────────┼─────────────────┼─────────────────┼─────────────┘ │ │ │ │ └─────────────────┼─────────────────┼─────────────────┘ ▼ │ ┌───────────────────────┐ │ │ NATS JetStream │◄──────┘ │ (Event Streaming) │ └───────────┬───────────┘ │ ┌───────────────┼───────────────┐ │ │ │ ▼ ▼ ▼ ┌─────────────────┐ ┌───────────┐ ┌───────────────────┐ │ Apache Flink │ │ RisingWave│ │ Apache Spark │ │ (Streaming ETL)│ │ (Stream │ │ (Batch ETL) │ │ │ │ SQL) │ │ │ └────────┬────────┘ └─────┬─────┘ └─────────┬─────────┘ │ │ │ └────────────────┼─────────────────┘ │ Write Iceberg Tables ▼ ┌───────────────────────┐ │ Nessie │ │ (Iceberg Catalog) │ │ Git-like versioning │ └───────────┬───────────┘ │ ▼ ┌───────────────────────┐ │ NFS Storage │ │ (candlekeep:/lakehouse)│ └───────────────────────┘ │ ▼ ┌───────────────────────┐ │ Trino │ │ (Interactive Query) │ │ + Grafana Dashboards │ └───────────────────────┘ ``` ## Component Details ### Apache Nessie (Iceberg Catalog) **Purpose:** Git-like version control for data tables ```yaml # HelmRelease: nessie # Version: 0.107.1 spec: versionStoreType: ROCKSDB # Embedded storage catalog: iceberg: configDefaults: warehouse: s3://lakehouse/ ``` **Features:** - Branch/tag data versions - Time travel queries - Multi-table transactions - Cross-engine compatibility ### Apache Spark (Batch Processing) **Purpose:** Large-scale batch ETL and ML feature engineering ```yaml # SparkApplication for HTTPRoute analytics apiVersion: sparkoperator.k8s.io/v1beta2 kind: SparkApplication spec: type: Python mode: cluster sparkConf: spark.sql.catalog.nessie: org.apache.iceberg.spark.SparkCatalog spark.sql.catalog.nessie.catalog-impl: org.apache.iceberg.nessie.NessieCatalog spark.sql.catalog.nessie.uri: http://nessie:19120/api/v1 ``` **Use Cases:** - Daily HTTPRoute log aggregation - Feature engineering for ML - Historical data compaction ### Apache Flink (Stream Processing) **Purpose:** Real-time event processing ```yaml # Flink Kubernetes Operator # Version: 1.13.0 spec: job: jarURI: local:///opt/flink/jobs/httproute-analytics.jar parallelism: 2 ``` **Use Cases:** - Real-time traffic anomaly detection - Streaming ETL to Iceberg - Session windowing for user analytics ### RisingWave (Streaming SQL) **Purpose:** Simplified streaming SQL for real-time dashboards ```sql -- Materialized view for real-time traffic CREATE MATERIALIZED VIEW traffic_5min AS SELECT window_start, route_name, COUNT(*) as request_count, AVG(response_time_ms) as avg_latency FROM httproute_events GROUP BY TUMBLE(event_time, INTERVAL '5 MINUTES'), route_name; ``` **Use Cases:** - Real-time Grafana dashboards - Streaming aggregations - Alerting triggers ### Trino (Interactive Query) **Purpose:** Fast SQL queries across Iceberg tables ```yaml # Trino coordinator + 2 workers catalogs: iceberg: | connector.name=iceberg iceberg.catalog.type=nessie iceberg.nessie.uri=http://nessie:19120/api/v1 ``` **Use Cases:** - Ad-hoc analytics queries - Grafana data source for dashboards - Cross-table JOINs ## Data Flow: HTTPRoute Analytics ``` Envoy Gateway │ ▼ (access logs via OTEL) NATS JetStream │ ├─► Flink Job (streaming) │ │ │ ▼ │ Iceberg Table: httproute_raw │ └─► Spark Job (nightly batch) │ ▼ Iceberg Table: httproute_daily_agg │ ▼ Trino ─► Grafana Dashboard ``` ## Storage Layout ``` candlekeep:/kubernetes/lakehouse/ ├── warehouse/ │ └── analytics/ │ ├── httproute_raw/ # Raw events (partitioned by date) │ ├── httproute_daily_agg/ # Daily aggregates │ ├── inference_metrics/ # ML inference stats │ └── feature_store/ # ML features └── checkpoints/ ├── flink/ # Flink savepoints └── spark/ # Spark checkpoints ``` ## Resource Allocation | Component | Replicas | CPU | Memory | |-----------|----------|-----|--------| | Nessie | 1 | 0.5 | 512Mi | | Spark Operator | 1 | 0.2 | 256Mi | | Flink Operator | 1 | 0.2 | 256Mi | | Flink JobManager | 1 | 1 | 2Gi | | Flink TaskManager | 2 | 2 | 4Gi | | RisingWave | 1 | 2 | 4Gi | | Trino Coordinator | 1 | 1 | 2Gi | | Trino Worker | 2 | 2 | 4Gi | ## Links * [Apache Iceberg](https://iceberg.apache.org/) * [Project Nessie](https://projectnessie.org/) * [Apache Flink](https://flink.apache.org/) * [RisingWave](https://risingwave.com/) * [Trino](https://trino.io/) * Related: [ADR-0025](0025-observability-stack.md) - Observability Stack