Files
homelab-design/decisions/0033-data-analytics-platform.md

9.2 KiB

Data Analytics Platform Architecture

  • Status: accepted
  • Date: 2026-02-05
  • Deciders: Billy
  • Technical Story: Build a modern lakehouse architecture for HTTP analytics and ML feature engineering

Context and Problem Statement

The homelab generates significant telemetry data from HTTP traffic (via Envoy Gateway), application logs, and ML inference metrics. This data is valuable for:

  • Traffic pattern analysis
  • Security anomaly detection
  • ML feature engineering
  • Cost optimization insights

How do we build a scalable analytics platform that supports both batch and real-time processing?

Decision Drivers

  • Modern lakehouse architecture (SQL + streaming)
  • Real-time and batch processing capabilities
  • Cost-effective on homelab hardware
  • Integration with existing observability stack
  • Support for ML feature pipelines
  • Open table formats for interoperability

Considered Options

  1. Lakehouse: Nessie + Spark + Flink + Trino + RisingWave
  2. Traditional DWH: ClickHouse only
  3. Cloud-native: Databricks/Snowflake (SaaS)
  4. Minimal: PostgreSQL with TimescaleDB

Decision Outcome

Chosen option: Option 1 - Modern Lakehouse Architecture

A full lakehouse stack with Apache Iceberg tables (via Nessie catalog), Spark for batch ETL, Flink for streaming, Trino for interactive queries, and RisingWave for streaming SQL.

Positive Consequences

  • Unified batch and streaming on same data
  • Git-like versioning of tables via Nessie
  • Standard SQL across all engines
  • Decoupled compute and storage
  • Open formats prevent vendor lock-in
  • ML feature engineering support

Negative Consequences

  • Complex multi-component architecture
  • Higher resource requirements
  • Steeper learning curve
  • Multiple operators to maintain

Architecture

┌─────────────────────────────────────────────────────────────────────────────┐
│                           DATA SOURCES                                       │
│  ┌──────────────┐  ┌──────────────┐  ┌──────────────┐  ┌──────────────┐     │
│  │ Envoy Logs   │  │ Application  │  │ Inference    │  │ Prometheus   │     │
│  │ (HTTPRoute)  │  │ Telemetry    │  │ Metrics      │  │ Metrics      │     │
│  └──────┬───────┘  └──────┬───────┘  └──────┬───────┘  └──────┬───────┘     │
└─────────┼─────────────────┼─────────────────┼─────────────────┼─────────────┘
          │                 │                 │                 │
          └─────────────────┼─────────────────┼─────────────────┘
                            ▼                 │
              ┌───────────────────────┐       │
              │   NATS JetStream      │◄──────┘
              │   (Event Streaming)   │
              └───────────┬───────────┘
                          │
          ┌───────────────┼───────────────┐
          │               │               │
          ▼               ▼               ▼
┌─────────────────┐ ┌───────────┐ ┌───────────────────┐
│  Apache Flink   │ │ RisingWave│ │   Apache Spark    │
│  (Streaming ETL)│ │ (Stream   │ │   (Batch ETL)     │
│                 │ │  SQL)     │ │                   │
└────────┬────────┘ └─────┬─────┘ └─────────┬─────────┘
         │                │                 │
         └────────────────┼─────────────────┘
                          │ Write Iceberg Tables
                          ▼
              ┌───────────────────────┐
              │       Nessie          │
              │   (Iceberg Catalog)   │
              │   Git-like versioning │
              └───────────┬───────────┘
                          │
                          ▼
              ┌───────────────────────┐
              │    NFS Storage        │
              │  (candlekeep:/lakehouse)│
              └───────────────────────┘
                          │
                          ▼
              ┌───────────────────────┐
              │        Trino          │
              │  (Interactive Query)  │
              │  + Grafana Dashboards │
              └───────────────────────┘

Component Details

Apache Nessie (Iceberg Catalog)

Purpose: Git-like version control for data tables

# HelmRelease: nessie
# Version: 0.107.1
spec:
  versionStoreType: ROCKSDB  # Embedded storage
  catalog:
    iceberg:
      configDefaults:
        warehouse: s3://lakehouse/

Features:

  • Branch/tag data versions
  • Time travel queries
  • Multi-table transactions
  • Cross-engine compatibility

Apache Spark (Batch Processing)

Purpose: Large-scale batch ETL and ML feature engineering

# SparkApplication for HTTPRoute analytics
apiVersion: sparkoperator.k8s.io/v1beta2
kind: SparkApplication
spec:
  type: Python
  mode: cluster
  sparkConf:
    spark.sql.catalog.nessie: org.apache.iceberg.spark.SparkCatalog
    spark.sql.catalog.nessie.catalog-impl: org.apache.iceberg.nessie.NessieCatalog
    spark.sql.catalog.nessie.uri: http://nessie:19120/api/v1

Use Cases:

  • Daily HTTPRoute log aggregation
  • Feature engineering for ML
  • Historical data compaction

Purpose: Real-time event processing

# Flink Kubernetes Operator
# Version: 1.13.0
spec:
  job:
    jarURI: local:///opt/flink/jobs/httproute-analytics.jar
    parallelism: 2

Use Cases:

  • Real-time traffic anomaly detection
  • Streaming ETL to Iceberg
  • Session windowing for user analytics

RisingWave (Streaming SQL)

Purpose: Simplified streaming SQL for real-time dashboards

-- Materialized view for real-time traffic
CREATE MATERIALIZED VIEW traffic_5min AS
SELECT 
  window_start,
  route_name,
  COUNT(*) as request_count,
  AVG(response_time_ms) as avg_latency
FROM httproute_events
GROUP BY 
  TUMBLE(event_time, INTERVAL '5 MINUTES'),
  route_name;

Use Cases:

  • Real-time Grafana dashboards
  • Streaming aggregations
  • Alerting triggers

Trino (Interactive Query)

Purpose: Fast SQL queries across Iceberg tables

# Trino coordinator + 2 workers
catalogs:
  iceberg: |
    connector.name=iceberg
    iceberg.catalog.type=nessie
    iceberg.nessie.uri=http://nessie:19120/api/v1

Use Cases:

  • Ad-hoc analytics queries
  • Grafana data source for dashboards
  • Cross-table JOINs

Data Flow: HTTPRoute Analytics

Envoy Gateway
    │
    ▼ (access logs via OTEL)
NATS JetStream
    │
    ├─► Flink Job (streaming)
    │       │
    │       ▼
    │   Iceberg Table: httproute_raw
    │
    └─► Spark Job (nightly batch)
            │
            ▼
        Iceberg Table: httproute_daily_agg
            │
            ▼
        Trino ─► Grafana Dashboard

Storage Layout

candlekeep:/kubernetes/lakehouse/
├── warehouse/
│   └── analytics/
│       ├── httproute_raw/        # Raw events (partitioned by date)
│       ├── httproute_daily_agg/  # Daily aggregates
│       ├── inference_metrics/    # ML inference stats
│       └── feature_store/        # ML features
└── checkpoints/
    ├── flink/                    # Flink savepoints
    └── spark/                    # Spark checkpoints

Resource Allocation

Component Replicas CPU Memory
Nessie 1 0.5 512Mi
Spark Operator 1 0.2 256Mi
Flink Operator 1 0.2 256Mi
Flink JobManager 1 1 2Gi
Flink TaskManager 2 2 4Gi
RisingWave 1 2 4Gi
Trino Coordinator 1 1 2Gi
Trino Worker 2 2 4Gi