daviestechlabs/homelab-design

Fork 0

Files

Billy D. 80fb911e22 updating to match everything in my homelab.

2026-02-05 16:13:53 -05:00

9.2 KiB

Raw Blame History

Data Analytics Platform Architecture

Status: accepted
Date: 2026-02-05
Deciders: Billy
Technical Story: Build a modern lakehouse architecture for HTTP analytics and ML feature engineering

Context and Problem Statement

The homelab generates significant telemetry data from HTTP traffic (via Envoy Gateway), application logs, and ML inference metrics. This data is valuable for:

Traffic pattern analysis
Security anomaly detection
ML feature engineering
Cost optimization insights

How do we build a scalable analytics platform that supports both batch and real-time processing?

Decision Drivers

Modern lakehouse architecture (SQL + streaming)
Real-time and batch processing capabilities
Cost-effective on homelab hardware
Integration with existing observability stack
Support for ML feature pipelines
Open table formats for interoperability

Considered Options

Lakehouse: Nessie + Spark + Flink + Trino + RisingWave
Traditional DWH: ClickHouse only
Cloud-native: Databricks/Snowflake (SaaS)
Minimal: PostgreSQL with TimescaleDB

Decision Outcome

Chosen option: Option 1 - Modern Lakehouse Architecture

A full lakehouse stack with Apache Iceberg tables (via Nessie catalog), Spark for batch ETL, Flink for streaming, Trino for interactive queries, and RisingWave for streaming SQL.

Positive Consequences

Unified batch and streaming on same data
Git-like versioning of tables via Nessie
Standard SQL across all engines
Decoupled compute and storage
Open formats prevent vendor lock-in
ML feature engineering support

Negative Consequences

Complex multi-component architecture
Higher resource requirements
Steeper learning curve
Multiple operators to maintain

Architecture

┌─────────────────────────────────────────────────────────────────────────────┐
│                           DATA SOURCES                                       │
│  ┌──────────────┐  ┌──────────────┐  ┌──────────────┐  ┌──────────────┐     │
│  │ Envoy Logs   │  │ Application  │  │ Inference    │  │ Prometheus   │     │
│  │ (HTTPRoute)  │  │ Telemetry    │  │ Metrics      │  │ Metrics      │     │
│  └──────┬───────┘  └──────┬───────┘  └──────┬───────┘  └──────┬───────┘     │
└─────────┼─────────────────┼─────────────────┼─────────────────┼─────────────┘
          │                 │                 │                 │
          └─────────────────┼─────────────────┼─────────────────┘
                            ▼                 │
              ┌───────────────────────┐       │
              │   NATS JetStream      │◄──────┘
              │   (Event Streaming)   │
              └───────────┬───────────┘
                          │
          ┌───────────────┼───────────────┐
          │               │               │
          ▼               ▼               ▼
┌─────────────────┐ ┌───────────┐ ┌───────────────────┐
│  Apache Flink   │ │ RisingWave│ │   Apache Spark    │
│  (Streaming ETL)│ │ (Stream   │ │   (Batch ETL)     │
│                 │ │  SQL)     │ │                   │
└────────┬────────┘ └─────┬─────┘ └─────────┬─────────┘
         │                │                 │
         └────────────────┼─────────────────┘
                          │ Write Iceberg Tables
                          ▼
              ┌───────────────────────┐
              │       Nessie          │
              │   (Iceberg Catalog)   │
              │   Git-like versioning │
              └───────────┬───────────┘
                          │
                          ▼
              ┌───────────────────────┐
              │    NFS Storage        │
              │  (candlekeep:/lakehouse)│
              └───────────────────────┘
                          │
                          ▼
              ┌───────────────────────┐
              │        Trino          │
              │  (Interactive Query)  │
              │  + Grafana Dashboards │
              └───────────────────────┘

Component Details

Apache Nessie (Iceberg Catalog)

Purpose: Git-like version control for data tables

# HelmRelease: nessie
# Version: 0.107.1
spec:
  versionStoreType: ROCKSDB  # Embedded storage
  catalog:
    iceberg:
      configDefaults:
        warehouse: s3://lakehouse/

Features:

Branch/tag data versions
Time travel queries
Multi-table transactions
Cross-engine compatibility

Apache Spark (Batch Processing)

Purpose: Large-scale batch ETL and ML feature engineering

# SparkApplication for HTTPRoute analytics
apiVersion: sparkoperator.k8s.io/v1beta2
kind: SparkApplication
spec:
  type: Python
  mode: cluster
  sparkConf:
    spark.sql.catalog.nessie: org.apache.iceberg.spark.SparkCatalog
    spark.sql.catalog.nessie.catalog-impl: org.apache.iceberg.nessie.NessieCatalog
    spark.sql.catalog.nessie.uri: http://nessie:19120/api/v1

Use Cases:

Daily HTTPRoute log aggregation
Feature engineering for ML
Historical data compaction

Apache Flink (Stream Processing)

Purpose: Real-time event processing

# Flink Kubernetes Operator
# Version: 1.13.0
spec:
  job:
    jarURI: local:///opt/flink/jobs/httproute-analytics.jar
    parallelism: 2

Use Cases:

Real-time traffic anomaly detection
Streaming ETL to Iceberg
Session windowing for user analytics

RisingWave (Streaming SQL)

Purpose: Simplified streaming SQL for real-time dashboards

-- Materialized view for real-time traffic
CREATE MATERIALIZED VIEW traffic_5min AS
SELECT 
  window_start,
  route_name,
  COUNT(*) as request_count,
  AVG(response_time_ms) as avg_latency
FROM httproute_events
GROUP BY 
  TUMBLE(event_time, INTERVAL '5 MINUTES'),
  route_name;

Use Cases:

Real-time Grafana dashboards
Streaming aggregations
Alerting triggers

Trino (Interactive Query)

Purpose: Fast SQL queries across Iceberg tables

# Trino coordinator + 2 workers
catalogs:
  iceberg: |
    connector.name=iceberg
    iceberg.catalog.type=nessie
    iceberg.nessie.uri=http://nessie:19120/api/v1

Use Cases:

Ad-hoc analytics queries
Grafana data source for dashboards
Cross-table JOINs

Data Flow: HTTPRoute Analytics

Envoy Gateway
    │
    ▼ (access logs via OTEL)
NATS JetStream
    │
    ├─► Flink Job (streaming)
    │       │
    │       ▼
    │   Iceberg Table: httproute_raw
    │
    └─► Spark Job (nightly batch)
            │
            ▼
        Iceberg Table: httproute_daily_agg
            │
            ▼
        Trino ─► Grafana Dashboard

Storage Layout

candlekeep:/kubernetes/lakehouse/
├── warehouse/
│   └── analytics/
│       ├── httproute_raw/        # Raw events (partitioned by date)
│       ├── httproute_daily_agg/  # Daily aggregates
│       ├── inference_metrics/    # ML inference stats
│       └── feature_store/        # ML features
└── checkpoints/
    ├── flink/                    # Flink savepoints
    └── spark/                    # Spark checkpoints

Resource Allocation

Component	Replicas	CPU	Memory
Nessie	1	0.5	512Mi
Spark Operator	1	0.2	256Mi
Flink Operator	1	0.2	256Mi
Flink JobManager	1	1	2Gi
Flink TaskManager	2	2	4Gi
RisingWave	1	2	4Gi
Trino Coordinator	1	1	2Gi
Trino Worker	2	2	4Gi

9.2 KiB Raw Blame History

Data Analytics Platform Architecture

Context and Problem Statement

Decision Drivers

Considered Options

Decision Outcome

Positive Consequences

Negative Consequences

Architecture

Component Details

Apache Nessie (Iceberg Catalog)

Apache Spark (Batch Processing)

Apache Flink (Stream Processing)

RisingWave (Streaming SQL)

Trino (Interactive Query)

Data Flow: HTTPRoute Analytics

Storage Layout

Resource Allocation

Links

9.2 KiB

Raw Blame History