268 lines
9.2 KiB
Markdown
268 lines
9.2 KiB
Markdown
# Data Analytics Platform Architecture
|
|
|
|
* Status: accepted
|
|
* Date: 2026-02-05
|
|
* Deciders: Billy
|
|
* Technical Story: Build a modern lakehouse architecture for HTTP analytics and ML feature engineering
|
|
|
|
## Context and Problem Statement
|
|
|
|
The homelab generates significant telemetry data from HTTP traffic (via Envoy Gateway), application logs, and ML inference metrics. This data is valuable for:
|
|
- Traffic pattern analysis
|
|
- Security anomaly detection
|
|
- ML feature engineering
|
|
- Cost optimization insights
|
|
|
|
How do we build a scalable analytics platform that supports both batch and real-time processing?
|
|
|
|
## Decision Drivers
|
|
|
|
* Modern lakehouse architecture (SQL + streaming)
|
|
* Real-time and batch processing capabilities
|
|
* Cost-effective on homelab hardware
|
|
* Integration with existing observability stack
|
|
* Support for ML feature pipelines
|
|
* Open table formats for interoperability
|
|
|
|
## Considered Options
|
|
|
|
1. **Lakehouse: Nessie + Spark + Flink + Trino + RisingWave**
|
|
2. **Traditional DWH: ClickHouse only**
|
|
3. **Cloud-native: Databricks/Snowflake (SaaS)**
|
|
4. **Minimal: PostgreSQL with TimescaleDB**
|
|
|
|
## Decision Outcome
|
|
|
|
Chosen option: **Option 1 - Modern Lakehouse Architecture**
|
|
|
|
A full lakehouse stack with Apache Iceberg tables (via Nessie catalog), Spark for batch ETL, Flink for streaming, Trino for interactive queries, and RisingWave for streaming SQL.
|
|
|
|
### Positive Consequences
|
|
|
|
* Unified batch and streaming on same data
|
|
* Git-like versioning of tables via Nessie
|
|
* Standard SQL across all engines
|
|
* Decoupled compute and storage
|
|
* Open formats prevent vendor lock-in
|
|
* ML feature engineering support
|
|
|
|
### Negative Consequences
|
|
|
|
* Complex multi-component architecture
|
|
* Higher resource requirements
|
|
* Steeper learning curve
|
|
* Multiple operators to maintain
|
|
|
|
## Architecture
|
|
|
|
```
|
|
┌─────────────────────────────────────────────────────────────────────────────┐
|
|
│ DATA SOURCES │
|
|
│ ┌──────────────┐ ┌──────────────┐ ┌──────────────┐ ┌──────────────┐ │
|
|
│ │ Envoy Logs │ │ Application │ │ Inference │ │ Prometheus │ │
|
|
│ │ (HTTPRoute) │ │ Telemetry │ │ Metrics │ │ Metrics │ │
|
|
│ └──────┬───────┘ └──────┬───────┘ └──────┬───────┘ └──────┬───────┘ │
|
|
└─────────┼─────────────────┼─────────────────┼─────────────────┼─────────────┘
|
|
│ │ │ │
|
|
└─────────────────┼─────────────────┼─────────────────┘
|
|
▼ │
|
|
┌───────────────────────┐ │
|
|
│ NATS JetStream │◄──────┘
|
|
│ (Event Streaming) │
|
|
└───────────┬───────────┘
|
|
│
|
|
┌───────────────┼───────────────┐
|
|
│ │ │
|
|
▼ ▼ ▼
|
|
┌─────────────────┐ ┌───────────┐ ┌───────────────────┐
|
|
│ Apache Flink │ │ RisingWave│ │ Apache Spark │
|
|
│ (Streaming ETL)│ │ (Stream │ │ (Batch ETL) │
|
|
│ │ │ SQL) │ │ │
|
|
└────────┬────────┘ └─────┬─────┘ └─────────┬─────────┘
|
|
│ │ │
|
|
└────────────────┼─────────────────┘
|
|
│ Write Iceberg Tables
|
|
▼
|
|
┌───────────────────────┐
|
|
│ Nessie │
|
|
│ (Iceberg Catalog) │
|
|
│ Git-like versioning │
|
|
└───────────┬───────────┘
|
|
│
|
|
▼
|
|
┌───────────────────────┐
|
|
│ NFS Storage │
|
|
│ (candlekeep:/lakehouse)│
|
|
└───────────────────────┘
|
|
│
|
|
▼
|
|
┌───────────────────────┐
|
|
│ Trino │
|
|
│ (Interactive Query) │
|
|
│ + Grafana Dashboards │
|
|
└───────────────────────┘
|
|
```
|
|
|
|
## Component Details
|
|
|
|
### Apache Nessie (Iceberg Catalog)
|
|
|
|
**Purpose:** Git-like version control for data tables
|
|
|
|
```yaml
|
|
# HelmRelease: nessie
|
|
# Version: 0.107.1
|
|
spec:
|
|
versionStoreType: ROCKSDB # Embedded storage
|
|
catalog:
|
|
iceberg:
|
|
configDefaults:
|
|
warehouse: s3://lakehouse/
|
|
```
|
|
|
|
**Features:**
|
|
- Branch/tag data versions
|
|
- Time travel queries
|
|
- Multi-table transactions
|
|
- Cross-engine compatibility
|
|
|
|
### Apache Spark (Batch Processing)
|
|
|
|
**Purpose:** Large-scale batch ETL and ML feature engineering
|
|
|
|
```yaml
|
|
# SparkApplication for HTTPRoute analytics
|
|
apiVersion: sparkoperator.k8s.io/v1beta2
|
|
kind: SparkApplication
|
|
spec:
|
|
type: Python
|
|
mode: cluster
|
|
sparkConf:
|
|
spark.sql.catalog.nessie: org.apache.iceberg.spark.SparkCatalog
|
|
spark.sql.catalog.nessie.catalog-impl: org.apache.iceberg.nessie.NessieCatalog
|
|
spark.sql.catalog.nessie.uri: http://nessie:19120/api/v1
|
|
```
|
|
|
|
**Use Cases:**
|
|
- Daily HTTPRoute log aggregation
|
|
- Feature engineering for ML
|
|
- Historical data compaction
|
|
|
|
### Apache Flink (Stream Processing)
|
|
|
|
**Purpose:** Real-time event processing
|
|
|
|
```yaml
|
|
# Flink Kubernetes Operator
|
|
# Version: 1.13.0
|
|
spec:
|
|
job:
|
|
jarURI: local:///opt/flink/jobs/httproute-analytics.jar
|
|
parallelism: 2
|
|
```
|
|
|
|
**Use Cases:**
|
|
- Real-time traffic anomaly detection
|
|
- Streaming ETL to Iceberg
|
|
- Session windowing for user analytics
|
|
|
|
### RisingWave (Streaming SQL)
|
|
|
|
**Purpose:** Simplified streaming SQL for real-time dashboards
|
|
|
|
```sql
|
|
-- Materialized view for real-time traffic
|
|
CREATE MATERIALIZED VIEW traffic_5min AS
|
|
SELECT
|
|
window_start,
|
|
route_name,
|
|
COUNT(*) as request_count,
|
|
AVG(response_time_ms) as avg_latency
|
|
FROM httproute_events
|
|
GROUP BY
|
|
TUMBLE(event_time, INTERVAL '5 MINUTES'),
|
|
route_name;
|
|
```
|
|
|
|
**Use Cases:**
|
|
- Real-time Grafana dashboards
|
|
- Streaming aggregations
|
|
- Alerting triggers
|
|
|
|
### Trino (Interactive Query)
|
|
|
|
**Purpose:** Fast SQL queries across Iceberg tables
|
|
|
|
```yaml
|
|
# Trino coordinator + 2 workers
|
|
catalogs:
|
|
iceberg: |
|
|
connector.name=iceberg
|
|
iceberg.catalog.type=nessie
|
|
iceberg.nessie.uri=http://nessie:19120/api/v1
|
|
```
|
|
|
|
**Use Cases:**
|
|
- Ad-hoc analytics queries
|
|
- Grafana data source for dashboards
|
|
- Cross-table JOINs
|
|
|
|
## Data Flow: HTTPRoute Analytics
|
|
|
|
```
|
|
Envoy Gateway
|
|
│
|
|
▼ (access logs via OTEL)
|
|
NATS JetStream
|
|
│
|
|
├─► Flink Job (streaming)
|
|
│ │
|
|
│ ▼
|
|
│ Iceberg Table: httproute_raw
|
|
│
|
|
└─► Spark Job (nightly batch)
|
|
│
|
|
▼
|
|
Iceberg Table: httproute_daily_agg
|
|
│
|
|
▼
|
|
Trino ─► Grafana Dashboard
|
|
```
|
|
|
|
## Storage Layout
|
|
|
|
```
|
|
candlekeep:/kubernetes/lakehouse/
|
|
├── warehouse/
|
|
│ └── analytics/
|
|
│ ├── httproute_raw/ # Raw events (partitioned by date)
|
|
│ ├── httproute_daily_agg/ # Daily aggregates
|
|
│ ├── inference_metrics/ # ML inference stats
|
|
│ └── feature_store/ # ML features
|
|
└── checkpoints/
|
|
├── flink/ # Flink savepoints
|
|
└── spark/ # Spark checkpoints
|
|
```
|
|
|
|
## Resource Allocation
|
|
|
|
| Component | Replicas | CPU | Memory |
|
|
|-----------|----------|-----|--------|
|
|
| Nessie | 1 | 0.5 | 512Mi |
|
|
| Spark Operator | 1 | 0.2 | 256Mi |
|
|
| Flink Operator | 1 | 0.2 | 256Mi |
|
|
| Flink JobManager | 1 | 1 | 2Gi |
|
|
| Flink TaskManager | 2 | 2 | 4Gi |
|
|
| RisingWave | 1 | 2 | 4Gi |
|
|
| Trino Coordinator | 1 | 1 | 2Gi |
|
|
| Trino Worker | 2 | 2 | 4Gi |
|
|
|
|
## Links
|
|
|
|
* [Apache Iceberg](https://iceberg.apache.org/)
|
|
* [Project Nessie](https://projectnessie.org/)
|
|
* [Apache Flink](https://flink.apache.org/)
|
|
* [RisingWave](https://risingwave.com/)
|
|
* [Trino](https://trino.io/)
|
|
* Related: [ADR-0025](0025-observability-stack.md) - Observability Stack
|