updating to match everything in my homelab.
This commit is contained in:
267
decisions/0033-data-analytics-platform.md
Normal file
267
decisions/0033-data-analytics-platform.md
Normal file
@@ -0,0 +1,267 @@
|
||||
# Data Analytics Platform Architecture
|
||||
|
||||
* Status: accepted
|
||||
* Date: 2026-02-05
|
||||
* Deciders: Billy
|
||||
* Technical Story: Build a modern lakehouse architecture for HTTP analytics and ML feature engineering
|
||||
|
||||
## Context and Problem Statement
|
||||
|
||||
The homelab generates significant telemetry data from HTTP traffic (via Envoy Gateway), application logs, and ML inference metrics. This data is valuable for:
|
||||
- Traffic pattern analysis
|
||||
- Security anomaly detection
|
||||
- ML feature engineering
|
||||
- Cost optimization insights
|
||||
|
||||
How do we build a scalable analytics platform that supports both batch and real-time processing?
|
||||
|
||||
## Decision Drivers
|
||||
|
||||
* Modern lakehouse architecture (SQL + streaming)
|
||||
* Real-time and batch processing capabilities
|
||||
* Cost-effective on homelab hardware
|
||||
* Integration with existing observability stack
|
||||
* Support for ML feature pipelines
|
||||
* Open table formats for interoperability
|
||||
|
||||
## Considered Options
|
||||
|
||||
1. **Lakehouse: Nessie + Spark + Flink + Trino + RisingWave**
|
||||
2. **Traditional DWH: ClickHouse only**
|
||||
3. **Cloud-native: Databricks/Snowflake (SaaS)**
|
||||
4. **Minimal: PostgreSQL with TimescaleDB**
|
||||
|
||||
## Decision Outcome
|
||||
|
||||
Chosen option: **Option 1 - Modern Lakehouse Architecture**
|
||||
|
||||
A full lakehouse stack with Apache Iceberg tables (via Nessie catalog), Spark for batch ETL, Flink for streaming, Trino for interactive queries, and RisingWave for streaming SQL.
|
||||
|
||||
### Positive Consequences
|
||||
|
||||
* Unified batch and streaming on same data
|
||||
* Git-like versioning of tables via Nessie
|
||||
* Standard SQL across all engines
|
||||
* Decoupled compute and storage
|
||||
* Open formats prevent vendor lock-in
|
||||
* ML feature engineering support
|
||||
|
||||
### Negative Consequences
|
||||
|
||||
* Complex multi-component architecture
|
||||
* Higher resource requirements
|
||||
* Steeper learning curve
|
||||
* Multiple operators to maintain
|
||||
|
||||
## Architecture
|
||||
|
||||
```
|
||||
┌─────────────────────────────────────────────────────────────────────────────┐
|
||||
│ DATA SOURCES │
|
||||
│ ┌──────────────┐ ┌──────────────┐ ┌──────────────┐ ┌──────────────┐ │
|
||||
│ │ Envoy Logs │ │ Application │ │ Inference │ │ Prometheus │ │
|
||||
│ │ (HTTPRoute) │ │ Telemetry │ │ Metrics │ │ Metrics │ │
|
||||
│ └──────┬───────┘ └──────┬───────┘ └──────┬───────┘ └──────┬───────┘ │
|
||||
└─────────┼─────────────────┼─────────────────┼─────────────────┼─────────────┘
|
||||
│ │ │ │
|
||||
└─────────────────┼─────────────────┼─────────────────┘
|
||||
▼ │
|
||||
┌───────────────────────┐ │
|
||||
│ NATS JetStream │◄──────┘
|
||||
│ (Event Streaming) │
|
||||
└───────────┬───────────┘
|
||||
│
|
||||
┌───────────────┼───────────────┐
|
||||
│ │ │
|
||||
▼ ▼ ▼
|
||||
┌─────────────────┐ ┌───────────┐ ┌───────────────────┐
|
||||
│ Apache Flink │ │ RisingWave│ │ Apache Spark │
|
||||
│ (Streaming ETL)│ │ (Stream │ │ (Batch ETL) │
|
||||
│ │ │ SQL) │ │ │
|
||||
└────────┬────────┘ └─────┬─────┘ └─────────┬─────────┘
|
||||
│ │ │
|
||||
└────────────────┼─────────────────┘
|
||||
│ Write Iceberg Tables
|
||||
▼
|
||||
┌───────────────────────┐
|
||||
│ Nessie │
|
||||
│ (Iceberg Catalog) │
|
||||
│ Git-like versioning │
|
||||
└───────────┬───────────┘
|
||||
│
|
||||
▼
|
||||
┌───────────────────────┐
|
||||
│ NFS Storage │
|
||||
│ (candlekeep:/lakehouse)│
|
||||
└───────────────────────┘
|
||||
│
|
||||
▼
|
||||
┌───────────────────────┐
|
||||
│ Trino │
|
||||
│ (Interactive Query) │
|
||||
│ + Grafana Dashboards │
|
||||
└───────────────────────┘
|
||||
```
|
||||
|
||||
## Component Details
|
||||
|
||||
### Apache Nessie (Iceberg Catalog)
|
||||
|
||||
**Purpose:** Git-like version control for data tables
|
||||
|
||||
```yaml
|
||||
# HelmRelease: nessie
|
||||
# Version: 0.107.1
|
||||
spec:
|
||||
versionStoreType: ROCKSDB # Embedded storage
|
||||
catalog:
|
||||
iceberg:
|
||||
configDefaults:
|
||||
warehouse: s3://lakehouse/
|
||||
```
|
||||
|
||||
**Features:**
|
||||
- Branch/tag data versions
|
||||
- Time travel queries
|
||||
- Multi-table transactions
|
||||
- Cross-engine compatibility
|
||||
|
||||
### Apache Spark (Batch Processing)
|
||||
|
||||
**Purpose:** Large-scale batch ETL and ML feature engineering
|
||||
|
||||
```yaml
|
||||
# SparkApplication for HTTPRoute analytics
|
||||
apiVersion: sparkoperator.k8s.io/v1beta2
|
||||
kind: SparkApplication
|
||||
spec:
|
||||
type: Python
|
||||
mode: cluster
|
||||
sparkConf:
|
||||
spark.sql.catalog.nessie: org.apache.iceberg.spark.SparkCatalog
|
||||
spark.sql.catalog.nessie.catalog-impl: org.apache.iceberg.nessie.NessieCatalog
|
||||
spark.sql.catalog.nessie.uri: http://nessie:19120/api/v1
|
||||
```
|
||||
|
||||
**Use Cases:**
|
||||
- Daily HTTPRoute log aggregation
|
||||
- Feature engineering for ML
|
||||
- Historical data compaction
|
||||
|
||||
### Apache Flink (Stream Processing)
|
||||
|
||||
**Purpose:** Real-time event processing
|
||||
|
||||
```yaml
|
||||
# Flink Kubernetes Operator
|
||||
# Version: 1.13.0
|
||||
spec:
|
||||
job:
|
||||
jarURI: local:///opt/flink/jobs/httproute-analytics.jar
|
||||
parallelism: 2
|
||||
```
|
||||
|
||||
**Use Cases:**
|
||||
- Real-time traffic anomaly detection
|
||||
- Streaming ETL to Iceberg
|
||||
- Session windowing for user analytics
|
||||
|
||||
### RisingWave (Streaming SQL)
|
||||
|
||||
**Purpose:** Simplified streaming SQL for real-time dashboards
|
||||
|
||||
```sql
|
||||
-- Materialized view for real-time traffic
|
||||
CREATE MATERIALIZED VIEW traffic_5min AS
|
||||
SELECT
|
||||
window_start,
|
||||
route_name,
|
||||
COUNT(*) as request_count,
|
||||
AVG(response_time_ms) as avg_latency
|
||||
FROM httproute_events
|
||||
GROUP BY
|
||||
TUMBLE(event_time, INTERVAL '5 MINUTES'),
|
||||
route_name;
|
||||
```
|
||||
|
||||
**Use Cases:**
|
||||
- Real-time Grafana dashboards
|
||||
- Streaming aggregations
|
||||
- Alerting triggers
|
||||
|
||||
### Trino (Interactive Query)
|
||||
|
||||
**Purpose:** Fast SQL queries across Iceberg tables
|
||||
|
||||
```yaml
|
||||
# Trino coordinator + 2 workers
|
||||
catalogs:
|
||||
iceberg: |
|
||||
connector.name=iceberg
|
||||
iceberg.catalog.type=nessie
|
||||
iceberg.nessie.uri=http://nessie:19120/api/v1
|
||||
```
|
||||
|
||||
**Use Cases:**
|
||||
- Ad-hoc analytics queries
|
||||
- Grafana data source for dashboards
|
||||
- Cross-table JOINs
|
||||
|
||||
## Data Flow: HTTPRoute Analytics
|
||||
|
||||
```
|
||||
Envoy Gateway
|
||||
│
|
||||
▼ (access logs via OTEL)
|
||||
NATS JetStream
|
||||
│
|
||||
├─► Flink Job (streaming)
|
||||
│ │
|
||||
│ ▼
|
||||
│ Iceberg Table: httproute_raw
|
||||
│
|
||||
└─► Spark Job (nightly batch)
|
||||
│
|
||||
▼
|
||||
Iceberg Table: httproute_daily_agg
|
||||
│
|
||||
▼
|
||||
Trino ─► Grafana Dashboard
|
||||
```
|
||||
|
||||
## Storage Layout
|
||||
|
||||
```
|
||||
candlekeep:/kubernetes/lakehouse/
|
||||
├── warehouse/
|
||||
│ └── analytics/
|
||||
│ ├── httproute_raw/ # Raw events (partitioned by date)
|
||||
│ ├── httproute_daily_agg/ # Daily aggregates
|
||||
│ ├── inference_metrics/ # ML inference stats
|
||||
│ └── feature_store/ # ML features
|
||||
└── checkpoints/
|
||||
├── flink/ # Flink savepoints
|
||||
└── spark/ # Spark checkpoints
|
||||
```
|
||||
|
||||
## Resource Allocation
|
||||
|
||||
| Component | Replicas | CPU | Memory |
|
||||
|-----------|----------|-----|--------|
|
||||
| Nessie | 1 | 0.5 | 512Mi |
|
||||
| Spark Operator | 1 | 0.2 | 256Mi |
|
||||
| Flink Operator | 1 | 0.2 | 256Mi |
|
||||
| Flink JobManager | 1 | 1 | 2Gi |
|
||||
| Flink TaskManager | 2 | 2 | 4Gi |
|
||||
| RisingWave | 1 | 2 | 4Gi |
|
||||
| Trino Coordinator | 1 | 1 | 2Gi |
|
||||
| Trino Worker | 2 | 2 | 4Gi |
|
||||
|
||||
## Links
|
||||
|
||||
* [Apache Iceberg](https://iceberg.apache.org/)
|
||||
* [Project Nessie](https://projectnessie.org/)
|
||||
* [Apache Flink](https://flink.apache.org/)
|
||||
* [RisingWave](https://risingwave.com/)
|
||||
* [Trino](https://trino.io/)
|
||||
* Related: [ADR-0025](0025-observability-stack.md) - Observability Stack
|
||||
Reference in New Issue
Block a user