updating to match everything in my homelab.

2026-02-05 16:13:53 -05:00
parent f8787379c5
commit 80fb911e22
30 changed files with 3107 additions and 7 deletions
--- a/decisions/0030-mfa-yubikey-strategy.md
+++ b/decisions/0030-mfa-yubikey-strategy.md
@@ -52,8 +52,8 @@ WebAuthn provides the best security (phishing-resistant) and user experience (to

 | Application | WebAuthn Support | Current Status | Action Required |
 |-------------|------------------|----------------|-----------------|
-| Authentik | ✅ Native | ✅ Working | Configure enforcement policies |
-| Vaultwarden | ✅ Native | ⚠️ Partial | Enable in admin settings |
+| Authentik | ✅ Native | ⚠️ In Progress | Configure enforcement policies |
+| Vaultwarden | ✅ Native | ✅ Implemented | None - WebAuthn enrolled |

 ## Authentik Configuration

--- a/decisions/0031-gitea-cicd-strategy.md
+++ b/decisions/0031-gitea-cicd-strategy.md
@@ -0,0 +1,301 @@
+# Gitea CI/CD Pipeline Strategy
+
+* Status: accepted
+* Date: 2026-02-04
+* Deciders: Billy
+* Technical Story: Establish CI/CD patterns for building and publishing container images via Gitea Actions
+
+## Context and Problem Statement
+
+The homelab uses Gitea as the Git hosting platform. Applications need automated CI/CD pipelines to build container images, run tests, and publish artifacts. Gitea Actions provides GitHub Actions-compatible workflow execution.
+
+How do we configure CI/CD pipelines that work reliably with the homelab's self-hosted infrastructure including private container registry, rootless Docker-in-Docker runners, and internal services?
+
+## Decision Drivers
+
+* Self-hosted - no external CI/CD dependencies
+* Container registry integration - push to Gitea's built-in registry
+* Rootless security - runners don't require privileged containers
+* Internal networking - leverage cluster service discovery
+* Semantic versioning - automated version bumps based on commit messages
+
+## Considered Options
+
+1. **Gitea Actions with rootless DinD runners**
+2. **External CI/CD (GitHub Actions, GitLab CI)**
+3. **Self-hosted Jenkins/Drone**
+4. **Tekton Pipelines**
+
+## Decision Outcome
+
+Chosen option: **Option 1 - Gitea Actions with rootless DinD runners**
+
+Gitea Actions provides GitHub Actions compatibility, runs inside the cluster with access to internal services, and supports rootless Docker-in-Docker for secure container builds.
+
+### Positive Consequences
+
+* GitHub Actions syntax familiarity
+* In-cluster access to internal services
+* Built-in container registry integration
+* No external dependencies
+* Rootless execution for security
+
+### Negative Consequences
+
+* Some GitHub Actions may not work (org-specific actions)
+* Rootless DinD has some limitations
+* Self-hosted maintenance burden
+
+## Architecture
+
+```
+┌─────────────────────────────────────────────────────────────────────────────┐
+│                              Developer Push                                  │
+└──────────────────────────────────┬──────────────────────────────────────────┘
+                                   │
+                                   ▼
+┌─────────────────────────────────────────────────────────────────────────────┐
+│                              Gitea Server                                    │
+│                    (git.daviestechlabs.io)                                  │
+│  ┌─────────────────────────────────────────────────────────────────────┐    │
+│  │                        Actions Trigger                               │    │
+│  │  • Push to main branch                                              │    │
+│  │  • Pull request                                                     │    │
+│  │  • Tag creation                                                     │    │
+│  │  • workflow_dispatch                                                │    │
+│  └─────────────────────────────────────────────────────────────────────┘    │
+└──────────────────────────────────┬──────────────────────────────────────────┘
+                                   │
+                                   ▼
+┌─────────────────────────────────────────────────────────────────────────────┐
+│                         Gitea Actions Runner                                 │
+│                    (rootless Docker-in-Docker)                              │
+│                                                                             │
+│  ┌─────────────┐    ┌─────────────┐    ┌─────────────┐                     │
+│  │  Checkout   │───▶│   Buildx    │───▶│    Push     │                     │
+│  │             │    │   Build     │    │  Registry   │                     │
+│  └─────────────┘    └─────────────┘    └──────┬──────┘                     │
+│                                               │                             │
+└───────────────────────────────────────────────┼─────────────────────────────┘
+                                                │
+                                                ▼
+┌─────────────────────────────────────────────────────────────────────────────┐
+│                         Gitea Container Registry                             │
+│              (gitea-http.gitea.svc.cluster.local:3000)                      │
+│                                                                             │
+│  Images:                                                                    │
+│  • daviestechlabs/ray-worker-nvidia:v1.0.1                                 │
+│  • daviestechlabs/ray-worker-rdna2:v1.0.1                                  │
+│  • daviestechlabs/ray-worker-strixhalo:v1.0.1                              │
+│  • daviestechlabs/ray-worker-intel:v1.0.1                                  │
+│  • daviestechlabs/ntfy-discord:latest                                      │
+└─────────────────────────────────────────────────────────────────────────────┘
+```
+
+## Runner Configuration
+
+### Rootless Docker-in-Docker
+
+The runner uses rootless Docker for security:
+
+```yaml
+# Runner deployment uses rootless DinD
+# No privileged containers required
+# No sudo access in workflows
+```
+
+### Runner Registration
+
+Runners must be registered with **project-scoped tokens**, not instance tokens:
+
+1. Go to **Repository → Settings → Actions → Runners**
+2. Create new runner with project token
+3. Use token for runner registration
+
+**Common mistake:** Using instance-level token causes jobs not to be picked up.
+
+## Registry Authentication
+
+### Internal HTTP Endpoint
+
+Use internal cluster DNS for registry access. This avoids:
+- Cloudflare tunnel 100MB upload limit
+- TLS certificate issues
+- External network latency
+
+```yaml
+env:
+  REGISTRY: gitea-http.gitea.svc.cluster.local:3000/daviestechlabs
+  REGISTRY_HOST: gitea-http.gitea.svc.cluster.local:3000
+```
+
+### Buildx Configuration
+
+Configure buildx to use HTTP for internal registry:
+
+```yaml
+- name: Set up Docker Buildx
+  uses: docker/setup-buildx-action@v3
+  with:
+    buildkitd-config-inline: |
+      [registry."gitea-http.gitea.svc.cluster.local:3000"]
+        http = true
+        insecure = true
+```
+
+### Credential Configuration
+
+For rootless DinD, create docker config directly (no `docker login` - it defaults to HTTPS):
+
+```yaml
+- name: Configure Gitea Registry Auth
+  if: github.event_name != 'pull_request'
+  run: |
+    AUTH=$(echo -n "${{ secrets.REGISTRY_USER }}:${{ secrets.REGISTRY_TOKEN }}" | base64 -w0)
+    mkdir -p ~/.docker
+    cat > ~/.docker/config.json << EOF
+    {
+      "auths": {
+        "${{ env.REGISTRY_HOST }}": {
+          "auth": "$AUTH"
+        }
+      }
+    }
+    EOF
+```
+
+**Important:** Buildx reads `~/.docker/config.json` for authentication during push. Do NOT use `docker login` for HTTP registries as it defaults to HTTPS.
+
+### Required Secrets
+
+Configure in **Repository → Settings → Actions → Secrets**:
+
+| Secret | Purpose |
+|--------|---------|
+| `REGISTRY_USER` | Gitea username with package write access |
+| `REGISTRY_TOKEN` | Gitea access token with `write:package` scope |
+| `DOCKERHUB_TOKEN` | (Optional) Docker Hub token for rate limit bypass |
+
+## Semantic Versioning
+
+### Commit Message Conventions
+
+Version bumps are determined from commit message prefixes:
+
+| Prefix | Bump Type | Example |
+|--------|-----------|---------|
+| `major:` or `BREAKING CHANGE` | Major (x.0.0) | `major: Remove deprecated API` |
+| `minor:`, `feat:`, `feature:` | Minor (0.x.0) | `feat: Add new endpoint` |
+| (anything else) | Patch (0.0.x) | `fix: Correct typo` |
+
+### Version Calculation
+
+```yaml
+- name: Calculate semantic version
+  id: version
+  run: |
+    LATEST=$(git describe --tags --abbrev=0 2>/dev/null || echo "v0.0.0")
+    VERSION=${LATEST#v}
+    IFS='.' read -r MAJOR MINOR PATCH <<< "$VERSION"
+    
+    MSG="${{ github.event.head_commit.message }}"
+    if echo "$MSG" | grep -qiE "^major:|BREAKING CHANGE"; then
+      MAJOR=$((MAJOR + 1)); MINOR=0; PATCH=0
+    elif echo "$MSG" | grep -qiE "^(minor:|feat:|feature:)"; then
+      MINOR=$((MINOR + 1)); PATCH=0
+    else
+      PATCH=$((PATCH + 1))
+    fi
+    
+    echo "version=v${MAJOR}.${MINOR}.${PATCH}" >> $GITHUB_OUTPUT
+```
+
+### Automatic Tagging
+
+After successful builds, create and push a git tag:
+
+```yaml
+- name: Create and push tag
+  run: |
+    git config user.name "gitea-actions[bot]"
+    git config user.email "actions@git.daviestechlabs.io"
+    git tag -a "$VERSION" -m "Release $VERSION ($BUMP)"
+    git push origin "$VERSION"
+```
+
+## Notifications
+
+### ntfy Integration
+
+Send build status to ntfy for notifications:
+
+```yaml
+- name: Notify on success
+  run: |
+    curl -s \
+      -H "Title: ✅ Images Built: ${{ gitea.repository }}" \
+      -H "Priority: default" \
+      -H "Tags: white_check_mark,docker" \
+      -d "Version: ${{ needs.determine-version.outputs.version }}" \
+      http://ntfy.observability.svc.cluster.local:80/gitea-ci
+```
+
+## Skip Patterns
+
+### Commit Message Skip Flags
+
+| Flag | Effect |
+|------|--------|
+| `[skip images]` | Skip all image builds |
+| `[ray-serve only]` | Skip worker images |
+| `[skip ci]` | Skip entire workflow |
+
+### Path-based Triggers
+
+Only run on relevant file changes:
+
+```yaml
+on:
+  push:
+    paths:
+      - 'dockerfiles/**'
+      - '.gitea/workflows/build-push.yaml'
+```
+
+## Troubleshooting
+
+### Common Issues
+
+| Issue | Cause | Solution |
+|-------|-------|----------|
+| Jobs not picked up | Instance token instead of project token | Re-register with project-scoped token |
+| 401 Unauthorized | Missing or wrong registry credentials | Check REGISTRY_USER and REGISTRY_TOKEN secrets |
+| "http: server gave HTTP response to HTTPS client" | Using `docker login` with HTTP registry | Create config.json directly, don't use docker login |
+| Cloudflare 100MB upload limit | Using external endpoint for large images | Use internal HTTP endpoint |
+| TLS certificate error | Using HTTPS with self-signed cert | Use internal HTTP endpoint with buildkitd http=true |
+| sudo not found | Rootless DinD has no sudo | Use user-space configuration methods |
+| "must contain at least one job without dependencies" | All jobs have `needs` | Ensure at least one job has no `needs` clause |
+
+### Debugging
+
+1. Check runner logs in Gitea Actions UI
+2. Add debug output: `echo "::debug::Variable=$VAR"`
+3. Use `actions/debug-output` step for verbose logging
+
+## Workflow Template
+
+See [kuberay-images/.gitea/workflows/build-push.yaml](https://git.daviestechlabs.io/daviestechlabs/kuberay-images/src/branch/main/.gitea/workflows/build-push.yaml) for complete example.
+
+## Future Enhancements
+
+1. **Caching improvements** - Persistent layer cache across builds
+2. **Multi-arch builds** - ARM64 support for Raspberry Pi
+3. **Security scanning** - Trivy integration in CI
+4. **Signed images** - Cosign for image signatures
+5. **SLSA provenance** - Supply chain attestations
+
+## References
+
+* [Gitea Actions Documentation](https://docs.gitea.com/usage/actions/overview)
+* [Docker Buildx Documentation](https://docs.docker.com/build/buildx/)
+* [Semantic Versioning](https://semver.org/)
--- a/decisions/0032-velero-backup-strategy.md
+++ b/decisions/0032-velero-backup-strategy.md
@@ -0,0 +1,180 @@
+# Velero Backup and Disaster Recovery Strategy
+
+* Status: accepted
+* Date: 2026-02-05
+* Deciders: Billy
+* Technical Story: Establish cluster backup and disaster recovery capabilities
+
+## Context and Problem Statement
+
+A homelab running critical workloads (AI/ML pipelines, databases, productivity apps) needs protection against data loss from hardware failures, misconfigurations, or disasters. Kubernetes resources and persistent data must be recoverable.
+
+How do we implement backup and disaster recovery for the homelab cluster?
+
+## Decision Drivers
+
+* Full cluster state backup - resources, secrets, PVCs
+* Application-consistent backups for databases
+* S3-compatible storage for off-cluster backups
+* Scheduled automated backups
+* Selective restore capability
+* GitOps compatibility
+
+## Considered Options
+
+1. **Velero with Node Agent (Kopia)**
+2. **Kasten K10**
+3. **Longhorn snapshots only**
+4. **etcd snapshots + manual PVC backups**
+
+## Decision Outcome
+
+Chosen option: **Option 1 - Velero with Node Agent (Kopia)**
+
+Velero provides comprehensive Kubernetes backup/restore with file-level PVC backups via the Node Agent (formerly Restic, now Kopia). Backups are stored on the external NAS via S3-compatible storage.
+
+### Positive Consequences
+
+* Full cluster state captured (deployments, secrets, configmaps)
+* PVC data backed up via file-level snapshots
+* S3 backend on NAS for off-cluster storage
+* Scheduled daily backups with retention
+* Selective namespace/label restore
+* Active CNCF project with strong community
+
+### Negative Consequences
+
+* Node Agent runs as DaemonSet (14 pods on current cluster)
+* File-level backup slower than volume snapshots
+* Full cluster restore requires careful ordering
+* Some CRDs may need special handling
+
+## Architecture
+
+```
+┌─────────────────────────────────────────────────────────────────┐
+│                        Velero Server                             │
+│                        (velero namespace)                        │
+└────────────────────────────┬────────────────────────────────────┘
+                             │
+              ┌──────────────┼──────────────┐
+              │              │              │
+              ▼              ▼              ▼
+       ┌───────────┐  ┌───────────┐  ┌───────────┐
+       │   Node    │  │   Node    │  │   Node    │
+       │   Agent   │  │   Agent   │  │   Agent   │
+       │ (per node)│  │ (per node)│  │ (per node)│
+       └─────┬─────┘  └─────┬─────┘  └─────┬─────┘
+             │              │              │
+             └──────────────┼──────────────┘
+                            │
+                            ▼
+              ┌───────────────────────────┐
+              │  BackupStorageLocation    │
+              │  (S3 on NAS - candlekeep) │
+              │  /backups/velero          │
+              └───────────────────────────┘
+```
+
+## Configuration
+
+### Schedule
+
+```yaml
+apiVersion: velero.io/v1
+kind: Schedule
+metadata:
+  name: nightly-cluster-backup
+  namespace: velero
+spec:
+  schedule: "0 2 * * *"  # 2 AM daily
+  template:
+    includedNamespaces:
+      - "*"
+    excludedNamespaces:
+      - kube-system
+      - kube-node-lease
+      - kube-public
+    includedResources:
+      - "*"
+    excludeNodeAgent: false
+    defaultVolumesToFsBackup: true
+    ttl: 720h  # 30 days retention
+```
+
+### Backup Storage Location
+
+```yaml
+apiVersion: velero.io/v1
+kind: BackupStorageLocation
+metadata:
+  name: default
+  namespace: velero
+spec:
+  provider: aws
+  objectStorage:
+    bucket: velero
+  config:
+    region: us-east-1
+    s3ForcePathStyle: "true"
+    s3Url: http://candlekeep.lab.daviestechlabs.io:9000
+```
+
+## Backup Scope
+
+### Included
+
+| Category | Examples | Backup Method |
+|----------|----------|---------------|
+| Kubernetes resources | Deployments, Services, ConfigMaps | Velero native |
+| Secrets | Vault-synced, SOPS-decrypted | Velero native |
+| Persistent Volumes | Database data, user files | Node Agent (Kopia) |
+| CRDs | CNPG Clusters, RayServices, HelmReleases | Velero native |
+
+### Excluded
+
+| Category | Reason |
+|----------|--------|
+| kube-system | Rebuilt from Talos config |
+| flux-system | Rebuilt from Git (GitOps) |
+| Node-local data | Ephemeral, not critical |
+
+## Recovery Procedures
+
+### Full Cluster Recovery
+
+1. Bootstrap new Talos cluster
+2. Install Velero with same BSL configuration
+3. `velero restore create --from-backup nightly-cluster-backup-YYYYMMDD`
+4. Re-bootstrap Flux for GitOps reconciliation
+
+### Selective Namespace Recovery
+
+```bash
+velero restore create \
+  --from-backup nightly-cluster-backup-20260205020000 \
+  --include-namespaces ai-ml \
+  --restore-pvs
+```
+
+### Database Recovery (CNPG)
+
+For CNPG clusters, prefer CNPG's native PITR:
+```bash
+# CNPG handles its own WAL archiving to S3
+# Velero provides secondary backup layer
+```
+
+## Monitoring
+
+| Metric | Alert Threshold |
+|--------|-----------------|
+| `velero_backup_success_total` | No increase in 25h |
+| `velero_backup_failure_total` | Any increase |
+| Backup duration | > 4 hours |
+
+## Links
+
+* [Velero Documentation](https://velero.io/docs/)
+* [Node Agent (Kopia) Integration](https://velero.io/docs/main/file-system-backup/)
+* Related: [ADR-0026](0026-storage-strategy.md) - Storage Strategy
--- a/decisions/0033-data-analytics-platform.md
+++ b/decisions/0033-data-analytics-platform.md
@@ -0,0 +1,267 @@
+# Data Analytics Platform Architecture
+
+* Status: accepted
+* Date: 2026-02-05
+* Deciders: Billy
+* Technical Story: Build a modern lakehouse architecture for HTTP analytics and ML feature engineering
+
+## Context and Problem Statement
+
+The homelab generates significant telemetry data from HTTP traffic (via Envoy Gateway), application logs, and ML inference metrics. This data is valuable for:
+- Traffic pattern analysis
+- Security anomaly detection
+- ML feature engineering
+- Cost optimization insights
+
+How do we build a scalable analytics platform that supports both batch and real-time processing?
+
+## Decision Drivers
+
+* Modern lakehouse architecture (SQL + streaming)
+* Real-time and batch processing capabilities
+* Cost-effective on homelab hardware
+* Integration with existing observability stack
+* Support for ML feature pipelines
+* Open table formats for interoperability
+
+## Considered Options
+
+1. **Lakehouse: Nessie + Spark + Flink + Trino + RisingWave**
+2. **Traditional DWH: ClickHouse only**
+3. **Cloud-native: Databricks/Snowflake (SaaS)**
+4. **Minimal: PostgreSQL with TimescaleDB**
+
+## Decision Outcome
+
+Chosen option: **Option 1 - Modern Lakehouse Architecture**
+
+A full lakehouse stack with Apache Iceberg tables (via Nessie catalog), Spark for batch ETL, Flink for streaming, Trino for interactive queries, and RisingWave for streaming SQL.
+
+### Positive Consequences
+
+* Unified batch and streaming on same data
+* Git-like versioning of tables via Nessie
+* Standard SQL across all engines
+* Decoupled compute and storage
+* Open formats prevent vendor lock-in
+* ML feature engineering support
+
+### Negative Consequences
+
+* Complex multi-component architecture
+* Higher resource requirements
+* Steeper learning curve
+* Multiple operators to maintain
+
+## Architecture
+
+```
+┌─────────────────────────────────────────────────────────────────────────────┐
+│                           DATA SOURCES                                       │
+│  ┌──────────────┐  ┌──────────────┐  ┌──────────────┐  ┌──────────────┐     │
+│  │ Envoy Logs   │  │ Application  │  │ Inference    │  │ Prometheus   │     │
+│  │ (HTTPRoute)  │  │ Telemetry    │  │ Metrics      │  │ Metrics      │     │
+│  └──────┬───────┘  └──────┬───────┘  └──────┬───────┘  └──────┬───────┘     │
+└─────────┼─────────────────┼─────────────────┼─────────────────┼─────────────┘
+          │                 │                 │                 │
+          └─────────────────┼─────────────────┼─────────────────┘
+                            ▼                 │
+              ┌───────────────────────┐       │
+              │   NATS JetStream      │◄──────┘
+              │   (Event Streaming)   │
+              └───────────┬───────────┘
+                          │
+          ┌───────────────┼───────────────┐
+          │               │               │
+          ▼               ▼               ▼
+┌─────────────────┐ ┌───────────┐ ┌───────────────────┐
+│  Apache Flink   │ │ RisingWave│ │   Apache Spark    │
+│  (Streaming ETL)│ │ (Stream   │ │   (Batch ETL)     │
+│                 │ │  SQL)     │ │                   │
+└────────┬────────┘ └─────┬─────┘ └─────────┬─────────┘
+         │                │                 │
+         └────────────────┼─────────────────┘
+                          │ Write Iceberg Tables
+                          ▼
+              ┌───────────────────────┐
+              │       Nessie          │
+              │   (Iceberg Catalog)   │
+              │   Git-like versioning │
+              └───────────┬───────────┘
+                          │
+                          ▼
+              ┌───────────────────────┐
+              │    NFS Storage        │
+              │  (candlekeep:/lakehouse)│
+              └───────────────────────┘
+                          │
+                          ▼
+              ┌───────────────────────┐
+              │        Trino          │
+              │  (Interactive Query)  │
+              │  + Grafana Dashboards │
+              └───────────────────────┘
+```
+
+## Component Details
+
+### Apache Nessie (Iceberg Catalog)
+
+**Purpose:** Git-like version control for data tables
+
+```yaml
+# HelmRelease: nessie
+# Version: 0.107.1
+spec:
+  versionStoreType: ROCKSDB  # Embedded storage
+  catalog:
+    iceberg:
+      configDefaults:
+        warehouse: s3://lakehouse/
+```
+
+**Features:**
+- Branch/tag data versions
+- Time travel queries
+- Multi-table transactions
+- Cross-engine compatibility
+
+### Apache Spark (Batch Processing)
+
+**Purpose:** Large-scale batch ETL and ML feature engineering
+
+```yaml
+# SparkApplication for HTTPRoute analytics
+apiVersion: sparkoperator.k8s.io/v1beta2
+kind: SparkApplication
+spec:
+  type: Python
+  mode: cluster
+  sparkConf:
+    spark.sql.catalog.nessie: org.apache.iceberg.spark.SparkCatalog
+    spark.sql.catalog.nessie.catalog-impl: org.apache.iceberg.nessie.NessieCatalog
+    spark.sql.catalog.nessie.uri: http://nessie:19120/api/v1
+```
+
+**Use Cases:**
+- Daily HTTPRoute log aggregation
+- Feature engineering for ML
+- Historical data compaction
+
+### Apache Flink (Stream Processing)
+
+**Purpose:** Real-time event processing
+
+```yaml
+# Flink Kubernetes Operator
+# Version: 1.13.0
+spec:
+  job:
+    jarURI: local:///opt/flink/jobs/httproute-analytics.jar
+    parallelism: 2
+```
+
+**Use Cases:**
+- Real-time traffic anomaly detection
+- Streaming ETL to Iceberg
+- Session windowing for user analytics
+
+### RisingWave (Streaming SQL)
+
+**Purpose:** Simplified streaming SQL for real-time dashboards
+
+```sql
+-- Materialized view for real-time traffic
+CREATE MATERIALIZED VIEW traffic_5min AS
+SELECT 
+  window_start,
+  route_name,
+  COUNT(*) as request_count,
+  AVG(response_time_ms) as avg_latency
+FROM httproute_events
+GROUP BY 
+  TUMBLE(event_time, INTERVAL '5 MINUTES'),
+  route_name;
+```
+
+**Use Cases:**
+- Real-time Grafana dashboards
+- Streaming aggregations
+- Alerting triggers
+
+### Trino (Interactive Query)
+
+**Purpose:** Fast SQL queries across Iceberg tables
+
+```yaml
+# Trino coordinator + 2 workers
+catalogs:
+  iceberg: |
+    connector.name=iceberg
+    iceberg.catalog.type=nessie
+    iceberg.nessie.uri=http://nessie:19120/api/v1
+```
+
+**Use Cases:**
+- Ad-hoc analytics queries
+- Grafana data source for dashboards
+- Cross-table JOINs
+
+## Data Flow: HTTPRoute Analytics
+
+```
+Envoy Gateway
+    │
+    ▼ (access logs via OTEL)
+NATS JetStream
+    │
+    ├─► Flink Job (streaming)
+    │       │
+    │       ▼
+    │   Iceberg Table: httproute_raw
+    │
+    └─► Spark Job (nightly batch)
+            │
+            ▼
+        Iceberg Table: httproute_daily_agg
+            │
+            ▼
+        Trino ─► Grafana Dashboard
+```
+
+## Storage Layout
+
+```
+candlekeep:/kubernetes/lakehouse/
+├── warehouse/
+│   └── analytics/
+│       ├── httproute_raw/        # Raw events (partitioned by date)
+│       ├── httproute_daily_agg/  # Daily aggregates
+│       ├── inference_metrics/    # ML inference stats
+│       └── feature_store/        # ML features
+└── checkpoints/
+    ├── flink/                    # Flink savepoints
+    └── spark/                    # Spark checkpoints
+```
+
+## Resource Allocation
+
+| Component | Replicas | CPU | Memory |
+|-----------|----------|-----|--------|
+| Nessie | 1 | 0.5 | 512Mi |
+| Spark Operator | 1 | 0.2 | 256Mi |
+| Flink Operator | 1 | 0.2 | 256Mi |
+| Flink JobManager | 1 | 1 | 2Gi |
+| Flink TaskManager | 2 | 2 | 4Gi |
+| RisingWave | 1 | 2 | 4Gi |
+| Trino Coordinator | 1 | 1 | 2Gi |
+| Trino Worker | 2 | 2 | 4Gi |
+
+## Links
+
+* [Apache Iceberg](https://iceberg.apache.org/)
+* [Project Nessie](https://projectnessie.org/)
+* [Apache Flink](https://flink.apache.org/)
+* [RisingWave](https://risingwave.com/)
+* [Trino](https://trino.io/)
+* Related: [ADR-0025](0025-observability-stack.md) - Observability Stack
--- a/decisions/0034-volcano-batch-scheduling.md
+++ b/decisions/0034-volcano-batch-scheduling.md
@@ -0,0 +1,206 @@
+# Volcano Batch Scheduling Strategy
+
+* Status: accepted
+* Date: 2026-02-05
+* Deciders: Billy
+* Technical Story: Optimize scheduling for batch ML and analytics workloads
+
+## Context and Problem Statement
+
+The homelab runs diverse workloads including:
+- AI/ML training jobs (batch, GPU-intensive)
+- Spark/Flink analytics jobs (batch, CPU/memory-intensive)
+- KubeRay cluster with multiple GPU workers
+- Long-running inference services
+
+The default Kubernetes scheduler (kube-scheduler) is optimized for microservices, not batch workloads. It lacks:
+- Gang scheduling (all-or-nothing pod placement)
+- Fair-share queuing across teams/projects
+- Preemption policies for priority workloads
+- Resource reservation for batch jobs
+
+How do we optimize scheduling for batch and ML workloads?
+
+## Decision Drivers
+
+* Gang scheduling for distributed ML training
+* Fair-share resource allocation
+* Priority-based preemption
+* Integration with Kubeflow and Spark
+* GPU-aware scheduling
+* Queue management for multi-tenant scenarios
+
+## Considered Options
+
+1. **Volcano Scheduler**
+2. **Apache YuniKorn**
+3. **Kubernetes default scheduler with Priority Classes**
+4. **Kueue (Kubernetes Batch Workload Queueing)**
+
+## Decision Outcome
+
+Chosen option: **Option 1 - Volcano Scheduler**
+
+Volcano is a CNCF project designed for batch, HPC, and ML workloads. It provides gang scheduling, queue management, and integrates natively with Spark, Flink, and ML frameworks.
+
+### Positive Consequences
+
+* Gang scheduling prevents partial deployments
+* Queue-based fair-share resource management
+* Native Spark and Flink integration
+* Preemption for high-priority jobs
+* CNCF project with active community
+* Coexists with default scheduler
+
+### Negative Consequences
+
+* Additional scheduler components (admission, controller, scheduler)
+* Learning curve for queue configuration
+* Workloads must opt-in via scheduler name
+
+## Architecture
+
+```
+┌─────────────────────────────────────────────────────────────────┐
+│                        Volcano System                            │
+│                     (volcano-system namespace)                   │
+│                                                                  │
+│  ┌─────────────────┐  ┌───────────────────┐  ┌───────────────┐  │
+│  │   Admission     │  │   Controllers     │  │   Scheduler   │  │
+│  │   Webhook       │  │   (Job lifecycle) │  │   (Placement) │  │
+│  └─────────────────┘  └───────────────────┘  └───────────────┘  │
+└─────────────────────────────────────────────────────────────────┘
+                                  │
+                                  ▼
+┌─────────────────────────────────────────────────────────────────┐
+│                         Queues                                   │
+│  ┌───────────────────────────────────────────────────────────┐  │
+│  │  ml-training    │  analytics    │  inference   │  default  │  │
+│  │  weight: 40     │  weight: 30   │  weight: 20  │  weight: 10│ │
+│  └───────────────────────────────────────────────────────────┘  │
+└─────────────────────────────────────────────────────────────────┘
+                                  │
+                                  ▼
+┌─────────────────────────────────────────────────────────────────┐
+│                        Workloads                                 │
+│  ┌──────────────┐  ┌──────────────┐  ┌──────────────────────┐   │
+│  │ Spark Jobs   │  │ Flink Jobs   │  │ ML Training (KFP)    │   │
+│  │ (analytics)  │  │ (analytics)  │  │ (ml-training)        │   │
+│  └──────────────┘  └──────────────┘  └──────────────────────┘   │
+└─────────────────────────────────────────────────────────────────┘
+```
+
+## Configuration
+
+### Queue Definition
+
+```yaml
+apiVersion: scheduling.volcano.sh/v1beta1
+kind: Queue
+metadata:
+  name: ml-training
+spec:
+  weight: 40
+  reclaimable: true
+  guarantee:
+    resource:
+      cpu: "4"
+      memory: "16Gi"
+  capability:
+    resource:
+      cpu: "32"
+      memory: "128Gi"
+      nvidia.com/gpu: "2"
+```
+
+### Spark Integration
+
+```yaml
+apiVersion: sparkoperator.k8s.io/v1beta2
+kind: SparkApplication
+metadata:
+  name: analytics-job
+spec:
+  batchScheduler: volcano
+  batchSchedulerOptions:
+    queue: analytics
+    priorityClassName: normal
+  driver:
+    schedulerName: volcano
+  executor:
+    schedulerName: volcano
+    instances: 4
+```
+
+### Gang Scheduling for ML Training
+
+```yaml
+apiVersion: batch.volcano.sh/v1alpha1
+kind: Job
+metadata:
+  name: distributed-training
+spec:
+  schedulerName: volcano
+  minAvailable: 4  # Gang: all 4 pods or none
+  queue: ml-training
+  tasks:
+    - name: worker
+      replicas: 4
+      template:
+        spec:
+          containers:
+            - name: trainer
+              resources:
+                limits:
+                  nvidia.com/gpu: 1
+```
+
+## Queue Structure
+
+| Queue | Weight | Use Case | Guarantee | Preemptable |
+|-------|--------|----------|-----------|-------------|
+| `ml-training` | 40 | Kubeflow jobs, RayJobs | 4 CPU, 16Gi | No |
+| `analytics` | 30 | Spark/Flink batch jobs | 2 CPU, 8Gi | Yes |
+| `inference` | 20 | Batch inference jobs | 2 CPU, 8Gi | No |
+| `default` | 10 | Miscellaneous batch | None | Yes |
+
+## Scheduler Selection
+
+Workloads use Volcano by setting:
+
+```yaml
+spec:
+  schedulerName: volcano
+```
+
+Long-running services (inference endpoints, databases) continue using the default scheduler for stability.
+
+## Preemption Policy
+
+```yaml
+apiVersion: scheduling.volcano.sh/v1beta1
+kind: PriorityClass
+metadata:
+  name: high-priority
+spec:
+  value: 1000
+  preemptionPolicy: PreemptLowerPriority
+  description: "High priority ML training jobs"
+```
+
+## Monitoring
+
+| Metric | Description |
+|--------|-------------|
+| `volcano_queue_allocated_*` | Resources currently allocated per queue |
+| `volcano_queue_pending_*` | Pending resource requests per queue |
+| `volcano_job_status` | Job lifecycle states |
+| `volcano_scheduler_throughput` | Scheduling decisions per second |
+
+## Links
+
+* [Volcano Documentation](https://volcano.sh/docs/)
+* [Gang Scheduling](https://volcano.sh/docs/gang_scheduling/)
+* [Spark on Volcano](https://volcano.sh/docs/spark/)
+* Related: [ADR-0009](0009-dual-workflow-engines.md) - Dual Workflow Engines
+* Related: [ADR-0033](0033-data-analytics-platform.md) - Data Analytics Platform
--- a/decisions/0035-arm64-worker-strategy.md
+++ b/decisions/0035-arm64-worker-strategy.md
@@ -0,0 +1,195 @@
+# ARM64 Raspberry Pi Worker Node Strategy
+
+* Status: accepted
+* Date: 2026-02-05
+* Deciders: Billy
+* Technical Story: Integrate Raspberry Pi nodes into the Kubernetes cluster
+
+## Context and Problem Statement
+
+The homelab cluster includes 5 Raspberry Pi 4/5 nodes (ARM64 architecture) alongside x86_64 servers. These low-power nodes provide:
+- Additional compute capacity for lightweight workloads
+- Geographic distribution within the home network
+- Learning platform for multi-architecture Kubernetes
+
+However, ARM64 nodes have constraints:
+- No GPU acceleration
+- Lower CPU/memory than x86_64 servers
+- Some container images lack ARM64 support
+- Limited local storage
+
+How do we effectively integrate ARM64 nodes while avoiding scheduling failures?
+
+## Decision Drivers
+
+* Maximize utilization of ARM64 compute
+* Prevent ARM-incompatible workloads from scheduling
+* Maintain cluster stability
+* Support multi-arch container images
+* Minimize operational overhead
+
+## Considered Options
+
+1. **Node labels + affinity for workload placement**
+2. **Separate ARM64-only namespace**
+3. **Taints to exclude from general scheduling**
+4. **ARM64 nodes for specific workload types only**
+
+## Decision Outcome
+
+Chosen option: **Option 1 + Option 4 hybrid** - Use node labels with affinity rules, and designate ARM64 nodes for specific workload categories.
+
+ARM64 nodes handle:
+- Lightweight control plane components (where multi-arch images exist)
+- Velero node-agent (backup DaemonSet)
+- Node-level monitoring (Prometheus node-exporter)
+- Future: Edge/IoT workloads
+
+### Positive Consequences
+
+* Clear workload segmentation
+* No scheduling failures from arch mismatch
+* Efficient use of low-power nodes
+* Room for future ARM-specific workloads
+* Cost-effective cluster expansion
+
+### Negative Consequences
+
+* Some nodes may be underutilized
+* Must maintain multi-arch image awareness
+* Additional scheduling complexity
+
+## Cluster Composition
+
+| Node | Architecture | Role | Instance Type |
+|------|--------------|------|---------------|
+| bruenor | amd64 | control-plane | - |
+| catti | amd64 | control-plane | - |
+| storm | amd64 | control-plane | - |
+| khelben | amd64 | GPU worker (Strix Halo) | - |
+| elminster | amd64 | GPU worker (NVIDIA) | - |
+| drizzt | amd64 | GPU worker (RDNA2) | - |
+| danilo | amd64 | GPU worker (Intel Arc) | - |
+| regis | amd64 | worker | - |
+| wulfgar | amd64 | worker | - |
+| **durnan** | **arm64** | worker | raspberry-pi |
+| **elaith** | **arm64** | worker | raspberry-pi |
+| **jarlaxle** | **arm64** | worker | raspberry-pi |
+| **mirt** | **arm64** | worker | raspberry-pi |
+| **volo** | **arm64** | worker | raspberry-pi |
+
+## Node Labels
+
+```yaml
+# Applied via Talos machine config or kubectl
+labels:
+  kubernetes.io/arch: arm64
+  kubernetes.io/os: linux
+  node.kubernetes.io/instance-type: raspberry-pi
+  kubernetes.io/storage: none  # No Longhorn on Pis
+```
+
+## Workload Placement
+
+### DaemonSets (Run Everywhere)
+
+These run on all nodes including ARM64:
+
+| DaemonSet | Namespace | Multi-arch |
+|-----------|-----------|------------|
+| velero-node-agent | velero | ✅ |
+| cilium-agent | kube-system | ✅ |
+| node-exporter | observability | ✅ |
+
+### ARM64-Excluded Workloads
+
+These explicitly exclude ARM64 via node affinity:
+
+```yaml
+spec:
+  affinity:
+    nodeAffinity:
+      requiredDuringSchedulingIgnoredDuringExecution:
+        nodeSelectorTerms:
+          - matchExpressions:
+              - key: kubernetes.io/arch
+                operator: In
+                values:
+                  - amd64
+```
+
+| Workload Type | Reason for Exclusion |
+|---------------|----------------------|
+| GPU workloads | No GPU on Pis |
+| Longhorn | Pis have no storage label |
+| Heavy databases | Insufficient resources |
+| Most HelmReleases | Image compatibility |
+
+### ARM64-Compatible Light Workloads
+
+Potential future workloads for ARM64 nodes:
+
+| Workload | Use Case |
+|----------|----------|
+| MQTT broker | IoT message routing |
+| Pi-hole | DNS ad blocking |
+| Home Assistant | Home automation |
+| Lightweight proxies | Traffic routing |
+
+## Storage Exclusion
+
+ARM64 nodes are excluded from Longhorn:
+
+```yaml
+# Longhorn Helm values
+defaultSettings:
+  systemManagedComponentsNodeSelector: "kubernetes.io/arch:amd64"
+```
+
+Node label:
+```yaml
+kubernetes.io/storage: none
+```
+
+## Resource Constraints
+
+| Node Type | CPU | Memory | Typical Available |
+|-----------|-----|--------|-------------------|
+| Raspberry Pi 4 | 4 cores | 4-8GB | 3 cores, 3GB |
+| Raspberry Pi 5 | 4 cores | 8GB | 3.5 cores, 6GB |
+
+## Multi-Architecture Image Strategy
+
+For workloads that should run on ARM64:
+
+1. **Use multi-arch base images** (e.g., `alpine`, `debian`)
+2. **Build with Docker buildx**:
+   ```bash
+   docker buildx build --platform linux/amd64,linux/arm64 -t myimage:latest .
+   ```
+3. **Verify arch support** before deployment
+
+## Monitoring ARM64 Nodes
+
+```promql
+# Node resource usage by architecture
+sum by (node, arch) (
+  node_memory_MemAvailable_bytes{} 
+  * on(node) group_left(arch) 
+  kube_node_labels{label_kubernetes_io_arch!=""}
+)
+```
+
+## Future Considerations
+
+- **Edge workloads**: ARM64 nodes ideal for edge compute patterns
+- **IoT integration**: MQTT, sensor data collection
+- **Scale-out**: Add more Pis for lightweight workload capacity
+- **ARM64 ML inference**: Some models support ARM (TensorFlow Lite)
+
+## Links
+
+* [Kubernetes Multi-Architecture](https://kubernetes.io/docs/concepts/containers/images/#multi-architecture-images)
+* [Talos on Raspberry Pi](https://talos.dev/v1.12/talos-guides/install/single-board-computers/rpi_generic/)
+* Related: [ADR-0002](0002-use-talos-linux.md) - Use Talos Linux
+* Related: [ADR-0026](0026-storage-strategy.md) - Storage Strategy
--- a/decisions/0036-renovate-dependency-updates.md
+++ b/decisions/0036-renovate-dependency-updates.md
@@ -0,0 +1,256 @@
+# Automated Dependency Updates with Renovate
+
+* Status: accepted
+* Date: 2026-02-05
+* Deciders: Billy
+* Technical Story: Automate dependency updates across all homelab repositories
+
+## Context and Problem Statement
+
+The homelab consists of 20+ repositories containing:
+- Kubernetes manifests with container image references
+- Helm chart versions
+- Python/Go dependencies
+- GitHub Actions / Gitea Actions workflow versions
+
+Manually tracking and updating dependencies is:
+- Time-consuming
+- Error-prone
+- Often neglected until security issues arise
+
+How do we automate dependency updates while maintaining control over what gets updated?
+
+## Decision Drivers
+
+* Automated detection of outdated dependencies
+* PR-based update workflow for review
+* Support for Kubernetes manifests, Helm, Python, Go, Docker
+* Self-hosted on existing infrastructure
+* Configurable grouping and scheduling
+* Security update prioritization
+
+## Considered Options
+
+1. **Renovate (self-hosted)**
+2. **Dependabot (GitHub-native)**
+3. **Manual updates with version scripts**
+4. **Flux image automation**
+
+## Decision Outcome
+
+Chosen option: **Option 1 - Renovate (self-hosted)**
+
+Renovate runs as a CronJob in the cluster, scanning all repositories in the Gitea organization and creating PRs for outdated dependencies. It supports more package managers than Dependabot and works with Gitea.
+
+### Positive Consequences
+
+* Comprehensive manager support (40+ package managers)
+* Works with self-hosted Gitea
+* Configurable grouping (batch minor updates)
+* Auto-merge for patch/minor updates
+* Dashboard for update overview
+* Reusable preset configurations
+
+### Negative Consequences
+
+* Additional CronJob to maintain
+* Configuration complexity
+* API token management for Gitea access
+
+## Architecture
+
+```
+┌───────────────────────────────────────────────────────────────────┐
+│                      Renovate CronJob                              │
+│                      (ci-cd namespace)                             │
+│                                                                    │
+│  Schedule: Every 8 hours (0 */8 * * *)                            │
+│                                                                    │
+│  ┌────────────────────────────────────────────────────────────┐   │
+│  │                    Renovate Container                       │   │
+│  │                                                             │   │
+│  │  1. Fetch repositories from Gitea org                       │   │
+│  │  2. Scan each repo for dependencies                         │   │
+│  │  3. Compare versions with upstream registries               │   │
+│  │  4. Create/update PRs for outdated deps                     │   │
+│  │  5. Auto-merge approved patches                             │   │
+│  └────────────────────────────────────────────────────────────┘   │
+└───────────────────────────────────────────────────────────────────┘
+                                  │
+                                  ▼
+┌───────────────────────────────────────────────────────────────────┐
+│                         Gitea                                      │
+│                                                                    │
+│  ┌───────────────┐  ┌───────────────┐  ┌───────────────┐          │
+│  │ homelab-k8s2  │  │ chat-handler  │  │ kuberay-images│          │
+│  │               │  │               │  │               │          │
+│  │ PR: Update    │  │ PR: Update    │  │ PR: Update    │          │
+│  │ flux to 2.5.0 │  │ httpx to 0.28 │  │ ROCm to 6.4   │          │
+│  └───────────────┘  └───────────────┘  └───────────────┘          │
+└───────────────────────────────────────────────────────────────────┘
+```
+
+## Configuration
+
+### CronJob
+
+```yaml
+apiVersion: batch/v1
+kind: CronJob
+metadata:
+  name: renovate
+  namespace: ci-cd
+spec:
+  schedule: "0 */8 * * *"  # Every 8 hours
+  jobTemplate:
+    spec:
+      template:
+        spec:
+          containers:
+            - name: renovate
+              image: renovate/renovate:39
+              env:
+                - name: RENOVATE_PLATFORM
+                  value: "gitea"
+                - name: RENOVATE_ENDPOINT
+                  value: "https://git.daviestechlabs.io/api/v1"
+                - name: RENOVATE_TOKEN
+                  valueFrom:
+                    secretKeyRef:
+                      name: renovate-github-token
+                      key: token
+                - name: RENOVATE_AUTODISCOVER
+                  value: "true"
+                - name: RENOVATE_AUTODISCOVER_FILTER
+                  value: "daviestechlabs/*"
+          restartPolicy: OnFailure
+```
+
+### Repository Config (renovate.json)
+
+```json
+{
+  "$schema": "https://docs.renovatebot.com/renovate-schema.json",
+  "extends": [
+    "config:recommended",
+    "group:allNonMajor",
+    ":automergeMinor",
+    ":automergePatch"
+  ],
+  "kubernetes": {
+    "fileMatch": ["\\.ya?ml$"]
+  },
+  "packageRules": [
+    {
+      "matchManagers": ["helm-values", "helmv3"],
+      "groupName": "helm charts"
+    },
+    {
+      "matchPackagePatterns": ["^ghcr.io/"],
+      "groupName": "GHCR images"
+    },
+    {
+      "matchUpdateTypes": ["major"],
+      "automerge": false,
+      "labels": ["major-update"]
+    }
+  ],
+  "schedule": ["before 6am on monday"]
+}
+```
+
+## Supported Package Managers
+
+| Manager | File Patterns | Examples |
+|---------|---------------|----------|
+| kubernetes | `*.yaml`, `*.yml` | Container images in Deployments |
+| helm | `Chart.yaml`, `values.yaml` | Helm chart dependencies |
+| helmv3 | HelmRelease CRDs | Flux HelmReleases |
+| flux | Flux CRDs | GitRepository, OCIRepository |
+| pip | `requirements.txt`, `pyproject.toml` | Python packages |
+| gomod | `go.mod` | Go modules |
+| dockerfile | `Dockerfile*` | Base images |
+| github-actions | `.github/workflows/*.yml` | Action versions |
+| gitea-actions | `.gitea/workflows/*.yml` | Action versions |
+
+## Update Strategy
+
+### Auto-merge Enabled
+
+| Update Type | Auto-merge | Delay |
+|-------------|------------|-------|
+| Patch (x.x.1 → x.x.2) | ✅ Yes | Immediate |
+| Minor (x.1.x → x.2.x) | ✅ Yes | 3 days stabilization |
+| Major (1.x.x → 2.x.x) | ❌ No | Manual review |
+
+### Grouping Strategy
+
+| Group | Contents | Frequency |
+|-------|----------|-----------|
+| `all-non-major` | All patch + minor updates | Weekly (Monday) |
+| `helm-charts` | All Helm chart updates | Weekly |
+| `container-images` | Docker image updates | Weekly |
+| `security` | CVE fixes | Immediate |
+
+## Security Updates
+
+Renovate prioritizes security updates:
+
+```json
+{
+  "vulnerabilityAlerts": {
+    "enabled": true,
+    "labels": ["security"]
+  },
+  "packageRules": [
+    {
+      "matchCategories": ["security"],
+      "automerge": true,
+      "schedule": ["at any time"],
+      "prPriority": 10
+    }
+  ]
+}
+```
+
+## Dashboard
+
+Renovate creates a "Dependency Dashboard" issue in each repository:
+
+```markdown
+## Dependency Dashboard
+
+### Open PRs
+- [ ] Update httpx to 0.28.1 (#42)
+- [x] Update pillow to 11.0.0 (#41) - merged
+
+### Pending Approval
+- [ ] Major: Update pydantic to v2 (#40)
+
+### Rate Limited
+- fastapi (waiting for next schedule window)
+```
+
+## Secrets
+
+| Secret | Source | Purpose |
+|--------|--------|---------|
+| `renovate-github-token` | Vault | Gitea API access |
+| `renovate-dockerhub` | Vault | Docker Hub rate limits |
+
+## Monitoring
+
+```promql
+# Renovate job success rate
+sum(kube_job_status_succeeded{job_name=~"renovate-.*"}) 
+/ 
+sum(kube_job_status_succeeded{job_name=~"renovate-.*"} + kube_job_status_failed{job_name=~"renovate-.*"})
+```
+
+## Links
+
+* [Renovate Documentation](https://docs.renovatebot.com/)
+* [Renovate Presets](https://docs.renovatebot.com/presets-default/)
+* [Gitea Platform Support](https://docs.renovatebot.com/modules/platform/gitea/)
+* Related: [ADR-0013](0013-gitea-actions-for-ci.md) - Gitea Actions for CI
+* Related: [ADR-0031](0031-gitea-cicd-strategy.md) - Gitea CI/CD Strategy
--- a/decisions/0037-node-naming-conventions.md
+++ b/decisions/0037-node-naming-conventions.md
@@ -0,0 +1,187 @@
+# Node Naming Conventions
+
+* Status: accepted
+* Date: 2026-02-05
+* Deciders: Billy
+* Technical Story: Establish memorable, role-based naming for cluster nodes
+
+## Context and Problem Statement
+
+The homelab cluster has grown to include:
+- 14 Kubernetes nodes (control plane + workers)
+- Multiple storage servers
+- Development workstations
+
+Generic names like `node-01`, `worker-gpu-1` are:
+- Hard to remember
+- Don't convey node purpose
+- Boring
+
+How do we name nodes in a way that's memorable, fun, and indicates their role?
+
+## Decision Drivers
+
+* Names should indicate node role/capability
+* Easy to remember and reference in conversation
+* Consistent theme across the homelab
+* Scalable as more nodes are added
+
+## Decision Outcome
+
+Chosen option: **Dungeons & Dragons character naming scheme**
+
+All nodes are named after famous D&D characters from Forgotten Realms, with character class mapping to node role.
+
+## Naming Scheme
+
+### Control Plane → Companions of the Hall
+
+The control plane nodes are named after the legendary Companions of the Hall, Drizzt's closest allies.
+
+| Node | Character | Hardware | Notes |
+|------|-----------|----------|-------|
+| `bruenor` | Bruenor Battlehammer | Intel N100 | Dwarf King of Mithral Hall |
+| `catti` | Catti-brie | Intel N100 | Human ranger, Bruenor's adopted daughter |
+| `storm` | Storm Silverhand | Intel N100 | Chosen of Mystra, Harper leader |
+
+### Wizards → GPU Nodes (Spellcasters)
+
+Wizards cast powerful spells, just as GPU nodes power AI/ML workloads.
+
+| Node | Character | GPU | Notes |
+|------|-----------|-----|-------|
+| `khelben` | Khelben "Blackstaff" Arunsun | AMD Radeon 8060S 64GB | Primary AI inference, Strix Halo APU |
+| `elminster` | Elminster Aumar | NVIDIA RTX 2070 8GB | CUDA workloads, Sage of Shadowdale |
+| `drizzt` | Drizzt Do'Urden* | AMD Radeon 680M | ROCm backup node |
+| `danilo` | Danilo Thann | Intel Arc A770 | Intel inference, bard/wizard multiclass |
+| `regis` | Regis | NVIDIA GPU | Halfling with magical ruby, spellthief vibes |
+
+*Drizzt is technically a ranger, but his magical scimitars and time in Menzoberranzan qualify him for the GPU tier.
+
+### Rogues → ARM64 Edge Nodes
+
+Rogues are nimble and work in the shadows—perfect for lightweight edge compute on Raspberry Pi nodes.
+
+| Node | Character | Hardware | Notes |
+|------|-----------|----------|-------|
+| `durnan` | Durnan | Raspberry Pi 4 8GB | Yawning Portal innkeeper, retired adventurer |
+| `elaith` | Elaith Craulnober | Raspberry Pi 4 8GB | The Serpent, moon elf rogue |
+| `jarlaxle` | Jarlaxle Baenre | Raspberry Pi 4 8GB | Drow mercenary leader |
+| `mirt` | Mirt the Moneylender | Raspberry Pi 4 8GB | Harper agent, "Old Wolf" |
+| `volo` | Volothamp Geddarm | Raspberry Pi 4 8GB | Famous author and traveler |
+
+### Fighters → x86 CPU Workers
+
+Fighters are the workhorses, handling general compute without magical (GPU) abilities.
+
+| Node | Character | Hardware | Notes |
+|------|-----------|----------|-------|
+| `wulfgar` | Wulfgar | Intel x86_64 | Barbarian of Icewind Dale, Aegis-fang wielder |
+
+### Infrastructure Nodes (Locations)
+
+| Node | Character/Location | Role | Notes |
+|------|-------------------|------|-------|
+| `candlekeep` | Candlekeep | Primary NAS (Synology) | Library fortress, knowledge storage |
+| `neverwinter` | Neverwinter | Fast NAS (TrueNAS Scale) | Jewel of the North, all-SSD, nfs-fast |
+| `waterdeep` | Waterdeep | Mac Mini dev workstation | City of Splendors, primary city |
+
+### Future Expansion
+
+| Class | Role | Candidate Names |
+|-------|------|-----------------|
+| Clerics | Database/backup nodes | Cadderly, Dawnbringer |
+| Fighters | High-CPU compute | Artemis Entreri, Obould |
+| Druids | Monitoring/observability | Jaheira, Cernd |
+| Bards | API gateways | Other Thann family members |
+| Paladins | Security nodes | Ajantis, Keldorn |
+
+## Architecture
+
+```
+┌───────────────────────────────────────────────────────────────────────────────┐
+│                     Homelab Cluster (14 Kubernetes Nodes)                      │
+│                                                                                │
+│  ┌──────────────────────────────────────────────────────────────────────┐     │
+│  │              👑 Control Plane (Companions of the Hall)                │     │
+│  │                                                                       │     │
+│  │      bruenor              catti                storm                  │     │
+│  │      Intel N100           Intel N100           Intel N100             │     │
+│  │      "Dwarf King"         "Catti-brie"         "Silverhand"           │     │
+│  └──────────────────────────────────────────────────────────────────────┘     │
+│                                                                                │
+│  ┌──────────────────────────────────────────────────────────────────────┐     │
+│  │                    🧙 Wizards (GPU Spellcasters)                      │     │
+│  │                                                                       │     │
+│  │  khelben         elminster       drizzt        danilo       regis    │     │
+│  │  Radeon 8060S    RTX 2070        Radeon 680M   Arc A770     NVIDIA   │     │
+│  │  64GB unified    8GB VRAM        iGPU          16GB         GPU      │     │
+│  │  "Blackstaff"    "Sage"          "Ranger"      "Bard"       "Ruby"   │     │
+│  └──────────────────────────────────────────────────────────────────────┘     │
+│                                                                                │
+│  ┌──────────────────────────────────────────────────────────────────────┐     │
+│  │                    🗡️ Rogues (ARM64 Edge Nodes)                       │     │
+│  │                                                                       │     │
+│  │  durnan         elaith         jarlaxle        mirt         volo     │     │
+│  │  Pi 4 8GB       Pi 4 8GB       Pi 4 8GB        Pi 4 8GB     Pi 4 8GB │     │
+│  │  "Innkeeper"    "Serpent"      "Mercenary"     "Old Wolf"   "Author" │     │
+│  └──────────────────────────────────────────────────────────────────────┘     │
+│                                                                                │
+│  ┌──────────────────────────────────────────────────────────────────────┐     │
+│  │                    ⚔️ Fighters (x86 CPU Workers)                      │     │
+│  │                                                                       │     │
+│  │                           wulfgar                                     │     │
+│  │                           Intel x86_64                                │     │
+│  │                           "Barbarian of Icewind Dale"                 │     │
+│  └──────────────────────────────────────────────────────────────────────┘     │
+└───────────────────────────────────────────────────────────────────────────────┘
+
+┌───────────────────────────────────────────────────────────────────────────────┐
+│                    🏰 Locations (Off-Cluster Infrastructure)                   │
+│                                                                                │
+│  📚 candlekeep              ❄️ neverwinter              🏙️ waterdeep           │
+│  Synology NAS               TrueNAS Scale (SSD)         Mac Mini               │
+│  nfs-default                nfs-fast                    Dev workstation        │
+│  High capacity              High speed                  Primary dev box        │
+│  "Library Fortress"         "Jewel of the North"        "City of Splendors"    │
+└───────────────────────────────────────────────────────────────────────────────┘
+```
+
+## Storage Mapping
+
+| Location | Storage Class | Speed | Capacity | Use Case |
+|----------|--------------|-------|----------|----------|
+| Candlekeep | `nfs-default` | HDD | High | Backups, archives, media |
+| Neverwinter | `nfs-fast` | SSD | Medium | Database WAL, hot data |
+| Longhorn | `longhorn` | Local SSD | Distributed | Replicated app data |
+
+## Node Labels
+
+```yaml
+# GPU Wizard nodes
+node.kubernetes.io/instance-type: gpu-wizard
+homelab.daviestechlabs.io/character-class: wizard
+homelab.daviestechlabs.io/character-name: khelben
+
+# ARM64 Rogue nodes  
+node.kubernetes.io/instance-type: raspberry-pi
+homelab.daviestechlabs.io/character-class: rogue
+homelab.daviestechlabs.io/character-name: jarlaxle
+```
+
+## DNS/Hostname Resolution
+
+All nodes are resolvable via:
+- Kubernetes DNS: `<node>.node.kubernetes.io`
+- Local DNS: `<node>.lab.daviestechlabs.io`
+- mDNS: `<node>.local`
+
+## References
+
+* [Forgotten Realms Wiki](https://forgottenrealms.fandom.com/)
+* [Khelben Arunsun](https://forgottenrealms.fandom.com/wiki/Khelben_Arunsun)
+* [Elminster](https://forgottenrealms.fandom.com/wiki/Elminster_Aumar)
+* [Candlekeep](https://forgottenrealms.fandom.com/wiki/Candlekeep)
+* [Neverwinter](https://forgottenrealms.fandom.com/wiki/Neverwinter)
+* Related: [ADR-0035](0035-arm64-worker-strategy.md) - ARM64 Worker Strategy
+* Related: [ADR-0011](0011-kuberay-unified-serving.md) - KubeRay Unified Serving