Billy D. 299a416f51 docs: accept ADR-0061 (Go handler refactor), supersede ADR-0004 (msgpack→protobuf)
All 5 handler services + companions-frontend migrated to handler-base v1.0.0
with protobuf wire format. golangci-lint clean across all repos.
2026-02-21 15:46:24 -05:00
2026-02-01 19:16:18 +00:00

🏠 DaviesTechLabs Homelab Architecture

Production-grade AI/ML platform running on bare-metal Kubernetes

Talos Kubernetes Flux License

ADR Count Accepted Proposed

📖 Quick Navigation

Document Purpose
AGENT-ONBOARDING.md Start here if you're an AI agent
ARCHITECTURE.md High-level system overview
TECH-STACK.md Complete technology stack
DOMAIN-MODEL.md Core entities and bounded contexts
GLOSSARY.md Terminology reference
decisions/ Architecture Decision Records (ADRs)

🎯 What This Is

A comprehensive architecture documentation repository for the DaviesTechLabs homelab Kubernetes cluster, featuring:

  • AI/ML Platform: KServe inference services, RAG pipelines, voice assistants
  • Multi-GPU Support: AMD ROCm (RDNA3/Strix Halo), NVIDIA CUDA, Intel Arc
  • GitOps: Flux CD with SOPS encryption
  • Event-Driven: NATS JetStream for real-time messaging
  • ML Workflows: Kubeflow Pipelines + Argo Workflows

🖥️ Cluster Overview

Node Role Hardware GPU
storm Control Plane Intel 13th Gen Integrated
bruenor Control Plane Intel 13th Gen Integrated
catti Control Plane Intel 13th Gen Integrated
elminster Worker NVIDIA RTX 2070 8GB CUDA
khelben Worker (vLLM) AMD Strix Halo 64GB Unified
drizzt Worker AMD Radeon 680M 12GB RDNA2
danilo Worker Intel Core Ultra 9 Intel Arc

🚀 Quick Start

View Current Cluster State

# Get node status
kubectl get nodes -o wide

# View AI/ML workloads
kubectl get pods -n ai-ml

# Check KServe inference services
kubectl get inferenceservices -n ai-ml

Key Endpoints

Service URL Purpose
Kubeflow kubeflow.lab.daviestechlabs.io ML Pipeline UI
Companions companions-chat.lab.daviestechlabs.io AI Chat Interface
Voice voice.lab.daviestechlabs.io Voice Assistant
Gitea git.daviestechlabs.io Self-hosted Git

📂 Repository Structure

homelab-design/
├── README.md                          # This file
├── AGENT-ONBOARDING.md                # AI agent quick-start
├── ARCHITECTURE.md                    # High-level system overview
├── CONTEXT-DIAGRAM.mmd                # C4 Level 1 (Mermaid)
├── CONTAINER-DIAGRAM.mmd              # C4 Level 2
├── TECH-STACK.md                      # Complete tech stack
├── DOMAIN-MODEL.md                    # Core entities
├── CODING-CONVENTIONS.md              # Patterns & practices
├── GLOSSARY.md                        # Terminology
├── decisions/                         # Architecture Decision Records
├── specs/                             # Feature specifications
└── diagrams/                          # Additional diagrams

Architecture Decision Records

# Decision Status Date
0001 Record Architecture Decisions accepted 2025-11-30
0002 Use Talos Linux for Kubernetes Nodes accepted 2025-11-30
0003 Use NATS for AI/ML Messaging accepted 2025-12-01
0004 Use MessagePack for NATS Messages accepted 2025-12-01
0005 Multi-GPU Heterogeneous Strategy accepted 2025-12-01
0006 GitOps with Flux CD accepted 2025-11-30
0007 Use KServe for ML Model Serving ♻️ superseded by ADR-0011 2025-12-15 (Updated: 2026-02-02)
0008 Use Milvus for Vector Storage accepted 2025-12-15
0009 Dual Workflow Engine Strategy (Argo + Kubeflow) accepted 2026-01-15
0010 Use Envoy Gateway for Ingress accepted 2025-12-01
0011 Use KubeRay as Unified GPU Backend accepted 2026-02-02
0012 Use uv for Python Development, pip for Docker Builds accepted 2026-02-02
0013 Use Gitea Actions for CI/CD accepted 2026-02-02
0014 Docker Build Best Practices accepted 2026-02-02
0015 CI Notifications and Semantic Versioning accepted 2026-02-02
0016 Affine Email Verification Strategy for Authentik OIDC accepted 2026-02-04
0017 Secrets Management Strategy accepted 2026-02-04
0018 Security Policy Enforcement accepted 2026-02-04
0019 Python Module Deployment Strategy accepted 2026-02-02
0020 Internal Registry URLs for CI/CD accepted 2026-02-02
0021 Notification Architecture accepted 2026-02-04
0022 ntfy-Discord Bridge Service accepted 2026-02-04
0023 Valkey for ML Inference Caching accepted 2026-02-04
0024 Ray Repository Structure accepted 2026-02-03
0025 Observability Stack Architecture accepted 2026-02-04
0026 Tiered Storage Strategy: Longhorn + NFS accepted 2026-02-04
0027 Database Strategy with CloudNativePG accepted 2026-02-04
0028 Authentik Single Sign-On Strategy accepted 2026-02-04
0029 Authentik User Registration and Approval Workflow accepted 2026-02-04
0030 MFA and Yubikey Strategy accepted 2026-02-04
0031 Gitea CI/CD Pipeline Strategy accepted 2026-02-04
0032 Velero Backup and Disaster Recovery Strategy accepted 2026-02-05
0033 Data Analytics Platform Architecture accepted 2026-02-05
0034 Volcano Batch Scheduling Strategy accepted 2026-02-05
0035 ARM64 Raspberry Pi Worker Node Strategy accepted 2026-02-05
0036 Automated Dependency Updates with Renovate accepted 2026-02-05
0037 Node Naming Conventions accepted 2026-02-05
0038 Infrastructure Metrics Collection Strategy accepted 2026-02-09
0039 Alerting and Notification Pipeline accepted 2026-02-09
0040 OPA Gatekeeper Policy Framework accepted 2026-02-09
0041 Falco Runtime Threat Detection accepted 2026-02-09
0042 Trivy Operator Vulnerability Scanning accepted 2026-02-09
0043 Cilium CNI and Network Fabric accepted 2026-02-09
0044 DNS and External Access Architecture accepted 2026-02-09
0045 TLS Certificate Strategy accepted 2026-02-09
0046 Companions Frontend Architecture accepted 2026-02-09
0047 MLflow Experiment Tracking and Model Registry accepted 2026-02-09
0048 Entertainment and Media Stack accepted 2026-02-09
0049 Self-Hosted Productivity Suite accepted 2026-02-09
0050 Argo Rollouts Progressive Delivery accepted 2026-02-09
0051 KEDA Event-Driven Autoscaling accepted 2026-02-09
0052 Cluster Utilities and Optimization accepted 2026-02-09
0053 Vaultwarden Password Management accepted 2026-02-09
0054 Kubeflow Pipeline CI/CD accepted 2026-02-13
0055 Internal Python Package Publishing accepted 2026-02-13
0056 Custom Trained Voice Support in TTS Module accepted 2026-02-13
0057 Per-Repository Renovate Configurations accepted 2026-02-13
0058 Training Strategy Distributed CPU Now, DGX Spark Later accepted 2026-02-14
0059 Add Mac Mini M4 Pro (waterdeep) to Ray Cluster as External Worker 📝 proposed 2026-02-16
0060 Internal PKI with Vault and cert-manager accepted 2026-02-16
Repository Purpose
homelab-k8s2 Kubernetes manifests, Flux GitOps
companions-frontend Go web server, HTMX frontend

AI/ML Repos (git.daviestechlabs.io/daviestechlabs)

The former monolithic llm-workflows repo has been archived and decomposed into:

Repository Purpose
handler-base Shared Python library for NATS handlers
chat-handler Text chat with RAG pipeline
voice-assistant Voice pipeline (STT → RAG → LLM → TTS)
pipeline-bridge Bridge between pipelines and services
stt-module Speech-to-text service
tts-module Text-to-speech service
ray-serve Ray Serve inference services
kuberay-images GPU-specific Ray worker Docker images
argo Argo Workflows (training, batch inference)
kubeflow Kubeflow Pipeline definitions
mlflow MLflow integration utilities
gradio-ui Gradio demo apps (embeddings, STT, TTS)
ntfy-discord ntfy → Discord notification bridge

📝 Contributing

  1. For architecture changes, create an ADR in decisions/
  2. Update relevant documentation
  3. Submit a PR with context

Last updated: 2026-02-17

Description
homelab design process goes here.
Readme MIT 684 KiB
Languages
Mermaid 100%