daviestechlabs/homelab-design

Fork 0

Go to file

Billy D. 100ba21eba

Update README with ADR Index / update-readme (push) Successful in 1m2s

Details

updates to adrs and fixing to reflect go refactor.

2026-02-23 06:14:30 -05:00

.gitea/workflows

docs: add ADR index workflow, standardize all ADR formats

2026-02-09 17:25:27 -05:00

decisions

updates to adrs and fixing to reflect go refactor.

2026-02-23 06:14:30 -05:00

diagrams

updates to finish nfs-fast implementation.

2026-02-16 18:08:38 -05:00

specs

feat: add comprehensive architecture documentation

2026-02-01 14:30:05 -05:00

AGENT-ONBOARDING.md

updates to adrs and fixing to reflect go refactor.

2026-02-23 06:14:30 -05:00

ARCHITECTURE.md

updates to adrs and fixing to reflect go refactor.

2026-02-23 06:14:30 -05:00

CODING-CONVENTIONS.md

updates to adrs and fixing to reflect go refactor.

2026-02-23 06:14:30 -05:00

CONTAINER-DIAGRAM.mmd

feat: add comprehensive architecture documentation

2026-02-01 14:30:05 -05:00

CONTEXT-DIAGRAM.mmd

feat: add comprehensive architecture documentation

2026-02-01 14:30:05 -05:00

DOMAIN-MODEL.md

feat: add comprehensive architecture documentation

2026-02-01 14:30:05 -05:00

GLOSSARY.md

feat: add comprehensive architecture documentation

2026-02-01 14:30:05 -05:00

LICENSE

Initial commit

2026-02-01 19:16:18 +00:00

README.md

docs: auto-update ADR index in README [skip ci]

2026-02-21 21:28:33 +00:00

TECH-STACK.md

updates to adrs and fixing to reflect go refactor.

2026-02-23 06:14:30 -05:00

README.md

🏠 DaviesTechLabs Homelab Architecture

Production-grade AI/ML platform running on bare-metal Kubernetes

Document	Purpose
AGENT-ONBOARDING.md	Start here if you're an AI agent
ARCHITECTURE.md	High-level system overview
TECH-STACK.md	Complete technology stack
DOMAIN-MODEL.md	Core entities and bounded contexts
GLOSSARY.md	Terminology reference
decisions/	Architecture Decision Records (ADRs)

🎯 What This Is

A comprehensive architecture documentation repository for the DaviesTechLabs homelab Kubernetes cluster, featuring:

AI/ML Platform: KServe inference services, RAG pipelines, voice assistants
Multi-GPU Support: AMD ROCm (RDNA3/Strix Halo), NVIDIA CUDA, Intel Arc
GitOps: Flux CD with SOPS encryption
Event-Driven: NATS JetStream for real-time messaging
ML Workflows: Kubeflow Pipelines + Argo Workflows

🖥️ Cluster Overview

Node	Role	Hardware	GPU
storm	Control Plane	Intel 13th Gen	Integrated
bruenor	Control Plane	Intel 13th Gen	Integrated
catti	Control Plane	Intel 13th Gen	Integrated
elminster	Worker	NVIDIA RTX 2070	8GB CUDA
khelben	Worker (vLLM)	AMD Strix Halo	64GB Unified
drizzt	Worker	AMD Radeon 680M	12GB RDNA2
danilo	Worker	Intel Core Ultra 9	Intel Arc

🚀 Quick Start

View Current Cluster State

# Get node status
kubectl get nodes -o wide

# View AI/ML workloads
kubectl get pods -n ai-ml

# Check KServe inference services
kubectl get inferenceservices -n ai-ml

Key Endpoints

Service	URL	Purpose
Kubeflow	`kubeflow.lab.daviestechlabs.io`	ML Pipeline UI
Companions	`companions-chat.lab.daviestechlabs.io`	AI Chat Interface
Voice	`voice.lab.daviestechlabs.io`	Voice Assistant
Gitea	`git.daviestechlabs.io`	Self-hosted Git

📂 Repository Structure

homelab-design/
├── README.md                          # This file
├── AGENT-ONBOARDING.md                # AI agent quick-start
├── ARCHITECTURE.md                    # High-level system overview
├── CONTEXT-DIAGRAM.mmd                # C4 Level 1 (Mermaid)
├── CONTAINER-DIAGRAM.mmd              # C4 Level 2
├── TECH-STACK.md                      # Complete tech stack
├── DOMAIN-MODEL.md                    # Core entities
├── CODING-CONVENTIONS.md              # Patterns & practices
├── GLOSSARY.md                        # Terminology
├── decisions/                         # Architecture Decision Records
├── specs/                             # Feature specifications
└── diagrams/                          # Additional diagrams

Architecture Decision Records

#	Decision	Status	Date
0001	Record Architecture Decisions	✅ accepted	2025-11-30
0002	Use Talos Linux for Kubernetes Nodes	✅ accepted	2025-11-30
0003	Use NATS for AI/ML Messaging	✅ accepted	2025-12-01
0004	Use MessagePack for NATS Messages	♻️ superseded by ADR-0061 (Protocol Buffers)	2025-12-01
0005	Multi-GPU Heterogeneous Strategy	✅ accepted	2025-12-01
0006	GitOps with Flux CD	✅ accepted	2025-11-30
0007	Use KServe for ML Model Serving	♻️ superseded by ADR-0011	2025-12-15 (Updated: 2026-02-02)
0008	Use Milvus for Vector Storage	✅ accepted	2025-12-15
0009	Dual Workflow Engine Strategy (Argo + Kubeflow)	✅ accepted	2026-01-15
0010	Use Envoy Gateway for Ingress	✅ accepted	2025-12-01
0011	Use KubeRay as Unified GPU Backend	✅ accepted	2026-02-02
0012	Use uv for Python Development, pip for Docker Builds	✅ accepted	2026-02-02
0013	Use Gitea Actions for CI/CD	✅ accepted	2026-02-02
0014	Docker Build Best Practices	✅ accepted	2026-02-02
0015	CI Notifications and Semantic Versioning	✅ accepted	2026-02-02
0016	Affine Email Verification Strategy for Authentik OIDC	✅ accepted	2026-02-04
0017	Secrets Management Strategy	✅ accepted	2026-02-04
0018	Security Policy Enforcement	✅ accepted	2026-02-04
0019	Python Module Deployment Strategy	✅ accepted	2026-02-02
0020	Internal Registry URLs for CI/CD	✅ accepted	2026-02-02
0021	Notification Architecture	✅ accepted	2026-02-04
0022	ntfy-Discord Bridge Service	✅ accepted	2026-02-04
0023	Valkey for ML Inference Caching	✅ accepted	2026-02-04
0024	Ray Repository Structure	✅ accepted	2026-02-03
0025	Observability Stack Architecture	✅ accepted	2026-02-04
0026	Tiered Storage Strategy: Longhorn + NFS	✅ accepted	2026-02-04
0027	Database Strategy with CloudNativePG	✅ accepted	2026-02-04
0028	Authentik Single Sign-On Strategy	✅ accepted	2026-02-04
0029	Authentik User Registration and Approval Workflow	✅ accepted	2026-02-04
0030	MFA and Yubikey Strategy	✅ accepted	2026-02-04
0031	Gitea CI/CD Pipeline Strategy	✅ accepted	2026-02-04
0032	Velero Backup and Disaster Recovery Strategy	✅ accepted	2026-02-05
0033	Data Analytics Platform Architecture	✅ accepted	2026-02-05
0034	Volcano Batch Scheduling Strategy	✅ accepted	2026-02-05
0035	ARM64 Raspberry Pi Worker Node Strategy	✅ accepted	2026-02-05
0036	Automated Dependency Updates with Renovate	✅ accepted	2026-02-05
0037	Node Naming Conventions	✅ accepted	2026-02-05
0038	Infrastructure Metrics Collection Strategy	✅ accepted	2026-02-09
0039	Alerting and Notification Pipeline	✅ accepted	2026-02-09
0040	OPA Gatekeeper Policy Framework	✅ accepted	2026-02-09
0041	Falco Runtime Threat Detection	✅ accepted	2026-02-09
0042	Trivy Operator Vulnerability Scanning	✅ accepted	2026-02-09
0043	Cilium CNI and Network Fabric	✅ accepted	2026-02-09
0044	DNS and External Access Architecture	✅ accepted	2026-02-09
0045	TLS Certificate Strategy	✅ accepted	2026-02-09
0046	Companions Frontend Architecture	✅ accepted	2026-02-09
0047	MLflow Experiment Tracking and Model Registry	✅ accepted	2026-02-09
0048	Entertainment and Media Stack	✅ accepted	2026-02-09
0049	Self-Hosted Productivity Suite	✅ accepted	2026-02-09
0050	Argo Rollouts Progressive Delivery	✅ accepted	2026-02-09
0051	KEDA Event-Driven Autoscaling	✅ accepted	2026-02-09
0052	Cluster Utilities and Optimization	✅ accepted	2026-02-09
0053	Vaultwarden Password Management	✅ accepted	2026-02-09
0054	Kubeflow Pipeline CI/CD	✅ accepted	2026-02-13
0055	Internal Python Package Publishing	✅ accepted	2026-02-13
0056	Custom Trained Voice Support in TTS Module	✅ accepted	2026-02-13
0057	Per-Repository Renovate Configurations	✅ accepted	2026-02-13
0058	Training Strategy – Distributed CPU Now, DGX Spark Later	✅ accepted	2026-02-14
0059	Mac Mini M4 Pro (waterdeep) as Local AI Agent for 3D Avatar Creation	📝 proposed	2026-02-16
0060	Internal PKI with Vault and cert-manager	✅ accepted	2026-02-16
0061	Refactor NATS Handler Services from Python to Go	✅ accepted	2026-02-19
0062	BlenderMCP for 3D Avatar Creation via Kasm Workstation	📝 proposed	2026-02-21

Repository	Purpose
homelab-k8s2	Kubernetes manifests, Flux GitOps
companions-frontend	Go web server, HTMX frontend

AI/ML Repos (git.daviestechlabs.io/daviestechlabs)

The former monolithic llm-workflows repo has been archived and decomposed into:

Repository	Purpose
`handler-base`	Shared Python library for NATS handlers
`chat-handler`	Text chat with RAG pipeline
`voice-assistant`	Voice pipeline (STT → RAG → LLM → TTS)
`pipeline-bridge`	Bridge between pipelines and services
`stt-module`	Speech-to-text service
`tts-module`	Text-to-speech service
`ray-serve`	Ray Serve inference services
`kuberay-images`	GPU-specific Ray worker Docker images
`argo`	Argo Workflows (training, batch inference)
`kubeflow`	Kubeflow Pipeline definitions
`mlflow`	MLflow integration utilities
`gradio-ui`	Gradio demo apps (embeddings, STT, TTS)
`ntfy-discord`	ntfy → Discord notification bridge

📝 Contributing

For architecture changes, create an ADR in decisions/
Update relevant documentation
Submit a PR with context

Last updated: 2026-02-21

README.md Unescape Escape

🏠 DaviesTechLabs Homelab Architecture

📖 Quick Navigation

🎯 What This Is

🖥️ Cluster Overview

🚀 Quick Start

View Current Cluster State

Key Endpoints

📂 Repository Structure

Architecture Decision Records

🔗 Related Repositories

AI/ML Repos (git.daviestechlabs.io/daviestechlabs)

📝 Contributing

README.md