Files
homelab-design/README.md
2026-02-15 16:19:29 +00:00

190 lines
12 KiB
Markdown
Raw Blame History

This file contains ambiguous Unicode characters
This file contains Unicode characters that might be confused with other characters. If you think that this is intentional, you can safely ignore this warning. Use the Escape button to reveal them.
# 🏠 DaviesTechLabs Homelab Architecture
> **Production-grade AI/ML platform running on bare-metal Kubernetes**
[![Talos](https://img.shields.io/badge/Talos-v1.12.1-blue?logo=linux)](https://talos.dev)
[![Kubernetes](https://img.shields.io/badge/Kubernetes-v1.35.0-326CE5?logo=kubernetes)](https://kubernetes.io)
[![Flux](https://img.shields.io/badge/GitOps-Flux-blue?logo=flux)](https://fluxcd.io)
[![License](https://img.shields.io/badge/License-MIT-green)](LICENSE)
<!-- ADR-BADGES-START -->
![ADR Count](https://img.shields.io/badge/ADRs-58_total-blue?logo=bookstack) ![Accepted](https://img.shields.io/badge/accepted-57-brightgreen) ![Proposed](https://img.shields.io/badge/proposed-0-yellow)
<!-- ADR-BADGES-END -->
## 📖 Quick Navigation
| Document | Purpose |
|----------|---------|
| [AGENT-ONBOARDING.md](AGENT-ONBOARDING.md) | **Start here if you're an AI agent** |
| [ARCHITECTURE.md](ARCHITECTURE.md) | High-level system overview |
| [TECH-STACK.md](TECH-STACK.md) | Complete technology stack |
| [DOMAIN-MODEL.md](DOMAIN-MODEL.md) | Core entities and bounded contexts |
| [GLOSSARY.md](GLOSSARY.md) | Terminology reference |
| [decisions/](decisions/) | Architecture Decision Records (ADRs) |
## 🎯 What This Is
A comprehensive architecture documentation repository for the DaviesTechLabs homelab Kubernetes cluster, featuring:
- **AI/ML Platform**: KServe inference services, RAG pipelines, voice assistants
- **Multi-GPU Support**: AMD ROCm (RDNA3/Strix Halo), NVIDIA CUDA, Intel Arc
- **GitOps**: Flux CD with SOPS encryption
- **Event-Driven**: NATS JetStream for real-time messaging
- **ML Workflows**: Kubeflow Pipelines + Argo Workflows
## 🖥️ Cluster Overview
| Node | Role | Hardware | GPU |
|------|------|----------|-----|
| storm | Control Plane | Intel 13th Gen | Integrated |
| bruenor | Control Plane | Intel 13th Gen | Integrated |
| catti | Control Plane | Intel 13th Gen | Integrated |
| elminster | Worker | NVIDIA RTX 2070 | 8GB CUDA |
| khelben | Worker (vLLM) | AMD Strix Halo | 64GB Unified |
| drizzt | Worker | AMD Radeon 680M | 12GB RDNA2 |
| danilo | Worker | Intel Core Ultra 9 | Intel Arc |
## 🚀 Quick Start
### View Current Cluster State
```bash
# Get node status
kubectl get nodes -o wide
# View AI/ML workloads
kubectl get pods -n ai-ml
# Check KServe inference services
kubectl get inferenceservices -n ai-ml
```
### Key Endpoints
| Service | URL | Purpose |
|---------|-----|---------|
| Kubeflow | `kubeflow.lab.daviestechlabs.io` | ML Pipeline UI |
| Companions | `companions-chat.lab.daviestechlabs.io` | AI Chat Interface |
| Voice | `voice.lab.daviestechlabs.io` | Voice Assistant |
| Gitea | `git.daviestechlabs.io` | Self-hosted Git |
## 📂 Repository Structure
```
homelab-design/
├── README.md # This file
├── AGENT-ONBOARDING.md # AI agent quick-start
├── ARCHITECTURE.md # High-level system overview
├── CONTEXT-DIAGRAM.mmd # C4 Level 1 (Mermaid)
├── CONTAINER-DIAGRAM.mmd # C4 Level 2
├── TECH-STACK.md # Complete tech stack
├── DOMAIN-MODEL.md # Core entities
├── CODING-CONVENTIONS.md # Patterns & practices
├── GLOSSARY.md # Terminology
├── decisions/ # Architecture Decision Records
├── specs/ # Feature specifications
└── diagrams/ # Additional diagrams
```
### Architecture Decision Records
<!-- ADR-TABLE-START -->
| # | Decision | Status | Date |
|---|----------|--------|------|
| 0001 | [Record Architecture Decisions](decisions/0001-record-architecture-decisions.md) | ✅ accepted | 2025-11-30 |
| 0002 | [Use Talos Linux for Kubernetes Nodes](decisions/0002-use-talos-linux.md) | ✅ accepted | 2025-11-30 |
| 0003 | [Use NATS for AI/ML Messaging](decisions/0003-use-nats-for-messaging.md) | ✅ accepted | 2025-12-01 |
| 0004 | [Use MessagePack for NATS Messages](decisions/0004-use-messagepack-for-nats.md) | ✅ accepted | 2025-12-01 |
| 0005 | [Multi-GPU Heterogeneous Strategy](decisions/0005-multi-gpu-strategy.md) | ✅ accepted | 2025-12-01 |
| 0006 | [GitOps with Flux CD](decisions/0006-gitops-with-flux.md) | ✅ accepted | 2025-11-30 |
| 0007 | [Use KServe for ML Model Serving](decisions/0007-use-kserve-for-inference.md) | ♻️ superseded by [ADR-0011](0011-kuberay-unified-gpu-backend.md) | 2025-12-15 (Updated: 2026-02-02) |
| 0008 | [Use Milvus for Vector Storage](decisions/0008-use-milvus-for-vectors.md) | ✅ accepted | 2025-12-15 |
| 0009 | [Dual Workflow Engine Strategy (Argo + Kubeflow)](decisions/0009-dual-workflow-engines.md) | ✅ accepted | 2026-01-15 |
| 0010 | [Use Envoy Gateway for Ingress](decisions/0010-use-envoy-gateway.md) | ✅ accepted | 2025-12-01 |
| 0011 | [Use KubeRay as Unified GPU Backend](decisions/0011-kuberay-unified-gpu-backend.md) | ✅ accepted | 2026-02-02 |
| 0012 | [Use uv for Python Development, pip for Docker Builds](decisions/0012-use-uv-for-python-development.md) | ✅ accepted | 2026-02-02 |
| 0013 | [Use Gitea Actions for CI/CD](decisions/0013-gitea-actions-for-ci.md) | ✅ accepted | 2026-02-02 |
| 0014 | [Docker Build Best Practices](decisions/0014-docker-build-best-practices.md) | ✅ accepted | 2026-02-02 |
| 0015 | [CI Notifications and Semantic Versioning](decisions/0015-ci-notifications-and-semantic-versioning.md) | ✅ accepted | 2026-02-02 |
| 0016 | [Affine Email Verification Strategy for Authentik OIDC](decisions/0016-affine-email-verification-strategy.md) | ✅ accepted | 2026-02-04 |
| 0017 | [Secrets Management Strategy](decisions/0017-secrets-management-strategy.md) | ✅ accepted | 2026-02-04 |
| 0018 | [Security Policy Enforcement](decisions/0018-security-policy-enforcement.md) | ✅ accepted | 2026-02-04 |
| 0019 | [Python Module Deployment Strategy](decisions/0019-handler-deployment-strategy.md) | ✅ accepted | 2026-02-02 |
| 0020 | [Internal Registry URLs for CI/CD](decisions/0020-internal-registry-for-cicd.md) | ✅ accepted | 2026-02-02 |
| 0021 | [Notification Architecture](decisions/0021-notification-architecture.md) | ✅ accepted | 2026-02-04 |
| 0022 | [ntfy-Discord Bridge Service](decisions/0022-ntfy-discord-bridge.md) | ✅ accepted | 2026-02-04 |
| 0023 | [Valkey for ML Inference Caching](decisions/0023-valkey-ml-caching.md) | ✅ accepted | 2026-02-04 |
| 0024 | [Ray Repository Structure](decisions/0024-ray-repository-structure.md) | ✅ accepted | 2026-02-03 |
| 0025 | [Observability Stack Architecture](decisions/0025-observability-stack.md) | ✅ accepted | 2026-02-04 |
| 0026 | [Tiered Storage Strategy: Longhorn + NFS](decisions/0026-storage-strategy.md) | ✅ accepted | 2026-02-04 |
| 0027 | [Database Strategy with CloudNativePG](decisions/0027-database-strategy.md) | ✅ accepted | 2026-02-04 |
| 0028 | [Authentik Single Sign-On Strategy](decisions/0028-authentik-sso-strategy.md) | ✅ accepted | 2026-02-04 |
| 0029 | [Authentik User Registration and Approval Workflow](decisions/0029-authentik-user-registration-workflow.md) | ✅ accepted | 2026-02-04 |
| 0030 | [MFA and Yubikey Strategy](decisions/0030-mfa-yubikey-strategy.md) | ✅ accepted | 2026-02-04 |
| 0031 | [Gitea CI/CD Pipeline Strategy](decisions/0031-gitea-cicd-strategy.md) | ✅ accepted | 2026-02-04 |
| 0032 | [Velero Backup and Disaster Recovery Strategy](decisions/0032-velero-backup-strategy.md) | ✅ accepted | 2026-02-05 |
| 0033 | [Data Analytics Platform Architecture](decisions/0033-data-analytics-platform.md) | ✅ accepted | 2026-02-05 |
| 0034 | [Volcano Batch Scheduling Strategy](decisions/0034-volcano-batch-scheduling.md) | ✅ accepted | 2026-02-05 |
| 0035 | [ARM64 Raspberry Pi Worker Node Strategy](decisions/0035-arm64-worker-strategy.md) | ✅ accepted | 2026-02-05 |
| 0036 | [Automated Dependency Updates with Renovate](decisions/0036-renovate-dependency-updates.md) | ✅ accepted | 2026-02-05 |
| 0037 | [Node Naming Conventions](decisions/0037-node-naming-conventions.md) | ✅ accepted | 2026-02-05 |
| 0038 | [Infrastructure Metrics Collection Strategy](decisions/0038-infrastructure-metrics-collection.md) | ✅ accepted | 2026-02-09 |
| 0039 | [Alerting and Notification Pipeline](decisions/0039-alerting-notification-pipeline.md) | ✅ accepted | 2026-02-09 |
| 0040 | [OPA Gatekeeper Policy Framework](decisions/0040-opa-gatekeeper-policy-framework.md) | ✅ accepted | 2026-02-09 |
| 0041 | [Falco Runtime Threat Detection](decisions/0041-falco-runtime-threat-detection.md) | ✅ accepted | 2026-02-09 |
| 0042 | [Trivy Operator Vulnerability Scanning](decisions/0042-trivy-operator-vulnerability-scanning.md) | ✅ accepted | 2026-02-09 |
| 0043 | [Cilium CNI and Network Fabric](decisions/0043-cilium-cni-network-fabric.md) | ✅ accepted | 2026-02-09 |
| 0044 | [DNS and External Access Architecture](decisions/0044-dns-and-external-access.md) | ✅ accepted | 2026-02-09 |
| 0045 | [TLS Certificate Strategy](decisions/0045-tls-certificate-strategy.md) | ✅ accepted | 2026-02-09 |
| 0046 | [Companions Frontend Architecture](decisions/0046-companions-frontend-architecture.md) | ✅ accepted | 2026-02-09 |
| 0047 | [MLflow Experiment Tracking and Model Registry](decisions/0047-mlflow-experiment-tracking.md) | ✅ accepted | 2026-02-09 |
| 0048 | [Entertainment and Media Stack](decisions/0048-entertainment-media-stack.md) | ✅ accepted | 2026-02-09 |
| 0049 | [Self-Hosted Productivity Suite](decisions/0049-self-hosted-productivity-suite.md) | ✅ accepted | 2026-02-09 |
| 0050 | [Argo Rollouts Progressive Delivery](decisions/0050-argo-rollouts-progressive-delivery.md) | ✅ accepted | 2026-02-09 |
| 0051 | [KEDA Event-Driven Autoscaling](decisions/0051-keda-event-driven-autoscaling.md) | ✅ accepted | 2026-02-09 |
| 0052 | [Cluster Utilities and Optimization](decisions/0052-cluster-utilities-optimization.md) | ✅ accepted | 2026-02-09 |
| 0053 | [Vaultwarden Password Management](decisions/0053-vaultwarden-password-management.md) | ✅ accepted | 2026-02-09 |
| 0054 | [Kubeflow Pipeline CI/CD](decisions/0054-kubeflow-pipeline-cicd.md) | ✅ accepted | 2026-02-13 |
| 0055 | [Internal Python Package Publishing](decisions/0055-internal-python-package-publishing.md) | ✅ accepted | 2026-02-13 |
| 0056 | [Custom Trained Voice Support in TTS Module](decisions/0056-custom-voice-support-tts-module.md) | ✅ accepted | 2026-02-13 |
| 0057 | [Per-Repository Renovate Configurations](decisions/0057-renovate-per-repo-configs.md) | ✅ accepted | 2026-02-13 |
| 0058 | [Training Strategy Distributed CPU Now, DGX Spark Later](decisions/0058-training-strategy-cpu-dgx-spark.md) | ✅ accepted | 2026-02-14 |
<!-- ADR-TABLE-END -->
## 🔗 Related Repositories
| Repository | Purpose |
|------------|---------|
| [homelab-k8s2](https://github.com/Billy-Davies-2/homelab-k8s2) | Kubernetes manifests, Flux GitOps |
| [companions-frontend](https://github.com/Billy-Davies-2/companions-frontend) | Go web server, HTMX frontend |
### AI/ML Repos (git.daviestechlabs.io/daviestechlabs)
The former monolithic `llm-workflows` repo has been archived and decomposed into:
| Repository | Purpose |
|------------|--------|
| `handler-base` | Shared Python library for NATS handlers |
| `chat-handler` | Text chat with RAG pipeline |
| `voice-assistant` | Voice pipeline (STT → RAG → LLM → TTS) |
| `pipeline-bridge` | Bridge between pipelines and services |
| `stt-module` | Speech-to-text service |
| `tts-module` | Text-to-speech service |
| `ray-serve` | Ray Serve inference services |
| `kuberay-images` | GPU-specific Ray worker Docker images |
| `argo` | Argo Workflows (training, batch inference) |
| `kubeflow` | Kubeflow Pipeline definitions |
| `mlflow` | MLflow integration utilities |
| `gradio-ui` | Gradio demo apps (embeddings, STT, TTS) |
| `ntfy-discord` | ntfy → Discord notification bridge |
## 📝 Contributing
1. For architecture changes, create an ADR in `decisions/`
2. Update relevant documentation
3. Submit a PR with context
---
*Last updated: 2026-02-15*