192 lines
12 KiB
Markdown
192 lines
12 KiB
Markdown
# 🏠 DaviesTechLabs Homelab Architecture
|
||
|
||
> **Production-grade AI/ML platform running on bare-metal Kubernetes**
|
||
|
||
[](https://talos.dev)
|
||
[](https://kubernetes.io)
|
||
[](https://fluxcd.io)
|
||
[](LICENSE)
|
||
|
||
<!-- ADR-BADGES-START -->
|
||
  
|
||
<!-- ADR-BADGES-END -->
|
||
|
||
## 📖 Quick Navigation
|
||
|
||
| Document | Purpose |
|
||
|----------|---------|
|
||
| [AGENT-ONBOARDING.md](AGENT-ONBOARDING.md) | **Start here if you're an AI agent** |
|
||
| [ARCHITECTURE.md](ARCHITECTURE.md) | High-level system overview |
|
||
| [TECH-STACK.md](TECH-STACK.md) | Complete technology stack |
|
||
| [DOMAIN-MODEL.md](DOMAIN-MODEL.md) | Core entities and bounded contexts |
|
||
| [GLOSSARY.md](GLOSSARY.md) | Terminology reference |
|
||
| [decisions/](decisions/) | Architecture Decision Records (ADRs) |
|
||
|
||
## 🎯 What This Is
|
||
|
||
A comprehensive architecture documentation repository for the DaviesTechLabs homelab Kubernetes cluster, featuring:
|
||
|
||
- **AI/ML Platform**: KServe inference services, RAG pipelines, voice assistants
|
||
- **Multi-GPU Support**: AMD ROCm (RDNA3/Strix Halo), NVIDIA CUDA, Intel Arc
|
||
- **GitOps**: Flux CD with SOPS encryption
|
||
- **Event-Driven**: NATS JetStream for real-time messaging
|
||
- **ML Workflows**: Kubeflow Pipelines + Argo Workflows
|
||
|
||
## 🖥️ Cluster Overview
|
||
|
||
| Node | Role | Hardware | GPU |
|
||
|------|------|----------|-----|
|
||
| storm | Control Plane | Intel 13th Gen | Integrated |
|
||
| bruenor | Control Plane | Intel 13th Gen | Integrated |
|
||
| catti | Control Plane | Intel 13th Gen | Integrated |
|
||
| elminster | Worker | NVIDIA RTX 2070 | 8GB CUDA |
|
||
| khelben | Worker (vLLM) | AMD Strix Halo | 64GB Unified |
|
||
| drizzt | Worker | AMD Radeon 680M | 12GB RDNA2 |
|
||
| danilo | Worker | Intel Core Ultra 9 | Intel Arc |
|
||
|
||
## 🚀 Quick Start
|
||
|
||
### View Current Cluster State
|
||
|
||
```bash
|
||
# Get node status
|
||
kubectl get nodes -o wide
|
||
|
||
# View AI/ML workloads
|
||
kubectl get pods -n ai-ml
|
||
|
||
# Check KServe inference services
|
||
kubectl get inferenceservices -n ai-ml
|
||
```
|
||
|
||
### Key Endpoints
|
||
|
||
| Service | URL | Purpose |
|
||
|---------|-----|---------|
|
||
| Kubeflow | `kubeflow.lab.daviestechlabs.io` | ML Pipeline UI |
|
||
| Companions | `companions-chat.lab.daviestechlabs.io` | AI Chat Interface |
|
||
| Voice | `voice.lab.daviestechlabs.io` | Voice Assistant |
|
||
| Gitea | `git.daviestechlabs.io` | Self-hosted Git |
|
||
|
||
## 📂 Repository Structure
|
||
|
||
```
|
||
homelab-design/
|
||
├── README.md # This file
|
||
├── AGENT-ONBOARDING.md # AI agent quick-start
|
||
├── ARCHITECTURE.md # High-level system overview
|
||
├── CONTEXT-DIAGRAM.mmd # C4 Level 1 (Mermaid)
|
||
├── CONTAINER-DIAGRAM.mmd # C4 Level 2
|
||
├── TECH-STACK.md # Complete tech stack
|
||
├── DOMAIN-MODEL.md # Core entities
|
||
├── CODING-CONVENTIONS.md # Patterns & practices
|
||
├── GLOSSARY.md # Terminology
|
||
├── decisions/ # Architecture Decision Records
|
||
├── specs/ # Feature specifications
|
||
└── diagrams/ # Additional diagrams
|
||
```
|
||
|
||
### Architecture Decision Records
|
||
|
||
<!-- ADR-TABLE-START -->
|
||
| # | Decision | Status | Date |
|
||
|---|----------|--------|------|
|
||
| 0001 | [Record Architecture Decisions](decisions/0001-record-architecture-decisions.md) | ✅ accepted | 2025-11-30 |
|
||
| 0002 | [Use Talos Linux for Kubernetes Nodes](decisions/0002-use-talos-linux.md) | ✅ accepted | 2025-11-30 |
|
||
| 0003 | [Use NATS for AI/ML Messaging](decisions/0003-use-nats-for-messaging.md) | ✅ accepted | 2025-12-01 |
|
||
| 0004 | [Use MessagePack for NATS Messages](decisions/0004-use-messagepack-for-nats.md) | ✅ accepted | 2025-12-01 |
|
||
| 0005 | [Multi-GPU Heterogeneous Strategy](decisions/0005-multi-gpu-strategy.md) | ✅ accepted | 2025-12-01 |
|
||
| 0006 | [GitOps with Flux CD](decisions/0006-gitops-with-flux.md) | ✅ accepted | 2025-11-30 |
|
||
| 0007 | [Use KServe for ML Model Serving](decisions/0007-use-kserve-for-inference.md) | ♻️ superseded by [ADR-0011](0011-kuberay-unified-gpu-backend.md) | 2025-12-15 (Updated: 2026-02-02) |
|
||
| 0008 | [Use Milvus for Vector Storage](decisions/0008-use-milvus-for-vectors.md) | ✅ accepted | 2025-12-15 |
|
||
| 0009 | [Dual Workflow Engine Strategy (Argo + Kubeflow)](decisions/0009-dual-workflow-engines.md) | ✅ accepted | 2026-01-15 |
|
||
| 0010 | [Use Envoy Gateway for Ingress](decisions/0010-use-envoy-gateway.md) | ✅ accepted | 2025-12-01 |
|
||
| 0011 | [Use KubeRay as Unified GPU Backend](decisions/0011-kuberay-unified-gpu-backend.md) | ✅ accepted | 2026-02-02 |
|
||
| 0012 | [Use uv for Python Development, pip for Docker Builds](decisions/0012-use-uv-for-python-development.md) | ✅ accepted | 2026-02-02 |
|
||
| 0013 | [Use Gitea Actions for CI/CD](decisions/0013-gitea-actions-for-ci.md) | ✅ accepted | 2026-02-02 |
|
||
| 0014 | [Docker Build Best Practices](decisions/0014-docker-build-best-practices.md) | ✅ accepted | 2026-02-02 |
|
||
| 0015 | [CI Notifications and Semantic Versioning](decisions/0015-ci-notifications-and-semantic-versioning.md) | ✅ accepted | 2026-02-02 |
|
||
| 0016 | [Affine Email Verification Strategy for Authentik OIDC](decisions/0016-affine-email-verification-strategy.md) | ✅ accepted | 2026-02-04 |
|
||
| 0017 | [Secrets Management Strategy](decisions/0017-secrets-management-strategy.md) | ✅ accepted | 2026-02-04 |
|
||
| 0018 | [Security Policy Enforcement](decisions/0018-security-policy-enforcement.md) | ✅ accepted | 2026-02-04 |
|
||
| 0019 | [Python Module Deployment Strategy](decisions/0019-handler-deployment-strategy.md) | ✅ accepted | 2026-02-02 |
|
||
| 0020 | [Internal Registry URLs for CI/CD](decisions/0020-internal-registry-for-cicd.md) | ✅ accepted | 2026-02-02 |
|
||
| 0021 | [Notification Architecture](decisions/0021-notification-architecture.md) | ✅ accepted | 2026-02-04 |
|
||
| 0022 | [ntfy-Discord Bridge Service](decisions/0022-ntfy-discord-bridge.md) | ✅ accepted | 2026-02-04 |
|
||
| 0023 | [Valkey for ML Inference Caching](decisions/0023-valkey-ml-caching.md) | ✅ accepted | 2026-02-04 |
|
||
| 0024 | [Ray Repository Structure](decisions/0024-ray-repository-structure.md) | ✅ accepted | 2026-02-03 |
|
||
| 0025 | [Observability Stack Architecture](decisions/0025-observability-stack.md) | ✅ accepted | 2026-02-04 |
|
||
| 0026 | [Tiered Storage Strategy: Longhorn + NFS](decisions/0026-storage-strategy.md) | ✅ accepted | 2026-02-04 |
|
||
| 0027 | [Database Strategy with CloudNativePG](decisions/0027-database-strategy.md) | ✅ accepted | 2026-02-04 |
|
||
| 0028 | [Authentik Single Sign-On Strategy](decisions/0028-authentik-sso-strategy.md) | ✅ accepted | 2026-02-04 |
|
||
| 0029 | [Authentik User Registration and Approval Workflow](decisions/0029-authentik-user-registration-workflow.md) | ✅ accepted | 2026-02-04 |
|
||
| 0030 | [MFA and Yubikey Strategy](decisions/0030-mfa-yubikey-strategy.md) | ✅ accepted | 2026-02-04 |
|
||
| 0031 | [Gitea CI/CD Pipeline Strategy](decisions/0031-gitea-cicd-strategy.md) | ✅ accepted | 2026-02-04 |
|
||
| 0032 | [Velero Backup and Disaster Recovery Strategy](decisions/0032-velero-backup-strategy.md) | ✅ accepted | 2026-02-05 |
|
||
| 0033 | [Data Analytics Platform Architecture](decisions/0033-data-analytics-platform.md) | ✅ accepted | 2026-02-05 |
|
||
| 0034 | [Volcano Batch Scheduling Strategy](decisions/0034-volcano-batch-scheduling.md) | ✅ accepted | 2026-02-05 |
|
||
| 0035 | [ARM64 Raspberry Pi Worker Node Strategy](decisions/0035-arm64-worker-strategy.md) | ✅ accepted | 2026-02-05 |
|
||
| 0036 | [Automated Dependency Updates with Renovate](decisions/0036-renovate-dependency-updates.md) | ✅ accepted | 2026-02-05 |
|
||
| 0037 | [Node Naming Conventions](decisions/0037-node-naming-conventions.md) | ✅ accepted | 2026-02-05 |
|
||
| 0038 | [Infrastructure Metrics Collection Strategy](decisions/0038-infrastructure-metrics-collection.md) | ✅ accepted | 2026-02-09 |
|
||
| 0039 | [Alerting and Notification Pipeline](decisions/0039-alerting-notification-pipeline.md) | ✅ accepted | 2026-02-09 |
|
||
| 0040 | [OPA Gatekeeper Policy Framework](decisions/0040-opa-gatekeeper-policy-framework.md) | ✅ accepted | 2026-02-09 |
|
||
| 0041 | [Falco Runtime Threat Detection](decisions/0041-falco-runtime-threat-detection.md) | ✅ accepted | 2026-02-09 |
|
||
| 0042 | [Trivy Operator Vulnerability Scanning](decisions/0042-trivy-operator-vulnerability-scanning.md) | ✅ accepted | 2026-02-09 |
|
||
| 0043 | [Cilium CNI and Network Fabric](decisions/0043-cilium-cni-network-fabric.md) | ✅ accepted | 2026-02-09 |
|
||
| 0044 | [DNS and External Access Architecture](decisions/0044-dns-and-external-access.md) | ✅ accepted | 2026-02-09 |
|
||
| 0045 | [TLS Certificate Strategy](decisions/0045-tls-certificate-strategy.md) | ✅ accepted | 2026-02-09 |
|
||
| 0046 | [Companions Frontend Architecture](decisions/0046-companions-frontend-architecture.md) | ✅ accepted | 2026-02-09 |
|
||
| 0047 | [MLflow Experiment Tracking and Model Registry](decisions/0047-mlflow-experiment-tracking.md) | ✅ accepted | 2026-02-09 |
|
||
| 0048 | [Entertainment and Media Stack](decisions/0048-entertainment-media-stack.md) | ✅ accepted | 2026-02-09 |
|
||
| 0049 | [Self-Hosted Productivity Suite](decisions/0049-self-hosted-productivity-suite.md) | ✅ accepted | 2026-02-09 |
|
||
| 0050 | [Argo Rollouts Progressive Delivery](decisions/0050-argo-rollouts-progressive-delivery.md) | ✅ accepted | 2026-02-09 |
|
||
| 0051 | [KEDA Event-Driven Autoscaling](decisions/0051-keda-event-driven-autoscaling.md) | ✅ accepted | 2026-02-09 |
|
||
| 0052 | [Cluster Utilities and Optimization](decisions/0052-cluster-utilities-optimization.md) | ✅ accepted | 2026-02-09 |
|
||
| 0053 | [Vaultwarden Password Management](decisions/0053-vaultwarden-password-management.md) | ✅ accepted | 2026-02-09 |
|
||
| 0054 | [Kubeflow Pipeline CI/CD](decisions/0054-kubeflow-pipeline-cicd.md) | ✅ accepted | 2026-02-13 |
|
||
| 0055 | [Internal Python Package Publishing](decisions/0055-internal-python-package-publishing.md) | ✅ accepted | 2026-02-13 |
|
||
| 0056 | [Custom Trained Voice Support in TTS Module](decisions/0056-custom-voice-support-tts-module.md) | ✅ accepted | 2026-02-13 |
|
||
| 0057 | [Per-Repository Renovate Configurations](decisions/0057-renovate-per-repo-configs.md) | ✅ accepted | 2026-02-13 |
|
||
| 0058 | [Training Strategy – Distributed CPU Now, DGX Spark Later](decisions/0058-training-strategy-cpu-dgx-spark.md) | ✅ accepted | 2026-02-14 |
|
||
| 0059 | [Add Mac Mini M4 Pro (waterdeep) to Ray Cluster as External Worker](decisions/0059-mac-mini-ray-worker.md) | 📝 proposed | 2026-02-16 |
|
||
| 0060 | [Internal PKI with Vault and cert-manager](decisions/0060-internal-pki-vault.md) | ✅ accepted | 2026-02-16 |
|
||
<!-- ADR-TABLE-END -->
|
||
|
||
## 🔗 Related Repositories
|
||
|
||
| Repository | Purpose |
|
||
|------------|---------|
|
||
| [homelab-k8s2](https://github.com/Billy-Davies-2/homelab-k8s2) | Kubernetes manifests, Flux GitOps |
|
||
| [companions-frontend](https://github.com/Billy-Davies-2/companions-frontend) | Go web server, HTMX frontend |
|
||
|
||
### AI/ML Repos (git.daviestechlabs.io/daviestechlabs)
|
||
|
||
The former monolithic `llm-workflows` repo has been archived and decomposed into:
|
||
|
||
| Repository | Purpose |
|
||
|------------|--------|
|
||
| `handler-base` | Shared Python library for NATS handlers |
|
||
| `chat-handler` | Text chat with RAG pipeline |
|
||
| `voice-assistant` | Voice pipeline (STT → RAG → LLM → TTS) |
|
||
| `pipeline-bridge` | Bridge between pipelines and services |
|
||
| `stt-module` | Speech-to-text service |
|
||
| `tts-module` | Text-to-speech service |
|
||
| `ray-serve` | Ray Serve inference services |
|
||
| `kuberay-images` | GPU-specific Ray worker Docker images |
|
||
| `argo` | Argo Workflows (training, batch inference) |
|
||
| `kubeflow` | Kubeflow Pipeline definitions |
|
||
| `mlflow` | MLflow integration utilities |
|
||
| `gradio-ui` | Gradio demo apps (embeddings, STT, TTS) |
|
||
| `ntfy-discord` | ntfy → Discord notification bridge |
|
||
|
||
## 📝 Contributing
|
||
|
||
1. For architecture changes, create an ADR in `decisions/`
|
||
2. Update relevant documentation
|
||
3. Submit a PR with context
|
||
|
||
---
|
||
|
||
*Last updated: 2026-02-17*
|