21 KiB
21 KiB
🏗️ System Architecture
Comprehensive technical overview of the DaviesTechLabs homelab infrastructure
Overview
The homelab is a production-grade Kubernetes cluster running on bare-metal hardware, designed for AI/ML workloads with multi-GPU support. It follows GitOps principles using Flux CD with SOPS-encrypted secrets.
System Layers
┌─────────────────────────────────────────────────────────────────────────────┐
│ USER LAYER │
├─────────────────────────────────────────────────────────────────────────────┤
│ ┌──────────────────┐ ┌──────────────────┐ ┌──────────────────┐ │
│ │ Companions WebApp│ │ Voice WebApp │ │ Kubeflow UI │ │
│ │ HTMX + Alpine │ │ Gradio UI │ │ Pipeline Mgmt │ │
│ └────────┬─────────┘ └────────┬─────────┘ └────────┬─────────┘ │
│ │ WebSocket │ HTTP/WS │ HTTP │
└───────────┴─────────────────────┴─────────────────────┴─────────────────────┘
│
▼
┌─────────────────────────────────────────────────────────────────────────────┐
│ INGRESS LAYER │
├─────────────────────────────────────────────────────────────────────────────┤
│ Cloudflared Tunnel ──► Envoy Gateway ──► HTTPRoute CRDs │
│ │
│ External: *.daviestechlabs.io Internal: *.lab.daviestechlabs.io │
│ • git.daviestechlabs.io • kubeflow.lab.daviestechlabs.io │
│ • auth.daviestechlabs.io • companions-chat.lab... │
└─────────────────────────────────────────────────────────────────────────────┘
│
▼
┌─────────────────────────────────────────────────────────────────────────────┐
│ MESSAGE BUS LAYER │
├─────────────────────────────────────────────────────────────────────────────┤
│ NATS + JetStream │
│ ┌─────────────────────────────────────────────────────────────────────┐ │
│ │ Streams: │ │
│ │ • COMPANIONS_LOGINS (7d retention) - User analytics │ │
│ │ • COMPANIONS_CHAT (30d retention) - Chat history │ │
│ │ • AI_CHAT_STREAM (5min, memory) - Ephemeral streaming │ │
│ │ • AI_VOICE_STREAM (1h, file) - Voice processing │ │
│ │ • AI_PIPELINE (24h, file) - Workflow triggers │ │
│ └─────────────────────────────────────────────────────────────────────┘ │
│ │
│ Message Format: MessagePack (binary, not JSON) │
└─────────────────────────────────────────────────────────────────────────────┘
│
┌─────────────────────────┼─────────────────────────┐
▼ ▼ ▼
┌───────────────────┐ ┌───────────────────┐ ┌───────────────────┐
│ Chat Handler │ │ Voice Assistant │ │ Pipeline Bridge │
├───────────────────┤ ├───────────────────┤ ├───────────────────┤
│ • RAG retrieval │ │ • STT (Whisper) │ │ • KFP triggers │
│ • LLM inference │ │ • RAG retrieval │ │ • Argo triggers │
│ • Streaming resp │ │ • LLM inference │ │ • Status updates │
│ • Session state │ │ • TTS (XTTS) │ │ • Error handling │
└───────────────────┘ └───────────────────┘ └───────────────────┘
│
▼
┌─────────────────────────────────────────────────────────────────────────────┐
│ GPU INFERENCE LAYER (KubeRay) │
├─────────────────────────────────────────────────────────────────────────────┤
│ RayService: ai-inference-serve-svc:8000 │
│ ┌─────────────────────────────────────────────────────────────────────┐ │
│ │ Ray Serve (Unified Endpoint) │ │
│ │ ┌──────────┐ ┌──────────┐ ┌──────────┐ ┌──────────┐ ┌──────────┐ │ │
│ │ │ /whisper │ │ /tts │ │ /llm │ │/embeddings│ │/reranker │ │ │
│ │ │ Whisper │ │ XTTS │ │ vLLM │ │ BGE-L │ │ BGE-Rnk │ │ │
│ │ │ (0.5 GPU)│ │(0.5 GPU) │ │(0.95 GPU)│ │ (0.8 GPU) │ │(0.8 GPU) │ │ │
│ │ ├──────────┤ ├──────────┤ ├──────────┤ ├──────────┤ ├──────────┤ │ │
│ │ │elminster │ │elminster │ │ khelben │ │ drizzt │ │ danilo │ │ │
│ │ │RTX 2070 │ │RTX 2070 │ │Strix Halo│ │Radeon 680│ │Intel Arc │ │ │
│ │ │ CUDA │ │ CUDA │ │ ROCm │ │ ROCm │ │ Intel │ │ │
│ │ └──────────┘ └──────────┘ └──────────┘ └──────────┘ └──────────┘ │ │
│ └─────────────────────────────────────────────────────────────────────┘ │
│ │
│ KServe Aliases: {whisper,tts,llm,embeddings,reranker}-predictor.ai-ml │
│ Milvus: Vector database for RAG (Helm, MinIO backend) │
└─────────────────────────────────────────────────────────────────────────────┘
│
▼
┌─────────────────────────────────────────────────────────────────────────────┐
│ WORKFLOW ENGINE LAYER │
├─────────────────────────────────────────────────────────────────────────────┤
│ ┌────────────────────────────┐ ┌────────────────────────────┐ │
│ │ Argo Workflows │◄──►│ Kubeflow Pipelines │ │
│ ├────────────────────────────┤ ├────────────────────────────┤ │
│ │ • Complex DAG orchestration│ │ • ML pipeline caching │ │
│ │ • Training workflows │ │ • Experiment tracking │ │
│ │ • Document ingestion │ │ • Model versioning │ │
│ │ • Batch inference │ │ • Artifact lineage │ │
│ └────────────────────────────┘ └────────────────────────────┘ │
│ │
│ Trigger: Argo Events (EventSource → Sensor → Workflow/Pipeline) │
└─────────────────────────────────────────────────────────────────────────────┘
│
▼
┌─────────────────────────────────────────────────────────────────────────────┐
│ INFRASTRUCTURE LAYER │
├─────────────────────────────────────────────────────────────────────────────┤
│ Storage: Compute: Security: │
│ ├─ Longhorn (block) ├─ Volcano Scheduler ├─ Vault (secrets) │
│ ├─ NFS CSI (shared) ├─ GPU Device Plugins ├─ Authentik (SSO) │
│ └─ MinIO (S3) │ ├─ AMD ROCm ├─ Falco (runtime) │
│ │ ├─ NVIDIA CUDA └─ SOPS (GitOps) │
│ Databases: │ └─ Intel i915/Arc │
│ ├─ CloudNative-PG └─ Node Feature Discovery │
│ ├─ Valkey (cache) │
│ └─ ClickHouse (analytics) │
└─────────────────────────────────────────────────────────────────────────────┘
│
▼
┌─────────────────────────────────────────────────────────────────────────────┐
│ PLATFORM LAYER │
├─────────────────────────────────────────────────────────────────────────────┤
│ Talos Linux v1.12.x │ Kubernetes v1.35.0 │ Cilium CNI │
│ │
│ 14 nodes: 3 control plane │ 4 GPU workers │ 2 CPU-only x86 workers │
│ │ 5 Raspberry Pi (arm64) workers │
└─────────────────────────────────────────────────────────────────────────────┘
Node Topology
Control Plane (HA)
| Node | IP | CPU | Memory | Storage | Role |
|---|---|---|---|---|---|
| storm | 192.168.100.25 | Intel 13th Gen (4c) | 16GB | 500GB NVMe | etcd, API server |
| bruenor | 192.168.100.26 | Intel 13th Gen (4c) | 16GB | 500GB NVMe | etcd, API server |
| catti | 192.168.100.27 | Intel 13th Gen (4c) | 16GB | 500GB NVMe | etcd, API server |
VIP: 192.168.100.20 (shared across control plane)
Worker Nodes — GPU
| Node | IP | CPU | RAM | GPU | GPU Memory | Workload |
|---|---|---|---|---|---|---|
| elminster | 192.168.100.31 | Intel (16c) | 62 GB | NVIDIA RTX 2070 | 8 GB VRAM | Whisper, XTTS |
| khelben | 192.168.100.32 | AMD Ryzen (32c) | 94 GB | AMD Strix Halo | 32 GB Unified | vLLM (dedicated) |
| drizzt | 192.168.100.40 | AMD Ryzen 7 6800H (16c) | 27 GB | AMD Radeon 680M | 12 GB VRAM | BGE Embeddings |
| danilo | 192.168.100.41 | Intel Core Ultra 9 (22c) | 62 GB | Intel Arc | 16 GB Shared | Reranker |
Worker Nodes — CPU-only (x86_64)
| Node | IP | CPU | RAM | Workload |
|---|---|---|---|---|
| regis | 192.168.100.43 | Intel (4c) | 16 GB | General workloads |
| wulfgar | 192.168.100.42 | Intel (4c) | 31 GB | General workloads |
Worker Nodes — Raspberry Pi (arm64)
| Node | IP | CPU | RAM | Workload |
|---|---|---|---|---|
| durnan | 192.168.100.54 | Cortex-A72 (4c) | 4 GB | Lightweight services |
| jarlaxle | 192.168.100.53 | Cortex-A72 (4c) | 4 GB | Lightweight services |
| mirt | 192.168.100.52 | Cortex-A72 (4c) | 4 GB | Lightweight services |
| volo | 192.168.100.51 | Cortex-A72 (4c) | 4 GB | Lightweight services |
| elaith | 192.168.100.55 | Cortex-A72 (4c) | 8 GB | Lightweight services |
Cluster Totals
| Resource | Total |
|---|---|
| Nodes | 14 (3 control + 11 worker) |
| CPU cores | ~126 |
| System RAM | ~378 GB |
| Architectures | amd64, arm64 |
| GPUs | 4 (NVIDIA, AMD, Intel) |
Networking
External Access
Internet → Cloudflare → cloudflared tunnel → Envoy Gateway → Services
DNS Zones
- External:
*.daviestechlabs.io(Cloudflare DNS) - Internal:
*.lab.daviestechlabs.io(internal split-horizon)
Network CIDRs
| Network | CIDR | Purpose |
|---|---|---|
| Node Network | 192.168.100.0/24 | Physical nodes |
| Pod Network | 10.42.0.0/16 | Kubernetes pods |
| Service Network | 10.43.0.0/16 | Kubernetes services |
Data Flow: Chat Request
sequenceDiagram
participant U as User
participant W as WebApp
participant N as NATS
participant C as Chat Handler
participant M as Milvus
participant L as vLLM
participant V as Valkey
U->>W: Send message
W->>N: Publish ai.chat.user.{id}.message
N->>C: Deliver to chat-handler
C->>V: Get session history
C->>M: RAG query (if enabled)
M-->>C: Relevant documents
C->>L: LLM inference (with context)
L-->>C: Streaming tokens
C->>N: Publish ai.chat.response.stream.{id}
N-->>W: Deliver streaming chunks
W-->>U: Display tokens
C->>V: Save to session
GitOps Flow
Developer → Git Push → GitHub/Gitea
│
▼
┌─────────────┐
│ Flux CD │
│ (reconcile) │
└──────┬──────┘
│
┌──────────────┼──────────────┐
▼ ▼ ▼
┌──────────┐ ┌──────────┐ ┌──────────┐
│homelab- │ │ llm- │ │ helm │
│ k8s2 │ │workflows │ │ charts │
└──────────┘ └──────────┘ └──────────┘
│ │ │
└──────────────┴──────────────┘
│
▼
┌─────────────┐
│ Kubernetes │
│ Cluster │
└─────────────┘
Security Architecture
Secrets Management
External Secrets Operator ──► Vault / SOPS ──► Kubernetes Secrets
Authentication
User ──► Cloudflare Access ──► Authentik ──► Application
│
└──► OIDC/SAML providers
Network Security
- Cilium: Network policies, eBPF-based security
- Falco: Runtime security monitoring
- RBAC: Fine-grained Kubernetes permissions
High Availability
Control Plane
- 3-node etcd cluster with automatic leader election
- Virtual IP (192.168.100.20) for API server access
- Automatic failover via Talos
Workloads
- Pod anti-affinity for critical services
- HPA for auto-scaling
- PodDisruptionBudgets for controlled updates
Storage
- Longhorn 3-replica default
- MinIO erasure coding for S3
- Regular Velero backups
Observability
Metrics Pipeline
Applications ──► OpenTelemetry Collector ──► Prometheus ──► Grafana
Logging Pipeline
Applications ──► Grafana Alloy ──► Loki ──► Grafana
Tracing Pipeline
Applications ──► OpenTelemetry SDK ──► Jaeger/Tempo ──► Grafana
Key Design Decisions
| Decision | Rationale | ADR |
|---|---|---|
| Talos Linux | Immutable, API-driven, secure | ADR-0002 |
| NATS over Kafka | Simpler ops, sufficient throughput | ADR-0003 |
| MessagePack over JSON | Binary efficiency for audio | ADR-0004 |
| Multi-GPU heterogeneous | Cost optimization, workload matching | ADR-0005 |
| GitOps with Flux | Declarative, auditable, secure | ADR-0006 |
| KServe for inference | Standardized API, autoscaling | ADR-0007 |
| KubeRay unified backend | Fractional GPU, single endpoint | ADR-0011 |
| Go handler refactor | Slim images for non-ML services | ADR-0061 |
Related Documents
- TECH-STACK.md - Complete technology inventory
- DOMAIN-MODEL.md - Core entities and relationships
- decisions/ - All architecture decisions