Files
homelab-design/ARCHITECTURE.md
Billy D. 200cc5704b
All checks were successful
Update README with ADR Index / update-readme (push) Successful in 7s
docs: update node inventory and 70B QLoRA feasibility analysis
2026-02-15 11:19:22 -05:00

326 lines
21 KiB
Markdown

# 🏗️ System Architecture
> **Comprehensive technical overview of the DaviesTechLabs homelab infrastructure**
## Overview
The homelab is a production-grade Kubernetes cluster running on bare-metal hardware, designed for AI/ML workloads with multi-GPU support. It follows GitOps principles using Flux CD with SOPS-encrypted secrets.
## System Layers
```
┌─────────────────────────────────────────────────────────────────────────────┐
│ USER LAYER │
├─────────────────────────────────────────────────────────────────────────────┤
│ ┌──────────────────┐ ┌──────────────────┐ ┌──────────────────┐ │
│ │ Companions WebApp│ │ Voice WebApp │ │ Kubeflow UI │ │
│ │ HTMX + Alpine │ │ Gradio UI │ │ Pipeline Mgmt │ │
│ └────────┬─────────┘ └────────┬─────────┘ └────────┬─────────┘ │
│ │ WebSocket │ HTTP/WS │ HTTP │
└───────────┴─────────────────────┴─────────────────────┴─────────────────────┘
┌─────────────────────────────────────────────────────────────────────────────┐
│ INGRESS LAYER │
├─────────────────────────────────────────────────────────────────────────────┤
│ Cloudflared Tunnel ──► Envoy Gateway ──► HTTPRoute CRDs │
│ │
│ External: *.daviestechlabs.io Internal: *.lab.daviestechlabs.io │
│ • git.daviestechlabs.io • kubeflow.lab.daviestechlabs.io │
│ • auth.daviestechlabs.io • companions-chat.lab... │
└─────────────────────────────────────────────────────────────────────────────┘
┌─────────────────────────────────────────────────────────────────────────────┐
│ MESSAGE BUS LAYER │
├─────────────────────────────────────────────────────────────────────────────┤
│ NATS + JetStream │
│ ┌─────────────────────────────────────────────────────────────────────┐ │
│ │ Streams: │ │
│ │ • COMPANIONS_LOGINS (7d retention) - User analytics │ │
│ │ • COMPANIONS_CHAT (30d retention) - Chat history │ │
│ │ • AI_CHAT_STREAM (5min, memory) - Ephemeral streaming │ │
│ │ • AI_VOICE_STREAM (1h, file) - Voice processing │ │
│ │ • AI_PIPELINE (24h, file) - Workflow triggers │ │
│ └─────────────────────────────────────────────────────────────────────┘ │
│ │
│ Message Format: MessagePack (binary, not JSON) │
└─────────────────────────────────────────────────────────────────────────────┘
┌─────────────────────────┼─────────────────────────┐
▼ ▼ ▼
┌───────────────────┐ ┌───────────────────┐ ┌───────────────────┐
│ Chat Handler │ │ Voice Assistant │ │ Pipeline Bridge │
├───────────────────┤ ├───────────────────┤ ├───────────────────┤
│ • RAG retrieval │ │ • STT (Whisper) │ │ • KFP triggers │
│ • LLM inference │ │ • RAG retrieval │ │ • Argo triggers │
│ • Streaming resp │ │ • LLM inference │ │ • Status updates │
│ • Session state │ │ • TTS (XTTS) │ │ • Error handling │
└───────────────────┘ └───────────────────┘ └───────────────────┘
┌─────────────────────────────────────────────────────────────────────────────┐
│ GPU INFERENCE LAYER (KubeRay) │
├─────────────────────────────────────────────────────────────────────────────┤
│ RayService: ai-inference-serve-svc:8000 │
│ ┌─────────────────────────────────────────────────────────────────────┐ │
│ │ Ray Serve (Unified Endpoint) │ │
│ │ ┌──────────┐ ┌──────────┐ ┌──────────┐ ┌──────────┐ ┌──────────┐ │ │
│ │ │ /whisper │ │ /tts │ │ /llm │ │/embeddings│ │/reranker │ │ │
│ │ │ Whisper │ │ XTTS │ │ vLLM │ │ BGE-L │ │ BGE-Rnk │ │ │
│ │ │ (0.5 GPU)│ │(0.5 GPU) │ │(0.95 GPU)│ │ (0.8 GPU) │ │(0.8 GPU) │ │ │
│ │ ├──────────┤ ├──────────┤ ├──────────┤ ├──────────┤ ├──────────┤ │ │
│ │ │elminster │ │elminster │ │ khelben │ │ drizzt │ │ danilo │ │ │
│ │ │RTX 2070 │ │RTX 2070 │ │Strix Halo│ │Radeon 680│ │Intel Arc │ │ │
│ │ │ CUDA │ │ CUDA │ │ ROCm │ │ ROCm │ │ Intel │ │ │
│ │ └──────────┘ └──────────┘ └──────────┘ └──────────┘ └──────────┘ │ │
│ └─────────────────────────────────────────────────────────────────────┘ │
│ │
│ KServe Aliases: {whisper,tts,llm,embeddings,reranker}-predictor.ai-ml │
│ Milvus: Vector database for RAG (Helm, MinIO backend) │
└─────────────────────────────────────────────────────────────────────────────┘
┌─────────────────────────────────────────────────────────────────────────────┐
│ WORKFLOW ENGINE LAYER │
├─────────────────────────────────────────────────────────────────────────────┤
│ ┌────────────────────────────┐ ┌────────────────────────────┐ │
│ │ Argo Workflows │◄──►│ Kubeflow Pipelines │ │
│ ├────────────────────────────┤ ├────────────────────────────┤ │
│ │ • Complex DAG orchestration│ │ • ML pipeline caching │ │
│ │ • Training workflows │ │ • Experiment tracking │ │
│ │ • Document ingestion │ │ • Model versioning │ │
│ │ • Batch inference │ │ • Artifact lineage │ │
│ └────────────────────────────┘ └────────────────────────────┘ │
│ │
│ Trigger: Argo Events (EventSource → Sensor → Workflow/Pipeline) │
└─────────────────────────────────────────────────────────────────────────────┘
┌─────────────────────────────────────────────────────────────────────────────┐
│ INFRASTRUCTURE LAYER │
├─────────────────────────────────────────────────────────────────────────────┤
│ Storage: Compute: Security: │
│ ├─ Longhorn (block) ├─ Volcano Scheduler ├─ Vault (secrets) │
│ ├─ NFS CSI (shared) ├─ GPU Device Plugins ├─ Authentik (SSO) │
│ └─ MinIO (S3) │ ├─ AMD ROCm ├─ Falco (runtime) │
│ │ ├─ NVIDIA CUDA └─ SOPS (GitOps) │
│ Databases: │ └─ Intel i915/Arc │
│ ├─ CloudNative-PG └─ Node Feature Discovery │
│ ├─ Valkey (cache) │
│ └─ ClickHouse (analytics) │
└─────────────────────────────────────────────────────────────────────────────┘
┌─────────────────────────────────────────────────────────────────────────────┐
│ PLATFORM LAYER │
├─────────────────────────────────────────────────────────────────────────────┤
│ Talos Linux v1.12.x │ Kubernetes v1.35.0 │ Cilium CNI │
│ │
│ 14 nodes: 3 control plane │ 4 GPU workers │ 2 CPU-only x86 workers │
│ │ 5 Raspberry Pi (arm64) workers │
└─────────────────────────────────────────────────────────────────────────────┘
```
## Node Topology
### Control Plane (HA)
| Node | IP | CPU | Memory | Storage | Role |
|------|-------|-----|--------|---------|------|
| storm | 192.168.100.25 | Intel 13th Gen (4c) | 16GB | 500GB NVMe | etcd, API server |
| bruenor | 192.168.100.26 | Intel 13th Gen (4c) | 16GB | 500GB NVMe | etcd, API server |
| catti | 192.168.100.27 | Intel 13th Gen (4c) | 16GB | 500GB NVMe | etcd, API server |
**VIP**: 192.168.100.20 (shared across control plane)
### Worker Nodes — GPU
| Node | IP | CPU | RAM | GPU | GPU Memory | Workload |
|------|-------|-----|-----|-----|------------|----------|
| elminster | 192.168.100.31 | Intel (16c) | 62 GB | NVIDIA RTX 2070 | 8 GB VRAM | Whisper, XTTS |
| khelben | 192.168.100.32 | AMD Ryzen (32c) | 94 GB | AMD Strix Halo | 32 GB Unified | vLLM (dedicated) |
| drizzt | 192.168.100.40 | AMD Ryzen 7 6800H (16c) | 27 GB | AMD Radeon 680M | 12 GB VRAM | BGE Embeddings |
| danilo | 192.168.100.41 | Intel Core Ultra 9 (22c) | 62 GB | Intel Arc | 16 GB Shared | Reranker |
### Worker Nodes — CPU-only (x86_64)
| Node | IP | CPU | RAM | Workload |
|------|-------|-----|-----|----------|
| regis | 192.168.100.43 | Intel (4c) | 16 GB | General workloads |
| wulfgar | 192.168.100.42 | Intel (4c) | 31 GB | General workloads |
### Worker Nodes — Raspberry Pi (arm64)
| Node | IP | CPU | RAM | Workload |
|------|-------|-----|-----|----------|
| durnan | 192.168.100.54 | Cortex-A72 (4c) | 4 GB | Lightweight services |
| jarlaxle | 192.168.100.53 | Cortex-A72 (4c) | 4 GB | Lightweight services |
| mirt | 192.168.100.52 | Cortex-A72 (4c) | 4 GB | Lightweight services |
| volo | 192.168.100.51 | Cortex-A72 (4c) | 4 GB | Lightweight services |
| elaith | 192.168.100.55 | Cortex-A72 (4c) | 8 GB | Lightweight services |
### Cluster Totals
| Resource | Total |
|----------|-------|
| Nodes | 14 (3 control + 11 worker) |
| CPU cores | ~126 |
| System RAM | ~378 GB |
| Architectures | amd64, arm64 |
| GPUs | 4 (NVIDIA, AMD, Intel) |
## Networking
### External Access
```
Internet → Cloudflare → cloudflared tunnel → Envoy Gateway → Services
```
### DNS Zones
- **External**: `*.daviestechlabs.io` (Cloudflare DNS)
- **Internal**: `*.lab.daviestechlabs.io` (internal split-horizon)
### Network CIDRs
| Network | CIDR | Purpose |
|---------|------|---------|
| Node Network | 192.168.100.0/24 | Physical nodes |
| Pod Network | 10.42.0.0/16 | Kubernetes pods |
| Service Network | 10.43.0.0/16 | Kubernetes services |
## Data Flow: Chat Request
```mermaid
sequenceDiagram
participant U as User
participant W as WebApp
participant N as NATS
participant C as Chat Handler
participant M as Milvus
participant L as vLLM
participant V as Valkey
U->>W: Send message
W->>N: Publish ai.chat.user.{id}.message
N->>C: Deliver to chat-handler
C->>V: Get session history
C->>M: RAG query (if enabled)
M-->>C: Relevant documents
C->>L: LLM inference (with context)
L-->>C: Streaming tokens
C->>N: Publish ai.chat.response.stream.{id}
N-->>W: Deliver streaming chunks
W-->>U: Display tokens
C->>V: Save to session
```
## GitOps Flow
```
Developer → Git Push → GitHub/Gitea
┌─────────────┐
│ Flux CD │
│ (reconcile) │
└──────┬──────┘
┌──────────────┼──────────────┐
▼ ▼ ▼
┌──────────┐ ┌──────────┐ ┌──────────┐
│homelab- │ │ llm- │ │ helm │
│ k8s2 │ │workflows │ │ charts │
└──────────┘ └──────────┘ └──────────┘
│ │ │
└──────────────┴──────────────┘
┌─────────────┐
│ Kubernetes │
│ Cluster │
└─────────────┘
```
## Security Architecture
### Secrets Management
```
External Secrets Operator ──► Vault / SOPS ──► Kubernetes Secrets
```
### Authentication
```
User ──► Cloudflare Access ──► Authentik ──► Application
└──► OIDC/SAML providers
```
### Network Security
- **Cilium**: Network policies, eBPF-based security
- **Falco**: Runtime security monitoring
- **RBAC**: Fine-grained Kubernetes permissions
## High Availability
### Control Plane
- 3-node etcd cluster with automatic leader election
- Virtual IP (192.168.100.20) for API server access
- Automatic failover via Talos
### Workloads
- Pod anti-affinity for critical services
- HPA for auto-scaling
- PodDisruptionBudgets for controlled updates
### Storage
- Longhorn 3-replica default
- MinIO erasure coding for S3
- Regular Velero backups
## Observability
### Metrics Pipeline
```
Applications ──► OpenTelemetry Collector ──► Prometheus ──► Grafana
```
### Logging Pipeline
```
Applications ──► Grafana Alloy ──► Loki ──► Grafana
```
### Tracing Pipeline
```
Applications ──► OpenTelemetry SDK ──► Jaeger/Tempo ──► Grafana
```
## Key Design Decisions
| Decision | Rationale | ADR |
|----------|-----------|-----|
| Talos Linux | Immutable, API-driven, secure | [ADR-0002](decisions/0002-use-talos-linux.md) |
| NATS over Kafka | Simpler ops, sufficient throughput | [ADR-0003](decisions/0003-use-nats-for-messaging.md) |
| MessagePack over JSON | Binary efficiency for audio | [ADR-0004](decisions/0004-use-messagepack-for-nats.md) |
| Multi-GPU heterogeneous | Cost optimization, workload matching | [ADR-0005](decisions/0005-multi-gpu-strategy.md) |
| GitOps with Flux | Declarative, auditable, secure | [ADR-0006](decisions/0006-gitops-with-flux.md) |
| KServe for inference | Standardized API, autoscaling | [ADR-0007](decisions/0007-use-kserve-for-inference.md) |
| KubeRay unified backend | Fractional GPU, single endpoint | [ADR-0011](decisions/0011-kuberay-unified-gpu-backend.md) |
## Related Documents
- [TECH-STACK.md](TECH-STACK.md) - Complete technology inventory
- [DOMAIN-MODEL.md](DOMAIN-MODEL.md) - Core entities and relationships
- [decisions/](decisions/) - All architecture decisions