327 lines
21 KiB
Markdown
327 lines
21 KiB
Markdown
# 🏗️ System Architecture
|
|
|
|
> **Comprehensive technical overview of the DaviesTechLabs homelab infrastructure**
|
|
|
|
## Overview
|
|
|
|
The homelab is a production-grade Kubernetes cluster running on bare-metal hardware, designed for AI/ML workloads with multi-GPU support. It follows GitOps principles using Flux CD with SOPS-encrypted secrets.
|
|
|
|
## System Layers
|
|
|
|
```
|
|
┌─────────────────────────────────────────────────────────────────────────────┐
|
|
│ USER LAYER │
|
|
├─────────────────────────────────────────────────────────────────────────────┤
|
|
│ ┌──────────────────┐ ┌──────────────────┐ ┌──────────────────┐ │
|
|
│ │ Companions WebApp│ │ Voice WebApp │ │ Kubeflow UI │ │
|
|
│ │ HTMX + Alpine │ │ Gradio UI │ │ Pipeline Mgmt │ │
|
|
│ └────────┬─────────┘ └────────┬─────────┘ └────────┬─────────┘ │
|
|
│ │ WebSocket │ HTTP/WS │ HTTP │
|
|
└───────────┴─────────────────────┴─────────────────────┴─────────────────────┘
|
|
│
|
|
▼
|
|
┌─────────────────────────────────────────────────────────────────────────────┐
|
|
│ INGRESS LAYER │
|
|
├─────────────────────────────────────────────────────────────────────────────┤
|
|
│ Cloudflared Tunnel ──► Envoy Gateway ──► HTTPRoute CRDs │
|
|
│ │
|
|
│ External: *.daviestechlabs.io Internal: *.lab.daviestechlabs.io │
|
|
│ • git.daviestechlabs.io • kubeflow.lab.daviestechlabs.io │
|
|
│ • auth.daviestechlabs.io • companions-chat.lab... │
|
|
└─────────────────────────────────────────────────────────────────────────────┘
|
|
│
|
|
▼
|
|
┌─────────────────────────────────────────────────────────────────────────────┐
|
|
│ MESSAGE BUS LAYER │
|
|
├─────────────────────────────────────────────────────────────────────────────┤
|
|
│ NATS + JetStream │
|
|
│ ┌─────────────────────────────────────────────────────────────────────┐ │
|
|
│ │ Streams: │ │
|
|
│ │ • COMPANIONS_LOGINS (7d retention) - User analytics │ │
|
|
│ │ • COMPANIONS_CHAT (30d retention) - Chat history │ │
|
|
│ │ • AI_CHAT_STREAM (5min, memory) - Ephemeral streaming │ │
|
|
│ │ • AI_VOICE_STREAM (1h, file) - Voice processing │ │
|
|
│ │ • AI_PIPELINE (24h, file) - Workflow triggers │ │
|
|
│ └─────────────────────────────────────────────────────────────────────┘ │
|
|
│ │
|
|
│ Message Format: MessagePack (binary, not JSON) │
|
|
└─────────────────────────────────────────────────────────────────────────────┘
|
|
│
|
|
┌─────────────────────────┼─────────────────────────┐
|
|
▼ ▼ ▼
|
|
┌───────────────────┐ ┌───────────────────┐ ┌───────────────────┐
|
|
│ Chat Handler │ │ Voice Assistant │ │ Pipeline Bridge │
|
|
├───────────────────┤ ├───────────────────┤ ├───────────────────┤
|
|
│ • RAG retrieval │ │ • STT (Whisper) │ │ • KFP triggers │
|
|
│ • LLM inference │ │ • RAG retrieval │ │ • Argo triggers │
|
|
│ • Streaming resp │ │ • LLM inference │ │ • Status updates │
|
|
│ • Session state │ │ • TTS (XTTS) │ │ • Error handling │
|
|
└───────────────────┘ └───────────────────┘ └───────────────────┘
|
|
│
|
|
▼
|
|
┌─────────────────────────────────────────────────────────────────────────────┐
|
|
│ GPU INFERENCE LAYER (KubeRay) │
|
|
├─────────────────────────────────────────────────────────────────────────────┤
|
|
│ RayService: ai-inference-serve-svc:8000 │
|
|
│ ┌─────────────────────────────────────────────────────────────────────┐ │
|
|
│ │ Ray Serve (Unified Endpoint) │ │
|
|
│ │ ┌──────────┐ ┌──────────┐ ┌──────────┐ ┌──────────┐ ┌──────────┐ │ │
|
|
│ │ │ /whisper │ │ /tts │ │ /llm │ │/embeddings│ │/reranker │ │ │
|
|
│ │ │ Whisper │ │ XTTS │ │ vLLM │ │ BGE-L │ │ BGE-Rnk │ │ │
|
|
│ │ │ (0.5 GPU)│ │(0.5 GPU) │ │(0.95 GPU)│ │ (0.8 GPU) │ │(0.8 GPU) │ │ │
|
|
│ │ ├──────────┤ ├──────────┤ ├──────────┤ ├──────────┤ ├──────────┤ │ │
|
|
│ │ │elminster │ │elminster │ │ khelben │ │ drizzt │ │ danilo │ │ │
|
|
│ │ │RTX 2070 │ │RTX 2070 │ │Strix Halo│ │Radeon 680│ │Intel Arc │ │ │
|
|
│ │ │ CUDA │ │ CUDA │ │ ROCm │ │ ROCm │ │ Intel │ │ │
|
|
│ │ └──────────┘ └──────────┘ └──────────┘ └──────────┘ └──────────┘ │ │
|
|
│ └─────────────────────────────────────────────────────────────────────┘ │
|
|
│ │
|
|
│ KServe Aliases: {whisper,tts,llm,embeddings,reranker}-predictor.ai-ml │
|
|
│ Milvus: Vector database for RAG (Helm, MinIO backend) │
|
|
└─────────────────────────────────────────────────────────────────────────────┘
|
|
│
|
|
▼
|
|
┌─────────────────────────────────────────────────────────────────────────────┐
|
|
│ WORKFLOW ENGINE LAYER │
|
|
├─────────────────────────────────────────────────────────────────────────────┤
|
|
│ ┌────────────────────────────┐ ┌────────────────────────────┐ │
|
|
│ │ Argo Workflows │◄──►│ Kubeflow Pipelines │ │
|
|
│ ├────────────────────────────┤ ├────────────────────────────┤ │
|
|
│ │ • Complex DAG orchestration│ │ • ML pipeline caching │ │
|
|
│ │ • Training workflows │ │ • Experiment tracking │ │
|
|
│ │ • Document ingestion │ │ • Model versioning │ │
|
|
│ │ • Batch inference │ │ • Artifact lineage │ │
|
|
│ └────────────────────────────┘ └────────────────────────────┘ │
|
|
│ │
|
|
│ Trigger: Argo Events (EventSource → Sensor → Workflow/Pipeline) │
|
|
└─────────────────────────────────────────────────────────────────────────────┘
|
|
│
|
|
▼
|
|
┌─────────────────────────────────────────────────────────────────────────────┐
|
|
│ INFRASTRUCTURE LAYER │
|
|
├─────────────────────────────────────────────────────────────────────────────┤
|
|
│ Storage: Compute: Security: │
|
|
│ ├─ Longhorn (block) ├─ Volcano Scheduler ├─ Vault (secrets) │
|
|
│ ├─ NFS CSI (shared) ├─ GPU Device Plugins ├─ Authentik (SSO) │
|
|
│ └─ MinIO (S3) │ ├─ AMD ROCm ├─ Falco (runtime) │
|
|
│ │ ├─ NVIDIA CUDA └─ SOPS (GitOps) │
|
|
│ Databases: │ └─ Intel i915/Arc │
|
|
│ ├─ CloudNative-PG └─ Node Feature Discovery │
|
|
│ ├─ Valkey (cache) │
|
|
│ └─ ClickHouse (analytics) │
|
|
└─────────────────────────────────────────────────────────────────────────────┘
|
|
│
|
|
▼
|
|
┌─────────────────────────────────────────────────────────────────────────────┐
|
|
│ PLATFORM LAYER │
|
|
├─────────────────────────────────────────────────────────────────────────────┤
|
|
│ Talos Linux v1.12.x │ Kubernetes v1.35.0 │ Cilium CNI │
|
|
│ │
|
|
│ 14 nodes: 3 control plane │ 4 GPU workers │ 2 CPU-only x86 workers │
|
|
│ │ 5 Raspberry Pi (arm64) workers │
|
|
└─────────────────────────────────────────────────────────────────────────────┘
|
|
```
|
|
|
|
## Node Topology
|
|
|
|
### Control Plane (HA)
|
|
|
|
| Node | IP | CPU | Memory | Storage | Role |
|
|
|------|-------|-----|--------|---------|------|
|
|
| storm | 192.168.100.25 | Intel 13th Gen (4c) | 16GB | 500GB NVMe | etcd, API server |
|
|
| bruenor | 192.168.100.26 | Intel 13th Gen (4c) | 16GB | 500GB NVMe | etcd, API server |
|
|
| catti | 192.168.100.27 | Intel 13th Gen (4c) | 16GB | 500GB NVMe | etcd, API server |
|
|
|
|
**VIP**: 192.168.100.20 (shared across control plane)
|
|
|
|
### Worker Nodes — GPU
|
|
|
|
| Node | IP | CPU | RAM | GPU | GPU Memory | Workload |
|
|
|------|-------|-----|-----|-----|------------|----------|
|
|
| elminster | 192.168.100.31 | Intel (16c) | 62 GB | NVIDIA RTX 2070 | 8 GB VRAM | Whisper, XTTS |
|
|
| khelben | 192.168.100.32 | AMD Ryzen (32c) | 94 GB | AMD Strix Halo | 32 GB Unified | vLLM (dedicated) |
|
|
| drizzt | 192.168.100.40 | AMD Ryzen 7 6800H (16c) | 27 GB | AMD Radeon 680M | 12 GB VRAM | BGE Embeddings |
|
|
| danilo | 192.168.100.41 | Intel Core Ultra 9 (22c) | 62 GB | Intel Arc | 16 GB Shared | Reranker |
|
|
|
|
### Worker Nodes — CPU-only (x86_64)
|
|
|
|
| Node | IP | CPU | RAM | Workload |
|
|
|------|-------|-----|-----|----------|
|
|
| regis | 192.168.100.43 | Intel (4c) | 16 GB | General workloads |
|
|
| wulfgar | 192.168.100.42 | Intel (4c) | 31 GB | General workloads |
|
|
|
|
### Worker Nodes — Raspberry Pi (arm64)
|
|
|
|
| Node | IP | CPU | RAM | Workload |
|
|
|------|-------|-----|-----|----------|
|
|
| durnan | 192.168.100.54 | Cortex-A72 (4c) | 4 GB | Lightweight services |
|
|
| jarlaxle | 192.168.100.53 | Cortex-A72 (4c) | 4 GB | Lightweight services |
|
|
| mirt | 192.168.100.52 | Cortex-A72 (4c) | 4 GB | Lightweight services |
|
|
| volo | 192.168.100.51 | Cortex-A72 (4c) | 4 GB | Lightweight services |
|
|
| elaith | 192.168.100.55 | Cortex-A72 (4c) | 8 GB | Lightweight services |
|
|
|
|
### Cluster Totals
|
|
|
|
| Resource | Total |
|
|
|----------|-------|
|
|
| Nodes | 14 (3 control + 11 worker) |
|
|
| CPU cores | ~126 |
|
|
| System RAM | ~378 GB |
|
|
| Architectures | amd64, arm64 |
|
|
| GPUs | 4 (NVIDIA, AMD, Intel) |
|
|
|
|
## Networking
|
|
|
|
### External Access
|
|
|
|
```
|
|
Internet → Cloudflare → cloudflared tunnel → Envoy Gateway → Services
|
|
```
|
|
|
|
### DNS Zones
|
|
|
|
- **External**: `*.daviestechlabs.io` (Cloudflare DNS)
|
|
- **Internal**: `*.lab.daviestechlabs.io` (internal split-horizon)
|
|
|
|
### Network CIDRs
|
|
|
|
| Network | CIDR | Purpose |
|
|
|---------|------|---------|
|
|
| Node Network | 192.168.100.0/24 | Physical nodes |
|
|
| Pod Network | 10.42.0.0/16 | Kubernetes pods |
|
|
| Service Network | 10.43.0.0/16 | Kubernetes services |
|
|
|
|
## Data Flow: Chat Request
|
|
|
|
```mermaid
|
|
sequenceDiagram
|
|
participant U as User
|
|
participant W as WebApp
|
|
participant N as NATS
|
|
participant C as Chat Handler
|
|
participant M as Milvus
|
|
participant L as vLLM
|
|
participant V as Valkey
|
|
|
|
U->>W: Send message
|
|
W->>N: Publish ai.chat.user.{id}.message
|
|
N->>C: Deliver to chat-handler
|
|
C->>V: Get session history
|
|
C->>M: RAG query (if enabled)
|
|
M-->>C: Relevant documents
|
|
C->>L: LLM inference (with context)
|
|
L-->>C: Streaming tokens
|
|
C->>N: Publish ai.chat.response.stream.{id}
|
|
N-->>W: Deliver streaming chunks
|
|
W-->>U: Display tokens
|
|
C->>V: Save to session
|
|
```
|
|
|
|
## GitOps Flow
|
|
|
|
```
|
|
Developer → Git Push → GitHub/Gitea
|
|
│
|
|
▼
|
|
┌─────────────┐
|
|
│ Flux CD │
|
|
│ (reconcile) │
|
|
└──────┬──────┘
|
|
│
|
|
┌──────────────┼──────────────┐
|
|
▼ ▼ ▼
|
|
┌──────────┐ ┌──────────┐ ┌──────────┐
|
|
│homelab- │ │ llm- │ │ helm │
|
|
│ k8s2 │ │workflows │ │ charts │
|
|
└──────────┘ └──────────┘ └──────────┘
|
|
│ │ │
|
|
└──────────────┴──────────────┘
|
|
│
|
|
▼
|
|
┌─────────────┐
|
|
│ Kubernetes │
|
|
│ Cluster │
|
|
└─────────────┘
|
|
```
|
|
|
|
## Security Architecture
|
|
|
|
### Secrets Management
|
|
|
|
```
|
|
External Secrets Operator ──► Vault / SOPS ──► Kubernetes Secrets
|
|
```
|
|
|
|
### Authentication
|
|
|
|
```
|
|
User ──► Cloudflare Access ──► Authentik ──► Application
|
|
│
|
|
└──► OIDC/SAML providers
|
|
```
|
|
|
|
### Network Security
|
|
|
|
- **Cilium**: Network policies, eBPF-based security
|
|
- **Falco**: Runtime security monitoring
|
|
- **RBAC**: Fine-grained Kubernetes permissions
|
|
|
|
## High Availability
|
|
|
|
### Control Plane
|
|
|
|
- 3-node etcd cluster with automatic leader election
|
|
- Virtual IP (192.168.100.20) for API server access
|
|
- Automatic failover via Talos
|
|
|
|
### Workloads
|
|
|
|
- Pod anti-affinity for critical services
|
|
- HPA for auto-scaling
|
|
- PodDisruptionBudgets for controlled updates
|
|
|
|
### Storage
|
|
|
|
- Longhorn 3-replica default
|
|
- MinIO erasure coding for S3
|
|
- Regular Velero backups
|
|
|
|
## Observability
|
|
|
|
### Metrics Pipeline
|
|
|
|
```
|
|
Applications ──► OpenTelemetry Collector ──► Prometheus ──► Grafana
|
|
```
|
|
|
|
### Logging Pipeline
|
|
|
|
```
|
|
Applications ──► Grafana Alloy ──► Loki ──► Grafana
|
|
```
|
|
|
|
### Tracing Pipeline
|
|
|
|
```
|
|
Applications ──► OpenTelemetry SDK ──► Jaeger/Tempo ──► Grafana
|
|
```
|
|
|
|
## Key Design Decisions
|
|
|
|
| Decision | Rationale | ADR |
|
|
|----------|-----------|-----|
|
|
| Talos Linux | Immutable, API-driven, secure | [ADR-0002](decisions/0002-use-talos-linux.md) |
|
|
| NATS over Kafka | Simpler ops, sufficient throughput | [ADR-0003](decisions/0003-use-nats-for-messaging.md) |
|
|
| MessagePack over JSON | Binary efficiency for audio | [ADR-0004](decisions/0004-use-messagepack-for-nats.md) |
|
|
| Multi-GPU heterogeneous | Cost optimization, workload matching | [ADR-0005](decisions/0005-multi-gpu-strategy.md) |
|
|
| GitOps with Flux | Declarative, auditable, secure | [ADR-0006](decisions/0006-gitops-with-flux.md) |
|
|
| KServe for inference | Standardized API, autoscaling | [ADR-0007](decisions/0007-use-kserve-for-inference.md) |
|
|
| KubeRay unified backend | Fractional GPU, single endpoint | [ADR-0011](decisions/0011-kuberay-unified-gpu-backend.md) |
|
|
| Go handler refactor | Slim images for non-ML services | [ADR-0061](decisions/0061-go-handler-refactor.md) |
|
|
|
|
## Related Documents
|
|
|
|
- [TECH-STACK.md](TECH-STACK.md) - Complete technology inventory
|
|
- [DOMAIN-MODEL.md](DOMAIN-MODEL.md) - Core entities and relationships
|
|
- [decisions/](decisions/) - All architecture decisions
|