homelab-design/ARCHITECTURE.md

# 🏗️ System Architecture

> **Comprehensive technical overview of the DaviesTechLabs homelab infrastructure**

## Overview

The homelab is a production-grade Kubernetes cluster running on bare-metal hardware, designed for AI/ML workloads with multi-GPU support. It follows GitOps principles using Flux CD with SOPS-encrypted secrets.

## System Layers

```
┌─────────────────────────────────────────────────────────────────────────────┐
│                              USER LAYER                                      │
├─────────────────────────────────────────────────────────────────────────────┤
│  ┌──────────────────┐  ┌──────────────────┐  ┌──────────────────┐           │
│  │ Companions WebApp│  │   Voice WebApp   │  │   Kubeflow UI    │           │
│  │  HTMX + Alpine   │  │    Gradio UI     │  │  Pipeline Mgmt   │           │
│  └────────┬─────────┘  └────────┬─────────┘  └────────┬─────────┘           │
│           │ WebSocket           │ HTTP/WS             │ HTTP                │
└───────────┴─────────────────────┴─────────────────────┴─────────────────────┘
                                  │
                                  ▼
┌─────────────────────────────────────────────────────────────────────────────┐
│                           INGRESS LAYER                                      │
├─────────────────────────────────────────────────────────────────────────────┤
│  Cloudflared Tunnel ──► Envoy Gateway ──► HTTPRoute CRDs                    │
│                                                                              │
│  External: *.daviestechlabs.io          Internal: *.lab.daviestechlabs.io  │
│  • git.daviestechlabs.io                • kubeflow.lab.daviestechlabs.io   │
│  • auth.daviestechlabs.io               • companions-chat.lab...           │
└─────────────────────────────────────────────────────────────────────────────┘
                                  │
                                  ▼
┌─────────────────────────────────────────────────────────────────────────────┐
│                          MESSAGE BUS LAYER                                   │
├─────────────────────────────────────────────────────────────────────────────┤
│                           NATS + JetStream                                   │
│  ┌─────────────────────────────────────────────────────────────────────┐    │
│  │  Streams:                                                            │    │
│  │  • COMPANIONS_LOGINS (7d retention)  - User analytics               │    │
│  │  • COMPANIONS_CHAT (30d retention)   - Chat history                 │    │
│  │  • AI_CHAT_STREAM (5min, memory)     - Ephemeral streaming          │    │
│  │  • AI_VOICE_STREAM (1h, file)        - Voice processing             │    │
│  │  • AI_PIPELINE (24h, file)           - Workflow triggers            │    │
│  └─────────────────────────────────────────────────────────────────────┘    │
│                                                                              │
│  Message Format: MessagePack (binary, not JSON)                             │
└─────────────────────────────────────────────────────────────────────────────┘
                                  │
        ┌─────────────────────────┼─────────────────────────┐
        ▼                         ▼                         ▼
┌───────────────────┐   ┌───────────────────┐   ┌───────────────────┐
│   Chat Handler    │   │  Voice Assistant  │   │  Pipeline Bridge  │
├───────────────────┤   ├───────────────────┤   ├───────────────────┤
│ • RAG retrieval   │   │ • STT (Whisper)   │   │ • KFP triggers    │
│ • LLM inference   │   │ • RAG retrieval   │   │ • Argo triggers   │
│ • Streaming resp  │   │ • LLM inference   │   │ • Status updates  │
│ • Session state   │   │ • TTS (XTTS)      │   │ • Error handling  │
└───────────────────┘   └───────────────────┘   └───────────────────┘
                                  │
                                  ▼
┌─────────────────────────────────────────────────────────────────────────────┐
│                      GPU INFERENCE LAYER (KubeRay)                           │
├─────────────────────────────────────────────────────────────────────────────┤
│  RayService: ai-inference-serve-svc:8000                                    │
│  ┌─────────────────────────────────────────────────────────────────────┐    │
│  │                    Ray Serve (Unified Endpoint)                      │    │
│  │  ┌──────────┐ ┌──────────┐ ┌──────────┐ ┌──────────┐ ┌──────────┐   │    │
│  │  │ /whisper │ │   /tts   │ │   /llm   │ │/embeddings│ │/reranker │   │    │
│  │  │ Whisper  │ │  XTTS    │ │  vLLM    │ │  BGE-L    │ │ BGE-Rnk  │   │    │
│  │  │ (0.5 GPU)│ │(0.5 GPU) │ │(0.95 GPU)│ │ (0.8 GPU) │ │(0.8 GPU) │   │    │
│  │  ├──────────┤ ├──────────┤ ├──────────┤ ├──────────┤ ├──────────┤   │    │
│  │  │elminster │ │elminster │ │ khelben  │ │  drizzt  │ │  danilo  │   │    │
│  │  │RTX 2070  │ │RTX 2070  │ │Strix Halo│ │Radeon 680│ │Intel Arc │   │    │
│  │  │  CUDA    │ │  CUDA    │ │  ROCm    │ │  ROCm    │ │  Intel   │   │    │
│  │  └──────────┘ └──────────┘ └──────────┘ └──────────┘ └──────────┘   │    │
│  └─────────────────────────────────────────────────────────────────────┘    │
│                                                                              │
│  KServe Aliases: {whisper,tts,llm,embeddings,reranker}-predictor.ai-ml     │
│  Milvus: Vector database for RAG (Helm, MinIO backend)                      │
└─────────────────────────────────────────────────────────────────────────────┘
                                  │
                                  ▼
┌─────────────────────────────────────────────────────────────────────────────┐
│                       WORKFLOW ENGINE LAYER                                  │
├─────────────────────────────────────────────────────────────────────────────┤
│  ┌────────────────────────────┐    ┌────────────────────────────┐          │
│  │     Argo Workflows         │◄──►│    Kubeflow Pipelines      │          │
│  ├────────────────────────────┤    ├────────────────────────────┤          │
│  │ • Complex DAG orchestration│    │ • ML pipeline caching      │          │
│  │ • Training workflows       │    │ • Experiment tracking      │          │
│  │ • Document ingestion       │    │ • Model versioning         │          │
│  │ • Batch inference          │    │ • Artifact lineage         │          │
│  └────────────────────────────┘    └────────────────────────────┘          │
│                                                                              │
│  Trigger: Argo Events (EventSource → Sensor → Workflow/Pipeline)           │
└─────────────────────────────────────────────────────────────────────────────┘
                                  │
                                  ▼
┌─────────────────────────────────────────────────────────────────────────────┐
│                        INFRASTRUCTURE LAYER                                  │
├─────────────────────────────────────────────────────────────────────────────┤
│  Storage:                     Compute:                 Security:            │
│  ├─ Longhorn (block)          ├─ Volcano Scheduler     ├─ Vault (secrets)  │
│  ├─ NFS CSI (shared)          ├─ GPU Device Plugins    ├─ Authentik (SSO)  │
│  └─ MinIO (S3)                │   ├─ AMD ROCm          ├─ Falco (runtime)  │
│                               │   ├─ NVIDIA CUDA       └─ SOPS (GitOps)    │
│  Databases:                   │   └─ Intel i915/Arc                        │
│  ├─ CloudNative-PG            └─ Node Feature Discovery                    │
│  ├─ Valkey (cache)                                                          │
│  └─ ClickHouse (analytics)                                                  │
└─────────────────────────────────────────────────────────────────────────────┘
                                  │
                                  ▼
┌─────────────────────────────────────────────────────────────────────────────┐
│                          PLATFORM LAYER                                      │
├─────────────────────────────────────────────────────────────────────────────┤
│  Talos Linux v1.12.x  │  Kubernetes v1.35.0  │  Cilium CNI                 │
│                                                                              │
│  14 nodes: 3 control plane │ 4 GPU workers │ 2 CPU-only x86 workers       │
│            │ 5 Raspberry Pi (arm64) workers                                 │
└─────────────────────────────────────────────────────────────────────────────┘
```

## Node Topology

### Control Plane (HA)

| Node | IP | CPU | Memory | Storage | Role |
|------|-------|-----|--------|---------|------|
| storm | 192.168.100.25 | Intel 13th Gen (4c) | 16GB | 500GB NVMe | etcd, API server |
| bruenor | 192.168.100.26 | Intel 13th Gen (4c) | 16GB | 500GB NVMe | etcd, API server |
| catti | 192.168.100.27 | Intel 13th Gen (4c) | 16GB | 500GB NVMe | etcd, API server |

**VIP**: 192.168.100.20 (shared across control plane)

### Worker Nodes — GPU

| Node | IP | CPU | RAM | GPU | GPU Memory | Workload |
|------|-------|-----|-----|-----|------------|----------|
| elminster | 192.168.100.31 | Intel (16c) | 62 GB | NVIDIA RTX 2070 | 8 GB VRAM | Whisper, XTTS |
| khelben | 192.168.100.32 | AMD Ryzen (32c) | 94 GB | AMD Strix Halo | 32 GB Unified | vLLM (dedicated) |
| drizzt | 192.168.100.40 | AMD Ryzen 7 6800H (16c) | 27 GB | AMD Radeon 680M | 12 GB VRAM | BGE Embeddings |
| danilo | 192.168.100.41 | Intel Core Ultra 9 (22c) | 62 GB | Intel Arc | 16 GB Shared | Reranker |

### Worker Nodes — CPU-only (x86_64)

| Node | IP | CPU | RAM | Workload |
|------|-------|-----|-----|----------|
| regis | 192.168.100.43 | Intel (4c) | 16 GB | General workloads |
| wulfgar | 192.168.100.42 | Intel (4c) | 31 GB | General workloads |

### Worker Nodes — Raspberry Pi (arm64)

| Node | IP | CPU | RAM | Workload |
|------|-------|-----|-----|----------|
| durnan | 192.168.100.54 | Cortex-A72 (4c) | 4 GB | Lightweight services |
| jarlaxle | 192.168.100.53 | Cortex-A72 (4c) | 4 GB | Lightweight services |
| mirt | 192.168.100.52 | Cortex-A72 (4c) | 4 GB | Lightweight services |
| volo | 192.168.100.51 | Cortex-A72 (4c) | 4 GB | Lightweight services |
| elaith | 192.168.100.55 | Cortex-A72 (4c) | 8 GB | Lightweight services |

### Cluster Totals

| Resource | Total |
|----------|-------|
| Nodes | 14 (3 control + 11 worker) |
| CPU cores | ~126 |
| System RAM | ~378 GB |
| Architectures | amd64, arm64 |
| GPUs | 4 (NVIDIA, AMD, Intel) |

## Networking

### External Access

```
Internet → Cloudflare → cloudflared tunnel → Envoy Gateway → Services
```

### DNS Zones

- **External**: `*.daviestechlabs.io` (Cloudflare DNS)
- **Internal**: `*.lab.daviestechlabs.io` (internal split-horizon)

### Network CIDRs

| Network | CIDR | Purpose |
|---------|------|---------|
| Node Network | 192.168.100.0/24 | Physical nodes |
| Pod Network | 10.42.0.0/16 | Kubernetes pods |
| Service Network | 10.43.0.0/16 | Kubernetes services |

## Data Flow: Chat Request

```mermaid
sequenceDiagram
    participant U as User
    participant W as WebApp
    participant N as NATS
    participant C as Chat Handler
    participant M as Milvus
    participant L as vLLM
    participant V as Valkey

    U->>W: Send message
    W->>N: Publish ai.chat.user.{id}.message
    N->>C: Deliver to chat-handler
    C->>V: Get session history
    C->>M: RAG query (if enabled)
    M-->>C: Relevant documents
    C->>L: LLM inference (with context)
    L-->>C: Streaming tokens
    C->>N: Publish ai.chat.response.stream.{id}
    N-->>W: Deliver streaming chunks
    W-->>U: Display tokens
    C->>V: Save to session
```

## GitOps Flow

```
Developer → Git Push → GitHub/Gitea
                           │
                           ▼
                    ┌─────────────┐
                    │   Flux CD   │
                    │ (reconcile) │
                    └──────┬──────┘
                           │
            ┌──────────────┼──────────────┐
            ▼              ▼              ▼
     ┌──────────┐   ┌──────────┐   ┌──────────┐
     │homelab-  │   │  llm-    │   │  helm    │
     │  k8s2    │   │workflows │   │ charts   │
     └──────────┘   └──────────┘   └──────────┘
            │              │              │
            └──────────────┴──────────────┘
                           │
                           ▼
                    ┌─────────────┐
                    │  Kubernetes │
                    │   Cluster   │
                    └─────────────┘
```

## Security Architecture

### Secrets Management

```
External Secrets Operator ──► Vault / SOPS ──► Kubernetes Secrets
```

### Authentication

```
User ──► Cloudflare Access ──► Authentik ──► Application
                                   │
                                   └──► OIDC/SAML providers
```

### Network Security

- **Cilium**: Network policies, eBPF-based security
- **Falco**: Runtime security monitoring
- **RBAC**: Fine-grained Kubernetes permissions

## High Availability

### Control Plane

- 3-node etcd cluster with automatic leader election
- Virtual IP (192.168.100.20) for API server access
- Automatic failover via Talos

### Workloads

- Pod anti-affinity for critical services
- HPA for auto-scaling
- PodDisruptionBudgets for controlled updates

### Storage

- Longhorn 3-replica default
- MinIO erasure coding for S3
- Regular Velero backups

## Observability

### Metrics Pipeline

```
Applications ──► OpenTelemetry Collector ──► Prometheus ──► Grafana
```

### Logging Pipeline

```
Applications ──► Grafana Alloy ──► Loki ──► Grafana
```

### Tracing Pipeline

```
Applications ──► OpenTelemetry SDK ──► Jaeger/Tempo ──► Grafana
```

## Key Design Decisions

| Decision | Rationale | ADR |
|----------|-----------|-----|
| Talos Linux | Immutable, API-driven, secure | [ADR-0002](decisions/0002-use-talos-linux.md) |
| NATS over Kafka | Simpler ops, sufficient throughput | [ADR-0003](decisions/0003-use-nats-for-messaging.md) |
| MessagePack over JSON | Binary efficiency for audio | [ADR-0004](decisions/0004-use-messagepack-for-nats.md) |
| Multi-GPU heterogeneous | Cost optimization, workload matching | [ADR-0005](decisions/0005-multi-gpu-strategy.md) |
| GitOps with Flux | Declarative, auditable, secure | [ADR-0006](decisions/0006-gitops-with-flux.md) |
| KServe for inference | Standardized API, autoscaling | [ADR-0007](decisions/0007-use-kserve-for-inference.md) |
| KubeRay unified backend | Fractional GPU, single endpoint | [ADR-0011](decisions/0011-kuberay-unified-gpu-backend.md) |
| Go handler refactor | Slim images for non-ML services | [ADR-0061](decisions/0061-go-handler-refactor.md) |

## Related Documents

- [TECH-STACK.md](TECH-STACK.md) - Complete technology inventory
- [DOMAIN-MODEL.md](DOMAIN-MODEL.md) - Core entities and relationships
- [decisions/](decisions/) - All architecture decisions