feat: add comprehensive architecture documentation
- Add AGENT-ONBOARDING.md for AI agents - Add ARCHITECTURE.md with full system overview - Add TECH-STACK.md with complete technology inventory - Add DOMAIN-MODEL.md with entities and bounded contexts - Add CODING-CONVENTIONS.md with patterns and practices - Add GLOSSARY.md with terminology reference - Add C4 diagrams (Context and Container levels) - Add 10 ADRs documenting key decisions: - Talos Linux, NATS, MessagePack, Multi-GPU strategy - GitOps with Flux, KServe, Milvus, Dual workflow engines - Envoy Gateway - Add specs directory with JetStream configuration - Add diagrams for GPU allocation and data flows Based on analysis of homelab-k8s2 and llm-workflows repositories and kubectl cluster-info dump data.
This commit is contained in:
287
ARCHITECTURE.md
Normal file
287
ARCHITECTURE.md
Normal file
@@ -0,0 +1,287 @@
|
||||
# 🏗️ System Architecture
|
||||
|
||||
> **Comprehensive technical overview of the DaviesTechLabs homelab infrastructure**
|
||||
|
||||
## Overview
|
||||
|
||||
The homelab is a production-grade Kubernetes cluster running on bare-metal hardware, designed for AI/ML workloads with multi-GPU support. It follows GitOps principles using Flux CD with SOPS-encrypted secrets.
|
||||
|
||||
## System Layers
|
||||
|
||||
```
|
||||
┌─────────────────────────────────────────────────────────────────────────────┐
|
||||
│ USER LAYER │
|
||||
├─────────────────────────────────────────────────────────────────────────────┤
|
||||
│ ┌──────────────────┐ ┌──────────────────┐ ┌──────────────────┐ │
|
||||
│ │ Companions WebApp│ │ Voice WebApp │ │ Kubeflow UI │ │
|
||||
│ │ HTMX + Alpine │ │ Gradio UI │ │ Pipeline Mgmt │ │
|
||||
│ └────────┬─────────┘ └────────┬─────────┘ └────────┬─────────┘ │
|
||||
│ │ WebSocket │ HTTP/WS │ HTTP │
|
||||
└───────────┴─────────────────────┴─────────────────────┴─────────────────────┘
|
||||
│
|
||||
▼
|
||||
┌─────────────────────────────────────────────────────────────────────────────┐
|
||||
│ INGRESS LAYER │
|
||||
├─────────────────────────────────────────────────────────────────────────────┤
|
||||
│ Cloudflared Tunnel ──► Envoy Gateway ──► HTTPRoute CRDs │
|
||||
│ │
|
||||
│ External: *.daviestechlabs.io Internal: *.lab.daviestechlabs.io │
|
||||
│ • git.daviestechlabs.io • kubeflow.lab.daviestechlabs.io │
|
||||
│ • auth.daviestechlabs.io • companions-chat.lab... │
|
||||
└─────────────────────────────────────────────────────────────────────────────┘
|
||||
│
|
||||
▼
|
||||
┌─────────────────────────────────────────────────────────────────────────────┐
|
||||
│ MESSAGE BUS LAYER │
|
||||
├─────────────────────────────────────────────────────────────────────────────┤
|
||||
│ NATS + JetStream │
|
||||
│ ┌─────────────────────────────────────────────────────────────────────┐ │
|
||||
│ │ Streams: │ │
|
||||
│ │ • COMPANIONS_LOGINS (7d retention) - User analytics │ │
|
||||
│ │ • COMPANIONS_CHAT (30d retention) - Chat history │ │
|
||||
│ │ • AI_CHAT_STREAM (5min, memory) - Ephemeral streaming │ │
|
||||
│ │ • AI_VOICE_STREAM (1h, file) - Voice processing │ │
|
||||
│ │ • AI_PIPELINE (24h, file) - Workflow triggers │ │
|
||||
│ └─────────────────────────────────────────────────────────────────────┘ │
|
||||
│ │
|
||||
│ Message Format: MessagePack (binary, not JSON) │
|
||||
└─────────────────────────────────────────────────────────────────────────────┘
|
||||
│
|
||||
┌─────────────────────────┼─────────────────────────┐
|
||||
▼ ▼ ▼
|
||||
┌───────────────────┐ ┌───────────────────┐ ┌───────────────────┐
|
||||
│ Chat Handler │ │ Voice Assistant │ │ Pipeline Bridge │
|
||||
├───────────────────┤ ├───────────────────┤ ├───────────────────┤
|
||||
│ • RAG retrieval │ │ • STT (Whisper) │ │ • KFP triggers │
|
||||
│ • LLM inference │ │ • RAG retrieval │ │ • Argo triggers │
|
||||
│ • Streaming resp │ │ • LLM inference │ │ • Status updates │
|
||||
│ • Session state │ │ • TTS (XTTS) │ │ • Error handling │
|
||||
└───────────────────┘ └───────────────────┘ └───────────────────┘
|
||||
│
|
||||
▼
|
||||
┌─────────────────────────────────────────────────────────────────────────────┐
|
||||
│ AI SERVICES LAYER │
|
||||
├─────────────────────────────────────────────────────────────────────────────┤
|
||||
│ ┌─────────┐ ┌─────────┐ ┌─────────┐ ┌─────────┐ ┌─────────┐ ┌─────────┐ │
|
||||
│ │ Whisper │ │ XTTS │ │ vLLM │ │ Milvus │ │ BGE │ │Reranker │ │
|
||||
│ │ (STT) │ │ (TTS) │ │ (LLM) │ │ (RAG) │ │(Embed) │ │ (BGE) │ │
|
||||
│ ├─────────┤ ├─────────┤ ├─────────┤ ├─────────┤ ├─────────┤ ├─────────┤ │
|
||||
│ │ KServe │ │ KServe │ │ vLLM │ │ Helm │ │ KServe │ │ KServe │ │
|
||||
│ │ nvidia │ │ nvidia │ │ ROCm │ │ Minio │ │ rdna2 │ │ intel │ │
|
||||
│ └─────────┘ └─────────┘ └─────────┘ └─────────┘ └─────────┘ └─────────┘ │
|
||||
└─────────────────────────────────────────────────────────────────────────────┘
|
||||
│
|
||||
▼
|
||||
┌─────────────────────────────────────────────────────────────────────────────┐
|
||||
│ WORKFLOW ENGINE LAYER │
|
||||
├─────────────────────────────────────────────────────────────────────────────┤
|
||||
│ ┌────────────────────────────┐ ┌────────────────────────────┐ │
|
||||
│ │ Argo Workflows │◄──►│ Kubeflow Pipelines │ │
|
||||
│ ├────────────────────────────┤ ├────────────────────────────┤ │
|
||||
│ │ • Complex DAG orchestration│ │ • ML pipeline caching │ │
|
||||
│ │ • Training workflows │ │ • Experiment tracking │ │
|
||||
│ │ • Document ingestion │ │ • Model versioning │ │
|
||||
│ │ • Batch inference │ │ • Artifact lineage │ │
|
||||
│ └────────────────────────────┘ └────────────────────────────┘ │
|
||||
│ │
|
||||
│ Trigger: Argo Events (EventSource → Sensor → Workflow/Pipeline) │
|
||||
└─────────────────────────────────────────────────────────────────────────────┘
|
||||
│
|
||||
▼
|
||||
┌─────────────────────────────────────────────────────────────────────────────┐
|
||||
│ INFRASTRUCTURE LAYER │
|
||||
├─────────────────────────────────────────────────────────────────────────────┤
|
||||
│ Storage: Compute: Security: │
|
||||
│ ├─ Longhorn (block) ├─ Volcano Scheduler ├─ Vault (secrets) │
|
||||
│ ├─ NFS CSI (shared) ├─ GPU Device Plugins ├─ Authentik (SSO) │
|
||||
│ └─ MinIO (S3) │ ├─ AMD ROCm ├─ Falco (runtime) │
|
||||
│ │ ├─ NVIDIA CUDA └─ SOPS (GitOps) │
|
||||
│ Databases: │ └─ Intel i915/Arc │
|
||||
│ ├─ CloudNative-PG └─ Node Feature Discovery │
|
||||
│ ├─ Valkey (cache) │
|
||||
│ └─ ClickHouse (analytics) │
|
||||
└─────────────────────────────────────────────────────────────────────────────┘
|
||||
│
|
||||
▼
|
||||
┌─────────────────────────────────────────────────────────────────────────────┐
|
||||
│ PLATFORM LAYER │
|
||||
├─────────────────────────────────────────────────────────────────────────────┤
|
||||
│ Talos Linux v1.12.1 │ Kubernetes v1.35.0 │ Cilium CNI │
|
||||
│ │
|
||||
│ Nodes: storm, bruenor, catti (control) │ elminster, khelben, drizzt, │
|
||||
│ │ danilo (workers) │
|
||||
└─────────────────────────────────────────────────────────────────────────────┘
|
||||
```
|
||||
|
||||
## Node Topology
|
||||
|
||||
### Control Plane (HA)
|
||||
|
||||
| Node | IP | CPU | Memory | Storage | Role |
|
||||
|------|-------|-----|--------|---------|------|
|
||||
| storm | 192.168.100.25 | Intel 13th Gen (4c) | 16GB | 500GB NVMe | etcd, API server |
|
||||
| bruenor | 192.168.100.26 | Intel 13th Gen (4c) | 16GB | 500GB NVMe | etcd, API server |
|
||||
| catti | 192.168.100.27 | Intel 13th Gen (4c) | 16GB | 500GB NVMe | etcd, API server |
|
||||
|
||||
**VIP**: 192.168.100.20 (shared across control plane)
|
||||
|
||||
### Worker Nodes
|
||||
|
||||
| Node | IP | CPU | GPU | GPU Memory | Workload |
|
||||
|------|-------|-----|-----|------------|----------|
|
||||
| elminster | 192.168.100.31 | Intel | NVIDIA RTX 2070 | 8GB VRAM | Whisper, XTTS |
|
||||
| khelben | 192.168.100.32 | AMD Ryzen | AMD Strix Halo | 64GB Unified | vLLM (dedicated) |
|
||||
| drizzt | 192.168.100.40 | AMD Ryzen 7 6800H | AMD Radeon 680M | 12GB VRAM | BGE Embeddings |
|
||||
| danilo | 192.168.100.41 | Intel Core Ultra 9 | Intel Arc | 16GB Shared | Reranker |
|
||||
|
||||
## Networking
|
||||
|
||||
### External Access
|
||||
|
||||
```
|
||||
Internet → Cloudflare → cloudflared tunnel → Envoy Gateway → Services
|
||||
```
|
||||
|
||||
### DNS Zones
|
||||
|
||||
- **External**: `*.daviestechlabs.io` (Cloudflare DNS)
|
||||
- **Internal**: `*.lab.daviestechlabs.io` (internal split-horizon)
|
||||
|
||||
### Network CIDRs
|
||||
|
||||
| Network | CIDR | Purpose |
|
||||
|---------|------|---------|
|
||||
| Node Network | 192.168.100.0/24 | Physical nodes |
|
||||
| Pod Network | 10.42.0.0/16 | Kubernetes pods |
|
||||
| Service Network | 10.43.0.0/16 | Kubernetes services |
|
||||
|
||||
## Data Flow: Chat Request
|
||||
|
||||
```mermaid
|
||||
sequenceDiagram
|
||||
participant U as User
|
||||
participant W as WebApp
|
||||
participant N as NATS
|
||||
participant C as Chat Handler
|
||||
participant M as Milvus
|
||||
participant L as vLLM
|
||||
participant V as Valkey
|
||||
|
||||
U->>W: Send message
|
||||
W->>N: Publish ai.chat.user.{id}.message
|
||||
N->>C: Deliver to chat-handler
|
||||
C->>V: Get session history
|
||||
C->>M: RAG query (if enabled)
|
||||
M-->>C: Relevant documents
|
||||
C->>L: LLM inference (with context)
|
||||
L-->>C: Streaming tokens
|
||||
C->>N: Publish ai.chat.response.stream.{id}
|
||||
N-->>W: Deliver streaming chunks
|
||||
W-->>U: Display tokens
|
||||
C->>V: Save to session
|
||||
```
|
||||
|
||||
## GitOps Flow
|
||||
|
||||
```
|
||||
Developer → Git Push → GitHub/Gitea
|
||||
│
|
||||
▼
|
||||
┌─────────────┐
|
||||
│ Flux CD │
|
||||
│ (reconcile) │
|
||||
└──────┬──────┘
|
||||
│
|
||||
┌──────────────┼──────────────┐
|
||||
▼ ▼ ▼
|
||||
┌──────────┐ ┌──────────┐ ┌──────────┐
|
||||
│homelab- │ │ llm- │ │ helm │
|
||||
│ k8s2 │ │workflows │ │ charts │
|
||||
└──────────┘ └──────────┘ └──────────┘
|
||||
│ │ │
|
||||
└──────────────┴──────────────┘
|
||||
│
|
||||
▼
|
||||
┌─────────────┐
|
||||
│ Kubernetes │
|
||||
│ Cluster │
|
||||
└─────────────┘
|
||||
```
|
||||
|
||||
## Security Architecture
|
||||
|
||||
### Secrets Management
|
||||
|
||||
```
|
||||
External Secrets Operator ──► Vault / SOPS ──► Kubernetes Secrets
|
||||
```
|
||||
|
||||
### Authentication
|
||||
|
||||
```
|
||||
User ──► Cloudflare Access ──► Authentik ──► Application
|
||||
│
|
||||
└──► OIDC/SAML providers
|
||||
```
|
||||
|
||||
### Network Security
|
||||
|
||||
- **Cilium**: Network policies, eBPF-based security
|
||||
- **Falco**: Runtime security monitoring
|
||||
- **RBAC**: Fine-grained Kubernetes permissions
|
||||
|
||||
## High Availability
|
||||
|
||||
### Control Plane
|
||||
|
||||
- 3-node etcd cluster with automatic leader election
|
||||
- Virtual IP (192.168.100.20) for API server access
|
||||
- Automatic failover via Talos
|
||||
|
||||
### Workloads
|
||||
|
||||
- Pod anti-affinity for critical services
|
||||
- HPA for auto-scaling
|
||||
- PodDisruptionBudgets for controlled updates
|
||||
|
||||
### Storage
|
||||
|
||||
- Longhorn 3-replica default
|
||||
- MinIO erasure coding for S3
|
||||
- Regular Velero backups
|
||||
|
||||
## Observability
|
||||
|
||||
### Metrics Pipeline
|
||||
|
||||
```
|
||||
Applications ──► OpenTelemetry Collector ──► Prometheus ──► Grafana
|
||||
```
|
||||
|
||||
### Logging Pipeline
|
||||
|
||||
```
|
||||
Applications ──► Grafana Alloy ──► Loki ──► Grafana
|
||||
```
|
||||
|
||||
### Tracing Pipeline
|
||||
|
||||
```
|
||||
Applications ──► OpenTelemetry SDK ──► Jaeger/Tempo ──► Grafana
|
||||
```
|
||||
|
||||
## Key Design Decisions
|
||||
|
||||
| Decision | Rationale | ADR |
|
||||
|----------|-----------|-----|
|
||||
| Talos Linux | Immutable, API-driven, secure | [ADR-0002](decisions/0002-use-talos-linux.md) |
|
||||
| NATS over Kafka | Simpler ops, sufficient throughput | [ADR-0003](decisions/0003-use-nats-for-messaging.md) |
|
||||
| MessagePack over JSON | Binary efficiency for audio | [ADR-0004](decisions/0004-use-messagepack-for-nats.md) |
|
||||
| Multi-GPU heterogeneous | Cost optimization, workload matching | [ADR-0005](decisions/0005-multi-gpu-strategy.md) |
|
||||
| GitOps with Flux | Declarative, auditable, secure | [ADR-0006](decisions/0006-gitops-with-flux.md) |
|
||||
|
||||
## Related Documents
|
||||
|
||||
- [TECH-STACK.md](TECH-STACK.md) - Complete technology inventory
|
||||
- [DOMAIN-MODEL.md](DOMAIN-MODEL.md) - Core entities and relationships
|
||||
- [decisions/](decisions/) - All architecture decisions
|
||||
Reference in New Issue
Block a user