feat: add comprehensive architecture documentation

- Add AGENT-ONBOARDING.md for AI agents - Add ARCHITECTURE.md with full system overview - Add TECH-STACK.md with complete technology inventory - Add DOMAIN-MODEL.md with entities and bounded contexts - Add CODING-CONVENTIONS.md with patterns and practices - Add GLOSSARY.md with terminology reference - Add C4 diagrams (Context and Container levels) - Add 10 ADRs documenting key decisions: - Talos Linux, NATS, MessagePack, Multi-GPU strategy - GitOps with Flux, KServe, Milvus, Dual workflow engines - Envoy Gateway - Add specs directory with JetStream configuration - Add diagrams for GPU allocation and data flows Based on analysis of homelab-k8s2 and llm-workflows repositories and kubectl cluster-info dump data.
2026-02-01 14:30:05 -05:00
parent 4d4f6f464c
commit 832cda34bd
26 changed files with 3805 additions and 2 deletions
--- a/diagrams/README.md
+++ b/diagrams/README.md
@@ -0,0 +1,35 @@
+# Diagrams
+
+This directory contains additional architecture diagrams beyond the main C4 diagrams.
+
+## Available Diagrams
+
+| File | Description |
+|------|-------------|
+| [gpu-allocation.mmd](gpu-allocation.mmd) | GPU workload distribution |
+| [data-flow-chat.mmd](data-flow-chat.mmd) | Chat request data flow |
+| [data-flow-voice.mmd](data-flow-voice.mmd) | Voice request data flow |
+
+## Rendering Diagrams
+
+### VS Code
+
+Install the "Markdown Preview Mermaid Support" extension.
+
+### CLI
+
+```bash
+# Using mmdc (Mermaid CLI)
+npx @mermaid-js/mermaid-cli mmdc -i diagram.mmd -o diagram.png
+```
+
+### Online
+
+Use [Mermaid Live Editor](https://mermaid.live)
+
+## Diagram Conventions
+
+1. Use `.mmd` extension for Mermaid diagrams
+2. Include title as comment at top of file
+3. Use consistent styling classes
+4. Keep diagrams focused (one concept per diagram)
--- a/diagrams/data-flow-chat.mmd
+++ b/diagrams/data-flow-chat.mmd
@@ -0,0 +1,51 @@
+%% Chat Request Data Flow
+%% Sequence diagram showing chat message processing
+
+sequenceDiagram
+    autonumber
+    participant U as User
+    participant W as WebApp<br/>(companions)
+    participant N as NATS
+    participant C as Chat Handler
+    participant V as Valkey<br/>(Cache)
+    participant E as BGE Embeddings
+    participant M as Milvus
+    participant R as Reranker
+    participant L as vLLM
+
+    U->>W: Send message
+    W->>N: Publish ai.chat.user.{id}.message
+    N->>C: Deliver message
+    
+    C->>V: Get session history
+    V-->>C: Previous messages
+    
+    alt RAG Enabled
+        C->>E: Generate query embedding
+        E-->>C: Query vector
+        C->>M: Search similar chunks
+        M-->>C: Top-K chunks
+        
+        opt Reranker Enabled
+            C->>R: Rerank chunks
+            R-->>C: Reordered chunks
+        end
+    end
+    
+    C->>L: LLM inference (context + query)
+    
+    alt Streaming Enabled
+        loop For each token
+            L-->>C: Token
+            C->>N: Publish ai.chat.response.stream.{id}
+            N-->>W: Deliver chunk
+            W-->>U: Display token
+        end
+    else Non-streaming
+        L-->>C: Full response
+        C->>N: Publish ai.chat.response.{id}
+        N-->>W: Deliver response
+        W-->>U: Display response
+    end
+    
+    C->>V: Save to session history
--- a/diagrams/data-flow-voice.mmd
+++ b/diagrams/data-flow-voice.mmd
@@ -0,0 +1,46 @@
+%% Voice Request Data Flow
+%% Sequence diagram showing voice assistant processing
+
+sequenceDiagram
+    autonumber
+    participant U as User
+    participant W as Voice WebApp
+    participant N as NATS
+    participant VA as Voice Assistant
+    participant STT as Whisper<br/>(STT)
+    participant E as BGE Embeddings
+    participant M as Milvus
+    participant R as Reranker
+    participant L as vLLM
+    participant TTS as XTTS<br/>(TTS)
+
+    U->>W: Record audio
+    W->>N: Publish ai.voice.user.{id}.request<br/>(msgpack with audio bytes)
+    N->>VA: Deliver voice request
+    
+    VA->>STT: Transcribe audio
+    STT-->>VA: Transcription text
+    
+    alt RAG Enabled
+        VA->>E: Generate query embedding
+        E-->>VA: Query vector
+        VA->>M: Search similar chunks
+        M-->>VA: Top-K chunks
+        
+        opt Reranker Enabled
+            VA->>R: Rerank chunks
+            R-->>VA: Reordered chunks
+        end
+    end
+    
+    VA->>L: LLM inference
+    L-->>VA: Response text
+    
+    VA->>TTS: Synthesize speech
+    TTS-->>VA: Audio bytes
+    
+    VA->>N: Publish ai.voice.response.{id}<br/>(text + audio)
+    N-->>W: Deliver response
+    W-->>U: Play audio + show text
+
+    Note over VA,TTS: Total latency target: < 3s
--- a/diagrams/gpu-allocation.mmd
+++ b/diagrams/gpu-allocation.mmd
@@ -0,0 +1,47 @@
+%% GPU Allocation Diagram
+%% Shows how AI workloads are distributed across GPU nodes
+
+flowchart TB
+    subgraph khelben["🖥️ khelben (AMD Strix Halo 64GB)"]
+        direction TB
+        vllm["🧠 vLLM<br/>LLM Inference<br/>100% GPU"]
+    end
+
+    subgraph elminster["🖥️ elminster (NVIDIA RTX 2070 8GB)"]
+        direction TB
+        whisper["🎤 Whisper<br/>STT<br/>~50% GPU"]
+        xtts["🔊 XTTS<br/>TTS<br/>~50% GPU"]
+    end
+
+    subgraph drizzt["🖥️ drizzt (AMD Radeon 680M 12GB)"]
+        direction TB
+        embeddings["📊 BGE Embeddings<br/>Vector Encoding<br/>~80% GPU"]
+    end
+
+    subgraph danilo["🖥️ danilo (Intel Arc)"]
+        direction TB
+        reranker["📋 BGE Reranker<br/>Document Ranking<br/>~80% GPU"]
+    end
+
+    subgraph workloads["Workload Routing"]
+        chat["💬 Chat Request"]
+        voice["🎤 Voice Request"]
+    end
+
+    chat --> embeddings
+    chat --> reranker
+    chat --> vllm
+
+    voice --> whisper
+    voice --> embeddings
+    voice --> reranker
+    voice --> vllm
+    voice --> xtts
+
+    classDef nvidia fill:#76B900,color:white
+    classDef amd fill:#ED1C24,color:white
+    classDef intel fill:#0071C5,color:white
+    
+    class whisper,xtts nvidia
+    class vllm,embeddings amd
+    class reranker intel