feat: add comprehensive architecture documentation

- Add AGENT-ONBOARDING.md for AI agents
- Add ARCHITECTURE.md with full system overview
- Add TECH-STACK.md with complete technology inventory
- Add DOMAIN-MODEL.md with entities and bounded contexts
- Add CODING-CONVENTIONS.md with patterns and practices
- Add GLOSSARY.md with terminology reference
- Add C4 diagrams (Context and Container levels)
- Add 10 ADRs documenting key decisions:
  - Talos Linux, NATS, MessagePack, Multi-GPU strategy
  - GitOps with Flux, KServe, Milvus, Dual workflow engines
  - Envoy Gateway
- Add specs directory with JetStream configuration
- Add diagrams for GPU allocation and data flows

Based on analysis of homelab-k8s2 and llm-workflows repositories
and kubectl cluster-info dump data.
This commit is contained in:
2026-02-01 14:30:05 -05:00
parent 4d4f6f464c
commit 832cda34bd
26 changed files with 3805 additions and 2 deletions

35
diagrams/README.md Normal file
View File

@@ -0,0 +1,35 @@
# Diagrams
This directory contains additional architecture diagrams beyond the main C4 diagrams.
## Available Diagrams
| File | Description |
|------|-------------|
| [gpu-allocation.mmd](gpu-allocation.mmd) | GPU workload distribution |
| [data-flow-chat.mmd](data-flow-chat.mmd) | Chat request data flow |
| [data-flow-voice.mmd](data-flow-voice.mmd) | Voice request data flow |
## Rendering Diagrams
### VS Code
Install the "Markdown Preview Mermaid Support" extension.
### CLI
```bash
# Using mmdc (Mermaid CLI)
npx @mermaid-js/mermaid-cli mmdc -i diagram.mmd -o diagram.png
```
### Online
Use [Mermaid Live Editor](https://mermaid.live)
## Diagram Conventions
1. Use `.mmd` extension for Mermaid diagrams
2. Include title as comment at top of file
3. Use consistent styling classes
4. Keep diagrams focused (one concept per diagram)

View File

@@ -0,0 +1,51 @@
%% Chat Request Data Flow
%% Sequence diagram showing chat message processing
sequenceDiagram
autonumber
participant U as User
participant W as WebApp<br/>(companions)
participant N as NATS
participant C as Chat Handler
participant V as Valkey<br/>(Cache)
participant E as BGE Embeddings
participant M as Milvus
participant R as Reranker
participant L as vLLM
U->>W: Send message
W->>N: Publish ai.chat.user.{id}.message
N->>C: Deliver message
C->>V: Get session history
V-->>C: Previous messages
alt RAG Enabled
C->>E: Generate query embedding
E-->>C: Query vector
C->>M: Search similar chunks
M-->>C: Top-K chunks
opt Reranker Enabled
C->>R: Rerank chunks
R-->>C: Reordered chunks
end
end
C->>L: LLM inference (context + query)
alt Streaming Enabled
loop For each token
L-->>C: Token
C->>N: Publish ai.chat.response.stream.{id}
N-->>W: Deliver chunk
W-->>U: Display token
end
else Non-streaming
L-->>C: Full response
C->>N: Publish ai.chat.response.{id}
N-->>W: Deliver response
W-->>U: Display response
end
C->>V: Save to session history

View File

@@ -0,0 +1,46 @@
%% Voice Request Data Flow
%% Sequence diagram showing voice assistant processing
sequenceDiagram
autonumber
participant U as User
participant W as Voice WebApp
participant N as NATS
participant VA as Voice Assistant
participant STT as Whisper<br/>(STT)
participant E as BGE Embeddings
participant M as Milvus
participant R as Reranker
participant L as vLLM
participant TTS as XTTS<br/>(TTS)
U->>W: Record audio
W->>N: Publish ai.voice.user.{id}.request<br/>(msgpack with audio bytes)
N->>VA: Deliver voice request
VA->>STT: Transcribe audio
STT-->>VA: Transcription text
alt RAG Enabled
VA->>E: Generate query embedding
E-->>VA: Query vector
VA->>M: Search similar chunks
M-->>VA: Top-K chunks
opt Reranker Enabled
VA->>R: Rerank chunks
R-->>VA: Reordered chunks
end
end
VA->>L: LLM inference
L-->>VA: Response text
VA->>TTS: Synthesize speech
TTS-->>VA: Audio bytes
VA->>N: Publish ai.voice.response.{id}<br/>(text + audio)
N-->>W: Deliver response
W-->>U: Play audio + show text
Note over VA,TTS: Total latency target: < 3s

View File

@@ -0,0 +1,47 @@
%% GPU Allocation Diagram
%% Shows how AI workloads are distributed across GPU nodes
flowchart TB
subgraph khelben["🖥️ khelben (AMD Strix Halo 64GB)"]
direction TB
vllm["🧠 vLLM<br/>LLM Inference<br/>100% GPU"]
end
subgraph elminster["🖥️ elminster (NVIDIA RTX 2070 8GB)"]
direction TB
whisper["🎤 Whisper<br/>STT<br/>~50% GPU"]
xtts["🔊 XTTS<br/>TTS<br/>~50% GPU"]
end
subgraph drizzt["🖥️ drizzt (AMD Radeon 680M 12GB)"]
direction TB
embeddings["📊 BGE Embeddings<br/>Vector Encoding<br/>~80% GPU"]
end
subgraph danilo["🖥️ danilo (Intel Arc)"]
direction TB
reranker["📋 BGE Reranker<br/>Document Ranking<br/>~80% GPU"]
end
subgraph workloads["Workload Routing"]
chat["💬 Chat Request"]
voice["🎤 Voice Request"]
end
chat --> embeddings
chat --> reranker
chat --> vllm
voice --> whisper
voice --> embeddings
voice --> reranker
voice --> vllm
voice --> xtts
classDef nvidia fill:#76B900,color:white
classDef amd fill:#ED1C24,color:white
classDef intel fill:#0071C5,color:white
class whisper,xtts nvidia
class vllm,embeddings amd
class reranker intel