feat: add comprehensive architecture documentation

- Add AGENT-ONBOARDING.md for AI agents - Add ARCHITECTURE.md with full system overview - Add TECH-STACK.md with complete technology inventory - Add DOMAIN-MODEL.md with entities and bounded contexts - Add CODING-CONVENTIONS.md with patterns and practices - Add GLOSSARY.md with terminology reference - Add C4 diagrams (Context and Container levels) - Add 10 ADRs documenting key decisions: - Talos Linux, NATS, MessagePack, Multi-GPU strategy - GitOps with Flux, KServe, Milvus, Dual workflow engines - Envoy Gateway - Add specs directory with JetStream configuration - Add diagrams for GPU allocation and data flows Based on analysis of homelab-k8s2 and llm-workflows repositories and kubectl cluster-info dump data.
2026-02-01 14:30:05 -05:00
parent 4d4f6f464c
commit 832cda34bd
26 changed files with 3805 additions and 2 deletions
--- a/AGENT-ONBOARDING.md
+++ b/AGENT-ONBOARDING.md
@@ -0,0 +1,191 @@
+# 🤖 Agent Onboarding
+
+> **This is the most important file for AI agents working on this codebase.**
+
+## TL;DR
+
+You are working on a **homelab Kubernetes cluster** running:
+- **Talos Linux v1.12.1** on bare-metal nodes
+- **Kubernetes v1.35.0** with Flux CD GitOps
+- **AI/ML platform** with KServe, Kubeflow, Milvus, NATS
+- **Multi-GPU** (AMD ROCm, NVIDIA CUDA, Intel Arc)
+
+## 🗺️ Repository Map
+
+| Repo | What It Contains | When to Edit |
+|------|------------------|--------------|
+| `homelab-k8s2` | Kubernetes manifests, Talos config, Flux | Infrastructure changes |
+| `llm-workflows` | NATS handlers, Argo/KFP workflows | Workflow/handler changes |
+| `companions-frontend` | Go server, HTMX UI, VRM avatars | Frontend changes |
+| `homelab-design` (this) | Architecture docs, ADRs | Design decisions |
+
+## 🏗️ System Architecture (30-Second Version)
+
+```
+┌─────────────────────────────────────────────────────────────────┐
+│                         USER INTERFACES                          │
+│  Companions WebApp │ Voice WebApp │ Kubeflow UI │ CLI           │
+└───────────────────────────┬─────────────────────────────────────┘
+                            │ WebSocket/HTTP
+                            ▼
+┌─────────────────────────────────────────────────────────────────┐
+│                      NATS MESSAGE BUS                            │
+│  Subjects: ai.chat.*, ai.voice.*, ai.pipeline.*                 │
+│  Format: MessagePack (binary)                                   │
+└───────────────────────────┬─────────────────────────────────────┘
+                            │
+        ┌───────────────────┼───────────────────┐
+        ▼                   ▼                   ▼
+┌───────────────┐   ┌───────────────┐   ┌───────────────┐
+│ Chat Handler  │   │Voice Assistant│   │Pipeline Bridge│
+│ (RAG+LLM)     │   │ (STT→LLM→TTS) │   │ (KFP/Argo)    │
+└───────┬───────┘   └───────┬───────┘   └───────┬───────┘
+        │                   │                   │
+        └───────────────────┼───────────────────┘
+                            ▼
+┌─────────────────────────────────────────────────────────────────┐
+│                       AI SERVICES                                │
+│  Whisper │ XTTS │ vLLM │ Milvus │ BGE Embed │ Reranker         │
+│    STT   │ TTS  │ LLM  │  RAG   │   Embed   │  Rank            │
+└─────────────────────────────────────────────────────────────────┘
+```
+
+## 📁 Key File Locations
+
+### Infrastructure (`homelab-k8s2`)
+
+```
+kubernetes/apps/
+├── ai-ml/                    # 🧠 AI/ML services
+│   ├── kserve/               #   InferenceServices
+│   ├── kubeflow/             #   Pipelines, Training Operator
+│   ├── milvus/               #   Vector database
+│   ├── nats/                 #   Message bus
+│   ├── vllm/                 #   LLM inference
+│   └── llm-workflows/        #   GitRepo sync to llm-workflows
+├── analytics/                # 📊 Spark, Flink, ClickHouse
+├── observability/            # 📈 Grafana, Alloy, OpenTelemetry
+└── security/                 # 🔒 Vault, Authentik, Falco
+
+talos/
+├── talconfig.yaml            # Node definitions
+├── patches/                  # GPU-specific patches
+│   ├── amd/amdgpu.yaml
+│   └── nvidia/nvidia-runtime.yaml
+```
+
+### Workflows (`llm-workflows`)
+
+```
+workflows/                    # NATS handler deployments
+├── chat-handler.yaml
+├── voice-assistant.yaml
+└── pipeline-bridge.yaml
+
+argo/                         # Argo WorkflowTemplates
+├── document-ingestion.yaml
+├── batch-inference.yaml
+└── qlora-training.yaml
+
+pipelines/                    # Kubeflow Pipeline Python
+├── voice_pipeline.py
+└── document_ingestion_pipeline.py
+```
+
+## 🔌 Service Endpoints (Internal)
+
+```python
+# Copy-paste ready for Python code
+NATS_URL = "nats://nats.ai-ml.svc.cluster.local:4222"
+VLLM_URL = "http://llm-draft.ai-ml.svc.cluster.local:8000/v1"
+WHISPER_URL = "http://whisper-predictor.ai-ml.svc.cluster.local"
+TTS_URL = "http://tts-predictor.ai-ml.svc.cluster.local"
+EMBEDDINGS_URL = "http://embeddings-predictor.ai-ml.svc.cluster.local"
+RERANKER_URL = "http://reranker-predictor.ai-ml.svc.cluster.local"
+MILVUS_HOST = "milvus.ai-ml.svc.cluster.local"
+MILVUS_PORT = 19530
+VALKEY_URL = "redis://valkey.ai-ml.svc.cluster.local:6379"
+```
+
+## 📨 NATS Subject Patterns
+
+```python
+# Chat
+f"ai.chat.user.{user_id}.message"      # User sends message
+f"ai.chat.response.{request_id}"       # Response back
+f"ai.chat.response.stream.{request_id}" # Streaming tokens
+
+# Voice
+f"ai.voice.user.{user_id}.request"     # Voice input
+f"ai.voice.response.{request_id}"      # Voice output
+
+# Pipelines
+"ai.pipeline.trigger"                   # Trigger any pipeline
+f"ai.pipeline.status.{request_id}"     # Status updates
+```
+
+## 🎮 GPU Allocation
+
+| Node | GPU | Workload | Memory |
+|------|-----|----------|--------|
+| khelben | AMD Strix Halo | vLLM (dedicated) | 64GB unified |
+| elminster | NVIDIA RTX 2070 | Whisper + XTTS | 8GB VRAM |
+| drizzt | AMD Radeon 680M | BGE Embeddings | 12GB VRAM |
+| danilo | Intel Arc | Reranker | 16GB shared |
+
+## ⚡ Common Tasks
+
+### Deploy a New AI Service
+
+1. Create InferenceService in `homelab-k8s2/kubernetes/apps/ai-ml/kserve/`
+2. Add endpoint to `llm-workflows/config/ai-services-config.yaml`
+3. Push to main → Flux deploys automatically
+
+### Add a New Workflow
+
+1. Create handler in `llm-workflows/chat-handler/` or `llm-workflows/voice-assistant/`
+2. Add Kubernetes Deployment in `llm-workflows/workflows/`
+3. Push to main → Flux deploys automatically
+
+### Create Architecture Decision
+
+1. Copy `decisions/0000-template.md` to `decisions/NNNN-title.md`
+2. Fill in context, decision, consequences
+3. Submit PR
+
+## ❌ Antipatterns to Avoid
+
+1. **Don't hardcode secrets** - Use External Secrets Operator
+2. **Don't use `latest` tags** - Pin versions for reproducibility
+3. **Don't skip ADRs** - Document significant decisions
+4. **Don't bypass Flux** - All changes via Git, never `kubectl apply` directly
+
+## 📚 Where to Learn More
+
+- [ARCHITECTURE.md](ARCHITECTURE.md) - Full system design
+- [TECH-STACK.md](TECH-STACK.md) - All technologies used
+- [decisions/](decisions/) - Why we made certain choices
+- [DOMAIN-MODEL.md](DOMAIN-MODEL.md) - Core entities
+
+## 🆘 Quick Debugging
+
+```bash
+# Check Flux sync status
+flux get all -A
+
+# View NATS JetStream streams
+kubectl exec -n ai-ml deploy/nats-box -- nats stream ls
+
+# Check GPU allocation
+kubectl describe node khelben | grep -A10 "Allocated"
+
+# View KServe inference services
+kubectl get inferenceservices -n ai-ml
+
+# Tail AI service logs
+kubectl logs -n ai-ml -l app=chat-handler -f
+```
+
+---
+
+*This document is the canonical starting point for AI agents. When in doubt, check the ADRs.*
--- a/ARCHITECTURE.md
+++ b/ARCHITECTURE.md
@@ -0,0 +1,287 @@
+# 🏗️ System Architecture
+
+> **Comprehensive technical overview of the DaviesTechLabs homelab infrastructure**
+
+## Overview
+
+The homelab is a production-grade Kubernetes cluster running on bare-metal hardware, designed for AI/ML workloads with multi-GPU support. It follows GitOps principles using Flux CD with SOPS-encrypted secrets.
+
+## System Layers
+
+```
+┌─────────────────────────────────────────────────────────────────────────────┐
+│                              USER LAYER                                      │
+├─────────────────────────────────────────────────────────────────────────────┤
+│  ┌──────────────────┐  ┌──────────────────┐  ┌──────────────────┐           │
+│  │ Companions WebApp│  │   Voice WebApp   │  │   Kubeflow UI    │           │
+│  │  HTMX + Alpine   │  │    Gradio UI     │  │  Pipeline Mgmt   │           │
+│  └────────┬─────────┘  └────────┬─────────┘  └────────┬─────────┘           │
+│           │ WebSocket           │ HTTP/WS             │ HTTP                │
+└───────────┴─────────────────────┴─────────────────────┴─────────────────────┘
+                                  │
+                                  ▼
+┌─────────────────────────────────────────────────────────────────────────────┐
+│                           INGRESS LAYER                                      │
+├─────────────────────────────────────────────────────────────────────────────┤
+│  Cloudflared Tunnel ──► Envoy Gateway ──► HTTPRoute CRDs                    │
+│                                                                              │
+│  External: *.daviestechlabs.io          Internal: *.lab.daviestechlabs.io  │
+│  • git.daviestechlabs.io                • kubeflow.lab.daviestechlabs.io   │
+│  • auth.daviestechlabs.io               • companions-chat.lab...           │
+└─────────────────────────────────────────────────────────────────────────────┘
+                                  │
+                                  ▼
+┌─────────────────────────────────────────────────────────────────────────────┐
+│                          MESSAGE BUS LAYER                                   │
+├─────────────────────────────────────────────────────────────────────────────┤
+│                           NATS + JetStream                                   │
+│  ┌─────────────────────────────────────────────────────────────────────┐    │
+│  │  Streams:                                                            │    │
+│  │  • COMPANIONS_LOGINS (7d retention)  - User analytics               │    │
+│  │  • COMPANIONS_CHAT (30d retention)   - Chat history                 │    │
+│  │  • AI_CHAT_STREAM (5min, memory)     - Ephemeral streaming          │    │
+│  │  • AI_VOICE_STREAM (1h, file)        - Voice processing             │    │
+│  │  • AI_PIPELINE (24h, file)           - Workflow triggers            │    │
+│  └─────────────────────────────────────────────────────────────────────┘    │
+│                                                                              │
+│  Message Format: MessagePack (binary, not JSON)                             │
+└─────────────────────────────────────────────────────────────────────────────┘
+                                  │
+        ┌─────────────────────────┼─────────────────────────┐
+        ▼                         ▼                         ▼
+┌───────────────────┐   ┌───────────────────┐   ┌───────────────────┐
+│   Chat Handler    │   │  Voice Assistant  │   │  Pipeline Bridge  │
+├───────────────────┤   ├───────────────────┤   ├───────────────────┤
+│ • RAG retrieval   │   │ • STT (Whisper)   │   │ • KFP triggers    │
+│ • LLM inference   │   │ • RAG retrieval   │   │ • Argo triggers   │
+│ • Streaming resp  │   │ • LLM inference   │   │ • Status updates  │
+│ • Session state   │   │ • TTS (XTTS)      │   │ • Error handling  │
+└───────────────────┘   └───────────────────┘   └───────────────────┘
+                                  │
+                                  ▼
+┌─────────────────────────────────────────────────────────────────────────────┐
+│                         AI SERVICES LAYER                                    │
+├─────────────────────────────────────────────────────────────────────────────┤
+│  ┌─────────┐ ┌─────────┐ ┌─────────┐ ┌─────────┐ ┌─────────┐ ┌─────────┐   │
+│  │ Whisper │ │  XTTS   │ │  vLLM   │ │ Milvus  │ │   BGE   │ │Reranker │   │
+│  │  (STT)  │ │  (TTS)  │ │  (LLM)  │ │  (RAG)  │ │(Embed)  │ │  (BGE)  │   │
+│  ├─────────┤ ├─────────┤ ├─────────┤ ├─────────┤ ├─────────┤ ├─────────┤   │
+│  │ KServe  │ │ KServe  │ │ vLLM    │ │  Helm   │ │ KServe  │ │ KServe  │   │
+│  │ nvidia  │ │ nvidia  │ │ ROCm    │ │ Minio   │ │ rdna2   │ │ intel   │   │
+│  └─────────┘ └─────────┘ └─────────┘ └─────────┘ └─────────┘ └─────────┘   │
+└─────────────────────────────────────────────────────────────────────────────┘
+                                  │
+                                  ▼
+┌─────────────────────────────────────────────────────────────────────────────┐
+│                       WORKFLOW ENGINE LAYER                                  │
+├─────────────────────────────────────────────────────────────────────────────┤
+│  ┌────────────────────────────┐    ┌────────────────────────────┐          │
+│  │     Argo Workflows         │◄──►│    Kubeflow Pipelines      │          │
+│  ├────────────────────────────┤    ├────────────────────────────┤          │
+│  │ • Complex DAG orchestration│    │ • ML pipeline caching      │          │
+│  │ • Training workflows       │    │ • Experiment tracking      │          │
+│  │ • Document ingestion       │    │ • Model versioning         │          │
+│  │ • Batch inference          │    │ • Artifact lineage         │          │
+│  └────────────────────────────┘    └────────────────────────────┘          │
+│                                                                              │
+│  Trigger: Argo Events (EventSource → Sensor → Workflow/Pipeline)           │
+└─────────────────────────────────────────────────────────────────────────────┘
+                                  │
+                                  ▼
+┌─────────────────────────────────────────────────────────────────────────────┐
+│                        INFRASTRUCTURE LAYER                                  │
+├─────────────────────────────────────────────────────────────────────────────┤
+│  Storage:                     Compute:                 Security:            │
+│  ├─ Longhorn (block)          ├─ Volcano Scheduler     ├─ Vault (secrets)  │
+│  ├─ NFS CSI (shared)          ├─ GPU Device Plugins    ├─ Authentik (SSO)  │
+│  └─ MinIO (S3)                │   ├─ AMD ROCm          ├─ Falco (runtime)  │
+│                               │   ├─ NVIDIA CUDA       └─ SOPS (GitOps)    │
+│  Databases:                   │   └─ Intel i915/Arc                        │
+│  ├─ CloudNative-PG            └─ Node Feature Discovery                    │
+│  ├─ Valkey (cache)                                                          │
+│  └─ ClickHouse (analytics)                                                  │
+└─────────────────────────────────────────────────────────────────────────────┘
+                                  │
+                                  ▼
+┌─────────────────────────────────────────────────────────────────────────────┐
+│                          PLATFORM LAYER                                      │
+├─────────────────────────────────────────────────────────────────────────────┤
+│  Talos Linux v1.12.1  │  Kubernetes v1.35.0  │  Cilium CNI                 │
+│                                                                              │
+│  Nodes: storm, bruenor, catti (control) │ elminster, khelben, drizzt,      │
+│                                          │ danilo (workers)                 │
+└─────────────────────────────────────────────────────────────────────────────┘
+```
+
+## Node Topology
+
+### Control Plane (HA)
+
+| Node | IP | CPU | Memory | Storage | Role |
+|------|-------|-----|--------|---------|------|
+| storm | 192.168.100.25 | Intel 13th Gen (4c) | 16GB | 500GB NVMe | etcd, API server |
+| bruenor | 192.168.100.26 | Intel 13th Gen (4c) | 16GB | 500GB NVMe | etcd, API server |
+| catti | 192.168.100.27 | Intel 13th Gen (4c) | 16GB | 500GB NVMe | etcd, API server |
+
+**VIP**: 192.168.100.20 (shared across control plane)
+
+### Worker Nodes
+
+| Node | IP | CPU | GPU | GPU Memory | Workload |
+|------|-------|-----|-----|------------|----------|
+| elminster | 192.168.100.31 | Intel | NVIDIA RTX 2070 | 8GB VRAM | Whisper, XTTS |
+| khelben | 192.168.100.32 | AMD Ryzen | AMD Strix Halo | 64GB Unified | vLLM (dedicated) |
+| drizzt | 192.168.100.40 | AMD Ryzen 7 6800H | AMD Radeon 680M | 12GB VRAM | BGE Embeddings |
+| danilo | 192.168.100.41 | Intel Core Ultra 9 | Intel Arc | 16GB Shared | Reranker |
+
+## Networking
+
+### External Access
+
+```
+Internet → Cloudflare → cloudflared tunnel → Envoy Gateway → Services
+```
+
+### DNS Zones
+
+- **External**: `*.daviestechlabs.io` (Cloudflare DNS)
+- **Internal**: `*.lab.daviestechlabs.io` (internal split-horizon)
+
+### Network CIDRs
+
+| Network | CIDR | Purpose |
+|---------|------|---------|
+| Node Network | 192.168.100.0/24 | Physical nodes |
+| Pod Network | 10.42.0.0/16 | Kubernetes pods |
+| Service Network | 10.43.0.0/16 | Kubernetes services |
+
+## Data Flow: Chat Request
+
+```mermaid
+sequenceDiagram
+    participant U as User
+    participant W as WebApp
+    participant N as NATS
+    participant C as Chat Handler
+    participant M as Milvus
+    participant L as vLLM
+    participant V as Valkey
+
+    U->>W: Send message
+    W->>N: Publish ai.chat.user.{id}.message
+    N->>C: Deliver to chat-handler
+    C->>V: Get session history
+    C->>M: RAG query (if enabled)
+    M-->>C: Relevant documents
+    C->>L: LLM inference (with context)
+    L-->>C: Streaming tokens
+    C->>N: Publish ai.chat.response.stream.{id}
+    N-->>W: Deliver streaming chunks
+    W-->>U: Display tokens
+    C->>V: Save to session
+```
+
+## GitOps Flow
+
+```
+Developer → Git Push → GitHub/Gitea
+                           │
+                           ▼
+                    ┌─────────────┐
+                    │   Flux CD   │
+                    │ (reconcile) │
+                    └──────┬──────┘
+                           │
+            ┌──────────────┼──────────────┐
+            ▼              ▼              ▼
+     ┌──────────┐   ┌──────────┐   ┌──────────┐
+     │homelab-  │   │  llm-    │   │  helm    │
+     │  k8s2    │   │workflows │   │ charts   │
+     └──────────┘   └──────────┘   └──────────┘
+            │              │              │
+            └──────────────┴──────────────┘
+                           │
+                           ▼
+                    ┌─────────────┐
+                    │  Kubernetes │
+                    │   Cluster   │
+                    └─────────────┘
+```
+
+## Security Architecture
+
+### Secrets Management
+
+```
+External Secrets Operator ──► Vault / SOPS ──► Kubernetes Secrets
+```
+
+### Authentication
+
+```
+User ──► Cloudflare Access ──► Authentik ──► Application
+                                   │
+                                   └──► OIDC/SAML providers
+```
+
+### Network Security
+
+- **Cilium**: Network policies, eBPF-based security
+- **Falco**: Runtime security monitoring
+- **RBAC**: Fine-grained Kubernetes permissions
+
+## High Availability
+
+### Control Plane
+
+- 3-node etcd cluster with automatic leader election
+- Virtual IP (192.168.100.20) for API server access
+- Automatic failover via Talos
+
+### Workloads
+
+- Pod anti-affinity for critical services
+- HPA for auto-scaling
+- PodDisruptionBudgets for controlled updates
+
+### Storage
+
+- Longhorn 3-replica default
+- MinIO erasure coding for S3
+- Regular Velero backups
+
+## Observability
+
+### Metrics Pipeline
+
+```
+Applications ──► OpenTelemetry Collector ──► Prometheus ──► Grafana
+```
+
+### Logging Pipeline
+
+```
+Applications ──► Grafana Alloy ──► Loki ──► Grafana
+```
+
+### Tracing Pipeline
+
+```
+Applications ──► OpenTelemetry SDK ──► Jaeger/Tempo ──► Grafana
+```
+
+## Key Design Decisions
+
+| Decision | Rationale | ADR |
+|----------|-----------|-----|
+| Talos Linux | Immutable, API-driven, secure | [ADR-0002](decisions/0002-use-talos-linux.md) |
+| NATS over Kafka | Simpler ops, sufficient throughput | [ADR-0003](decisions/0003-use-nats-for-messaging.md) |
+| MessagePack over JSON | Binary efficiency for audio | [ADR-0004](decisions/0004-use-messagepack-for-nats.md) |
+| Multi-GPU heterogeneous | Cost optimization, workload matching | [ADR-0005](decisions/0005-multi-gpu-strategy.md) |
+| GitOps with Flux | Declarative, auditable, secure | [ADR-0006](decisions/0006-gitops-with-flux.md) |
+
+## Related Documents
+
+- [TECH-STACK.md](TECH-STACK.md) - Complete technology inventory
+- [DOMAIN-MODEL.md](DOMAIN-MODEL.md) - Core entities and relationships
+- [decisions/](decisions/) - All architecture decisions
--- a/CODING-CONVENTIONS.md
+++ b/CODING-CONVENTIONS.md
@@ -0,0 +1,424 @@
+# 📐 Coding Conventions
+
+> **Patterns, practices, and folder structure conventions for DaviesTechLabs repositories**
+
+## Repository Conventions
+
+### homelab-k8s2 (Infrastructure)
+
+```
+kubernetes/
+├── apps/                    # Application deployments
+│   └── {namespace}/         # One folder per namespace
+│       └── {app}/           # One folder per application
+│           ├── app/         # Kubernetes manifests
+│           │   ├── kustomization.yaml
+│           │   ├── helmrelease.yaml   # OR individual manifests
+│           │   └── ...
+│           └── ks.yaml      # Flux Kustomization
+├── components/              # Reusable Kustomize components
+└── flux/                    # Flux system configuration
+```
+
+**Naming Conventions:**
+- Namespaces: lowercase with hyphens (`ai-ml`, `cert-manager`)
+- Apps: lowercase with hyphens (`chat-handler`, `voice-assistant`)
+- Secrets: `{app}-{type}` (e.g., `milvus-credentials`)
+
+### llm-workflows (Orchestration)
+
+```
+workflows/                   # Kubernetes Deployments for NATS handlers
+├── {handler}.yaml           # One file per handler
+
+argo/                        # Argo WorkflowTemplates
+├── {workflow-name}.yaml     # One file per workflow
+
+pipelines/                   # Kubeflow Pipeline Python files
+├── {pipeline}_pipeline.py   # Pipeline definition
+└── kfp-sync-job.yaml       # Upload job
+
+{handler}/                   # Python source code
+├── __init__.py
+├── {handler}.py            # Main entry point
+├── requirements.txt
+└── Dockerfile
+```
+
+---
+
+## Python Conventions
+
+### Project Structure
+
+```python
+# Use async/await for I/O
+async def handle_message(msg: Msg) -> None:
+    ...
+
+# Use dataclasses for structured data
+@dataclass
+class ChatRequest:
+    user_id: str
+    message: str
+    enable_rag: bool = True
+
+# Use msgpack for NATS messages
+import msgpack
+data = msgpack.packb({"key": "value"})
+```
+
+### Naming
+
+| Element | Convention | Example |
+|---------|------------|---------|
+| Files | snake_case | `chat_handler.py` |
+| Classes | PascalCase | `ChatHandler` |
+| Functions | snake_case | `process_message` |
+| Constants | UPPER_SNAKE | `NATS_URL` |
+| Private | Leading underscore | `_internal_method` |
+
+### Type Hints
+
+```python
+# Always use type hints
+from typing import Optional, List, Dict, Any
+
+async def query_rag(
+    query: str,
+    collection: str = "knowledge_base",
+    top_k: int = 5,
+) -> List[Dict[str, Any]]:
+    ...
+```
+
+### Error Handling
+
+```python
+# Use specific exceptions
+class RAGQueryError(Exception):
+    """Raised when RAG query fails."""
+    pass
+
+# Log errors with context
+import logging
+logger = logging.getLogger(__name__)
+
+try:
+    result = await milvus.search(...)
+except Exception as e:
+    logger.error(f"RAG query failed: {e}", extra={"query": query})
+    raise RAGQueryError(f"Failed to query collection {collection}") from e
+```
+
+### NATS Message Handling
+
+```python
+import nats
+import msgpack
+
+async def message_handler(msg: Msg) -> None:
+    try:
+        # Decode MessagePack
+        data = msgpack.unpackb(msg.data, raw=False)
+        
+        # Process
+        result = await process(data)
+        
+        # Reply if request-reply pattern
+        if msg.reply:
+            await msg.respond(msgpack.packb(result))
+        
+        # Acknowledge for JetStream
+        await msg.ack()
+        
+    except Exception as e:
+        logger.error(f"Handler error: {e}")
+        # NAK for retry (JetStream)
+        await msg.nak()
+```
+
+---
+
+## Kubernetes Manifest Conventions
+
+### Labels
+
+```yaml
+metadata:
+  labels:
+    # Required
+    app.kubernetes.io/name: chat-handler
+    app.kubernetes.io/instance: chat-handler
+    app.kubernetes.io/component: handler
+    app.kubernetes.io/part-of: ai-platform
+    
+    # Optional
+    app.kubernetes.io/version: "1.0.0"
+    app.kubernetes.io/managed-by: flux
+```
+
+### Annotations
+
+```yaml
+metadata:
+  annotations:
+    # Reloader for config changes
+    reloader.stakater.com/auto: "true"
+    
+    # Documentation
+    description: "Handles chat messages via NATS"
+```
+
+### Resource Requests
+
+```yaml
+resources:
+  requests:
+    cpu: 100m
+    memory: 256Mi
+  limits:
+    cpu: 500m
+    memory: 512Mi
+    
+# GPU workloads
+resources:
+  limits:
+    amd.com/gpu: 1        # AMD
+    nvidia.com/gpu: 1     # NVIDIA
+```
+
+### Health Checks
+
+```yaml
+livenessProbe:
+  httpGet:
+    path: /health
+    port: 8080
+  initialDelaySeconds: 10
+  periodSeconds: 30
+
+readinessProbe:
+  httpGet:
+    path: /ready
+    port: 8080
+  initialDelaySeconds: 5
+  periodSeconds: 10
+```
+
+---
+
+## Flux/GitOps Conventions
+
+### Kustomization Structure
+
+```yaml
+# ks.yaml - Flux Kustomization
+apiVersion: kustomize.toolkit.fluxcd.io/v1
+kind: Kustomization
+metadata:
+  name: &app chat-handler
+  namespace: flux-system
+spec:
+  targetNamespace: ai-ml
+  commonMetadata:
+    labels:
+      app.kubernetes.io/name: *app
+  path: ./kubernetes/apps/ai-ml/chat-handler/app
+  prune: true
+  sourceRef:
+    kind: GitRepository
+    name: flux-system
+  wait: true
+  interval: 30m
+  retryInterval: 1m
+  timeout: 5m
+```
+
+### HelmRelease Structure
+
+```yaml
+apiVersion: helm.toolkit.fluxcd.io/v2
+kind: HelmRelease
+metadata:
+  name: milvus
+spec:
+  interval: 30m
+  chart:
+    spec:
+      chart: milvus
+      version: 4.x.x
+      sourceRef:
+        kind: HelmRepository
+        name: milvus
+        namespace: flux-system
+  values:
+    # Values here
+```
+
+### Secret References
+
+```yaml
+# Never hardcode secrets
+env:
+  - name: DATABASE_PASSWORD
+    valueFrom:
+      secretKeyRef:
+        name: postgres-credentials
+        key: password
+```
+
+---
+
+## NATS Subject Conventions
+
+### Hierarchy
+
+```
+ai.{domain}.{scope}.{action}
+
+Examples:
+ai.chat.user.{userId}.message      # User chat message
+ai.chat.response.{requestId}       # Chat response
+ai.voice.user.{userId}.request     # Voice request
+ai.pipeline.trigger                # Pipeline trigger
+```
+
+### Wildcards
+
+```
+ai.chat.>                   # All chat events
+ai.chat.user.*.message      # All user messages
+ai.*.response.{id}          # Any response type
+```
+
+---
+
+## Git Conventions
+
+### Commit Messages
+
+```
+type(scope): subject
+
+body (optional)
+
+footer (optional)
+```
+
+**Types:**
+- `feat`: New feature
+- `fix`: Bug fix
+- `docs`: Documentation
+- `style`: Formatting
+- `refactor`: Code restructuring
+- `test`: Tests
+- `chore`: Maintenance
+
+**Examples:**
+```
+feat(chat-handler): add streaming response support
+fix(voice): handle empty audio gracefully
+docs(adr): add decision for MessagePack format
+```
+
+### Branch Naming
+
+```
+feature/short-description
+fix/issue-number-description
+docs/what-changed
+```
+
+---
+
+## Configuration Conventions
+
+### Environment Variables
+
+```python
+# Use pydantic-settings or similar
+from pydantic_settings import BaseSettings
+
+class Settings(BaseSettings):
+    nats_url: str = "nats://localhost:4222"
+    vllm_url: str = "http://localhost:8000"
+    milvus_host: str = "localhost"
+    milvus_port: int = 19530
+    log_level: str = "INFO"
+    
+    class Config:
+        env_prefix = ""  # No prefix
+```
+
+### ConfigMaps
+
+```yaml
+apiVersion: v1
+kind: ConfigMap
+metadata:
+  name: ai-services-config
+data:
+  NATS_URL: "nats://nats.ai-ml.svc.cluster.local:4222"
+  VLLM_URL: "http://llm-draft.ai-ml.svc.cluster.local:8000/v1"
+  # ... other non-sensitive config
+```
+
+---
+
+## Documentation Conventions
+
+### ADR Format
+
+See [decisions/0000-template.md](decisions/0000-template.md)
+
+### Code Comments
+
+```python
+# Use docstrings for public functions
+async def query_rag(query: str) -> List[Dict]:
+    """
+    Query the RAG system for relevant documents.
+    
+    Args:
+        query: The search query string
+        
+    Returns:
+        List of document chunks with scores
+        
+    Raises:
+        RAGQueryError: If the query fails
+    """
+    ...
+```
+
+### README Files
+
+Each application should have a README with:
+1. Purpose
+2. Configuration
+3. Deployment
+4. Local development
+5. API documentation (if applicable)
+
+---
+
+## Anti-Patterns to Avoid
+
+| Don't | Do Instead |
+|-------|------------|
+| `kubectl apply` directly | Commit to Git, let Flux deploy |
+| Hardcode secrets | Use External Secrets Operator |
+| Use `latest` image tags | Pin to specific versions |
+| Skip health checks | Always define liveness/readiness |
+| Ignore resource limits | Set appropriate requests/limits |
+| Use JSON for NATS messages | Use MessagePack (binary) |
+| Synchronous I/O in handlers | Use async/await |
+
+---
+
+## Related Documents
+
+- [TECH-STACK.md](TECH-STACK.md) - Technologies used
+- [ARCHITECTURE.md](ARCHITECTURE.md) - System design
+- [decisions/](decisions/) - Why we made certain choices
--- a/CONTAINER-DIAGRAM.mmd
+++ b/CONTAINER-DIAGRAM.mmd
@@ -0,0 +1,123 @@
+%% C4 Container Diagram - Level 2
+%% DaviesTechLabs Homelab AI/ML Platform
+%%
+%% To render: Use Mermaid Live Editor or VS Code Mermaid extension
+
+graph TB
+    subgraph users["Users"]
+        user["👤 User"]
+    end
+
+    subgraph ingress["Ingress Layer"]
+        cloudflared["cloudflared<br/>(Tunnel)"]
+        envoy["Envoy Gateway<br/>(HTTPRoute)"]
+    end
+
+    subgraph frontends["Frontend Applications"]
+        companions["Companions WebApp<br/>[Go + HTMX]<br/>AI Chat Interface"]
+        voice["Voice WebApp<br/>[Gradio]<br/>Voice Assistant UI"]
+        kubeflow_ui["Kubeflow UI<br/>[React]<br/>Pipeline Management"]
+    end
+
+    subgraph messaging["Message Bus"]
+        nats["NATS<br/>[JetStream]<br/>Event Streaming"]
+    end
+
+    subgraph handlers["NATS Handlers"]
+        chat_handler["Chat Handler<br/>[Python]<br/>RAG + LLM Orchestration"]
+        voice_handler["Voice Assistant<br/>[Python]<br/>STT → LLM → TTS"]
+        pipeline_bridge["Pipeline Bridge<br/>[Python]<br/>Workflow Triggers"]
+    end
+
+    subgraph ai_services["AI Services (KServe)"]
+        whisper["Whisper<br/>[faster-whisper]<br/>Speech-to-Text"]
+        xtts["XTTS<br/>[Coqui]<br/>Text-to-Speech"]
+        vllm["vLLM<br/>[ROCm]<br/>LLM Inference"]
+        embeddings["BGE Embeddings<br/>[sentence-transformers]<br/>Vector Encoding"]
+        reranker["BGE Reranker<br/>[sentence-transformers]<br/>Document Ranking"]
+    end
+
+    subgraph storage["Data Stores"]
+        milvus["Milvus<br/>[Vector DB]<br/>RAG Storage"]
+        valkey["Valkey<br/>[Redis API]<br/>Session Cache"]
+        postgres["CloudNative-PG<br/>[PostgreSQL]<br/>Metadata"]
+        minio["MinIO<br/>[S3 API]<br/>Object Storage"]
+    end
+
+    subgraph workflows["Workflow Engines"]
+        argo["Argo Workflows<br/>[DAG Engine]<br/>Complex Pipelines"]
+        kfp["Kubeflow Pipelines<br/>[ML Platform]<br/>Training + Inference"]
+        argo_events["Argo Events<br/>[Event Source]<br/>NATS → Workflow"]
+    end
+
+    subgraph mlops["MLOps"]
+        mlflow["MLflow<br/>[Tracking Server]<br/>Experiment Tracking"]
+        volcano["Volcano<br/>[Scheduler]<br/>GPU Scheduling"]
+    end
+
+    %% User flow
+    user --> cloudflared
+    cloudflared --> envoy
+    envoy --> companions
+    envoy --> voice
+    envoy --> kubeflow_ui
+
+    %% Frontend to NATS
+    companions --> |WebSocket| nats
+    voice --> |HTTP/WS| nats
+
+    %% NATS to handlers
+    nats --> chat_handler
+    nats --> voice_handler
+    nats --> pipeline_bridge
+
+    %% Handlers to AI services
+    chat_handler --> embeddings
+    chat_handler --> reranker
+    chat_handler --> vllm
+    chat_handler --> milvus
+    chat_handler --> valkey
+
+    voice_handler --> whisper
+    voice_handler --> embeddings
+    voice_handler --> reranker
+    voice_handler --> vllm
+    voice_handler --> xtts
+
+    %% Pipeline flow
+    pipeline_bridge --> argo_events
+    argo_events --> argo
+    argo_events --> kfp
+    kubeflow_ui --> kfp
+
+    %% Workflow to AI
+    argo --> ai_services
+    kfp --> ai_services
+    kfp --> mlflow
+
+    %% Storage connections
+    ai_services --> minio
+    milvus --> minio
+    kfp --> postgres
+    mlflow --> postgres
+    mlflow --> minio
+
+    %% GPU scheduling
+    volcano -.-> vllm
+    volcano -.-> whisper
+    volcano -.-> xtts
+
+    %% Styling
+    classDef frontend fill:#90EE90,stroke:#333
+    classDef handler fill:#87CEEB,stroke:#333
+    classDef ai fill:#FFB6C1,stroke:#333
+    classDef storage fill:#DDA0DD,stroke:#333
+    classDef workflow fill:#F0E68C,stroke:#333
+    classDef messaging fill:#FFA500,stroke:#333
+    
+    class companions,voice,kubeflow_ui frontend
+    class chat_handler,voice_handler,pipeline_bridge handler
+    class whisper,xtts,vllm,embeddings,reranker ai
+    class milvus,valkey,postgres,minio storage
+    class argo,kfp,argo_events,mlflow,volcano workflow
+    class nats messaging
--- a/CONTEXT-DIAGRAM.mmd
+++ b/CONTEXT-DIAGRAM.mmd
@@ -0,0 +1,69 @@
+%% C4 Context Diagram - Level 1
+%% DaviesTechLabs Homelab System Context
+%%
+%% To render: Use Mermaid Live Editor or VS Code Mermaid extension
+
+graph TB
+    subgraph users["External Users"]
+        dev["👤 Developer<br/>(Billy)"]
+        family["👥 Family Members"]
+        agents["🤖 AI Agents"]
+    end
+
+    subgraph external["External Systems"]
+        cf["☁️ Cloudflare<br/>DNS + Tunnel"]
+        gh["🐙 GitHub<br/>Source Code"]
+        ghcr["📦 GHCR<br/>Container Registry"]
+        hf["🤗 Hugging Face<br/>Model Registry"]
+    end
+
+    subgraph homelab["🏠 DaviesTechLabs Homelab"]
+        direction TB
+        
+        subgraph apps["Application Layer"]
+            companions["💬 Companions<br/>AI Chat"]
+            voice["🎤 Voice Assistant"]
+            media["🎬 Media Services<br/>(Jellyfin, *arr)"]
+            productivity["📝 Productivity<br/>(Nextcloud, Gitea)"]
+        end
+        
+        subgraph platform["Platform Layer"]
+            k8s["☸️ Kubernetes Cluster<br/>Talos Linux"]
+        end
+        
+        subgraph ai["AI/ML Layer"]
+            inference["🧠 Inference Services<br/>(vLLM, Whisper, XTTS)"]
+            workflows["⚙️ Workflow Engines<br/>(Kubeflow, Argo)"]
+            vectordb["📚 Vector Store<br/>(Milvus)"]
+        end
+    end
+
+    %% User interactions
+    dev --> |manages| productivity
+    dev --> |develops| k8s
+    family --> |uses| media
+    family --> |chats| companions
+    agents --> |queries| inference
+
+    %% External integrations
+    cf --> |routes traffic| apps
+    gh --> |GitOps sync| k8s
+    ghcr --> |pulls images| k8s
+    hf --> |downloads models| inference
+
+    %% Internal relationships
+    apps --> platform
+    ai --> platform
+    companions --> inference
+    voice --> inference
+    workflows --> inference
+    inference --> vectordb
+
+    %% Styling
+    classDef external fill:#f9f,stroke:#333,stroke-width:2px
+    classDef homelab fill:#bbf,stroke:#333,stroke-width:2px
+    classDef user fill:#bfb,stroke:#333,stroke-width:2px
+    
+    class cf,gh,ghcr,hf external
+    class companions,voice,media,productivity,k8s,inference,workflows,vectordb homelab
+    class dev,family,agents user
--- a/DOMAIN-MODEL.md
+++ b/DOMAIN-MODEL.md
@@ -0,0 +1,345 @@
+# 📊 Domain Model
+
+> **Core entities, bounded contexts, and relationships in the DaviesTechLabs homelab**
+
+## Bounded Contexts
+
+```
+┌─────────────────────────────────────────────────────────────────────────────┐
+│                           BOUNDED CONTEXTS                                   │
+├─────────────────────────────────────────────────────────────────────────────┤
+│                                                                              │
+│  ┌───────────────────┐   ┌───────────────────┐   ┌───────────────────┐     │
+│  │    CHAT CONTEXT   │   │   VOICE CONTEXT   │   │ WORKFLOW CONTEXT  │     │
+│  ├───────────────────┤   ├───────────────────┤   ├───────────────────┤     │
+│  │ • ChatSession     │   │ • VoiceSession    │   │ • Pipeline        │     │
+│  │ • ChatMessage     │   │ • AudioChunk      │   │ • PipelineRun     │     │
+│  │ • Conversation    │   │ • Transcription   │   │ • Artifact        │     │
+│  │ • User            │   │ • SynthesizedAudio│   │ • Experiment      │     │
+│  └─────────┬─────────┘   └─────────┬─────────┘   └─────────┬─────────┘     │
+│            │                       │                       │                │
+│            └───────────────────────┼───────────────────────┘                │
+│                                    │                                        │
+│                                    ▼                                        │
+│  ┌───────────────────────────────────────────────────────────────────┐     │
+│  │                    INFERENCE CONTEXT                               │     │
+│  ├───────────────────────────────────────────────────────────────────┤     │
+│  │ • InferenceRequest  • Model  • Embedding  • Document  • Chunk     │     │
+│  └───────────────────────────────────────────────────────────────────┘     │
+│                                                                              │
+└─────────────────────────────────────────────────────────────────────────────┘
+```
+
+---
+
+## Core Entities
+
+### User Context
+
+```yaml
+User:
+  id: string (UUID)
+  username: string
+  premium: boolean
+  preferences:
+    voice_id: string
+    model_preference: string
+    enable_rag: boolean
+  created_at: timestamp
+  
+Session:
+  id: string (UUID)
+  user_id: string
+  type: "chat" | "voice"
+  started_at: timestamp
+  last_activity: timestamp
+  metadata: object
+```
+
+### Chat Context
+
+```yaml
+ChatMessage:
+  id: string (UUID)
+  session_id: string
+  user_id: string
+  role: "user" | "assistant" | "system"
+  content: string
+  created_at: timestamp
+  metadata:
+    tokens_used: integer
+    latency_ms: float
+    rag_sources: string[]
+    model_used: string
+
+Conversation:
+  id: string (UUID)
+  user_id: string
+  messages: ChatMessage[]
+  title: string (auto-generated)
+  created_at: timestamp
+  updated_at: timestamp
+```
+
+### Voice Context
+
+```yaml
+VoiceRequest:
+  id: string (UUID)
+  user_id: string
+  audio_b64: string (base64)
+  format: "wav" | "webm" | "mp3"
+  language: string
+  premium: boolean
+  enable_rag: boolean
+
+VoiceResponse:
+  id: string (UUID)
+  request_id: string
+  transcription: string
+  response_text: string
+  audio_b64: string (base64)
+  audio_format: string
+  latency_ms: float
+  rag_docs_used: integer
+```
+
+### Inference Context
+
+```yaml
+InferenceRequest:
+  id: string (UUID)
+  service: "llm" | "stt" | "tts" | "embeddings" | "reranker"
+  input: string | bytes
+  parameters: object
+  priority: "standard" | "premium"
+
+InferenceResponse:
+  id: string (UUID)
+  request_id: string
+  output: string | bytes | float[]
+  metadata:
+    model: string
+    latency_ms: float
+    tokens: integer (if applicable)
+```
+
+### RAG Context
+
+```yaml
+Document:
+  id: string (UUID)
+  collection: string
+  title: string
+  content: string
+  source_url: string
+  ingested_at: timestamp
+
+Chunk:
+  id: string (UUID)
+  document_id: string
+  content: string
+  embedding: float[1024]  # BGE-large dimensions
+  metadata:
+    position: integer
+    page: integer
+
+RAGQuery:
+  query: string
+  collection: string
+  top_k: integer (default: 5)
+  rerank: boolean (default: true)
+  rerank_top_k: integer (default: 3)
+
+RAGResult:
+  chunks: Chunk[]
+  scores: float[]
+  reranked: boolean
+```
+
+### Workflow Context
+
+```yaml
+Pipeline:
+  id: string
+  name: string
+  version: string
+  engine: "kubeflow" | "argo"
+  definition: object (YAML)
+  
+PipelineRun:
+  id: string (UUID)
+  pipeline_id: string
+  status: "pending" | "running" | "succeeded" | "failed"
+  started_at: timestamp
+  completed_at: timestamp
+  parameters: object
+  artifacts: Artifact[]
+
+Artifact:
+  id: string (UUID)
+  run_id: string
+  name: string
+  type: "model" | "dataset" | "metrics" | "logs"
+  uri: string (s3://)
+  metadata: object
+
+Experiment:
+  id: string (UUID)
+  name: string
+  runs: PipelineRun[]
+  metrics: object
+  created_at: timestamp
+```
+
+---
+
+## Entity Relationships
+
+```mermaid
+erDiagram
+    USER ||--o{ SESSION : has
+    USER ||--o{ CONVERSATION : owns
+    SESSION ||--o{ CHAT_MESSAGE : contains
+    CONVERSATION ||--o{ CHAT_MESSAGE : contains
+    
+    USER ||--o{ VOICE_REQUEST : makes
+    VOICE_REQUEST ||--|| VOICE_RESPONSE : produces
+    
+    DOCUMENT ||--o{ CHUNK : contains
+    CHUNK }|--|| EMBEDDING : has
+    
+    PIPELINE ||--o{ PIPELINE_RUN : executed_as
+    PIPELINE_RUN ||--o{ ARTIFACT : produces
+    EXPERIMENT ||--o{ PIPELINE_RUN : tracks
+    
+    INFERENCE_REQUEST }|--|| INFERENCE_RESPONSE : produces
+```
+
+---
+
+## Aggregate Roots
+
+| Aggregate | Root Entity | Child Entities |
+|-----------|-------------|----------------|
+| Chat | Conversation | ChatMessage |
+| Voice | VoiceRequest | VoiceResponse |
+| RAG | Document | Chunk, Embedding |
+| Workflow | PipelineRun | Artifact |
+| User | User | Session, Preferences |
+
+---
+
+## Event Flow
+
+### Chat Event Stream
+
+```
+UserLogin
+  └─► SessionCreated
+        └─► MessageReceived
+              ├─► RAGQueryExecuted (optional)
+              ├─► InferenceRequested
+              └─► ResponseGenerated
+                    └─► MessageStored
+```
+
+### Voice Event Stream
+
+```
+VoiceRequestReceived
+  └─► TranscriptionStarted
+        └─► TranscriptionCompleted
+              └─► RAGQueryExecuted (optional)
+                    └─► LLMInferenceStarted
+                          └─► LLMResponseGenerated
+                                └─► TTSSynthesisStarted
+                                      └─► AudioResponseReady
+```
+
+### Workflow Event Stream
+
+```
+PipelineTriggerReceived
+  └─► PipelineRunCreated
+        └─► StepStarted (repeated)
+              └─► StepCompleted (repeated)
+                    └─► ArtifactProduced (repeated)
+                          └─► PipelineRunCompleted
+```
+
+---
+
+## Data Retention
+
+| Entity | Retention | Storage |
+|--------|-----------|---------|
+| ChatMessage | 30 days | JetStream → PostgreSQL |
+| VoiceRequest/Response | 1 hour (audio), 30 days (text) | JetStream → PostgreSQL |
+| Chunk/Embedding | Permanent | Milvus |
+| PipelineRun | Permanent | PostgreSQL |
+| Artifact | Permanent | MinIO |
+| Session | 7 days | Valkey |
+
+---
+
+## Invariants
+
+### Chat Context
+- A ChatMessage must belong to exactly one Conversation
+- A Conversation must have at least one ChatMessage
+- Messages are immutable once created
+
+### Voice Context
+- VoiceResponse must have corresponding VoiceRequest
+- Audio format must be one of: wav, webm, mp3
+- Transcription cannot be empty for valid audio
+
+### RAG Context
+- Chunk must belong to exactly one Document
+- Embedding dimensions must match model (1024 for BGE-large)
+- Document must have at least one Chunk
+
+### Workflow Context
+- PipelineRun must reference valid Pipeline
+- Artifacts must have valid S3 URIs
+- Run status transitions: pending → running → (succeeded|failed)
+
+---
+
+## Value Objects
+
+```python
+# Immutable value objects
+@dataclass(frozen=True)
+class MessageContent:
+    text: str
+    tokens: int
+
+@dataclass(frozen=True)  
+class AudioData:
+    data: bytes
+    format: str
+    duration_ms: int
+    sample_rate: int
+
+@dataclass(frozen=True)
+class EmbeddingVector:
+    values: tuple[float, ...]
+    model: str
+    dimensions: int
+
+@dataclass(frozen=True)
+class RAGContext:
+    chunks: tuple[str, ...]
+    scores: tuple[float, ...]
+    query: str
+```
+
+---
+
+## Related Documents
+
+- [ARCHITECTURE.md](ARCHITECTURE.md) - System architecture
+- [GLOSSARY.md](GLOSSARY.md) - Term definitions
+- [decisions/0004-use-messagepack-for-nats.md](decisions/0004-use-messagepack-for-nats.md) - Message format decision
--- a/GLOSSARY.md
+++ b/GLOSSARY.md
@@ -0,0 +1,242 @@
+# 📖 Glossary
+
+> **Terminology and abbreviations used in the DaviesTechLabs homelab**
+
+## A
+
+**ADR (Architecture Decision Record)**
+: A document that captures an important architectural decision, including context, decision, and consequences.
+
+**Argo Events**
+: Event-driven automation for Kubernetes that triggers workflows based on events from various sources.
+
+**Argo Workflows**
+: A container-native workflow engine for orchestrating parallel jobs on Kubernetes.
+
+**Authentik**
+: Self-hosted identity provider supporting SAML, OIDC, and other protocols.
+
+## B
+
+**BGE (BAAI General Embedding)**
+: A family of embedding models from BAAI used for semantic search and RAG.
+
+**Bounded Context**
+: A DDD concept defining a boundary within which a particular domain model applies.
+
+## C
+
+**C4 Model**
+: A hierarchical approach to software architecture diagrams: Context, Container, Component, Code.
+
+**Cilium**
+: eBPF-based networking, security, and observability for Kubernetes.
+
+**CloudNative-PG**
+: Kubernetes operator for PostgreSQL databases.
+
+**CNI (Container Network Interface)**
+: Standard for configuring network interfaces in Linux containers.
+
+## D
+
+**DDD (Domain-Driven Design)**
+: Software design approach focusing on the core domain and domain logic.
+
+## E
+
+**Embedding**
+: A vector representation of text, used for semantic similarity and search.
+
+**Envoy Gateway**
+: Kubernetes Gateway API implementation using Envoy proxy.
+
+**External Secrets Operator (ESO)**
+: Kubernetes operator that syncs secrets from external stores (Vault, etc.).
+
+## F
+
+**Falco**
+: Runtime security tool that detects anomalous activity in containers.
+
+**Flux CD**
+: GitOps toolkit for Kubernetes, continuously reconciling cluster state with Git.
+
+## G
+
+**GitOps**
+: Operational practice using Git as the single source of truth for declarative infrastructure.
+
+**GPU Device Plugin**
+: Kubernetes plugin that exposes GPU resources to containers.
+
+## H
+
+**HelmRelease**
+: Flux CRD for managing Helm chart releases declaratively.
+
+**HTTPRoute**
+: Kubernetes Gateway API resource for HTTP routing rules.
+
+## I
+
+**InferenceService**
+: KServe CRD for deploying ML models with autoscaling and traffic management.
+
+## J
+
+**JetStream**
+: NATS persistence layer providing streaming, key-value, and object stores.
+
+## K
+
+**KServe**
+: Kubernetes-native platform for deploying and serving ML models.
+
+**Kubeflow**
+: ML toolkit for Kubernetes, including pipelines, training operators, and more.
+
+**Kustomization**
+: Flux CRD for applying Kustomize overlays from Git sources.
+
+## L
+
+**LLM (Large Language Model)**
+: AI model trained on vast text data, capable of generating human-like text.
+
+**Longhorn**
+: Cloud-native distributed storage for Kubernetes.
+
+## M
+
+**MessagePack (msgpack)**
+: Binary serialization format, more compact than JSON.
+
+**Milvus**
+: Open-source vector database for similarity search and AI applications.
+
+**MLflow**
+: Platform for managing the ML lifecycle: experiments, models, deployment.
+
+**MinIO**
+: S3-compatible object storage.
+
+## N
+
+**NATS**
+: Cloud-native messaging system for microservices, IoT, and serverless.
+
+**Node Feature Discovery (NFD)**
+: Kubernetes add-on for detecting hardware features on nodes.
+
+## P
+
+**Pipeline**
+: In ML context, a DAG of components that process data and train/serve models.
+
+**Premium User**
+: User tier with enhanced features (more RAG docs, priority routing).
+
+## R
+
+**RAG (Retrieval-Augmented Generation)**
+: AI technique combining document retrieval with LLM generation for grounded responses.
+
+**Reranker**
+: Model that rescores retrieved documents based on relevance to a query.
+
+**ROCm**
+: AMD's open-source GPU computing platform (alternative to CUDA).
+
+## S
+
+**Schematic**
+: Talos Linux concept for defining system extensions and configurations.
+
+**SOPS (Secrets OPerationS)**
+: Tool for encrypting secrets in Git repositories.
+
+**STT (Speech-to-Text)**
+: Converting spoken audio to text (e.g., Whisper).
+
+**Strix Halo**
+: AMD's unified memory architecture for APUs with large GPU memory.
+
+## T
+
+**Talos Linux**
+: Minimal, immutable Linux distribution designed specifically for Kubernetes.
+
+**TTS (Text-to-Speech)**
+: Converting text to spoken audio (e.g., XTTS/Coqui).
+
+## V
+
+**Valkey**
+: Redis-compatible in-memory data store (Redis fork).
+
+**vLLM**
+: High-throughput LLM serving engine with PagedAttention.
+
+**VIP (Virtual IP)**
+: IP address shared among multiple hosts for high availability.
+
+**Volcano**
+: Kubernetes batch scheduler for high-performance workloads (ML, HPC).
+
+**VRM**
+: File format for 3D humanoid avatars.
+
+## W
+
+**Whisper**
+: OpenAI's speech recognition model.
+
+## X
+
+**XTTS**
+: Coqui's multi-language text-to-speech model with voice cloning.
+
+---
+
+## Acronyms Quick Reference
+
+| Acronym | Full Form |
+|---------|-----------|
+| ADR | Architecture Decision Record |
+| API | Application Programming Interface |
+| BGE | BAAI General Embedding |
+| CI/CD | Continuous Integration/Continuous Deployment |
+| CRD | Custom Resource Definition |
+| DAG | Directed Acyclic Graph |
+| DDD | Domain-Driven Design |
+| ESO | External Secrets Operator |
+| GPU | Graphics Processing Unit |
+| HA | High Availability |
+| HPA | Horizontal Pod Autoscaler |
+| LLM | Large Language Model |
+| ML | Machine Learning |
+| NATS | (not an acronym, named after message passing in Erlang) |
+| NFD | Node Feature Discovery |
+| OIDC | OpenID Connect |
+| RAG | Retrieval-Augmented Generation |
+| RBAC | Role-Based Access Control |
+| ROCm | Radeon Open Compute |
+| S3 | Simple Storage Service |
+| SAML | Security Assertion Markup Language |
+| SOPS | Secrets OPerationS |
+| SSO | Single Sign-On |
+| STT | Speech-to-Text |
+| TLS | Transport Layer Security |
+| TTS | Text-to-Speech |
+| UUID | Universally Unique Identifier |
+| VIP | Virtual IP |
+| VRAM | Video Random Access Memory |
+
+---
+
+## Related Documents
+
+- [ARCHITECTURE.md](ARCHITECTURE.md) - System overview
+- [TECH-STACK.md](TECH-STACK.md) - Technology details
+- [DOMAIN-MODEL.md](DOMAIN-MODEL.md) - Entity definitions
--- a/README.md
+++ b/README.md
@@ -1,3 +1,105 @@
-# homelab-design
+# 🏠 DaviesTechLabs Homelab Architecture

-homelab design process goes here.
+> **Production-grade AI/ML platform running on bare-metal Kubernetes**
+
+[![Talos](https://img.shields.io/badge/Talos-v1.12.1-blue?logo=linux)](https://talos.dev)
+[![Kubernetes](https://img.shields.io/badge/Kubernetes-v1.35.0-326CE5?logo=kubernetes)](https://kubernetes.io)
+[![Flux](https://img.shields.io/badge/GitOps-Flux-blue?logo=flux)](https://fluxcd.io)
+[![License](https://img.shields.io/badge/License-MIT-green)](LICENSE)
+
+## 📖 Quick Navigation
+
+| Document | Purpose |
+|----------|---------|
+| [AGENT-ONBOARDING.md](AGENT-ONBOARDING.md) | **Start here if you're an AI agent** |
+| [ARCHITECTURE.md](ARCHITECTURE.md) | High-level system overview |
+| [TECH-STACK.md](TECH-STACK.md) | Complete technology stack |
+| [DOMAIN-MODEL.md](DOMAIN-MODEL.md) | Core entities and bounded contexts |
+| [GLOSSARY.md](GLOSSARY.md) | Terminology reference |
+| [decisions/](decisions/) | Architecture Decision Records (ADRs) |
+
+## 🎯 What This Is
+
+A comprehensive architecture documentation repository for the DaviesTechLabs homelab Kubernetes cluster, featuring:
+
+- **AI/ML Platform**: KServe inference services, RAG pipelines, voice assistants
+- **Multi-GPU Support**: AMD ROCm (RDNA3/Strix Halo), NVIDIA CUDA, Intel Arc
+- **GitOps**: Flux CD with SOPS encryption
+- **Event-Driven**: NATS JetStream for real-time messaging
+- **ML Workflows**: Kubeflow Pipelines + Argo Workflows
+
+## 🖥️ Cluster Overview
+
+| Node | Role | Hardware | GPU |
+|------|------|----------|-----|
+| storm | Control Plane | Intel 13th Gen | Integrated |
+| bruenor | Control Plane | Intel 13th Gen | Integrated |
+| catti | Control Plane | Intel 13th Gen | Integrated |
+| elminster | Worker | NVIDIA RTX 2070 | 8GB CUDA |
+| khelben | Worker (vLLM) | AMD Strix Halo | 64GB Unified |
+| drizzt | Worker | AMD Radeon 680M | 12GB RDNA2 |
+| danilo | Worker | Intel Core Ultra 9 | Intel Arc |
+
+## 🚀 Quick Start
+
+### View Current Cluster State
+
+```bash
+# Get node status
+kubectl get nodes -o wide
+
+# View AI/ML workloads
+kubectl get pods -n ai-ml
+
+# Check KServe inference services
+kubectl get inferenceservices -n ai-ml
+```
+
+### Key Endpoints
+
+| Service | URL | Purpose |
+|---------|-----|---------|
+| Kubeflow | `kubeflow.lab.daviestechlabs.io` | ML Pipeline UI |
+| Companions | `companions-chat.lab.daviestechlabs.io` | AI Chat Interface |
+| Voice | `voice.lab.daviestechlabs.io` | Voice Assistant |
+| Gitea | `git.daviestechlabs.io` | Self-hosted Git |
+
+## 📂 Repository Structure
+
+```
+homelab-design/
+├── README.md                          # This file
+├── AGENT-ONBOARDING.md                # AI agent quick-start
+├── ARCHITECTURE.md                    # High-level system overview
+├── CONTEXT-DIAGRAM.mmd                # C4 Level 1 (Mermaid)
+├── CONTAINER-DIAGRAM.mmd              # C4 Level 2
+├── TECH-STACK.md                      # Complete tech stack
+├── DOMAIN-MODEL.md                    # Core entities
+├── CODING-CONVENTIONS.md              # Patterns & practices
+├── GLOSSARY.md                        # Terminology
+├── decisions/                         # ADRs
+│   ├── 0000-template.md
+│   ├── 0001-record-architecture-decisions.md
+│   ├── 0002-use-talos-linux.md
+│   └── ...
+├── specs/                             # Feature specifications
+└── diagrams/                          # Additional diagrams
+```
+
+## 🔗 Related Repositories
+
+| Repository | Purpose |
+|------------|---------|
+| [homelab-k8s2](https://github.com/Billy-Davies-2/homelab-k8s2) | Kubernetes manifests, Flux GitOps |
+| [llm-workflows](https://github.com/Billy-Davies-2/llm-workflows) | NATS handlers, Argo/KFP workflows |
+| [companions-frontend](https://github.com/Billy-Davies-2/companions-frontend) | Go web server, HTMX frontend |
+
+## 📝 Contributing
+
+1. For architecture changes, create an ADR in `decisions/`
+2. Update relevant documentation
+3. Submit a PR with context
+
+---
+
+*Last updated: 2026-02-01*
--- a/TECH-STACK.md
+++ b/TECH-STACK.md
@@ -0,0 +1,271 @@
+# 🛠️ Technology Stack
+
+> **Complete inventory of technologies used in the DaviesTechLabs homelab**
+
+## Platform Layer
+
+### Operating System
+
+| Component | Version | Purpose |
+|-----------|---------|---------|
+| [Talos Linux](https://talos.dev) | v1.12.1 | Immutable, API-driven Kubernetes OS |
+| Kernel | 6.18.2-talos | Linux kernel with GPU drivers |
+
+### Container Orchestration
+
+| Component | Version | Purpose |
+|-----------|---------|---------|
+| [Kubernetes](https://kubernetes.io) | v1.35.0 | Container orchestration |
+| [containerd](https://containerd.io) | 2.1.6 | Container runtime |
+| [Cilium](https://cilium.io) | Latest | CNI, network policies, eBPF |
+
+### GitOps
+
+| Component | Version | Purpose |
+|-----------|---------|---------|
+| [Flux CD](https://fluxcd.io) | v2 | GitOps continuous delivery |
+| [SOPS](https://github.com/getsops/sops) | Latest | Secret encryption |
+| [Age](https://github.com/FiloSottile/age) | Latest | Encryption key management |
+
+---
+
+## AI/ML Layer
+
+### Inference Engines
+
+| Service | Framework | GPU | Model Type |
+|---------|-----------|-----|------------|
+| [vLLM](https://vllm.ai) | ROCm | AMD Strix Halo | Large Language Models |
+| [faster-whisper](https://github.com/guillaumekln/faster-whisper) | CUDA | NVIDIA RTX 2070 | Speech-to-Text |
+| [XTTS](https://github.com/coqui-ai/TTS) | CUDA | NVIDIA RTX 2070 | Text-to-Speech |
+| [BGE Embeddings](https://huggingface.co/BAAI/bge-large-en-v1.5) | ROCm | AMD Radeon 680M | Text Embeddings |
+| [BGE Reranker](https://huggingface.co/BAAI/bge-reranker-large) | Intel | Intel Arc | Document Reranking |
+
+### ML Serving
+
+| Component | Version | Purpose |
+|-----------|---------|---------|
+| [KServe](https://kserve.github.io) | v0.12+ | Model serving framework |
+| [Ray Serve](https://ray.io/serve) | 2.53.0 | Unified inference endpoints |
+
+### ML Workflows
+
+| Component | Version | Purpose |
+|-----------|---------|---------|
+| [Kubeflow Pipelines](https://kubeflow.org) | 2.15.0 | ML pipeline orchestration |
+| [Argo Workflows](https://argoproj.github.io/workflows) | v3.7.8 | DAG-based workflows |
+| [Argo Events](https://argoproj.github.io/events) | Latest | Event-driven triggers |
+| [MLflow](https://mlflow.org) | 3.7.0 | Experiment tracking, model registry |
+
+### GPU Scheduling
+
+| Component | Version | Purpose |
+|-----------|---------|---------|
+| [Volcano](https://volcano.sh) | Latest | GPU-aware scheduling |
+| AMD GPU Device Plugin | v1.4.1 | ROCm GPU allocation |
+| NVIDIA Device Plugin | Latest | CUDA GPU allocation |
+| [Node Feature Discovery](https://github.com/kubernetes-sigs/node-feature-discovery) | v0.18.2 | Hardware detection |
+
+---
+
+## Data Layer
+
+### Databases
+
+| Component | Version | Purpose |
+|-----------|---------|---------|
+| [CloudNative-PG](https://cloudnative-pg.io) | 16.11 | PostgreSQL for metadata |
+| [Milvus](https://milvus.io) | Latest | Vector database for RAG |
+| [ClickHouse](https://clickhouse.com) | Latest | Analytics, access logs |
+| [Valkey](https://valkey.io) | Latest | Redis-compatible cache |
+
+### Object Storage
+
+| Component | Version | Purpose |
+|-----------|---------|---------|
+| [MinIO](https://min.io) | Latest | S3-compatible storage |
+| [Longhorn](https://longhorn.io) | v1.10.1 | Distributed block storage |
+| NFS CSI Driver | Latest | Shared filesystem |
+
+### Messaging
+
+| Component | Version | Purpose |
+|-----------|---------|---------|
+| [NATS](https://nats.io) | Latest | Message bus |
+| NATS JetStream | Built-in | Persistent streaming |
+
+### Data Processing
+
+| Component | Version | Purpose |
+|-----------|---------|---------|
+| [Apache Spark](https://spark.apache.org) | Latest | Batch analytics |
+| [Apache Flink](https://flink.apache.org) | Latest | Stream processing |
+| [Apache Iceberg](https://iceberg.apache.org) | Latest | Table format |
+| [Nessie](https://projectnessie.org) | Latest | Data catalog |
+| [Trino](https://trino.io) | 479 | SQL query engine |
+
+---
+
+## Application Layer
+
+### Web Frameworks
+
+| Application | Language | Framework | Purpose |
+|-------------|----------|-----------|---------|
+| Companions | Go | net/http + HTMX | AI chat interface |
+| Voice WebApp | Python | Gradio | Voice assistant UI |
+| Various handlers | Python | asyncio + nats.py | NATS event handlers |
+
+### Frontend
+
+| Technology | Purpose |
+|------------|---------|
+| [HTMX](https://htmx.org) | Dynamic HTML updates |
+| [Alpine.js](https://alpinejs.dev) | Lightweight reactivity |
+| [VRM](https://vrm.dev) | 3D avatar rendering |
+
+---
+
+## Networking Layer
+
+### Ingress
+
+| Component | Version | Purpose |
+|-----------|---------|---------|
+| [Envoy Gateway](https://gateway.envoyproxy.io) | v1.6.3 | Gateway API implementation |
+| [cloudflared](https://developers.cloudflare.com/cloudflare-one/connections/connect-apps) | Latest | Cloudflare tunnel |
+
+### DNS & Certificates
+
+| Component | Version | Purpose |
+|-----------|---------|---------|
+| [external-dns](https://github.com/kubernetes-sigs/external-dns) | Latest | Automatic DNS management |
+| [cert-manager](https://cert-manager.io) | Latest | TLS certificate automation |
+
+### Service Mesh
+
+| Component | Purpose |
+|-----------|---------|
+| [Spegel](https://github.com/spegel-org/spegel) | P2P container image distribution |
+
+---
+
+## Security Layer
+
+### Identity & Access
+
+| Component | Version | Purpose |
+|-----------|---------|---------|
+| [Authentik](https://goauthentik.io) | 2025.12.1 | Identity provider, SSO |
+| [Vault](https://vaultproject.io) | 1.21.2 | Secret management |
+| [External Secrets Operator](https://external-secrets.io) | v1.3.1 | Kubernetes secrets sync |
+
+### Runtime Security
+
+| Component | Version | Purpose |
+|-----------|---------|---------|
+| [Falco](https://falco.org) | 0.42.1 | Runtime threat detection |
+| Cilium Network Policies | Built-in | Network segmentation |
+
+### Backup
+
+| Component | Version | Purpose |
+|-----------|---------|---------|
+| [Velero](https://velero.io) | v1.17.1 | Cluster backup/restore |
+
+---
+
+## Observability Layer
+
+### Metrics
+
+| Component | Purpose |
+|-----------|---------|
+| [Prometheus](https://prometheus.io) | Metrics collection |
+| [Grafana](https://grafana.com) | Dashboards & visualization |
+
+### Logging
+
+| Component | Version | Purpose |
+|-----------|---------|---------|
+| [Grafana Alloy](https://grafana.com/oss/alloy) | v1.12.0 | Log collection |
+| [Loki](https://grafana.com/oss/loki) | Latest | Log aggregation |
+
+### Tracing
+
+| Component | Purpose |
+|-----------|---------|
+| [OpenTelemetry Collector](https://opentelemetry.io) | Trace collection |
+| Tempo/Jaeger | Trace storage & query |
+
+---
+
+## Development Tools
+
+### Local Development
+
+| Tool | Purpose |
+|------|---------|
+| [mise](https://mise.jdx.dev) | Tool version management |
+| [Task](https://taskfile.dev) | Task runner (Taskfile.yaml) |
+| [flux-local](https://github.com/allenporter/flux-local) | Local Flux testing |
+
+### CI/CD
+
+| Tool | Purpose |
+|------|---------|
+| GitHub Actions | CI/CD pipelines |
+| [Renovate](https://renovatebot.com) | Dependency updates |
+
+### Image Building
+
+| Tool | Purpose |
+|------|---------|
+| Docker | Container builds |
+| GHCR | Container registry |
+
+---
+
+## Media & Entertainment
+
+| Component | Version | Purpose |
+|-----------|---------|---------|
+| [Jellyfin](https://jellyfin.org) | 10.11.5 | Media server |
+| [Nextcloud](https://nextcloud.com) | 32.0.5 | File sync & share |
+| Prowlarr, Bazarr, etc. | Various | *arr stack |
+| [Kasm](https://kasmweb.com) | 1.18.1 | Browser isolation |
+
+---
+
+## Python Dependencies (llm-workflows)
+
+```toml
+# Core
+nats-py>=2.7.0          # NATS client
+msgpack>=1.0.0          # Binary serialization
+aiohttp>=3.9.0          # HTTP client
+
+# ML/AI
+pymilvus>=2.4.0         # Milvus client
+sentence-transformers   # Embeddings
+openai>=1.0.0           # vLLM OpenAI API
+
+# Kubeflow
+kfp>=2.12.1             # Pipeline SDK
+```
+
+---
+
+## Version Pinning Strategy
+
+| Component Type | Strategy |
+|----------------|----------|
+| Base images | Pin major.minor |
+| Helm charts | Pin exact version |
+| Python packages | Pin minimum version |
+| System extensions | Pin via Talos schematic |
+
+## Related Documents
+
+- [ARCHITECTURE.md](ARCHITECTURE.md) - How components connect
+- [decisions/](decisions/) - Why we chose specific technologies
--- a/decisions/0000-template.md
+++ b/decisions/0000-template.md
@@ -0,0 +1,71 @@
+# [short title of solved problem and solution]
+
+* Status: [proposed | rejected | accepted | deprecated | superseded by [ADR-NNNN](NNNN-example.md)]
+* Date: YYYY-MM-DD
+* Deciders: [list of people involved in decision]
+* Technical Story: [description | ticket/issue URL]
+
+## Context and Problem Statement
+
+[Describe the context and problem statement, e.g., in free form using two to three sentences. You may want to articulate the problem in form of a question.]
+
+## Decision Drivers
+
+* [driver 1, e.g., a force, facing concern, …]
+* [driver 2, e.g., a force, facing concern, …]
+* … <!-- numbers of drivers can vary -->
+
+## Considered Options
+
+* [option 1]
+* [option 2]
+* [option 3]
+* … <!-- numbers of options can vary -->
+
+## Decision Outcome
+
+Chosen option: "[option N]", because [justification. e.g., only option which meets k.o. criterion decision driver | which resolves force | … | comes out best (see below)].
+
+### Positive Consequences
+
+* [e.g., improvement of quality attribute satisfaction, follow-up decisions required, …]
+* …
+
+### Negative Consequences
+
+* [e.g., compromising quality attribute, follow-up decisions required, …]
+* …
+
+## Pros and Cons of the Options
+
+### [option 1]
+
+[example | description | pointer to more information | …]
+
+* Good, because [argument a]
+* Good, because [argument b]
+* Bad, because [argument c]
+* … <!-- numbers of pros and cons can vary -->
+
+### [option 2]
+
+[example | description | pointer to more information | …]
+
+* Good, because [argument a]
+* Good, because [argument b]
+* Bad, because [argument c]
+* … <!-- numbers of pros and cons can vary -->
+
+### [option 3]
+
+[example | description | pointer to more information | …]
+
+* Good, because [argument a]
+* Good, because [argument b]
+* Bad, because [argument c]
+* … <!-- numbers of pros and cons can vary -->
+
+## Links
+
+* [Link type] [Link to ADR] <!-- example: Refined by [ADR-0005](0005-example.md) -->
+* … <!-- numbers of links can vary -->
--- a/decisions/0001-record-architecture-decisions.md
+++ b/decisions/0001-record-architecture-decisions.md
@@ -0,0 +1,79 @@
+# Record Architecture Decisions
+
+* Status: accepted
+* Date: 2025-11-30
+* Deciders: Billy Davies
+* Technical Story: Initial setup of homelab documentation
+
+## Context and Problem Statement
+
+As the homelab infrastructure grows in complexity with AI/ML services, multi-GPU configurations, and event-driven architectures, we need a way to document and communicate significant architectural decisions. Without documentation, the rationale behind choices gets lost, making future changes risky and onboarding difficult.
+
+## Decision Drivers
+
+* Need to preserve context for why decisions were made
+* Enable future maintainers (including AI agents) to understand the system
+* Provide a structured way to evaluate alternatives
+* Support the wiki/design process for iterative improvements
+
+## Considered Options
+
+* Informal documentation in README files
+* Wiki pages without structure
+* Architecture Decision Records (ADRs)
+* No documentation (rely on code)
+
+## Decision Outcome
+
+Chosen option: "Architecture Decision Records (ADRs)", because they provide a structured format that captures context, alternatives, and consequences. They're lightweight, version-controlled, and well-suited for technical decisions.
+
+### Positive Consequences
+
+* Clear historical record of decisions
+* Structured format makes decisions searchable
+* Forces consideration of alternatives
+* Git-versioned alongside code
+* AI agents can parse and understand decisions
+
+### Negative Consequences
+
+* Requires discipline to create ADRs
+* May accumulate outdated decisions over time
+* Additional overhead for simple decisions
+
+## Pros and Cons of the Options
+
+### Informal README documentation
+
+* Good, because low friction
+* Good, because close to code
+* Bad, because no structure for alternatives
+* Bad, because decisions get buried in prose
+
+### Wiki pages
+
+* Good, because easy to edit
+* Good, because supports rich formatting
+* Bad, because separate from code repository
+* Bad, because no enforced structure
+
+### ADRs
+
+* Good, because structured format
+* Good, because version controlled
+* Good, because captures alternatives considered
+* Good, because industry-standard practice
+* Bad, because requires creating new files
+* Bad, because may seem bureaucratic for small decisions
+
+### No documentation
+
+* Good, because no overhead
+* Bad, because context is lost
+* Bad, because makes onboarding difficult
+* Bad, because risky for future changes
+
+## Links
+
+* Based on [MADR template](https://adr.github.io/madr/)
+* [ADR GitHub organization](https://adr.github.io/)
--- a/decisions/0002-use-talos-linux.md
+++ b/decisions/0002-use-talos-linux.md
@@ -0,0 +1,97 @@
+# Use Talos Linux for Kubernetes Nodes
+
+* Status: accepted
+* Date: 2025-11-30
+* Deciders: Billy Davies
+* Technical Story: Selecting OS for bare-metal Kubernetes cluster
+
+## Context and Problem Statement
+
+We need a reliable, secure operating system for running Kubernetes on bare-metal homelab nodes. The OS should minimize attack surface, be easy to manage at scale, and support our GPU requirements (AMD ROCm, NVIDIA CUDA, Intel).
+
+## Decision Drivers
+
+* Security-first design (immutable, minimal)
+* API-driven management (no SSH)
+* Support for various GPU drivers
+* Kubernetes-native focus
+* Community support and updates
+* Ease of upgrades
+
+## Considered Options
+
+* Ubuntu Server with kubeadm
+* Flatcar Container Linux
+* Talos Linux
+* k3OS (discontinued)
+* Rocky Linux with RKE2
+
+## Decision Outcome
+
+Chosen option: "Talos Linux", because it provides an immutable, API-driven, Kubernetes-focused OS that minimizes attack surface and simplifies operations.
+
+### Positive Consequences
+
+* Immutable root filesystem prevents drift
+* No SSH reduces attack vectors
+* API-driven management integrates well with GitOps
+* Schematic system allows custom kernel modules (GPU drivers)
+* Consistent configuration across all nodes
+* Automatic updates with minimal disruption
+
+### Negative Consequences
+
+* Learning curve for API-driven management
+* Debugging requires different approaches (no SSH)
+* Custom extensions require schematic IDs
+* Less flexibility for non-Kubernetes workloads
+
+## Pros and Cons of the Options
+
+### Ubuntu Server with kubeadm
+
+* Good, because familiar
+* Good, because extensive package availability
+* Good, because easy debugging via SSH
+* Bad, because mutable system leads to drift
+* Bad, because large attack surface
+* Bad, because manual package management
+
+### Flatcar Container Linux
+
+* Good, because immutable
+* Good, because auto-updates
+* Good, because container-focused
+* Bad, because less Kubernetes-specific
+* Bad, because smaller community than Talos
+* Bad, because GPU driver setup more complex
+
+### Talos Linux
+
+* Good, because purpose-built for Kubernetes
+* Good, because immutable and minimal
+* Good, because API-driven (no SSH)
+* Good, because excellent Kubernetes integration
+* Good, because active development and community
+* Good, because schematic system for GPU drivers
+* Bad, because learning curve
+* Bad, because no traditional debugging
+
+### k3OS
+
+* Good, because simple
+* Bad, because discontinued
+
+### Rocky Linux with RKE2
+
+* Good, because enterprise-like
+* Good, because familiar Linux experience
+* Bad, because mutable system
+* Bad, because more operational overhead
+* Bad, because larger attack surface
+
+## Links
+
+* [Talos Linux](https://talos.dev)
+* [Talos Image Factory](https://factory.talos.dev)
+* Related: [ADR-0005](0005-multi-gpu-strategy.md) - GPU driver integration via schematics
--- a/decisions/0003-use-nats-for-messaging.md
+++ b/decisions/0003-use-nats-for-messaging.md
@@ -0,0 +1,112 @@
+# Use NATS for AI/ML Messaging
+
+* Status: accepted
+* Date: 2025-12-01
+* Deciders: Billy Davies
+* Technical Story: Selecting message bus for AI service orchestration
+
+## Context and Problem Statement
+
+The AI/ML platform requires a messaging system for:
+- Real-time chat message routing
+- Voice request/response streaming
+- Pipeline triggers and status updates
+- Event-driven workflow orchestration
+
+We need a messaging system that handles both ephemeral real-time messages and persistent streams.
+
+## Decision Drivers
+
+* Low latency for real-time chat/voice
+* Persistence for audit and replay
+* Simple operations for homelab
+* Support for request-reply pattern
+* Wildcard subscriptions for routing
+* Binary message support (audio data)
+
+## Considered Options
+
+* Apache Kafka
+* RabbitMQ
+* Redis Pub/Sub + Streams
+* NATS with JetStream
+* Apache Pulsar
+
+## Decision Outcome
+
+Chosen option: "NATS with JetStream", because it provides both fire-and-forget messaging and persistent streams with significantly simpler operations than alternatives.
+
+### Positive Consequences
+
+* Sub-millisecond latency for real-time messages
+* JetStream provides persistence when needed
+* Simple deployment (single binary)
+* Excellent Kubernetes integration
+* Request-reply pattern built-in
+* Wildcard subscriptions for flexible routing
+* Low resource footprint
+
+### Negative Consequences
+
+* Less ecosystem than Kafka
+* JetStream less mature than Kafka Streams
+* No built-in schema registry
+* Smaller community than RabbitMQ
+
+## Pros and Cons of the Options
+
+### Apache Kafka
+
+* Good, because industry standard for streaming
+* Good, because rich ecosystem (Kafka Streams, Connect)
+* Good, because schema registry
+* Good, because excellent for high throughput
+* Bad, because operationally complex (ZooKeeper/KRaft)
+* Bad, because high resource requirements
+* Bad, because overkill for homelab scale
+* Bad, because higher latency for real-time messages
+
+### RabbitMQ
+
+* Good, because mature and stable
+* Good, because flexible routing
+* Good, because good management UI
+* Bad, because AMQP protocol overhead
+* Bad, because not designed for streaming
+* Bad, because more complex clustering
+
+### Redis Pub/Sub + Streams
+
+* Good, because simple
+* Good, because already might use Redis
+* Good, because low latency
+* Bad, because pub/sub not persistent
+* Bad, because streams API less intuitive
+* Bad, because not primary purpose of Redis
+
+### NATS with JetStream
+
+* Good, because extremely low latency
+* Good, because simple operations
+* Good, because both pub/sub and persistence
+* Good, because request-reply built-in
+* Good, because wildcard subscriptions
+* Good, because low resource usage
+* Good, because excellent Go/Python clients
+* Bad, because smaller ecosystem
+* Bad, because JetStream newer than Kafka
+
+### Apache Pulsar
+
+* Good, because unified messaging + streaming
+* Good, because multi-tenancy
+* Good, because geo-replication
+* Bad, because complex architecture
+* Bad, because high resource requirements
+* Bad, because smaller community
+
+## Links
+
+* [NATS.io](https://nats.io)
+* [JetStream Documentation](https://docs.nats.io/nats-concepts/jetstream)
+* Related: [ADR-0004](0004-use-messagepack-for-nats.md) - Message format
--- a/decisions/0004-use-messagepack-for-nats.md
+++ b/decisions/0004-use-messagepack-for-nats.md
@@ -0,0 +1,137 @@
+# Use MessagePack for NATS Messages
+
+* Status: accepted
+* Date: 2025-12-01
+* Deciders: Billy Davies
+* Technical Story: Selecting serialization format for NATS messages
+
+## Context and Problem Statement
+
+NATS messages in the AI platform carry various payloads:
+- Text chat messages (small)
+- Voice audio data (potentially large, base64 or binary)
+- Streaming response chunks
+- Pipeline parameters
+
+We need a serialization format that handles both text and binary efficiently.
+
+## Decision Drivers
+
+* Efficient binary data handling (audio)
+* Compact message size
+* Fast serialization/deserialization
+* Cross-language support (Python, Go)
+* Debugging ability
+* Schema flexibility
+
+## Considered Options
+
+* JSON
+* Protocol Buffers (protobuf)
+* MessagePack (msgpack)
+* CBOR
+* Avro
+
+## Decision Outcome
+
+Chosen option: "MessagePack (msgpack)", because it provides binary efficiency with JSON-like simplicity and schema-less flexibility.
+
+### Positive Consequences
+
+* Native binary support (no base64 overhead for audio)
+* 20-50% smaller than JSON for typical messages
+* Faster serialization than JSON
+* No schema compilation step
+* Easy debugging (can pretty-print like JSON)
+* Excellent Python and Go libraries
+
+### Negative Consequences
+
+* Less human-readable than JSON when raw
+* No built-in schema validation
+* Slightly less common than JSON
+
+## Pros and Cons of the Options
+
+### JSON
+
+* Good, because human-readable
+* Good, because universal support
+* Good, because no setup required
+* Bad, because binary data requires base64 (33% overhead)
+* Bad, because larger message sizes
+* Bad, because slower parsing
+
+### Protocol Buffers
+
+* Good, because very compact
+* Good, because fast
+* Good, because schema validation
+* Good, because cross-language
+* Bad, because requires schema definition
+* Bad, because compilation step
+* Bad, because less flexible for evolving schemas
+* Bad, because overkill for simple messages
+
+### MessagePack
+
+* Good, because binary-efficient
+* Good, because JSON-like simplicity
+* Good, because no schema required
+* Good, because excellent library support
+* Good, because can include raw bytes
+* Bad, because not human-readable raw
+* Bad, because no schema validation
+
+### CBOR
+
+* Good, because binary-efficient
+* Good, because IETF standard
+* Good, because schema-less
+* Bad, because less common libraries
+* Bad, because smaller community
+* Bad, because similar to msgpack with less adoption
+
+### Avro
+
+* Good, because schema evolution
+* Good, because compact
+* Good, because schema registry integration
+* Bad, because requires schema
+* Bad, because more complex setup
+* Bad, because Java-centric ecosystem
+
+## Implementation Notes
+
+```python
+# Python usage
+import msgpack
+
+# Serialize
+data = {
+    "user_id": "user-123",
+    "audio": audio_bytes,  # Raw bytes, no base64
+    "premium": True
+}
+payload = msgpack.packb(data)
+
+# Deserialize
+data = msgpack.unpackb(payload, raw=False)
+```
+
+```go
+// Go usage
+import "github.com/vmihailenco/msgpack/v5"
+
+type Message struct {
+    UserID string `msgpack:"user_id"`
+    Audio  []byte `msgpack:"audio"`
+}
+```
+
+## Links
+
+* [MessagePack Specification](https://msgpack.org)
+* [msgpack-python](https://github.com/msgpack/msgpack-python)
+* Related: [ADR-0003](0003-use-nats-for-messaging.md) - Message bus choice
+* See: [BINARY_MESSAGES_AND_JETSTREAM.md](../specs/BINARY_MESSAGES_AND_JETSTREAM.md)
--- a/decisions/0005-multi-gpu-strategy.md
+++ b/decisions/0005-multi-gpu-strategy.md
@@ -0,0 +1,145 @@
+# Multi-GPU Heterogeneous Strategy
+
+* Status: accepted
+* Date: 2025-12-01
+* Deciders: Billy Davies
+* Technical Story: GPU allocation strategy for AI workloads
+
+## Context and Problem Statement
+
+The homelab has diverse GPU hardware:
+- AMD Strix Halo (64GB unified memory) - khelben
+- NVIDIA RTX 2070 (8GB VRAM) - elminster  
+- AMD Radeon 680M (12GB VRAM) - drizzt
+- Intel Arc (integrated) - danilo
+
+Different AI workloads have different requirements. How do we allocate GPUs effectively?
+
+## Decision Drivers
+
+* Maximize utilization of all GPUs
+* Match workloads to appropriate hardware
+* Support concurrent inference services
+* Enable fractional GPU sharing where appropriate
+* Minimize cross-vendor complexity
+
+## Considered Options
+
+* Single GPU vendor only
+* All workloads on largest GPU
+* Workload-specific GPU allocation
+* Dynamic GPU scheduling (MIG/fractional)
+
+## Decision Outcome
+
+Chosen option: "Workload-specific GPU allocation with dedicated nodes", where each AI service is pinned to the most appropriate GPU based on requirements.
+
+### Allocation Strategy
+
+| Workload | GPU | Node | Rationale |
+|----------|-----|------|-----------|
+| vLLM (LLM inference) | AMD Strix Halo (64GB) | khelben (dedicated) | Large models need unified memory |
+| Whisper (STT) | NVIDIA RTX 2070 (8GB) | elminster | CUDA optimized, medium memory |
+| XTTS (TTS) | NVIDIA RTX 2070 (8GB) | elminster | Shares with Whisper |
+| BGE Embeddings | AMD Radeon 680M (12GB) | drizzt | ROCm support, batch processing |
+| BGE Reranker | Intel Arc | danilo | Light workload, Intel optimization |
+
+### Positive Consequences
+
+* Each workload gets optimal hardware
+* No GPU memory contention for LLM
+* NVIDIA services can share via time-slicing
+* Cost-effective use of varied hardware
+* Clear ownership and debugging
+
+### Negative Consequences
+
+* More complex scheduling (node taints/tolerations)
+* Less flexibility for workload migration
+* Must maintain multiple GPU driver stacks
+* Some GPUs underutilized at times
+
+## Implementation
+
+### Node Taints
+
+```yaml
+# khelben - dedicated vLLM node
+nodeTaints:
+  dedicated: "vllm:NoSchedule"
+```
+
+### Pod Tolerations and Node Affinity
+
+```yaml
+# vLLM deployment
+spec:
+  tolerations:
+    - key: "dedicated"
+      operator: "Equal"
+      value: "vllm"
+      effect: "NoSchedule"
+  affinity:
+    nodeAffinity:
+      requiredDuringSchedulingIgnoredDuringExecution:
+        nodeSelectorTerms:
+          - matchExpressions:
+              - key: kubernetes.io/hostname
+                operator: In
+                values: ["khelben"]
+```
+
+### Resource Limits
+
+```yaml
+# NVIDIA GPU (elminster)
+resources:
+  limits:
+    nvidia.com/gpu: 1
+
+# AMD GPU (drizzt, khelben)  
+resources:
+  limits:
+    amd.com/gpu: 1
+```
+
+## Pros and Cons of the Options
+
+### Single GPU vendor only
+
+* Good, because simpler driver management
+* Good, because consistent tooling
+* Bad, because wastes existing hardware
+* Bad, because higher cost for new hardware
+
+### All workloads on largest GPU
+
+* Good, because simple scheduling
+* Good, because unified memory benefits
+* Bad, because memory contention
+* Bad, because single point of failure
+* Bad, because wastes other GPUs
+
+### Workload-specific allocation (chosen)
+
+* Good, because optimal hardware matching
+* Good, because uses all available GPUs
+* Good, because clear resource boundaries
+* Good, because parallel inference
+* Bad, because more complex configuration
+* Bad, because multiple driver stacks
+
+### Dynamic GPU scheduling
+
+* Good, because flexible
+* Good, because maximizes utilization
+* Bad, because complex to implement
+* Bad, because MIG not available on consumer GPUs
+* Bad, because cross-vendor scheduling immature
+
+## Links
+
+* [Volcano Scheduler](https://volcano.sh)
+* [AMD GPU Device Plugin](https://github.com/ROCm/k8s-device-plugin)
+* [NVIDIA Device Plugin](https://github.com/NVIDIA/k8s-device-plugin)
+* Related: [ADR-0002](0002-use-talos-linux.md) - GPU drivers via Talos schematics
--- a/decisions/0006-gitops-with-flux.md
+++ b/decisions/0006-gitops-with-flux.md
@@ -0,0 +1,140 @@
+# GitOps with Flux CD
+
+* Status: accepted
+* Date: 2025-11-30
+* Deciders: Billy Davies
+* Technical Story: Implementing GitOps for cluster management
+
+## Context and Problem Statement
+
+Managing a Kubernetes cluster with numerous applications, configurations, and secrets requires a reliable, auditable, and reproducible approach. Manual `kubectl apply` is error-prone and doesn't track state over time.
+
+## Decision Drivers
+
+* Infrastructure as Code (IaC) principles
+* Audit trail for all changes
+* Self-healing cluster state
+* Multi-repository support
+* Secret encryption integration
+* Active community and maintenance
+
+## Considered Options
+
+* Manual kubectl apply
+* ArgoCD
+* Flux CD
+* Rancher Fleet
+* Pulumi/Terraform for Kubernetes
+
+## Decision Outcome
+
+Chosen option: "Flux CD", because it provides a mature GitOps implementation with excellent multi-source support, SOPS integration, and aligns well with the Kubernetes ecosystem.
+
+### Positive Consequences
+
+* Git is single source of truth
+* Automatic drift detection and correction
+* Native SOPS/Age secret encryption
+* Multi-repository support (homelab-k8s2 + llm-workflows)
+* Helm and Kustomize native support
+* Webhook-free sync (pull-based)
+
+### Negative Consequences
+
+* No built-in UI (use CLI or third-party)
+* Learning curve for CRD-based configuration
+* Debugging requires understanding Flux controllers
+
+## Configuration
+
+### Repository Structure
+
+```
+homelab-k8s2/
+├── kubernetes/
+│   ├── flux/            # Flux system config
+│   │   ├── config/
+│   │   │   ├── cluster.yaml
+│   │   │   └── secrets.yaml  # SOPS encrypted
+│   │   └── repositories/
+│   │       ├── helm/    # HelmRepositories
+│   │       └── git/     # GitRepositories
+│   └── apps/            # Application Kustomizations
+```
+
+### Multi-Repository Sync
+
+```yaml
+# GitRepository for llm-workflows
+apiVersion: source.toolkit.fluxcd.io/v1
+kind: GitRepository
+metadata:
+  name: llm-workflows
+  namespace: flux-system
+spec:
+  url: ssh://git@github.com/Billy-Davies-2/llm-workflows
+  ref:
+    branch: main
+  secretRef:
+    name: github-deploy-key
+```
+
+### SOPS Integration
+
+```yaml
+# .sops.yaml
+creation_rules:
+  - path_regex: .*\.sops\.yaml$
+    age: >-
+      age1...  # Public key
+```
+
+## Pros and Cons of the Options
+
+### Manual kubectl apply
+
+* Good, because simple
+* Good, because no setup
+* Bad, because no audit trail
+* Bad, because no drift detection
+* Bad, because not reproducible
+
+### ArgoCD
+
+* Good, because great UI
+* Good, because app-of-apps pattern
+* Good, because large community
+* Bad, because heavier resource usage
+* Bad, because webhook-dependent sync
+* Bad, because SOPS requires plugins
+
+### Flux CD
+
+* Good, because lightweight
+* Good, because pull-based (no webhooks)
+* Good, because native SOPS support
+* Good, because multi-source/multi-tenant
+* Good, because Kubernetes-native CRDs
+* Bad, because no built-in UI
+* Bad, because CRD learning curve
+
+### Rancher Fleet
+
+* Good, because integrated with Rancher
+* Good, because multi-cluster
+* Bad, because Rancher ecosystem lock-in
+* Bad, because smaller community
+
+### Pulumi/Terraform
+
+* Good, because familiar IaC tools
+* Good, because drift detection
+* Bad, because not Kubernetes-native
+* Bad, because requires state management
+* Bad, because not continuous reconciliation
+
+## Links
+
+* [Flux CD](https://fluxcd.io)
+* [SOPS Integration](https://fluxcd.io/flux/guides/mozilla-sops/)
+* [flux-local](https://github.com/allenporter/flux-local) - Local testing
--- a/decisions/0007-use-kserve-for-inference.md
+++ b/decisions/0007-use-kserve-for-inference.md
@@ -0,0 +1,115 @@
+# Use KServe for ML Model Serving
+
+* Status: accepted
+* Date: 2025-12-15
+* Deciders: Billy Davies
+* Technical Story: Selecting model serving platform for inference services
+
+## Context and Problem Statement
+
+We need to deploy multiple ML models (Whisper, XTTS, BGE, vLLM) as inference endpoints. Each model has different requirements for scaling, protocols (HTTP/gRPC), and GPU allocation.
+
+## Decision Drivers
+
+* Standardized inference protocol (V2)
+* Autoscaling based on load
+* Traffic splitting for canary deployments
+* Integration with Kubeflow ecosystem
+* GPU resource management
+* Health checks and readiness
+
+## Considered Options
+
+* Raw Kubernetes Deployments + Services
+* KServe InferenceService
+* Seldon Core
+* BentoML
+* Ray Serve only
+
+## Decision Outcome
+
+Chosen option: "KServe InferenceService", because it provides a standardized, Kubernetes-native approach to model serving with built-in autoscaling and traffic management.
+
+### Positive Consequences
+
+* Standardized V2 inference protocol
+* Automatic scale-to-zero capability
+* Canary/blue-green deployments
+* Integration with Kubeflow UI
+* Transformer/Explainer components
+* GPU resource abstraction
+
+### Negative Consequences
+
+* Additional CRDs and operators
+* Learning curve for InferenceService spec
+* Some overhead for simple deployments
+* Knative Serving dependency (optional)
+
+## Pros and Cons of the Options
+
+### Raw Kubernetes Deployments
+
+* Good, because simple
+* Good, because full control
+* Bad, because no autoscaling logic
+* Bad, because manual service mesh
+* Bad, because repetitive configuration
+
+### KServe InferenceService
+
+* Good, because standardized API
+* Good, because autoscaling
+* Good, because traffic management
+* Good, because Kubeflow integration
+* Bad, because operator complexity
+* Bad, because Knative optional dependency
+
+### Seldon Core
+
+* Good, because mature
+* Good, because A/B testing
+* Good, because explainability
+* Bad, because more complex than KServe
+* Bad, because heavier resource usage
+
+### BentoML
+
+* Good, because developer-friendly
+* Good, because packaging focused
+* Bad, because less Kubernetes-native
+* Bad, because smaller community
+
+### Ray Serve
+
+* Good, because unified compute
+* Good, because Python-native
+* Good, because fractional GPU
+* Bad, because less standardized API
+* Bad, because Ray cluster overhead
+
+## Current Configuration
+
+```yaml
+apiVersion: serving.kserve.io/v1beta1
+kind: InferenceService
+metadata:
+  name: whisper
+  namespace: ai-ml
+spec:
+  predictor:
+    minReplicas: 1
+    maxReplicas: 3
+    containers:
+      - name: whisper
+        image: ghcr.io/org/whisper:latest
+        resources:
+          limits:
+            nvidia.com/gpu: 1
+```
+
+## Links
+
+* [KServe](https://kserve.github.io)
+* [V2 Inference Protocol](https://kserve.github.io/website/latest/modelserving/data_plane/v2_protocol/)
+* Related: [ADR-0005](0005-multi-gpu-strategy.md) - GPU allocation
--- a/decisions/0008-use-milvus-for-vectors.md
+++ b/decisions/0008-use-milvus-for-vectors.md
@@ -0,0 +1,107 @@
+# Use Milvus for Vector Storage
+
+* Status: accepted
+* Date: 2025-12-15
+* Deciders: Billy Davies
+* Technical Story: Selecting vector database for RAG system
+
+## Context and Problem Statement
+
+The RAG (Retrieval-Augmented Generation) system requires a vector database to store document embeddings and perform similarity search. We need to store millions of embeddings and query them with low latency.
+
+## Decision Drivers
+
+* Query performance (< 100ms for top-k search)
+* Scalability to millions of vectors
+* Kubernetes-native deployment
+* Active development and community
+* Support for metadata filtering
+* Backup and restore capabilities
+
+## Considered Options
+
+* Milvus
+* Pinecone (managed)
+* Qdrant
+* Weaviate
+* pgvector (PostgreSQL extension)
+* Chroma
+
+## Decision Outcome
+
+Chosen option: "Milvus", because it provides production-grade vector search with excellent Kubernetes support, scalability, and active development.
+
+### Positive Consequences
+
+* High-performance similarity search
+* Horizontal scalability
+* Rich filtering and hybrid search
+* Helm chart for Kubernetes
+* Active CNCF sandbox project
+* GPU acceleration available
+
+### Negative Consequences
+
+* Complex architecture (multiple components)
+* Higher resource usage than simpler alternatives
+* Requires object storage (MinIO)
+* Learning curve for optimization
+
+## Pros and Cons of the Options
+
+### Milvus
+
+* Good, because production-proven at scale
+* Good, because rich query API
+* Good, because Kubernetes-native
+* Good, because hybrid search (vector + scalar)
+* Good, because CNCF project
+* Bad, because complex architecture
+* Bad, because higher resource usage
+
+### Pinecone
+
+* Good, because fully managed
+* Good, because simple API
+* Good, because reliable
+* Bad, because external dependency
+* Bad, because cost at scale
+* Bad, because data sovereignty concerns
+
+### Qdrant
+
+* Good, because simpler than Milvus
+* Good, because Rust performance
+* Good, because good filtering
+* Bad, because smaller community
+* Bad, because less enterprise features
+
+### Weaviate
+
+* Good, because built-in vectorization
+* Good, because GraphQL API
+* Good, because modules system
+* Bad, because more opinionated
+* Bad, because schema requirements
+
+### pgvector
+
+* Good, because familiar PostgreSQL
+* Good, because simple deployment
+* Good, because ACID transactions
+* Bad, because limited scale
+* Bad, because slower for large datasets
+* Bad, because no specialized optimizations
+
+### Chroma
+
+* Good, because simple
+* Good, because embedded option
+* Bad, because not production-ready at scale
+* Bad, because limited features
+
+## Links
+
+* [Milvus](https://milvus.io)
+* [Milvus Helm Chart](https://github.com/milvus-io/milvus-helm)
+* Related: [DOMAIN-MODEL.md](../DOMAIN-MODEL.md) - Chunk/Embedding entities
--- a/decisions/0009-dual-workflow-engines.md
+++ b/decisions/0009-dual-workflow-engines.md
@@ -0,0 +1,124 @@
+# Dual Workflow Engine Strategy (Argo + Kubeflow)
+
+* Status: accepted
+* Date: 2026-01-15
+* Deciders: Billy Davies
+* Technical Story: Selecting workflow orchestration for ML pipelines
+
+## Context and Problem Statement
+
+The AI platform needs workflow orchestration for:
+- ML training pipelines with caching
+- Document ingestion (batch)
+- Complex DAG workflows (training → evaluation → deployment)
+- Hybrid scenarios combining both
+
+Should we use one engine or leverage strengths of multiple?
+
+## Decision Drivers
+
+* ML-specific features (caching, lineage)
+* Complex DAG support
+* Kubernetes-native execution
+* Visibility and debugging
+* Community and ecosystem
+* Integration with existing tools
+
+## Considered Options
+
+* Kubeflow Pipelines only
+* Argo Workflows only
+* Both engines with clear use cases
+* Airflow on Kubernetes
+* Prefect/Dagster
+
+## Decision Outcome
+
+Chosen option: "Both engines with clear use cases", using Kubeflow Pipelines for ML-centric workflows and Argo Workflows for complex DAG orchestration.
+
+### Decision Matrix
+
+| Use Case | Engine | Reason |
+|----------|--------|--------|
+| ML training with caching | Kubeflow | Component caching, experiment tracking |
+| Model evaluation | Kubeflow | Metric collection, comparison |
+| Document ingestion | Argo | Simple DAG, no ML features needed |
+| Batch inference | Argo | Parallelization, retries |
+| Complex DAG with branching | Argo | Superior control flow |
+| Hybrid ML training | Both | Argo orchestrates, KFP for ML steps |
+
+### Positive Consequences
+
+* Best tool for each job
+* ML pipelines get proper caching
+* Complex workflows get better DAG support
+* Can integrate via Argo Events
+* Gradual migration possible
+
+### Negative Consequences
+
+* Two systems to maintain
+* Team needs to learn both
+* More complex debugging
+* Integration overhead
+
+## Integration Architecture
+
+```
+NATS Event ──► Argo Events ──► Sensor ──┬──► Argo Workflow
+                                        │
+                                        └──► Kubeflow Pipeline (via API)
+
+                    OR
+
+Argo Workflow ──► Step: kfp-trigger ──► Kubeflow Pipeline
+                 (WorkflowTemplate)
+```
+
+## Pros and Cons of the Options
+
+### Kubeflow Pipelines only
+
+* Good, because ML-focused
+* Good, because caching
+* Good, because experiment tracking
+* Bad, because limited DAG features
+* Bad, because less flexible control flow
+
+### Argo Workflows only
+
+* Good, because powerful DAG
+* Good, because flexible
+* Good, because great debugging
+* Bad, because no ML caching
+* Bad, because no experiment tracking
+
+### Both engines (chosen)
+
+* Good, because best of both
+* Good, because appropriate tool per job
+* Good, because can integrate
+* Bad, because operational complexity
+* Bad, because learning two systems
+
+### Airflow
+
+* Good, because mature
+* Good, because large community
+* Bad, because Python-centric
+* Bad, because not Kubernetes-native
+* Bad, because no ML features
+
+### Prefect/Dagster
+
+* Good, because modern design
+* Good, because Python-native
+* Bad, because less Kubernetes-native
+* Bad, because newer/less proven
+
+## Links
+
+* [Kubeflow Pipelines](https://kubeflow.org/docs/components/pipelines/)
+* [Argo Workflows](https://argoproj.github.io/workflows/)
+* [Argo Events](https://argoproj.github.io/events/)
+* Related: [kfp-integration.yaml](../../llm-workflows/argo/kfp-integration.yaml)
--- a/decisions/0010-use-envoy-gateway.md
+++ b/decisions/0010-use-envoy-gateway.md
@@ -0,0 +1,120 @@
+# Use Envoy Gateway for Ingress
+
+* Status: accepted
+* Date: 2025-12-01
+* Deciders: Billy Davies
+* Technical Story: Selecting ingress controller for cluster
+
+## Context and Problem Statement
+
+We need an ingress solution that supports:
+- Gateway API (modern Kubernetes standard)
+- gRPC for ML inference
+- WebSocket for real-time chat/voice
+- Header-based routing for A/B testing
+- TLS termination
+
+## Decision Drivers
+
+* Gateway API support (HTTPRoute, GRPCRoute)
+* WebSocket support
+* gRPC support
+* Performance at edge
+* Active development
+* Envoy ecosystem familiarity
+
+## Considered Options
+
+* NGINX Ingress Controller
+* Traefik
+* Envoy Gateway
+* Istio Gateway
+* Contour
+
+## Decision Outcome
+
+Chosen option: "Envoy Gateway", because it's the reference implementation of Gateway API with full Envoy feature set.
+
+### Positive Consequences
+
+* Native Gateway API support
+* Full Envoy feature set
+* WebSocket and gRPC native
+* No Istio complexity
+* CNCF graduated project (Envoy)
+* Easy integration with observability
+
+### Negative Consequences
+
+* Newer than alternatives
+* Less documentation than NGINX
+* Envoy configuration learning curve
+
+## Pros and Cons of the Options
+
+### NGINX Ingress
+
+* Good, because mature
+* Good, because well-documented
+* Good, because familiar
+* Bad, because limited Gateway API
+* Bad, because commercial features gated
+
+### Traefik
+
+* Good, because auto-discovery
+* Good, because good UI
+* Good, because Let's Encrypt
+* Bad, because Gateway API experimental
+* Bad, because less gRPC focus
+
+### Envoy Gateway
+
+* Good, because Gateway API native
+* Good, because full Envoy features
+* Good, because extensible
+* Good, because gRPC/WebSocket native
+* Bad, because newer project
+* Bad, because less community content
+
+### Istio Gateway
+
+* Good, because full mesh features
+* Good, because Gateway API
+* Bad, because overkill without mesh
+* Bad, because resource heavy
+
+### Contour
+
+* Good, because Envoy-based
+* Good, because lightweight
+* Bad, because Gateway API evolving
+* Bad, because smaller community
+
+## Configuration Example
+
+```yaml
+apiVersion: gateway.networking.k8s.io/v1
+kind: HTTPRoute
+metadata:
+  name: companions-chat
+spec:
+  parentRefs:
+    - name: eg-gateway
+      namespace: network
+  hostnames:
+    - companions-chat.lab.daviestechlabs.io
+  rules:
+    - matches:
+        - path:
+            type: PathPrefix
+            value: /
+      backendRefs:
+        - name: companions-chat
+          port: 8080
+```
+
+## Links
+
+* [Envoy Gateway](https://gateway.envoyproxy.io)
+* [Gateway API](https://gateway-api.sigs.k8s.io)
--- a/diagrams/README.md
+++ b/diagrams/README.md
@@ -0,0 +1,35 @@
+# Diagrams
+
+This directory contains additional architecture diagrams beyond the main C4 diagrams.
+
+## Available Diagrams
+
+| File | Description |
+|------|-------------|
+| [gpu-allocation.mmd](gpu-allocation.mmd) | GPU workload distribution |
+| [data-flow-chat.mmd](data-flow-chat.mmd) | Chat request data flow |
+| [data-flow-voice.mmd](data-flow-voice.mmd) | Voice request data flow |
+
+## Rendering Diagrams
+
+### VS Code
+
+Install the "Markdown Preview Mermaid Support" extension.
+
+### CLI
+
+```bash
+# Using mmdc (Mermaid CLI)
+npx @mermaid-js/mermaid-cli mmdc -i diagram.mmd -o diagram.png
+```
+
+### Online
+
+Use [Mermaid Live Editor](https://mermaid.live)
+
+## Diagram Conventions
+
+1. Use `.mmd` extension for Mermaid diagrams
+2. Include title as comment at top of file
+3. Use consistent styling classes
+4. Keep diagrams focused (one concept per diagram)
--- a/diagrams/data-flow-chat.mmd
+++ b/diagrams/data-flow-chat.mmd
@@ -0,0 +1,51 @@
+%% Chat Request Data Flow
+%% Sequence diagram showing chat message processing
+
+sequenceDiagram
+    autonumber
+    participant U as User
+    participant W as WebApp<br/>(companions)
+    participant N as NATS
+    participant C as Chat Handler
+    participant V as Valkey<br/>(Cache)
+    participant E as BGE Embeddings
+    participant M as Milvus
+    participant R as Reranker
+    participant L as vLLM
+
+    U->>W: Send message
+    W->>N: Publish ai.chat.user.{id}.message
+    N->>C: Deliver message
+    
+    C->>V: Get session history
+    V-->>C: Previous messages
+    
+    alt RAG Enabled
+        C->>E: Generate query embedding
+        E-->>C: Query vector
+        C->>M: Search similar chunks
+        M-->>C: Top-K chunks
+        
+        opt Reranker Enabled
+            C->>R: Rerank chunks
+            R-->>C: Reordered chunks
+        end
+    end
+    
+    C->>L: LLM inference (context + query)
+    
+    alt Streaming Enabled
+        loop For each token
+            L-->>C: Token
+            C->>N: Publish ai.chat.response.stream.{id}
+            N-->>W: Deliver chunk
+            W-->>U: Display token
+        end
+    else Non-streaming
+        L-->>C: Full response
+        C->>N: Publish ai.chat.response.{id}
+        N-->>W: Deliver response
+        W-->>U: Display response
+    end
+    
+    C->>V: Save to session history
--- a/diagrams/data-flow-voice.mmd
+++ b/diagrams/data-flow-voice.mmd
@@ -0,0 +1,46 @@
+%% Voice Request Data Flow
+%% Sequence diagram showing voice assistant processing
+
+sequenceDiagram
+    autonumber
+    participant U as User
+    participant W as Voice WebApp
+    participant N as NATS
+    participant VA as Voice Assistant
+    participant STT as Whisper<br/>(STT)
+    participant E as BGE Embeddings
+    participant M as Milvus
+    participant R as Reranker
+    participant L as vLLM
+    participant TTS as XTTS<br/>(TTS)
+
+    U->>W: Record audio
+    W->>N: Publish ai.voice.user.{id}.request<br/>(msgpack with audio bytes)
+    N->>VA: Deliver voice request
+    
+    VA->>STT: Transcribe audio
+    STT-->>VA: Transcription text
+    
+    alt RAG Enabled
+        VA->>E: Generate query embedding
+        E-->>VA: Query vector
+        VA->>M: Search similar chunks
+        M-->>VA: Top-K chunks
+        
+        opt Reranker Enabled
+            VA->>R: Rerank chunks
+            R-->>VA: Reordered chunks
+        end
+    end
+    
+    VA->>L: LLM inference
+    L-->>VA: Response text
+    
+    VA->>TTS: Synthesize speech
+    TTS-->>VA: Audio bytes
+    
+    VA->>N: Publish ai.voice.response.{id}<br/>(text + audio)
+    N-->>W: Deliver response
+    W-->>U: Play audio + show text
+
+    Note over VA,TTS: Total latency target: < 3s
--- a/diagrams/gpu-allocation.mmd
+++ b/diagrams/gpu-allocation.mmd
@@ -0,0 +1,47 @@
+%% GPU Allocation Diagram
+%% Shows how AI workloads are distributed across GPU nodes
+
+flowchart TB
+    subgraph khelben["🖥️ khelben (AMD Strix Halo 64GB)"]
+        direction TB
+        vllm["🧠 vLLM<br/>LLM Inference<br/>100% GPU"]
+    end
+
+    subgraph elminster["🖥️ elminster (NVIDIA RTX 2070 8GB)"]
+        direction TB
+        whisper["🎤 Whisper<br/>STT<br/>~50% GPU"]
+        xtts["🔊 XTTS<br/>TTS<br/>~50% GPU"]
+    end
+
+    subgraph drizzt["🖥️ drizzt (AMD Radeon 680M 12GB)"]
+        direction TB
+        embeddings["📊 BGE Embeddings<br/>Vector Encoding<br/>~80% GPU"]
+    end
+
+    subgraph danilo["🖥️ danilo (Intel Arc)"]
+        direction TB
+        reranker["📋 BGE Reranker<br/>Document Ranking<br/>~80% GPU"]
+    end
+
+    subgraph workloads["Workload Routing"]
+        chat["💬 Chat Request"]
+        voice["🎤 Voice Request"]
+    end
+
+    chat --> embeddings
+    chat --> reranker
+    chat --> vllm
+
+    voice --> whisper
+    voice --> embeddings
+    voice --> reranker
+    voice --> vllm
+    voice --> xtts
+
+    classDef nvidia fill:#76B900,color:white
+    classDef amd fill:#ED1C24,color:white
+    classDef intel fill:#0071C5,color:white
+    
+    class whisper,xtts nvidia
+    class vllm,embeddings amd
+    class reranker intel
--- a/specs/BINARY_MESSAGES_AND_JETSTREAM.md
+++ b/specs/BINARY_MESSAGES_AND_JETSTREAM.md
@@ -0,0 +1,287 @@
+# Binary Messages and JetStream Configuration
+
+> Technical specification for NATS message handling in the AI platform
+
+## Overview
+
+The AI platform uses NATS with JetStream for message persistence. All messages use MessagePack (msgpack) binary format for efficiency, especially when handling audio data.
+
+## Message Format
+
+### Why MessagePack?
+
+1. **Binary efficiency**: Audio data embedded directly without base64 overhead
+2. **Compact**: 20-50% smaller than equivalent JSON
+3. **Fast**: Lower serialization/deserialization overhead
+4. **Compatible**: JSON-like structure, easy debugging
+
+### Schema
+
+All messages follow this general structure:
+
+```python
+{
+    "request_id": str,       # UUID for correlation
+    "user_id": str,          # User identifier
+    "timestamp": float,      # Unix timestamp
+    "payload": Any,          # Type-specific data
+    "metadata": dict         # Optional metadata
+}
+```
+
+### Chat Message
+
+```python
+{
+    "request_id": "uuid-here",
+    "user_id": "user-123",
+    "username": "john_doe",
+    "message": "Hello, how are you?",
+    "premium": False,
+    "enable_streaming": True,
+    "enable_rag": True,
+    "enable_reranker": True,
+    "top_k": 5,
+    "session_id": "session-abc"
+}
+```
+
+### Voice Message
+
+```python
+{
+    "request_id": "uuid-here",
+    "user_id": "user-123",
+    "audio": b"...",           # Raw bytes, not base64!
+    "format": "wav",
+    "sample_rate": 16000,
+    "premium": False,
+    "enable_rag": True,
+    "language": "en"
+}
+```
+
+### Streaming Response Chunk
+
+```python
+{
+    "request_id": "uuid-here",
+    "type": "chunk",           # "chunk", "done", "error"
+    "content": "token",
+    "done": False,
+    "timestamp": 1706000000.0
+}
+```
+
+## JetStream Configuration
+
+### Streams
+
+| Stream | Subjects | Retention | Max Age | Storage | Replicas |
+|--------|----------|-----------|---------|---------|----------|
+| `COMPANIONS_LOGINS` | `ai.chat.user.*.login` | Limits | 7 days | File | 1 |
+| `COMPANIONS_CHAT` | `ai.chat.user.*.message`, `ai.chat.user.*.greeting.*` | Limits | 30 days | File | 1 |
+| `AI_CHAT_STREAM` | `ai.chat.response.stream.>` | Limits | 5 min | Memory | 1 |
+| `AI_VOICE_STREAM` | `ai.voice.>` | Limits | 1 hour | File | 1 |
+| `AI_VOICE_RESPONSE_STREAM` | `ai.voice.response.stream.>` | Limits | 5 min | Memory | 1 |
+| `AI_PIPELINE` | `ai.pipeline.>` | Limits | 24 hours | File | 1 |
+
+### Consumer Configuration
+
+```yaml
+# Durable consumer for chat handler
+consumer:
+  name: chat-handler
+  durable_name: chat-handler
+  filter_subjects:
+    - "ai.chat.user.*.message"
+  ack_policy: explicit
+  ack_wait: 30s
+  max_deliver: 3
+  deliver_policy: new
+```
+
+### Stream Creation (CLI)
+
+```bash
+# Create chat stream
+nats stream add COMPANIONS_CHAT \
+  --subjects "ai.chat.user.*.message,ai.chat.user.*.greeting.*" \
+  --retention limits \
+  --max-age 30d \
+  --storage file \
+  --replicas 1
+
+# Create ephemeral stream
+nats stream add AI_CHAT_STREAM \
+  --subjects "ai.chat.response.stream.>" \
+  --retention limits \
+  --max-age 5m \
+  --storage memory \
+  --replicas 1
+```
+
+## Python Implementation
+
+### Publisher
+
+```python
+import nats
+import msgpack
+from datetime import datetime
+
+async def publish_chat_message(nc: nats.NATS, user_id: str, message: str):
+    data = {
+        "request_id": str(uuid.uuid4()),
+        "user_id": user_id,
+        "message": message,
+        "timestamp": datetime.utcnow().timestamp(),
+        "enable_streaming": True,
+        "enable_rag": True,
+    }
+    
+    subject = f"ai.chat.user.{user_id}.message"
+    await nc.publish(subject, msgpack.packb(data))
+```
+
+### Subscriber (JetStream)
+
+```python
+async def message_handler(msg):
+    try:
+        data = msgpack.unpackb(msg.data, raw=False)
+        
+        # Process message
+        result = await process_chat(data)
+        
+        # Publish response
+        response_subject = f"ai.chat.response.{data['request_id']}"
+        await nc.publish(response_subject, msgpack.packb(result))
+        
+        # Acknowledge
+        await msg.ack()
+        
+    except Exception as e:
+        logger.error(f"Handler error: {e}")
+        await msg.nak(delay=5)  # Retry after 5s
+
+# Subscribe with JetStream
+js = nc.jetstream()
+sub = await js.subscribe(
+    "ai.chat.user.*.message",
+    cb=message_handler,
+    durable="chat-handler",
+    manual_ack=True
+)
+```
+
+### Streaming Response
+
+```python
+async def stream_response(nc, request_id: str, response_generator):
+    subject = f"ai.chat.response.stream.{request_id}"
+    
+    async for token in response_generator:
+        chunk = {
+            "request_id": request_id,
+            "type": "chunk",
+            "content": token,
+            "done": False
+        }
+        await nc.publish(subject, msgpack.packb(chunk))
+    
+    # Send done marker
+    done = {
+        "request_id": request_id,
+        "type": "done",
+        "content": "",
+        "done": True
+    }
+    await nc.publish(subject, msgpack.packb(done))
+```
+
+## Go Implementation
+
+### Publisher
+
+```go
+import (
+    "github.com/nats-io/nats.go"
+    "github.com/vmihailenco/msgpack/v5"
+)
+
+type ChatMessage struct {
+    RequestID string `msgpack:"request_id"`
+    UserID    string `msgpack:"user_id"`
+    Message   string `msgpack:"message"`
+}
+
+func PublishChat(nc *nats.Conn, userID, message string) error {
+    msg := ChatMessage{
+        RequestID: uuid.New().String(),
+        UserID:    userID,
+        Message:   message,
+    }
+    
+    data, err := msgpack.Marshal(msg)
+    if err != nil {
+        return err
+    }
+    
+    subject := fmt.Sprintf("ai.chat.user.%s.message", userID)
+    return nc.Publish(subject, data)
+}
+```
+
+## Error Handling
+
+### NAK with Delay
+
+```python
+# Temporary failure - retry later
+await msg.nak(delay=5)  # 5 second delay
+
+# Permanent failure - move to dead letter
+if attempt >= max_retries:
+    await nc.publish("ai.dlq.chat", msg.data)
+    await msg.term()  # Terminate delivery
+```
+
+### Dead Letter Queue
+
+```yaml
+stream:
+  name: AI_DLQ
+  subjects:
+    - "ai.dlq.>"
+  retention: limits
+  max_age: 7d
+  storage: file
+```
+
+## Monitoring
+
+### Key Metrics
+
+```bash
+# Stream info
+nats stream info COMPANIONS_CHAT
+
+# Consumer info
+nats consumer info COMPANIONS_CHAT chat-handler
+
+# Message rate
+nats stream report
+```
+
+### Prometheus Metrics
+
+- `nats_stream_messages_total`
+- `nats_consumer_pending_messages`
+- `nats_consumer_ack_pending`
+
+## Related
+
+- [ADR-0003: Use NATS for Messaging](../decisions/0003-use-nats-for-messaging.md)
+- [ADR-0004: Use MessagePack](../decisions/0004-use-messagepack-for-nats.md)
+- [DOMAIN-MODEL.md](../DOMAIN-MODEL.md)
--- a/specs/README.md
+++ b/specs/README.md
@@ -0,0 +1,36 @@
+# Specifications
+
+This directory contains feature-level specifications and technical designs.
+
+## Contents
+
+- [BINARY_MESSAGES_AND_JETSTREAM.md](BINARY_MESSAGES_AND_JETSTREAM.md) - MessagePack format and JetStream configuration
+- Future specs will be added here
+
+## Spec Template
+
+```markdown
+# Feature Name
+
+## Overview
+Brief description of the feature
+
+## Requirements
+- Requirement 1
+- Requirement 2
+
+## Design
+Technical design details
+
+## API
+Interface definitions
+
+## Implementation Notes
+Key implementation considerations
+
+## Testing
+Test strategy
+
+## Open Questions
+Unresolved items
+```