diff --git a/AGENT-ONBOARDING.md b/AGENT-ONBOARDING.md
new file mode 100644
index 0000000..ee43710
--- /dev/null
+++ b/AGENT-ONBOARDING.md
@@ -0,0 +1,191 @@
+# ๐ค Agent Onboarding
+
+> **This is the most important file for AI agents working on this codebase.**
+
+## TL;DR
+
+You are working on a **homelab Kubernetes cluster** running:
+- **Talos Linux v1.12.1** on bare-metal nodes
+- **Kubernetes v1.35.0** with Flux CD GitOps
+- **AI/ML platform** with KServe, Kubeflow, Milvus, NATS
+- **Multi-GPU** (AMD ROCm, NVIDIA CUDA, Intel Arc)
+
+## ๐บ๏ธ Repository Map
+
+| Repo | What It Contains | When to Edit |
+|------|------------------|--------------|
+| `homelab-k8s2` | Kubernetes manifests, Talos config, Flux | Infrastructure changes |
+| `llm-workflows` | NATS handlers, Argo/KFP workflows | Workflow/handler changes |
+| `companions-frontend` | Go server, HTMX UI, VRM avatars | Frontend changes |
+| `homelab-design` (this) | Architecture docs, ADRs | Design decisions |
+
+## ๐๏ธ System Architecture (30-Second Version)
+
+```
+โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ
+โ USER INTERFACES โ
+โ Companions WebApp โ Voice WebApp โ Kubeflow UI โ CLI โ
+โโโโโโโโโโโโโโโโโโโโโโโโโโโโโฌโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ
+ โ WebSocket/HTTP
+ โผ
+โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ
+โ NATS MESSAGE BUS โ
+โ Subjects: ai.chat.*, ai.voice.*, ai.pipeline.* โ
+โ Format: MessagePack (binary) โ
+โโโโโโโโโโโโโโโโโโโโโโโโโโโโโฌโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ
+ โ
+ โโโโโโโโโโโโโโโโโโโโโผโโโโโโโโโโโโโโโโโโโโ
+ โผ โผ โผ
+โโโโโโโโโโโโโโโโโ โโโโโโโโโโโโโโโโโ โโโโโโโโโโโโโโโโโ
+โ Chat Handler โ โVoice Assistantโ โPipeline Bridgeโ
+โ (RAG+LLM) โ โ (STTโLLMโTTS) โ โ (KFP/Argo) โ
+โโโโโโโโโฌโโโโโโโโ โโโโโโโโโฌโโโโโโโโ โโโโโโโโโฌโโโโโโโโ
+ โ โ โ
+ โโโโโโโโโโโโโโโโโโโโโผโโโโโโโโโโโโโโโโโโโโ
+ โผ
+โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ
+โ AI SERVICES โ
+โ Whisper โ XTTS โ vLLM โ Milvus โ BGE Embed โ Reranker โ
+โ STT โ TTS โ LLM โ RAG โ Embed โ Rank โ
+โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ
+```
+
+## ๐ Key File Locations
+
+### Infrastructure (`homelab-k8s2`)
+
+```
+kubernetes/apps/
+โโโ ai-ml/ # ๐ง AI/ML services
+โ โโโ kserve/ # InferenceServices
+โ โโโ kubeflow/ # Pipelines, Training Operator
+โ โโโ milvus/ # Vector database
+โ โโโ nats/ # Message bus
+โ โโโ vllm/ # LLM inference
+โ โโโ llm-workflows/ # GitRepo sync to llm-workflows
+โโโ analytics/ # ๐ Spark, Flink, ClickHouse
+โโโ observability/ # ๐ Grafana, Alloy, OpenTelemetry
+โโโ security/ # ๐ Vault, Authentik, Falco
+
+talos/
+โโโ talconfig.yaml # Node definitions
+โโโ patches/ # GPU-specific patches
+โ โโโ amd/amdgpu.yaml
+โ โโโ nvidia/nvidia-runtime.yaml
+```
+
+### Workflows (`llm-workflows`)
+
+```
+workflows/ # NATS handler deployments
+โโโ chat-handler.yaml
+โโโ voice-assistant.yaml
+โโโ pipeline-bridge.yaml
+
+argo/ # Argo WorkflowTemplates
+โโโ document-ingestion.yaml
+โโโ batch-inference.yaml
+โโโ qlora-training.yaml
+
+pipelines/ # Kubeflow Pipeline Python
+โโโ voice_pipeline.py
+โโโ document_ingestion_pipeline.py
+```
+
+## ๐ Service Endpoints (Internal)
+
+```python
+# Copy-paste ready for Python code
+NATS_URL = "nats://nats.ai-ml.svc.cluster.local:4222"
+VLLM_URL = "http://llm-draft.ai-ml.svc.cluster.local:8000/v1"
+WHISPER_URL = "http://whisper-predictor.ai-ml.svc.cluster.local"
+TTS_URL = "http://tts-predictor.ai-ml.svc.cluster.local"
+EMBEDDINGS_URL = "http://embeddings-predictor.ai-ml.svc.cluster.local"
+RERANKER_URL = "http://reranker-predictor.ai-ml.svc.cluster.local"
+MILVUS_HOST = "milvus.ai-ml.svc.cluster.local"
+MILVUS_PORT = 19530
+VALKEY_URL = "redis://valkey.ai-ml.svc.cluster.local:6379"
+```
+
+## ๐จ NATS Subject Patterns
+
+```python
+# Chat
+f"ai.chat.user.{user_id}.message" # User sends message
+f"ai.chat.response.{request_id}" # Response back
+f"ai.chat.response.stream.{request_id}" # Streaming tokens
+
+# Voice
+f"ai.voice.user.{user_id}.request" # Voice input
+f"ai.voice.response.{request_id}" # Voice output
+
+# Pipelines
+"ai.pipeline.trigger" # Trigger any pipeline
+f"ai.pipeline.status.{request_id}" # Status updates
+```
+
+## ๐ฎ GPU Allocation
+
+| Node | GPU | Workload | Memory |
+|------|-----|----------|--------|
+| khelben | AMD Strix Halo | vLLM (dedicated) | 64GB unified |
+| elminster | NVIDIA RTX 2070 | Whisper + XTTS | 8GB VRAM |
+| drizzt | AMD Radeon 680M | BGE Embeddings | 12GB VRAM |
+| danilo | Intel Arc | Reranker | 16GB shared |
+
+## โก Common Tasks
+
+### Deploy a New AI Service
+
+1. Create InferenceService in `homelab-k8s2/kubernetes/apps/ai-ml/kserve/`
+2. Add endpoint to `llm-workflows/config/ai-services-config.yaml`
+3. Push to main โ Flux deploys automatically
+
+### Add a New Workflow
+
+1. Create handler in `llm-workflows/chat-handler/` or `llm-workflows/voice-assistant/`
+2. Add Kubernetes Deployment in `llm-workflows/workflows/`
+3. Push to main โ Flux deploys automatically
+
+### Create Architecture Decision
+
+1. Copy `decisions/0000-template.md` to `decisions/NNNN-title.md`
+2. Fill in context, decision, consequences
+3. Submit PR
+
+## โ Antipatterns to Avoid
+
+1. **Don't hardcode secrets** - Use External Secrets Operator
+2. **Don't use `latest` tags** - Pin versions for reproducibility
+3. **Don't skip ADRs** - Document significant decisions
+4. **Don't bypass Flux** - All changes via Git, never `kubectl apply` directly
+
+## ๐ Where to Learn More
+
+- [ARCHITECTURE.md](ARCHITECTURE.md) - Full system design
+- [TECH-STACK.md](TECH-STACK.md) - All technologies used
+- [decisions/](decisions/) - Why we made certain choices
+- [DOMAIN-MODEL.md](DOMAIN-MODEL.md) - Core entities
+
+## ๐ Quick Debugging
+
+```bash
+# Check Flux sync status
+flux get all -A
+
+# View NATS JetStream streams
+kubectl exec -n ai-ml deploy/nats-box -- nats stream ls
+
+# Check GPU allocation
+kubectl describe node khelben | grep -A10 "Allocated"
+
+# View KServe inference services
+kubectl get inferenceservices -n ai-ml
+
+# Tail AI service logs
+kubectl logs -n ai-ml -l app=chat-handler -f
+```
+
+---
+
+*This document is the canonical starting point for AI agents. When in doubt, check the ADRs.*
diff --git a/ARCHITECTURE.md b/ARCHITECTURE.md
new file mode 100644
index 0000000..d554afa
--- /dev/null
+++ b/ARCHITECTURE.md
@@ -0,0 +1,287 @@
+# ๐๏ธ System Architecture
+
+> **Comprehensive technical overview of the DaviesTechLabs homelab infrastructure**
+
+## Overview
+
+The homelab is a production-grade Kubernetes cluster running on bare-metal hardware, designed for AI/ML workloads with multi-GPU support. It follows GitOps principles using Flux CD with SOPS-encrypted secrets.
+
+## System Layers
+
+```
+โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ
+โ USER LAYER โ
+โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโค
+โ โโโโโโโโโโโโโโโโโโโโ โโโโโโโโโโโโโโโโโโโโ โโโโโโโโโโโโโโโโโโโโ โ
+โ โ Companions WebAppโ โ Voice WebApp โ โ Kubeflow UI โ โ
+โ โ HTMX + Alpine โ โ Gradio UI โ โ Pipeline Mgmt โ โ
+โ โโโโโโโโโโฌโโโโโโโโโโ โโโโโโโโโโฌโโโโโโโโโโ โโโโโโโโโโฌโโโโโโโโโโ โ
+โ โ WebSocket โ HTTP/WS โ HTTP โ
+โโโโโโโโโโโโโดโโโโโโโโโโโโโโโโโโโโโโดโโโโโโโโโโโโโโโโโโโโโโดโโโโโโโโโโโโโโโโโโโโโโ
+ โ
+ โผ
+โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ
+โ INGRESS LAYER โ
+โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโค
+โ Cloudflared Tunnel โโโบ Envoy Gateway โโโบ HTTPRoute CRDs โ
+โ โ
+โ External: *.daviestechlabs.io Internal: *.lab.daviestechlabs.io โ
+โ โข git.daviestechlabs.io โข kubeflow.lab.daviestechlabs.io โ
+โ โข auth.daviestechlabs.io โข companions-chat.lab... โ
+โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ
+ โ
+ โผ
+โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ
+โ MESSAGE BUS LAYER โ
+โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโค
+โ NATS + JetStream โ
+โ โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ โ
+โ โ Streams: โ โ
+โ โ โข COMPANIONS_LOGINS (7d retention) - User analytics โ โ
+โ โ โข COMPANIONS_CHAT (30d retention) - Chat history โ โ
+โ โ โข AI_CHAT_STREAM (5min, memory) - Ephemeral streaming โ โ
+โ โ โข AI_VOICE_STREAM (1h, file) - Voice processing โ โ
+โ โ โข AI_PIPELINE (24h, file) - Workflow triggers โ โ
+โ โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ โ
+โ โ
+โ Message Format: MessagePack (binary, not JSON) โ
+โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ
+ โ
+ โโโโโโโโโโโโโโโโโโโโโโโโโโโผโโโโโโโโโโโโโโโโโโโโโโโโโโ
+ โผ โผ โผ
+โโโโโโโโโโโโโโโโโโโโโ โโโโโโโโโโโโโโโโโโโโโ โโโโโโโโโโโโโโโโโโโโโ
+โ Chat Handler โ โ Voice Assistant โ โ Pipeline Bridge โ
+โโโโโโโโโโโโโโโโโโโโโค โโโโโโโโโโโโโโโโโโโโโค โโโโโโโโโโโโโโโโโโโโโค
+โ โข RAG retrieval โ โ โข STT (Whisper) โ โ โข KFP triggers โ
+โ โข LLM inference โ โ โข RAG retrieval โ โ โข Argo triggers โ
+โ โข Streaming resp โ โ โข LLM inference โ โ โข Status updates โ
+โ โข Session state โ โ โข TTS (XTTS) โ โ โข Error handling โ
+โโโโโโโโโโโโโโโโโโโโโ โโโโโโโโโโโโโโโโโโโโโ โโโโโโโโโโโโโโโโโโโโโ
+ โ
+ โผ
+โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ
+โ AI SERVICES LAYER โ
+โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโค
+โ โโโโโโโโโโโ โโโโโโโโโโโ โโโโโโโโโโโ โโโโโโโโโโโ โโโโโโโโโโโ โโโโโโโโโโโ โ
+โ โ Whisper โ โ XTTS โ โ vLLM โ โ Milvus โ โ BGE โ โReranker โ โ
+โ โ (STT) โ โ (TTS) โ โ (LLM) โ โ (RAG) โ โ(Embed) โ โ (BGE) โ โ
+โ โโโโโโโโโโโค โโโโโโโโโโโค โโโโโโโโโโโค โโโโโโโโโโโค โโโโโโโโโโโค โโโโโโโโโโโค โ
+โ โ KServe โ โ KServe โ โ vLLM โ โ Helm โ โ KServe โ โ KServe โ โ
+โ โ nvidia โ โ nvidia โ โ ROCm โ โ Minio โ โ rdna2 โ โ intel โ โ
+โ โโโโโโโโโโโ โโโโโโโโโโโ โโโโโโโโโโโ โโโโโโโโโโโ โโโโโโโโโโโ โโโโโโโโโโโ โ
+โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ
+ โ
+ โผ
+โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ
+โ WORKFLOW ENGINE LAYER โ
+โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโค
+โ โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ โ
+โ โ Argo Workflows โโโโโบโ Kubeflow Pipelines โ โ
+โ โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโค โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโค โ
+โ โ โข Complex DAG orchestrationโ โ โข ML pipeline caching โ โ
+โ โ โข Training workflows โ โ โข Experiment tracking โ โ
+โ โ โข Document ingestion โ โ โข Model versioning โ โ
+โ โ โข Batch inference โ โ โข Artifact lineage โ โ
+โ โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ โ
+โ โ
+โ Trigger: Argo Events (EventSource โ Sensor โ Workflow/Pipeline) โ
+โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ
+ โ
+ โผ
+โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ
+โ INFRASTRUCTURE LAYER โ
+โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโค
+โ Storage: Compute: Security: โ
+โ โโ Longhorn (block) โโ Volcano Scheduler โโ Vault (secrets) โ
+โ โโ NFS CSI (shared) โโ GPU Device Plugins โโ Authentik (SSO) โ
+โ โโ MinIO (S3) โ โโ AMD ROCm โโ Falco (runtime) โ
+โ โ โโ NVIDIA CUDA โโ SOPS (GitOps) โ
+โ Databases: โ โโ Intel i915/Arc โ
+โ โโ CloudNative-PG โโ Node Feature Discovery โ
+โ โโ Valkey (cache) โ
+โ โโ ClickHouse (analytics) โ
+โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ
+ โ
+ โผ
+โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ
+โ PLATFORM LAYER โ
+โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโค
+โ Talos Linux v1.12.1 โ Kubernetes v1.35.0 โ Cilium CNI โ
+โ โ
+โ Nodes: storm, bruenor, catti (control) โ elminster, khelben, drizzt, โ
+โ โ danilo (workers) โ
+โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ
+```
+
+## Node Topology
+
+### Control Plane (HA)
+
+| Node | IP | CPU | Memory | Storage | Role |
+|------|-------|-----|--------|---------|------|
+| storm | 192.168.100.25 | Intel 13th Gen (4c) | 16GB | 500GB NVMe | etcd, API server |
+| bruenor | 192.168.100.26 | Intel 13th Gen (4c) | 16GB | 500GB NVMe | etcd, API server |
+| catti | 192.168.100.27 | Intel 13th Gen (4c) | 16GB | 500GB NVMe | etcd, API server |
+
+**VIP**: 192.168.100.20 (shared across control plane)
+
+### Worker Nodes
+
+| Node | IP | CPU | GPU | GPU Memory | Workload |
+|------|-------|-----|-----|------------|----------|
+| elminster | 192.168.100.31 | Intel | NVIDIA RTX 2070 | 8GB VRAM | Whisper, XTTS |
+| khelben | 192.168.100.32 | AMD Ryzen | AMD Strix Halo | 64GB Unified | vLLM (dedicated) |
+| drizzt | 192.168.100.40 | AMD Ryzen 7 6800H | AMD Radeon 680M | 12GB VRAM | BGE Embeddings |
+| danilo | 192.168.100.41 | Intel Core Ultra 9 | Intel Arc | 16GB Shared | Reranker |
+
+## Networking
+
+### External Access
+
+```
+Internet โ Cloudflare โ cloudflared tunnel โ Envoy Gateway โ Services
+```
+
+### DNS Zones
+
+- **External**: `*.daviestechlabs.io` (Cloudflare DNS)
+- **Internal**: `*.lab.daviestechlabs.io` (internal split-horizon)
+
+### Network CIDRs
+
+| Network | CIDR | Purpose |
+|---------|------|---------|
+| Node Network | 192.168.100.0/24 | Physical nodes |
+| Pod Network | 10.42.0.0/16 | Kubernetes pods |
+| Service Network | 10.43.0.0/16 | Kubernetes services |
+
+## Data Flow: Chat Request
+
+```mermaid
+sequenceDiagram
+ participant U as User
+ participant W as WebApp
+ participant N as NATS
+ participant C as Chat Handler
+ participant M as Milvus
+ participant L as vLLM
+ participant V as Valkey
+
+ U->>W: Send message
+ W->>N: Publish ai.chat.user.{id}.message
+ N->>C: Deliver to chat-handler
+ C->>V: Get session history
+ C->>M: RAG query (if enabled)
+ M-->>C: Relevant documents
+ C->>L: LLM inference (with context)
+ L-->>C: Streaming tokens
+ C->>N: Publish ai.chat.response.stream.{id}
+ N-->>W: Deliver streaming chunks
+ W-->>U: Display tokens
+ C->>V: Save to session
+```
+
+## GitOps Flow
+
+```
+Developer โ Git Push โ GitHub/Gitea
+ โ
+ โผ
+ โโโโโโโโโโโโโโโ
+ โ Flux CD โ
+ โ (reconcile) โ
+ โโโโโโโโฌโโโโโโโ
+ โ
+ โโโโโโโโโโโโโโโโผโโโโโโโโโโโโโโโ
+ โผ โผ โผ
+ โโโโโโโโโโโโ โโโโโโโโโโโโ โโโโโโโโโโโโ
+ โhomelab- โ โ llm- โ โ helm โ
+ โ k8s2 โ โworkflows โ โ charts โ
+ โโโโโโโโโโโโ โโโโโโโโโโโโ โโโโโโโโโโโโ
+ โ โ โ
+ โโโโโโโโโโโโโโโโดโโโโโโโโโโโโโโโ
+ โ
+ โผ
+ โโโโโโโโโโโโโโโ
+ โ Kubernetes โ
+ โ Cluster โ
+ โโโโโโโโโโโโโโโ
+```
+
+## Security Architecture
+
+### Secrets Management
+
+```
+External Secrets Operator โโโบ Vault / SOPS โโโบ Kubernetes Secrets
+```
+
+### Authentication
+
+```
+User โโโบ Cloudflare Access โโโบ Authentik โโโบ Application
+ โ
+ โโโโบ OIDC/SAML providers
+```
+
+### Network Security
+
+- **Cilium**: Network policies, eBPF-based security
+- **Falco**: Runtime security monitoring
+- **RBAC**: Fine-grained Kubernetes permissions
+
+## High Availability
+
+### Control Plane
+
+- 3-node etcd cluster with automatic leader election
+- Virtual IP (192.168.100.20) for API server access
+- Automatic failover via Talos
+
+### Workloads
+
+- Pod anti-affinity for critical services
+- HPA for auto-scaling
+- PodDisruptionBudgets for controlled updates
+
+### Storage
+
+- Longhorn 3-replica default
+- MinIO erasure coding for S3
+- Regular Velero backups
+
+## Observability
+
+### Metrics Pipeline
+
+```
+Applications โโโบ OpenTelemetry Collector โโโบ Prometheus โโโบ Grafana
+```
+
+### Logging Pipeline
+
+```
+Applications โโโบ Grafana Alloy โโโบ Loki โโโบ Grafana
+```
+
+### Tracing Pipeline
+
+```
+Applications โโโบ OpenTelemetry SDK โโโบ Jaeger/Tempo โโโบ Grafana
+```
+
+## Key Design Decisions
+
+| Decision | Rationale | ADR |
+|----------|-----------|-----|
+| Talos Linux | Immutable, API-driven, secure | [ADR-0002](decisions/0002-use-talos-linux.md) |
+| NATS over Kafka | Simpler ops, sufficient throughput | [ADR-0003](decisions/0003-use-nats-for-messaging.md) |
+| MessagePack over JSON | Binary efficiency for audio | [ADR-0004](decisions/0004-use-messagepack-for-nats.md) |
+| Multi-GPU heterogeneous | Cost optimization, workload matching | [ADR-0005](decisions/0005-multi-gpu-strategy.md) |
+| GitOps with Flux | Declarative, auditable, secure | [ADR-0006](decisions/0006-gitops-with-flux.md) |
+
+## Related Documents
+
+- [TECH-STACK.md](TECH-STACK.md) - Complete technology inventory
+- [DOMAIN-MODEL.md](DOMAIN-MODEL.md) - Core entities and relationships
+- [decisions/](decisions/) - All architecture decisions
diff --git a/CODING-CONVENTIONS.md b/CODING-CONVENTIONS.md
new file mode 100644
index 0000000..9929c93
--- /dev/null
+++ b/CODING-CONVENTIONS.md
@@ -0,0 +1,424 @@
+# ๐ Coding Conventions
+
+> **Patterns, practices, and folder structure conventions for DaviesTechLabs repositories**
+
+## Repository Conventions
+
+### homelab-k8s2 (Infrastructure)
+
+```
+kubernetes/
+โโโ apps/ # Application deployments
+โ โโโ {namespace}/ # One folder per namespace
+โ โโโ {app}/ # One folder per application
+โ โโโ app/ # Kubernetes manifests
+โ โ โโโ kustomization.yaml
+โ โ โโโ helmrelease.yaml # OR individual manifests
+โ โ โโโ ...
+โ โโโ ks.yaml # Flux Kustomization
+โโโ components/ # Reusable Kustomize components
+โโโ flux/ # Flux system configuration
+```
+
+**Naming Conventions:**
+- Namespaces: lowercase with hyphens (`ai-ml`, `cert-manager`)
+- Apps: lowercase with hyphens (`chat-handler`, `voice-assistant`)
+- Secrets: `{app}-{type}` (e.g., `milvus-credentials`)
+
+### llm-workflows (Orchestration)
+
+```
+workflows/ # Kubernetes Deployments for NATS handlers
+โโโ {handler}.yaml # One file per handler
+
+argo/ # Argo WorkflowTemplates
+โโโ {workflow-name}.yaml # One file per workflow
+
+pipelines/ # Kubeflow Pipeline Python files
+โโโ {pipeline}_pipeline.py # Pipeline definition
+โโโ kfp-sync-job.yaml # Upload job
+
+{handler}/ # Python source code
+โโโ __init__.py
+โโโ {handler}.py # Main entry point
+โโโ requirements.txt
+โโโ Dockerfile
+```
+
+---
+
+## Python Conventions
+
+### Project Structure
+
+```python
+# Use async/await for I/O
+async def handle_message(msg: Msg) -> None:
+ ...
+
+# Use dataclasses for structured data
+@dataclass
+class ChatRequest:
+ user_id: str
+ message: str
+ enable_rag: bool = True
+
+# Use msgpack for NATS messages
+import msgpack
+data = msgpack.packb({"key": "value"})
+```
+
+### Naming
+
+| Element | Convention | Example |
+|---------|------------|---------|
+| Files | snake_case | `chat_handler.py` |
+| Classes | PascalCase | `ChatHandler` |
+| Functions | snake_case | `process_message` |
+| Constants | UPPER_SNAKE | `NATS_URL` |
+| Private | Leading underscore | `_internal_method` |
+
+### Type Hints
+
+```python
+# Always use type hints
+from typing import Optional, List, Dict, Any
+
+async def query_rag(
+ query: str,
+ collection: str = "knowledge_base",
+ top_k: int = 5,
+) -> List[Dict[str, Any]]:
+ ...
+```
+
+### Error Handling
+
+```python
+# Use specific exceptions
+class RAGQueryError(Exception):
+ """Raised when RAG query fails."""
+ pass
+
+# Log errors with context
+import logging
+logger = logging.getLogger(__name__)
+
+try:
+ result = await milvus.search(...)
+except Exception as e:
+ logger.error(f"RAG query failed: {e}", extra={"query": query})
+ raise RAGQueryError(f"Failed to query collection {collection}") from e
+```
+
+### NATS Message Handling
+
+```python
+import nats
+import msgpack
+
+async def message_handler(msg: Msg) -> None:
+ try:
+ # Decode MessagePack
+ data = msgpack.unpackb(msg.data, raw=False)
+
+ # Process
+ result = await process(data)
+
+ # Reply if request-reply pattern
+ if msg.reply:
+ await msg.respond(msgpack.packb(result))
+
+ # Acknowledge for JetStream
+ await msg.ack()
+
+ except Exception as e:
+ logger.error(f"Handler error: {e}")
+ # NAK for retry (JetStream)
+ await msg.nak()
+```
+
+---
+
+## Kubernetes Manifest Conventions
+
+### Labels
+
+```yaml
+metadata:
+ labels:
+ # Required
+ app.kubernetes.io/name: chat-handler
+ app.kubernetes.io/instance: chat-handler
+ app.kubernetes.io/component: handler
+ app.kubernetes.io/part-of: ai-platform
+
+ # Optional
+ app.kubernetes.io/version: "1.0.0"
+ app.kubernetes.io/managed-by: flux
+```
+
+### Annotations
+
+```yaml
+metadata:
+ annotations:
+ # Reloader for config changes
+ reloader.stakater.com/auto: "true"
+
+ # Documentation
+ description: "Handles chat messages via NATS"
+```
+
+### Resource Requests
+
+```yaml
+resources:
+ requests:
+ cpu: 100m
+ memory: 256Mi
+ limits:
+ cpu: 500m
+ memory: 512Mi
+
+# GPU workloads
+resources:
+ limits:
+ amd.com/gpu: 1 # AMD
+ nvidia.com/gpu: 1 # NVIDIA
+```
+
+### Health Checks
+
+```yaml
+livenessProbe:
+ httpGet:
+ path: /health
+ port: 8080
+ initialDelaySeconds: 10
+ periodSeconds: 30
+
+readinessProbe:
+ httpGet:
+ path: /ready
+ port: 8080
+ initialDelaySeconds: 5
+ periodSeconds: 10
+```
+
+---
+
+## Flux/GitOps Conventions
+
+### Kustomization Structure
+
+```yaml
+# ks.yaml - Flux Kustomization
+apiVersion: kustomize.toolkit.fluxcd.io/v1
+kind: Kustomization
+metadata:
+ name: &app chat-handler
+ namespace: flux-system
+spec:
+ targetNamespace: ai-ml
+ commonMetadata:
+ labels:
+ app.kubernetes.io/name: *app
+ path: ./kubernetes/apps/ai-ml/chat-handler/app
+ prune: true
+ sourceRef:
+ kind: GitRepository
+ name: flux-system
+ wait: true
+ interval: 30m
+ retryInterval: 1m
+ timeout: 5m
+```
+
+### HelmRelease Structure
+
+```yaml
+apiVersion: helm.toolkit.fluxcd.io/v2
+kind: HelmRelease
+metadata:
+ name: milvus
+spec:
+ interval: 30m
+ chart:
+ spec:
+ chart: milvus
+ version: 4.x.x
+ sourceRef:
+ kind: HelmRepository
+ name: milvus
+ namespace: flux-system
+ values:
+ # Values here
+```
+
+### Secret References
+
+```yaml
+# Never hardcode secrets
+env:
+ - name: DATABASE_PASSWORD
+ valueFrom:
+ secretKeyRef:
+ name: postgres-credentials
+ key: password
+```
+
+---
+
+## NATS Subject Conventions
+
+### Hierarchy
+
+```
+ai.{domain}.{scope}.{action}
+
+Examples:
+ai.chat.user.{userId}.message # User chat message
+ai.chat.response.{requestId} # Chat response
+ai.voice.user.{userId}.request # Voice request
+ai.pipeline.trigger # Pipeline trigger
+```
+
+### Wildcards
+
+```
+ai.chat.> # All chat events
+ai.chat.user.*.message # All user messages
+ai.*.response.{id} # Any response type
+```
+
+---
+
+## Git Conventions
+
+### Commit Messages
+
+```
+type(scope): subject
+
+body (optional)
+
+footer (optional)
+```
+
+**Types:**
+- `feat`: New feature
+- `fix`: Bug fix
+- `docs`: Documentation
+- `style`: Formatting
+- `refactor`: Code restructuring
+- `test`: Tests
+- `chore`: Maintenance
+
+**Examples:**
+```
+feat(chat-handler): add streaming response support
+fix(voice): handle empty audio gracefully
+docs(adr): add decision for MessagePack format
+```
+
+### Branch Naming
+
+```
+feature/short-description
+fix/issue-number-description
+docs/what-changed
+```
+
+---
+
+## Configuration Conventions
+
+### Environment Variables
+
+```python
+# Use pydantic-settings or similar
+from pydantic_settings import BaseSettings
+
+class Settings(BaseSettings):
+ nats_url: str = "nats://localhost:4222"
+ vllm_url: str = "http://localhost:8000"
+ milvus_host: str = "localhost"
+ milvus_port: int = 19530
+ log_level: str = "INFO"
+
+ class Config:
+ env_prefix = "" # No prefix
+```
+
+### ConfigMaps
+
+```yaml
+apiVersion: v1
+kind: ConfigMap
+metadata:
+ name: ai-services-config
+data:
+ NATS_URL: "nats://nats.ai-ml.svc.cluster.local:4222"
+ VLLM_URL: "http://llm-draft.ai-ml.svc.cluster.local:8000/v1"
+ # ... other non-sensitive config
+```
+
+---
+
+## Documentation Conventions
+
+### ADR Format
+
+See [decisions/0000-template.md](decisions/0000-template.md)
+
+### Code Comments
+
+```python
+# Use docstrings for public functions
+async def query_rag(query: str) -> List[Dict]:
+ """
+ Query the RAG system for relevant documents.
+
+ Args:
+ query: The search query string
+
+ Returns:
+ List of document chunks with scores
+
+ Raises:
+ RAGQueryError: If the query fails
+ """
+ ...
+```
+
+### README Files
+
+Each application should have a README with:
+1. Purpose
+2. Configuration
+3. Deployment
+4. Local development
+5. API documentation (if applicable)
+
+---
+
+## Anti-Patterns to Avoid
+
+| Don't | Do Instead |
+|-------|------------|
+| `kubectl apply` directly | Commit to Git, let Flux deploy |
+| Hardcode secrets | Use External Secrets Operator |
+| Use `latest` image tags | Pin to specific versions |
+| Skip health checks | Always define liveness/readiness |
+| Ignore resource limits | Set appropriate requests/limits |
+| Use JSON for NATS messages | Use MessagePack (binary) |
+| Synchronous I/O in handlers | Use async/await |
+
+---
+
+## Related Documents
+
+- [TECH-STACK.md](TECH-STACK.md) - Technologies used
+- [ARCHITECTURE.md](ARCHITECTURE.md) - System design
+- [decisions/](decisions/) - Why we made certain choices
diff --git a/CONTAINER-DIAGRAM.mmd b/CONTAINER-DIAGRAM.mmd
new file mode 100644
index 0000000..7a5bfea
--- /dev/null
+++ b/CONTAINER-DIAGRAM.mmd
@@ -0,0 +1,123 @@
+%% C4 Container Diagram - Level 2
+%% DaviesTechLabs Homelab AI/ML Platform
+%%
+%% To render: Use Mermaid Live Editor or VS Code Mermaid extension
+
+graph TB
+ subgraph users["Users"]
+ user["๐ค User"]
+ end
+
+ subgraph ingress["Ingress Layer"]
+ cloudflared["cloudflared
(Tunnel)"]
+ envoy["Envoy Gateway
(HTTPRoute)"]
+ end
+
+ subgraph frontends["Frontend Applications"]
+ companions["Companions WebApp
[Go + HTMX]
AI Chat Interface"]
+ voice["Voice WebApp
[Gradio]
Voice Assistant UI"]
+ kubeflow_ui["Kubeflow UI
[React]
Pipeline Management"]
+ end
+
+ subgraph messaging["Message Bus"]
+ nats["NATS
[JetStream]
Event Streaming"]
+ end
+
+ subgraph handlers["NATS Handlers"]
+ chat_handler["Chat Handler
[Python]
RAG + LLM Orchestration"]
+ voice_handler["Voice Assistant
[Python]
STT โ LLM โ TTS"]
+ pipeline_bridge["Pipeline Bridge
[Python]
Workflow Triggers"]
+ end
+
+ subgraph ai_services["AI Services (KServe)"]
+ whisper["Whisper
[faster-whisper]
Speech-to-Text"]
+ xtts["XTTS
[Coqui]
Text-to-Speech"]
+ vllm["vLLM
[ROCm]
LLM Inference"]
+ embeddings["BGE Embeddings
[sentence-transformers]
Vector Encoding"]
+ reranker["BGE Reranker
[sentence-transformers]
Document Ranking"]
+ end
+
+ subgraph storage["Data Stores"]
+ milvus["Milvus
[Vector DB]
RAG Storage"]
+ valkey["Valkey
[Redis API]
Session Cache"]
+ postgres["CloudNative-PG
[PostgreSQL]
Metadata"]
+ minio["MinIO
[S3 API]
Object Storage"]
+ end
+
+ subgraph workflows["Workflow Engines"]
+ argo["Argo Workflows
[DAG Engine]
Complex Pipelines"]
+ kfp["Kubeflow Pipelines
[ML Platform]
Training + Inference"]
+ argo_events["Argo Events
[Event Source]
NATS โ Workflow"]
+ end
+
+ subgraph mlops["MLOps"]
+ mlflow["MLflow
[Tracking Server]
Experiment Tracking"]
+ volcano["Volcano
[Scheduler]
GPU Scheduling"]
+ end
+
+ %% User flow
+ user --> cloudflared
+ cloudflared --> envoy
+ envoy --> companions
+ envoy --> voice
+ envoy --> kubeflow_ui
+
+ %% Frontend to NATS
+ companions --> |WebSocket| nats
+ voice --> |HTTP/WS| nats
+
+ %% NATS to handlers
+ nats --> chat_handler
+ nats --> voice_handler
+ nats --> pipeline_bridge
+
+ %% Handlers to AI services
+ chat_handler --> embeddings
+ chat_handler --> reranker
+ chat_handler --> vllm
+ chat_handler --> milvus
+ chat_handler --> valkey
+
+ voice_handler --> whisper
+ voice_handler --> embeddings
+ voice_handler --> reranker
+ voice_handler --> vllm
+ voice_handler --> xtts
+
+ %% Pipeline flow
+ pipeline_bridge --> argo_events
+ argo_events --> argo
+ argo_events --> kfp
+ kubeflow_ui --> kfp
+
+ %% Workflow to AI
+ argo --> ai_services
+ kfp --> ai_services
+ kfp --> mlflow
+
+ %% Storage connections
+ ai_services --> minio
+ milvus --> minio
+ kfp --> postgres
+ mlflow --> postgres
+ mlflow --> minio
+
+ %% GPU scheduling
+ volcano -.-> vllm
+ volcano -.-> whisper
+ volcano -.-> xtts
+
+ %% Styling
+ classDef frontend fill:#90EE90,stroke:#333
+ classDef handler fill:#87CEEB,stroke:#333
+ classDef ai fill:#FFB6C1,stroke:#333
+ classDef storage fill:#DDA0DD,stroke:#333
+ classDef workflow fill:#F0E68C,stroke:#333
+ classDef messaging fill:#FFA500,stroke:#333
+
+ class companions,voice,kubeflow_ui frontend
+ class chat_handler,voice_handler,pipeline_bridge handler
+ class whisper,xtts,vllm,embeddings,reranker ai
+ class milvus,valkey,postgres,minio storage
+ class argo,kfp,argo_events,mlflow,volcano workflow
+ class nats messaging
diff --git a/CONTEXT-DIAGRAM.mmd b/CONTEXT-DIAGRAM.mmd
new file mode 100644
index 0000000..78c075e
--- /dev/null
+++ b/CONTEXT-DIAGRAM.mmd
@@ -0,0 +1,69 @@
+%% C4 Context Diagram - Level 1
+%% DaviesTechLabs Homelab System Context
+%%
+%% To render: Use Mermaid Live Editor or VS Code Mermaid extension
+
+graph TB
+ subgraph users["External Users"]
+ dev["๐ค Developer
(Billy)"]
+ family["๐ฅ Family Members"]
+ agents["๐ค AI Agents"]
+ end
+
+ subgraph external["External Systems"]
+ cf["โ๏ธ Cloudflare
DNS + Tunnel"]
+ gh["๐ GitHub
Source Code"]
+ ghcr["๐ฆ GHCR
Container Registry"]
+ hf["๐ค Hugging Face
Model Registry"]
+ end
+
+ subgraph homelab["๐ DaviesTechLabs Homelab"]
+ direction TB
+
+ subgraph apps["Application Layer"]
+ companions["๐ฌ Companions
AI Chat"]
+ voice["๐ค Voice Assistant"]
+ media["๐ฌ Media Services
(Jellyfin, *arr)"]
+ productivity["๐ Productivity
(Nextcloud, Gitea)"]
+ end
+
+ subgraph platform["Platform Layer"]
+ k8s["โธ๏ธ Kubernetes Cluster
Talos Linux"]
+ end
+
+ subgraph ai["AI/ML Layer"]
+ inference["๐ง Inference Services
(vLLM, Whisper, XTTS)"]
+ workflows["โ๏ธ Workflow Engines
(Kubeflow, Argo)"]
+ vectordb["๐ Vector Store
(Milvus)"]
+ end
+ end
+
+ %% User interactions
+ dev --> |manages| productivity
+ dev --> |develops| k8s
+ family --> |uses| media
+ family --> |chats| companions
+ agents --> |queries| inference
+
+ %% External integrations
+ cf --> |routes traffic| apps
+ gh --> |GitOps sync| k8s
+ ghcr --> |pulls images| k8s
+ hf --> |downloads models| inference
+
+ %% Internal relationships
+ apps --> platform
+ ai --> platform
+ companions --> inference
+ voice --> inference
+ workflows --> inference
+ inference --> vectordb
+
+ %% Styling
+ classDef external fill:#f9f,stroke:#333,stroke-width:2px
+ classDef homelab fill:#bbf,stroke:#333,stroke-width:2px
+ classDef user fill:#bfb,stroke:#333,stroke-width:2px
+
+ class cf,gh,ghcr,hf external
+ class companions,voice,media,productivity,k8s,inference,workflows,vectordb homelab
+ class dev,family,agents user
diff --git a/DOMAIN-MODEL.md b/DOMAIN-MODEL.md
new file mode 100644
index 0000000..94b076b
--- /dev/null
+++ b/DOMAIN-MODEL.md
@@ -0,0 +1,345 @@
+# ๐ Domain Model
+
+> **Core entities, bounded contexts, and relationships in the DaviesTechLabs homelab**
+
+## Bounded Contexts
+
+```
+โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ
+โ BOUNDED CONTEXTS โ
+โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโค
+โ โ
+โ โโโโโโโโโโโโโโโโโโโโโ โโโโโโโโโโโโโโโโโโโโโ โโโโโโโโโโโโโโโโโโโโโ โ
+โ โ CHAT CONTEXT โ โ VOICE CONTEXT โ โ WORKFLOW CONTEXT โ โ
+โ โโโโโโโโโโโโโโโโโโโโโค โโโโโโโโโโโโโโโโโโโโโค โโโโโโโโโโโโโโโโโโโโโค โ
+โ โ โข ChatSession โ โ โข VoiceSession โ โ โข Pipeline โ โ
+โ โ โข ChatMessage โ โ โข AudioChunk โ โ โข PipelineRun โ โ
+โ โ โข Conversation โ โ โข Transcription โ โ โข Artifact โ โ
+โ โ โข User โ โ โข SynthesizedAudioโ โ โข Experiment โ โ
+โ โโโโโโโโโโโฌโโโโโโโโโโ โโโโโโโโโโโฌโโโโโโโโโโ โโโโโโโโโโโฌโโโโโโโโโโ โ
+โ โ โ โ โ
+โ โโโโโโโโโโโโโโโโโโโโโโโโโผโโโโโโโโโโโโโโโโโโโโโโโโ โ
+โ โ โ
+โ โผ โ
+โ โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ โ
+โ โ INFERENCE CONTEXT โ โ
+โ โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโค โ
+โ โ โข InferenceRequest โข Model โข Embedding โข Document โข Chunk โ โ
+โ โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ โ
+โ โ
+โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ
+```
+
+---
+
+## Core Entities
+
+### User Context
+
+```yaml
+User:
+ id: string (UUID)
+ username: string
+ premium: boolean
+ preferences:
+ voice_id: string
+ model_preference: string
+ enable_rag: boolean
+ created_at: timestamp
+
+Session:
+ id: string (UUID)
+ user_id: string
+ type: "chat" | "voice"
+ started_at: timestamp
+ last_activity: timestamp
+ metadata: object
+```
+
+### Chat Context
+
+```yaml
+ChatMessage:
+ id: string (UUID)
+ session_id: string
+ user_id: string
+ role: "user" | "assistant" | "system"
+ content: string
+ created_at: timestamp
+ metadata:
+ tokens_used: integer
+ latency_ms: float
+ rag_sources: string[]
+ model_used: string
+
+Conversation:
+ id: string (UUID)
+ user_id: string
+ messages: ChatMessage[]
+ title: string (auto-generated)
+ created_at: timestamp
+ updated_at: timestamp
+```
+
+### Voice Context
+
+```yaml
+VoiceRequest:
+ id: string (UUID)
+ user_id: string
+ audio_b64: string (base64)
+ format: "wav" | "webm" | "mp3"
+ language: string
+ premium: boolean
+ enable_rag: boolean
+
+VoiceResponse:
+ id: string (UUID)
+ request_id: string
+ transcription: string
+ response_text: string
+ audio_b64: string (base64)
+ audio_format: string
+ latency_ms: float
+ rag_docs_used: integer
+```
+
+### Inference Context
+
+```yaml
+InferenceRequest:
+ id: string (UUID)
+ service: "llm" | "stt" | "tts" | "embeddings" | "reranker"
+ input: string | bytes
+ parameters: object
+ priority: "standard" | "premium"
+
+InferenceResponse:
+ id: string (UUID)
+ request_id: string
+ output: string | bytes | float[]
+ metadata:
+ model: string
+ latency_ms: float
+ tokens: integer (if applicable)
+```
+
+### RAG Context
+
+```yaml
+Document:
+ id: string (UUID)
+ collection: string
+ title: string
+ content: string
+ source_url: string
+ ingested_at: timestamp
+
+Chunk:
+ id: string (UUID)
+ document_id: string
+ content: string
+ embedding: float[1024] # BGE-large dimensions
+ metadata:
+ position: integer
+ page: integer
+
+RAGQuery:
+ query: string
+ collection: string
+ top_k: integer (default: 5)
+ rerank: boolean (default: true)
+ rerank_top_k: integer (default: 3)
+
+RAGResult:
+ chunks: Chunk[]
+ scores: float[]
+ reranked: boolean
+```
+
+### Workflow Context
+
+```yaml
+Pipeline:
+ id: string
+ name: string
+ version: string
+ engine: "kubeflow" | "argo"
+ definition: object (YAML)
+
+PipelineRun:
+ id: string (UUID)
+ pipeline_id: string
+ status: "pending" | "running" | "succeeded" | "failed"
+ started_at: timestamp
+ completed_at: timestamp
+ parameters: object
+ artifacts: Artifact[]
+
+Artifact:
+ id: string (UUID)
+ run_id: string
+ name: string
+ type: "model" | "dataset" | "metrics" | "logs"
+ uri: string (s3://)
+ metadata: object
+
+Experiment:
+ id: string (UUID)
+ name: string
+ runs: PipelineRun[]
+ metrics: object
+ created_at: timestamp
+```
+
+---
+
+## Entity Relationships
+
+```mermaid
+erDiagram
+ USER ||--o{ SESSION : has
+ USER ||--o{ CONVERSATION : owns
+ SESSION ||--o{ CHAT_MESSAGE : contains
+ CONVERSATION ||--o{ CHAT_MESSAGE : contains
+
+ USER ||--o{ VOICE_REQUEST : makes
+ VOICE_REQUEST ||--|| VOICE_RESPONSE : produces
+
+ DOCUMENT ||--o{ CHUNK : contains
+ CHUNK }|--|| EMBEDDING : has
+
+ PIPELINE ||--o{ PIPELINE_RUN : executed_as
+ PIPELINE_RUN ||--o{ ARTIFACT : produces
+ EXPERIMENT ||--o{ PIPELINE_RUN : tracks
+
+ INFERENCE_REQUEST }|--|| INFERENCE_RESPONSE : produces
+```
+
+---
+
+## Aggregate Roots
+
+| Aggregate | Root Entity | Child Entities |
+|-----------|-------------|----------------|
+| Chat | Conversation | ChatMessage |
+| Voice | VoiceRequest | VoiceResponse |
+| RAG | Document | Chunk, Embedding |
+| Workflow | PipelineRun | Artifact |
+| User | User | Session, Preferences |
+
+---
+
+## Event Flow
+
+### Chat Event Stream
+
+```
+UserLogin
+ โโโบ SessionCreated
+ โโโบ MessageReceived
+ โโโบ RAGQueryExecuted (optional)
+ โโโบ InferenceRequested
+ โโโบ ResponseGenerated
+ โโโบ MessageStored
+```
+
+### Voice Event Stream
+
+```
+VoiceRequestReceived
+ โโโบ TranscriptionStarted
+ โโโบ TranscriptionCompleted
+ โโโบ RAGQueryExecuted (optional)
+ โโโบ LLMInferenceStarted
+ โโโบ LLMResponseGenerated
+ โโโบ TTSSynthesisStarted
+ โโโบ AudioResponseReady
+```
+
+### Workflow Event Stream
+
+```
+PipelineTriggerReceived
+ โโโบ PipelineRunCreated
+ โโโบ StepStarted (repeated)
+ โโโบ StepCompleted (repeated)
+ โโโบ ArtifactProduced (repeated)
+ โโโบ PipelineRunCompleted
+```
+
+---
+
+## Data Retention
+
+| Entity | Retention | Storage |
+|--------|-----------|---------|
+| ChatMessage | 30 days | JetStream โ PostgreSQL |
+| VoiceRequest/Response | 1 hour (audio), 30 days (text) | JetStream โ PostgreSQL |
+| Chunk/Embedding | Permanent | Milvus |
+| PipelineRun | Permanent | PostgreSQL |
+| Artifact | Permanent | MinIO |
+| Session | 7 days | Valkey |
+
+---
+
+## Invariants
+
+### Chat Context
+- A ChatMessage must belong to exactly one Conversation
+- A Conversation must have at least one ChatMessage
+- Messages are immutable once created
+
+### Voice Context
+- VoiceResponse must have corresponding VoiceRequest
+- Audio format must be one of: wav, webm, mp3
+- Transcription cannot be empty for valid audio
+
+### RAG Context
+- Chunk must belong to exactly one Document
+- Embedding dimensions must match model (1024 for BGE-large)
+- Document must have at least one Chunk
+
+### Workflow Context
+- PipelineRun must reference valid Pipeline
+- Artifacts must have valid S3 URIs
+- Run status transitions: pending โ running โ (succeeded|failed)
+
+---
+
+## Value Objects
+
+```python
+# Immutable value objects
+@dataclass(frozen=True)
+class MessageContent:
+ text: str
+ tokens: int
+
+@dataclass(frozen=True)
+class AudioData:
+ data: bytes
+ format: str
+ duration_ms: int
+ sample_rate: int
+
+@dataclass(frozen=True)
+class EmbeddingVector:
+ values: tuple[float, ...]
+ model: str
+ dimensions: int
+
+@dataclass(frozen=True)
+class RAGContext:
+ chunks: tuple[str, ...]
+ scores: tuple[float, ...]
+ query: str
+```
+
+---
+
+## Related Documents
+
+- [ARCHITECTURE.md](ARCHITECTURE.md) - System architecture
+- [GLOSSARY.md](GLOSSARY.md) - Term definitions
+- [decisions/0004-use-messagepack-for-nats.md](decisions/0004-use-messagepack-for-nats.md) - Message format decision
diff --git a/GLOSSARY.md b/GLOSSARY.md
new file mode 100644
index 0000000..ce5efa2
--- /dev/null
+++ b/GLOSSARY.md
@@ -0,0 +1,242 @@
+# ๐ Glossary
+
+> **Terminology and abbreviations used in the DaviesTechLabs homelab**
+
+## A
+
+**ADR (Architecture Decision Record)**
+: A document that captures an important architectural decision, including context, decision, and consequences.
+
+**Argo Events**
+: Event-driven automation for Kubernetes that triggers workflows based on events from various sources.
+
+**Argo Workflows**
+: A container-native workflow engine for orchestrating parallel jobs on Kubernetes.
+
+**Authentik**
+: Self-hosted identity provider supporting SAML, OIDC, and other protocols.
+
+## B
+
+**BGE (BAAI General Embedding)**
+: A family of embedding models from BAAI used for semantic search and RAG.
+
+**Bounded Context**
+: A DDD concept defining a boundary within which a particular domain model applies.
+
+## C
+
+**C4 Model**
+: A hierarchical approach to software architecture diagrams: Context, Container, Component, Code.
+
+**Cilium**
+: eBPF-based networking, security, and observability for Kubernetes.
+
+**CloudNative-PG**
+: Kubernetes operator for PostgreSQL databases.
+
+**CNI (Container Network Interface)**
+: Standard for configuring network interfaces in Linux containers.
+
+## D
+
+**DDD (Domain-Driven Design)**
+: Software design approach focusing on the core domain and domain logic.
+
+## E
+
+**Embedding**
+: A vector representation of text, used for semantic similarity and search.
+
+**Envoy Gateway**
+: Kubernetes Gateway API implementation using Envoy proxy.
+
+**External Secrets Operator (ESO)**
+: Kubernetes operator that syncs secrets from external stores (Vault, etc.).
+
+## F
+
+**Falco**
+: Runtime security tool that detects anomalous activity in containers.
+
+**Flux CD**
+: GitOps toolkit for Kubernetes, continuously reconciling cluster state with Git.
+
+## G
+
+**GitOps**
+: Operational practice using Git as the single source of truth for declarative infrastructure.
+
+**GPU Device Plugin**
+: Kubernetes plugin that exposes GPU resources to containers.
+
+## H
+
+**HelmRelease**
+: Flux CRD for managing Helm chart releases declaratively.
+
+**HTTPRoute**
+: Kubernetes Gateway API resource for HTTP routing rules.
+
+## I
+
+**InferenceService**
+: KServe CRD for deploying ML models with autoscaling and traffic management.
+
+## J
+
+**JetStream**
+: NATS persistence layer providing streaming, key-value, and object stores.
+
+## K
+
+**KServe**
+: Kubernetes-native platform for deploying and serving ML models.
+
+**Kubeflow**
+: ML toolkit for Kubernetes, including pipelines, training operators, and more.
+
+**Kustomization**
+: Flux CRD for applying Kustomize overlays from Git sources.
+
+## L
+
+**LLM (Large Language Model)**
+: AI model trained on vast text data, capable of generating human-like text.
+
+**Longhorn**
+: Cloud-native distributed storage for Kubernetes.
+
+## M
+
+**MessagePack (msgpack)**
+: Binary serialization format, more compact than JSON.
+
+**Milvus**
+: Open-source vector database for similarity search and AI applications.
+
+**MLflow**
+: Platform for managing the ML lifecycle: experiments, models, deployment.
+
+**MinIO**
+: S3-compatible object storage.
+
+## N
+
+**NATS**
+: Cloud-native messaging system for microservices, IoT, and serverless.
+
+**Node Feature Discovery (NFD)**
+: Kubernetes add-on for detecting hardware features on nodes.
+
+## P
+
+**Pipeline**
+: In ML context, a DAG of components that process data and train/serve models.
+
+**Premium User**
+: User tier with enhanced features (more RAG docs, priority routing).
+
+## R
+
+**RAG (Retrieval-Augmented Generation)**
+: AI technique combining document retrieval with LLM generation for grounded responses.
+
+**Reranker**
+: Model that rescores retrieved documents based on relevance to a query.
+
+**ROCm**
+: AMD's open-source GPU computing platform (alternative to CUDA).
+
+## S
+
+**Schematic**
+: Talos Linux concept for defining system extensions and configurations.
+
+**SOPS (Secrets OPerationS)**
+: Tool for encrypting secrets in Git repositories.
+
+**STT (Speech-to-Text)**
+: Converting spoken audio to text (e.g., Whisper).
+
+**Strix Halo**
+: AMD's unified memory architecture for APUs with large GPU memory.
+
+## T
+
+**Talos Linux**
+: Minimal, immutable Linux distribution designed specifically for Kubernetes.
+
+**TTS (Text-to-Speech)**
+: Converting text to spoken audio (e.g., XTTS/Coqui).
+
+## V
+
+**Valkey**
+: Redis-compatible in-memory data store (Redis fork).
+
+**vLLM**
+: High-throughput LLM serving engine with PagedAttention.
+
+**VIP (Virtual IP)**
+: IP address shared among multiple hosts for high availability.
+
+**Volcano**
+: Kubernetes batch scheduler for high-performance workloads (ML, HPC).
+
+**VRM**
+: File format for 3D humanoid avatars.
+
+## W
+
+**Whisper**
+: OpenAI's speech recognition model.
+
+## X
+
+**XTTS**
+: Coqui's multi-language text-to-speech model with voice cloning.
+
+---
+
+## Acronyms Quick Reference
+
+| Acronym | Full Form |
+|---------|-----------|
+| ADR | Architecture Decision Record |
+| API | Application Programming Interface |
+| BGE | BAAI General Embedding |
+| CI/CD | Continuous Integration/Continuous Deployment |
+| CRD | Custom Resource Definition |
+| DAG | Directed Acyclic Graph |
+| DDD | Domain-Driven Design |
+| ESO | External Secrets Operator |
+| GPU | Graphics Processing Unit |
+| HA | High Availability |
+| HPA | Horizontal Pod Autoscaler |
+| LLM | Large Language Model |
+| ML | Machine Learning |
+| NATS | (not an acronym, named after message passing in Erlang) |
+| NFD | Node Feature Discovery |
+| OIDC | OpenID Connect |
+| RAG | Retrieval-Augmented Generation |
+| RBAC | Role-Based Access Control |
+| ROCm | Radeon Open Compute |
+| S3 | Simple Storage Service |
+| SAML | Security Assertion Markup Language |
+| SOPS | Secrets OPerationS |
+| SSO | Single Sign-On |
+| STT | Speech-to-Text |
+| TLS | Transport Layer Security |
+| TTS | Text-to-Speech |
+| UUID | Universally Unique Identifier |
+| VIP | Virtual IP |
+| VRAM | Video Random Access Memory |
+
+---
+
+## Related Documents
+
+- [ARCHITECTURE.md](ARCHITECTURE.md) - System overview
+- [TECH-STACK.md](TECH-STACK.md) - Technology details
+- [DOMAIN-MODEL.md](DOMAIN-MODEL.md) - Entity definitions
diff --git a/README.md b/README.md
index a408e86..35449c6 100644
--- a/README.md
+++ b/README.md
@@ -1,3 +1,105 @@
-# homelab-design
+# ๐ DaviesTechLabs Homelab Architecture
-homelab design process goes here.
\ No newline at end of file
+> **Production-grade AI/ML platform running on bare-metal Kubernetes**
+
+[](https://talos.dev)
+[](https://kubernetes.io)
+[](https://fluxcd.io)
+[](LICENSE)
+
+## ๐ Quick Navigation
+
+| Document | Purpose |
+|----------|---------|
+| [AGENT-ONBOARDING.md](AGENT-ONBOARDING.md) | **Start here if you're an AI agent** |
+| [ARCHITECTURE.md](ARCHITECTURE.md) | High-level system overview |
+| [TECH-STACK.md](TECH-STACK.md) | Complete technology stack |
+| [DOMAIN-MODEL.md](DOMAIN-MODEL.md) | Core entities and bounded contexts |
+| [GLOSSARY.md](GLOSSARY.md) | Terminology reference |
+| [decisions/](decisions/) | Architecture Decision Records (ADRs) |
+
+## ๐ฏ What This Is
+
+A comprehensive architecture documentation repository for the DaviesTechLabs homelab Kubernetes cluster, featuring:
+
+- **AI/ML Platform**: KServe inference services, RAG pipelines, voice assistants
+- **Multi-GPU Support**: AMD ROCm (RDNA3/Strix Halo), NVIDIA CUDA, Intel Arc
+- **GitOps**: Flux CD with SOPS encryption
+- **Event-Driven**: NATS JetStream for real-time messaging
+- **ML Workflows**: Kubeflow Pipelines + Argo Workflows
+
+## ๐ฅ๏ธ Cluster Overview
+
+| Node | Role | Hardware | GPU |
+|------|------|----------|-----|
+| storm | Control Plane | Intel 13th Gen | Integrated |
+| bruenor | Control Plane | Intel 13th Gen | Integrated |
+| catti | Control Plane | Intel 13th Gen | Integrated |
+| elminster | Worker | NVIDIA RTX 2070 | 8GB CUDA |
+| khelben | Worker (vLLM) | AMD Strix Halo | 64GB Unified |
+| drizzt | Worker | AMD Radeon 680M | 12GB RDNA2 |
+| danilo | Worker | Intel Core Ultra 9 | Intel Arc |
+
+## ๐ Quick Start
+
+### View Current Cluster State
+
+```bash
+# Get node status
+kubectl get nodes -o wide
+
+# View AI/ML workloads
+kubectl get pods -n ai-ml
+
+# Check KServe inference services
+kubectl get inferenceservices -n ai-ml
+```
+
+### Key Endpoints
+
+| Service | URL | Purpose |
+|---------|-----|---------|
+| Kubeflow | `kubeflow.lab.daviestechlabs.io` | ML Pipeline UI |
+| Companions | `companions-chat.lab.daviestechlabs.io` | AI Chat Interface |
+| Voice | `voice.lab.daviestechlabs.io` | Voice Assistant |
+| Gitea | `git.daviestechlabs.io` | Self-hosted Git |
+
+## ๐ Repository Structure
+
+```
+homelab-design/
+โโโ README.md # This file
+โโโ AGENT-ONBOARDING.md # AI agent quick-start
+โโโ ARCHITECTURE.md # High-level system overview
+โโโ CONTEXT-DIAGRAM.mmd # C4 Level 1 (Mermaid)
+โโโ CONTAINER-DIAGRAM.mmd # C4 Level 2
+โโโ TECH-STACK.md # Complete tech stack
+โโโ DOMAIN-MODEL.md # Core entities
+โโโ CODING-CONVENTIONS.md # Patterns & practices
+โโโ GLOSSARY.md # Terminology
+โโโ decisions/ # ADRs
+โ โโโ 0000-template.md
+โ โโโ 0001-record-architecture-decisions.md
+โ โโโ 0002-use-talos-linux.md
+โ โโโ ...
+โโโ specs/ # Feature specifications
+โโโ diagrams/ # Additional diagrams
+```
+
+## ๐ Related Repositories
+
+| Repository | Purpose |
+|------------|---------|
+| [homelab-k8s2](https://github.com/Billy-Davies-2/homelab-k8s2) | Kubernetes manifests, Flux GitOps |
+| [llm-workflows](https://github.com/Billy-Davies-2/llm-workflows) | NATS handlers, Argo/KFP workflows |
+| [companions-frontend](https://github.com/Billy-Davies-2/companions-frontend) | Go web server, HTMX frontend |
+
+## ๐ Contributing
+
+1. For architecture changes, create an ADR in `decisions/`
+2. Update relevant documentation
+3. Submit a PR with context
+
+---
+
+*Last updated: 2026-02-01*
diff --git a/TECH-STACK.md b/TECH-STACK.md
new file mode 100644
index 0000000..bbb1de8
--- /dev/null
+++ b/TECH-STACK.md
@@ -0,0 +1,271 @@
+# ๐ ๏ธ Technology Stack
+
+> **Complete inventory of technologies used in the DaviesTechLabs homelab**
+
+## Platform Layer
+
+### Operating System
+
+| Component | Version | Purpose |
+|-----------|---------|---------|
+| [Talos Linux](https://talos.dev) | v1.12.1 | Immutable, API-driven Kubernetes OS |
+| Kernel | 6.18.2-talos | Linux kernel with GPU drivers |
+
+### Container Orchestration
+
+| Component | Version | Purpose |
+|-----------|---------|---------|
+| [Kubernetes](https://kubernetes.io) | v1.35.0 | Container orchestration |
+| [containerd](https://containerd.io) | 2.1.6 | Container runtime |
+| [Cilium](https://cilium.io) | Latest | CNI, network policies, eBPF |
+
+### GitOps
+
+| Component | Version | Purpose |
+|-----------|---------|---------|
+| [Flux CD](https://fluxcd.io) | v2 | GitOps continuous delivery |
+| [SOPS](https://github.com/getsops/sops) | Latest | Secret encryption |
+| [Age](https://github.com/FiloSottile/age) | Latest | Encryption key management |
+
+---
+
+## AI/ML Layer
+
+### Inference Engines
+
+| Service | Framework | GPU | Model Type |
+|---------|-----------|-----|------------|
+| [vLLM](https://vllm.ai) | ROCm | AMD Strix Halo | Large Language Models |
+| [faster-whisper](https://github.com/guillaumekln/faster-whisper) | CUDA | NVIDIA RTX 2070 | Speech-to-Text |
+| [XTTS](https://github.com/coqui-ai/TTS) | CUDA | NVIDIA RTX 2070 | Text-to-Speech |
+| [BGE Embeddings](https://huggingface.co/BAAI/bge-large-en-v1.5) | ROCm | AMD Radeon 680M | Text Embeddings |
+| [BGE Reranker](https://huggingface.co/BAAI/bge-reranker-large) | Intel | Intel Arc | Document Reranking |
+
+### ML Serving
+
+| Component | Version | Purpose |
+|-----------|---------|---------|
+| [KServe](https://kserve.github.io) | v0.12+ | Model serving framework |
+| [Ray Serve](https://ray.io/serve) | 2.53.0 | Unified inference endpoints |
+
+### ML Workflows
+
+| Component | Version | Purpose |
+|-----------|---------|---------|
+| [Kubeflow Pipelines](https://kubeflow.org) | 2.15.0 | ML pipeline orchestration |
+| [Argo Workflows](https://argoproj.github.io/workflows) | v3.7.8 | DAG-based workflows |
+| [Argo Events](https://argoproj.github.io/events) | Latest | Event-driven triggers |
+| [MLflow](https://mlflow.org) | 3.7.0 | Experiment tracking, model registry |
+
+### GPU Scheduling
+
+| Component | Version | Purpose |
+|-----------|---------|---------|
+| [Volcano](https://volcano.sh) | Latest | GPU-aware scheduling |
+| AMD GPU Device Plugin | v1.4.1 | ROCm GPU allocation |
+| NVIDIA Device Plugin | Latest | CUDA GPU allocation |
+| [Node Feature Discovery](https://github.com/kubernetes-sigs/node-feature-discovery) | v0.18.2 | Hardware detection |
+
+---
+
+## Data Layer
+
+### Databases
+
+| Component | Version | Purpose |
+|-----------|---------|---------|
+| [CloudNative-PG](https://cloudnative-pg.io) | 16.11 | PostgreSQL for metadata |
+| [Milvus](https://milvus.io) | Latest | Vector database for RAG |
+| [ClickHouse](https://clickhouse.com) | Latest | Analytics, access logs |
+| [Valkey](https://valkey.io) | Latest | Redis-compatible cache |
+
+### Object Storage
+
+| Component | Version | Purpose |
+|-----------|---------|---------|
+| [MinIO](https://min.io) | Latest | S3-compatible storage |
+| [Longhorn](https://longhorn.io) | v1.10.1 | Distributed block storage |
+| NFS CSI Driver | Latest | Shared filesystem |
+
+### Messaging
+
+| Component | Version | Purpose |
+|-----------|---------|---------|
+| [NATS](https://nats.io) | Latest | Message bus |
+| NATS JetStream | Built-in | Persistent streaming |
+
+### Data Processing
+
+| Component | Version | Purpose |
+|-----------|---------|---------|
+| [Apache Spark](https://spark.apache.org) | Latest | Batch analytics |
+| [Apache Flink](https://flink.apache.org) | Latest | Stream processing |
+| [Apache Iceberg](https://iceberg.apache.org) | Latest | Table format |
+| [Nessie](https://projectnessie.org) | Latest | Data catalog |
+| [Trino](https://trino.io) | 479 | SQL query engine |
+
+---
+
+## Application Layer
+
+### Web Frameworks
+
+| Application | Language | Framework | Purpose |
+|-------------|----------|-----------|---------|
+| Companions | Go | net/http + HTMX | AI chat interface |
+| Voice WebApp | Python | Gradio | Voice assistant UI |
+| Various handlers | Python | asyncio + nats.py | NATS event handlers |
+
+### Frontend
+
+| Technology | Purpose |
+|------------|---------|
+| [HTMX](https://htmx.org) | Dynamic HTML updates |
+| [Alpine.js](https://alpinejs.dev) | Lightweight reactivity |
+| [VRM](https://vrm.dev) | 3D avatar rendering |
+
+---
+
+## Networking Layer
+
+### Ingress
+
+| Component | Version | Purpose |
+|-----------|---------|---------|
+| [Envoy Gateway](https://gateway.envoyproxy.io) | v1.6.3 | Gateway API implementation |
+| [cloudflared](https://developers.cloudflare.com/cloudflare-one/connections/connect-apps) | Latest | Cloudflare tunnel |
+
+### DNS & Certificates
+
+| Component | Version | Purpose |
+|-----------|---------|---------|
+| [external-dns](https://github.com/kubernetes-sigs/external-dns) | Latest | Automatic DNS management |
+| [cert-manager](https://cert-manager.io) | Latest | TLS certificate automation |
+
+### Service Mesh
+
+| Component | Purpose |
+|-----------|---------|
+| [Spegel](https://github.com/spegel-org/spegel) | P2P container image distribution |
+
+---
+
+## Security Layer
+
+### Identity & Access
+
+| Component | Version | Purpose |
+|-----------|---------|---------|
+| [Authentik](https://goauthentik.io) | 2025.12.1 | Identity provider, SSO |
+| [Vault](https://vaultproject.io) | 1.21.2 | Secret management |
+| [External Secrets Operator](https://external-secrets.io) | v1.3.1 | Kubernetes secrets sync |
+
+### Runtime Security
+
+| Component | Version | Purpose |
+|-----------|---------|---------|
+| [Falco](https://falco.org) | 0.42.1 | Runtime threat detection |
+| Cilium Network Policies | Built-in | Network segmentation |
+
+### Backup
+
+| Component | Version | Purpose |
+|-----------|---------|---------|
+| [Velero](https://velero.io) | v1.17.1 | Cluster backup/restore |
+
+---
+
+## Observability Layer
+
+### Metrics
+
+| Component | Purpose |
+|-----------|---------|
+| [Prometheus](https://prometheus.io) | Metrics collection |
+| [Grafana](https://grafana.com) | Dashboards & visualization |
+
+### Logging
+
+| Component | Version | Purpose |
+|-----------|---------|---------|
+| [Grafana Alloy](https://grafana.com/oss/alloy) | v1.12.0 | Log collection |
+| [Loki](https://grafana.com/oss/loki) | Latest | Log aggregation |
+
+### Tracing
+
+| Component | Purpose |
+|-----------|---------|
+| [OpenTelemetry Collector](https://opentelemetry.io) | Trace collection |
+| Tempo/Jaeger | Trace storage & query |
+
+---
+
+## Development Tools
+
+### Local Development
+
+| Tool | Purpose |
+|------|---------|
+| [mise](https://mise.jdx.dev) | Tool version management |
+| [Task](https://taskfile.dev) | Task runner (Taskfile.yaml) |
+| [flux-local](https://github.com/allenporter/flux-local) | Local Flux testing |
+
+### CI/CD
+
+| Tool | Purpose |
+|------|---------|
+| GitHub Actions | CI/CD pipelines |
+| [Renovate](https://renovatebot.com) | Dependency updates |
+
+### Image Building
+
+| Tool | Purpose |
+|------|---------|
+| Docker | Container builds |
+| GHCR | Container registry |
+
+---
+
+## Media & Entertainment
+
+| Component | Version | Purpose |
+|-----------|---------|---------|
+| [Jellyfin](https://jellyfin.org) | 10.11.5 | Media server |
+| [Nextcloud](https://nextcloud.com) | 32.0.5 | File sync & share |
+| Prowlarr, Bazarr, etc. | Various | *arr stack |
+| [Kasm](https://kasmweb.com) | 1.18.1 | Browser isolation |
+
+---
+
+## Python Dependencies (llm-workflows)
+
+```toml
+# Core
+nats-py>=2.7.0 # NATS client
+msgpack>=1.0.0 # Binary serialization
+aiohttp>=3.9.0 # HTTP client
+
+# ML/AI
+pymilvus>=2.4.0 # Milvus client
+sentence-transformers # Embeddings
+openai>=1.0.0 # vLLM OpenAI API
+
+# Kubeflow
+kfp>=2.12.1 # Pipeline SDK
+```
+
+---
+
+## Version Pinning Strategy
+
+| Component Type | Strategy |
+|----------------|----------|
+| Base images | Pin major.minor |
+| Helm charts | Pin exact version |
+| Python packages | Pin minimum version |
+| System extensions | Pin via Talos schematic |
+
+## Related Documents
+
+- [ARCHITECTURE.md](ARCHITECTURE.md) - How components connect
+- [decisions/](decisions/) - Why we chose specific technologies
diff --git a/decisions/0000-template.md b/decisions/0000-template.md
new file mode 100644
index 0000000..0566a72
--- /dev/null
+++ b/decisions/0000-template.md
@@ -0,0 +1,71 @@
+# [short title of solved problem and solution]
+
+* Status: [proposed | rejected | accepted | deprecated | superseded by [ADR-NNNN](NNNN-example.md)]
+* Date: YYYY-MM-DD
+* Deciders: [list of people involved in decision]
+* Technical Story: [description | ticket/issue URL]
+
+## Context and Problem Statement
+
+[Describe the context and problem statement, e.g., in free form using two to three sentences. You may want to articulate the problem in form of a question.]
+
+## Decision Drivers
+
+* [driver 1, e.g., a force, facing concern, โฆ]
+* [driver 2, e.g., a force, facing concern, โฆ]
+* โฆ
+
+## Considered Options
+
+* [option 1]
+* [option 2]
+* [option 3]
+* โฆ
+
+## Decision Outcome
+
+Chosen option: "[option N]", because [justification. e.g., only option which meets k.o. criterion decision driver | which resolves force | โฆ | comes out best (see below)].
+
+### Positive Consequences
+
+* [e.g., improvement of quality attribute satisfaction, follow-up decisions required, โฆ]
+* โฆ
+
+### Negative Consequences
+
+* [e.g., compromising quality attribute, follow-up decisions required, โฆ]
+* โฆ
+
+## Pros and Cons of the Options
+
+### [option 1]
+
+[example | description | pointer to more information | โฆ]
+
+* Good, because [argument a]
+* Good, because [argument b]
+* Bad, because [argument c]
+* โฆ
+
+### [option 2]
+
+[example | description | pointer to more information | โฆ]
+
+* Good, because [argument a]
+* Good, because [argument b]
+* Bad, because [argument c]
+* โฆ
+
+### [option 3]
+
+[example | description | pointer to more information | โฆ]
+
+* Good, because [argument a]
+* Good, because [argument b]
+* Bad, because [argument c]
+* โฆ
+
+## Links
+
+* [Link type] [Link to ADR]
+* โฆ
diff --git a/decisions/0001-record-architecture-decisions.md b/decisions/0001-record-architecture-decisions.md
new file mode 100644
index 0000000..6a48c48
--- /dev/null
+++ b/decisions/0001-record-architecture-decisions.md
@@ -0,0 +1,79 @@
+# Record Architecture Decisions
+
+* Status: accepted
+* Date: 2025-11-30
+* Deciders: Billy Davies
+* Technical Story: Initial setup of homelab documentation
+
+## Context and Problem Statement
+
+As the homelab infrastructure grows in complexity with AI/ML services, multi-GPU configurations, and event-driven architectures, we need a way to document and communicate significant architectural decisions. Without documentation, the rationale behind choices gets lost, making future changes risky and onboarding difficult.
+
+## Decision Drivers
+
+* Need to preserve context for why decisions were made
+* Enable future maintainers (including AI agents) to understand the system
+* Provide a structured way to evaluate alternatives
+* Support the wiki/design process for iterative improvements
+
+## Considered Options
+
+* Informal documentation in README files
+* Wiki pages without structure
+* Architecture Decision Records (ADRs)
+* No documentation (rely on code)
+
+## Decision Outcome
+
+Chosen option: "Architecture Decision Records (ADRs)", because they provide a structured format that captures context, alternatives, and consequences. They're lightweight, version-controlled, and well-suited for technical decisions.
+
+### Positive Consequences
+
+* Clear historical record of decisions
+* Structured format makes decisions searchable
+* Forces consideration of alternatives
+* Git-versioned alongside code
+* AI agents can parse and understand decisions
+
+### Negative Consequences
+
+* Requires discipline to create ADRs
+* May accumulate outdated decisions over time
+* Additional overhead for simple decisions
+
+## Pros and Cons of the Options
+
+### Informal README documentation
+
+* Good, because low friction
+* Good, because close to code
+* Bad, because no structure for alternatives
+* Bad, because decisions get buried in prose
+
+### Wiki pages
+
+* Good, because easy to edit
+* Good, because supports rich formatting
+* Bad, because separate from code repository
+* Bad, because no enforced structure
+
+### ADRs
+
+* Good, because structured format
+* Good, because version controlled
+* Good, because captures alternatives considered
+* Good, because industry-standard practice
+* Bad, because requires creating new files
+* Bad, because may seem bureaucratic for small decisions
+
+### No documentation
+
+* Good, because no overhead
+* Bad, because context is lost
+* Bad, because makes onboarding difficult
+* Bad, because risky for future changes
+
+## Links
+
+* Based on [MADR template](https://adr.github.io/madr/)
+* [ADR GitHub organization](https://adr.github.io/)
diff --git a/decisions/0002-use-talos-linux.md b/decisions/0002-use-talos-linux.md
new file mode 100644
index 0000000..f5bd437
--- /dev/null
+++ b/decisions/0002-use-talos-linux.md
@@ -0,0 +1,97 @@
+# Use Talos Linux for Kubernetes Nodes
+
+* Status: accepted
+* Date: 2025-11-30
+* Deciders: Billy Davies
+* Technical Story: Selecting OS for bare-metal Kubernetes cluster
+
+## Context and Problem Statement
+
+We need a reliable, secure operating system for running Kubernetes on bare-metal homelab nodes. The OS should minimize attack surface, be easy to manage at scale, and support our GPU requirements (AMD ROCm, NVIDIA CUDA, Intel).
+
+## Decision Drivers
+
+* Security-first design (immutable, minimal)
+* API-driven management (no SSH)
+* Support for various GPU drivers
+* Kubernetes-native focus
+* Community support and updates
+* Ease of upgrades
+
+## Considered Options
+
+* Ubuntu Server with kubeadm
+* Flatcar Container Linux
+* Talos Linux
+* k3OS (discontinued)
+* Rocky Linux with RKE2
+
+## Decision Outcome
+
+Chosen option: "Talos Linux", because it provides an immutable, API-driven, Kubernetes-focused OS that minimizes attack surface and simplifies operations.
+
+### Positive Consequences
+
+* Immutable root filesystem prevents drift
+* No SSH reduces attack vectors
+* API-driven management integrates well with GitOps
+* Schematic system allows custom kernel modules (GPU drivers)
+* Consistent configuration across all nodes
+* Automatic updates with minimal disruption
+
+### Negative Consequences
+
+* Learning curve for API-driven management
+* Debugging requires different approaches (no SSH)
+* Custom extensions require schematic IDs
+* Less flexibility for non-Kubernetes workloads
+
+## Pros and Cons of the Options
+
+### Ubuntu Server with kubeadm
+
+* Good, because familiar
+* Good, because extensive package availability
+* Good, because easy debugging via SSH
+* Bad, because mutable system leads to drift
+* Bad, because large attack surface
+* Bad, because manual package management
+
+### Flatcar Container Linux
+
+* Good, because immutable
+* Good, because auto-updates
+* Good, because container-focused
+* Bad, because less Kubernetes-specific
+* Bad, because smaller community than Talos
+* Bad, because GPU driver setup more complex
+
+### Talos Linux
+
+* Good, because purpose-built for Kubernetes
+* Good, because immutable and minimal
+* Good, because API-driven (no SSH)
+* Good, because excellent Kubernetes integration
+* Good, because active development and community
+* Good, because schematic system for GPU drivers
+* Bad, because learning curve
+* Bad, because no traditional debugging
+
+### k3OS
+
+* Good, because simple
+* Bad, because discontinued
+
+### Rocky Linux with RKE2
+
+* Good, because enterprise-like
+* Good, because familiar Linux experience
+* Bad, because mutable system
+* Bad, because more operational overhead
+* Bad, because larger attack surface
+
+## Links
+
+* [Talos Linux](https://talos.dev)
+* [Talos Image Factory](https://factory.talos.dev)
+* Related: [ADR-0005](0005-multi-gpu-strategy.md) - GPU driver integration via schematics
diff --git a/decisions/0003-use-nats-for-messaging.md b/decisions/0003-use-nats-for-messaging.md
new file mode 100644
index 0000000..7ad495a
--- /dev/null
+++ b/decisions/0003-use-nats-for-messaging.md
@@ -0,0 +1,112 @@
+# Use NATS for AI/ML Messaging
+
+* Status: accepted
+* Date: 2025-12-01
+* Deciders: Billy Davies
+* Technical Story: Selecting message bus for AI service orchestration
+
+## Context and Problem Statement
+
+The AI/ML platform requires a messaging system for:
+- Real-time chat message routing
+- Voice request/response streaming
+- Pipeline triggers and status updates
+- Event-driven workflow orchestration
+
+We need a messaging system that handles both ephemeral real-time messages and persistent streams.
+
+## Decision Drivers
+
+* Low latency for real-time chat/voice
+* Persistence for audit and replay
+* Simple operations for homelab
+* Support for request-reply pattern
+* Wildcard subscriptions for routing
+* Binary message support (audio data)
+
+## Considered Options
+
+* Apache Kafka
+* RabbitMQ
+* Redis Pub/Sub + Streams
+* NATS with JetStream
+* Apache Pulsar
+
+## Decision Outcome
+
+Chosen option: "NATS with JetStream", because it provides both fire-and-forget messaging and persistent streams with significantly simpler operations than alternatives.
+
+### Positive Consequences
+
+* Sub-millisecond latency for real-time messages
+* JetStream provides persistence when needed
+* Simple deployment (single binary)
+* Excellent Kubernetes integration
+* Request-reply pattern built-in
+* Wildcard subscriptions for flexible routing
+* Low resource footprint
+
+### Negative Consequences
+
+* Less ecosystem than Kafka
+* JetStream less mature than Kafka Streams
+* No built-in schema registry
+* Smaller community than RabbitMQ
+
+## Pros and Cons of the Options
+
+### Apache Kafka
+
+* Good, because industry standard for streaming
+* Good, because rich ecosystem (Kafka Streams, Connect)
+* Good, because schema registry
+* Good, because excellent for high throughput
+* Bad, because operationally complex (ZooKeeper/KRaft)
+* Bad, because high resource requirements
+* Bad, because overkill for homelab scale
+* Bad, because higher latency for real-time messages
+
+### RabbitMQ
+
+* Good, because mature and stable
+* Good, because flexible routing
+* Good, because good management UI
+* Bad, because AMQP protocol overhead
+* Bad, because not designed for streaming
+* Bad, because more complex clustering
+
+### Redis Pub/Sub + Streams
+
+* Good, because simple
+* Good, because already might use Redis
+* Good, because low latency
+* Bad, because pub/sub not persistent
+* Bad, because streams API less intuitive
+* Bad, because not primary purpose of Redis
+
+### NATS with JetStream
+
+* Good, because extremely low latency
+* Good, because simple operations
+* Good, because both pub/sub and persistence
+* Good, because request-reply built-in
+* Good, because wildcard subscriptions
+* Good, because low resource usage
+* Good, because excellent Go/Python clients
+* Bad, because smaller ecosystem
+* Bad, because JetStream newer than Kafka
+
+### Apache Pulsar
+
+* Good, because unified messaging + streaming
+* Good, because multi-tenancy
+* Good, because geo-replication
+* Bad, because complex architecture
+* Bad, because high resource requirements
+* Bad, because smaller community
+
+## Links
+
+* [NATS.io](https://nats.io)
+* [JetStream Documentation](https://docs.nats.io/nats-concepts/jetstream)
+* Related: [ADR-0004](0004-use-messagepack-for-nats.md) - Message format
diff --git a/decisions/0004-use-messagepack-for-nats.md b/decisions/0004-use-messagepack-for-nats.md
new file mode 100644
index 0000000..09b4cf9
--- /dev/null
+++ b/decisions/0004-use-messagepack-for-nats.md
@@ -0,0 +1,137 @@
+# Use MessagePack for NATS Messages
+
+* Status: accepted
+* Date: 2025-12-01
+* Deciders: Billy Davies
+* Technical Story: Selecting serialization format for NATS messages
+
+## Context and Problem Statement
+
+NATS messages in the AI platform carry various payloads:
+- Text chat messages (small)
+- Voice audio data (potentially large, base64 or binary)
+- Streaming response chunks
+- Pipeline parameters
+
+We need a serialization format that handles both text and binary efficiently.
+
+## Decision Drivers
+
+* Efficient binary data handling (audio)
+* Compact message size
+* Fast serialization/deserialization
+* Cross-language support (Python, Go)
+* Debugging ability
+* Schema flexibility
+
+## Considered Options
+
+* JSON
+* Protocol Buffers (protobuf)
+* MessagePack (msgpack)
+* CBOR
+* Avro
+
+## Decision Outcome
+
+Chosen option: "MessagePack (msgpack)", because it provides binary efficiency with JSON-like simplicity and schema-less flexibility.
+
+### Positive Consequences
+
+* Native binary support (no base64 overhead for audio)
+* 20-50% smaller than JSON for typical messages
+* Faster serialization than JSON
+* No schema compilation step
+* Easy debugging (can pretty-print like JSON)
+* Excellent Python and Go libraries
+
+### Negative Consequences
+
+* Less human-readable than JSON when raw
+* No built-in schema validation
+* Slightly less common than JSON
+
+## Pros and Cons of the Options
+
+### JSON
+
+* Good, because human-readable
+* Good, because universal support
+* Good, because no setup required
+* Bad, because binary data requires base64 (33% overhead)
+* Bad, because larger message sizes
+* Bad, because slower parsing
+
+### Protocol Buffers
+
+* Good, because very compact
+* Good, because fast
+* Good, because schema validation
+* Good, because cross-language
+* Bad, because requires schema definition
+* Bad, because compilation step
+* Bad, because less flexible for evolving schemas
+* Bad, because overkill for simple messages
+
+### MessagePack
+
+* Good, because binary-efficient
+* Good, because JSON-like simplicity
+* Good, because no schema required
+* Good, because excellent library support
+* Good, because can include raw bytes
+* Bad, because not human-readable raw
+* Bad, because no schema validation
+
+### CBOR
+
+* Good, because binary-efficient
+* Good, because IETF standard
+* Good, because schema-less
+* Bad, because less common libraries
+* Bad, because smaller community
+* Bad, because similar to msgpack with less adoption
+
+### Avro
+
+* Good, because schema evolution
+* Good, because compact
+* Good, because schema registry integration
+* Bad, because requires schema
+* Bad, because more complex setup
+* Bad, because Java-centric ecosystem
+
+## Implementation Notes
+
+```python
+# Python usage
+import msgpack
+
+# Serialize
+data = {
+ "user_id": "user-123",
+ "audio": audio_bytes, # Raw bytes, no base64
+ "premium": True
+}
+payload = msgpack.packb(data)
+
+# Deserialize
+data = msgpack.unpackb(payload, raw=False)
+```
+
+```go
+// Go usage
+import "github.com/vmihailenco/msgpack/v5"
+
+type Message struct {
+ UserID string `msgpack:"user_id"`
+ Audio []byte `msgpack:"audio"`
+}
+```
+
+## Links
+
+* [MessagePack Specification](https://msgpack.org)
+* [msgpack-python](https://github.com/msgpack/msgpack-python)
+* Related: [ADR-0003](0003-use-nats-for-messaging.md) - Message bus choice
+* See: [BINARY_MESSAGES_AND_JETSTREAM.md](../specs/BINARY_MESSAGES_AND_JETSTREAM.md)
diff --git a/decisions/0005-multi-gpu-strategy.md b/decisions/0005-multi-gpu-strategy.md
new file mode 100644
index 0000000..54c782e
--- /dev/null
+++ b/decisions/0005-multi-gpu-strategy.md
@@ -0,0 +1,145 @@
+# Multi-GPU Heterogeneous Strategy
+
+* Status: accepted
+* Date: 2025-12-01
+* Deciders: Billy Davies
+* Technical Story: GPU allocation strategy for AI workloads
+
+## Context and Problem Statement
+
+The homelab has diverse GPU hardware:
+- AMD Strix Halo (64GB unified memory) - khelben
+- NVIDIA RTX 2070 (8GB VRAM) - elminster
+- AMD Radeon 680M (12GB VRAM) - drizzt
+- Intel Arc (integrated) - danilo
+
+Different AI workloads have different requirements. How do we allocate GPUs effectively?
+
+## Decision Drivers
+
+* Maximize utilization of all GPUs
+* Match workloads to appropriate hardware
+* Support concurrent inference services
+* Enable fractional GPU sharing where appropriate
+* Minimize cross-vendor complexity
+
+## Considered Options
+
+* Single GPU vendor only
+* All workloads on largest GPU
+* Workload-specific GPU allocation
+* Dynamic GPU scheduling (MIG/fractional)
+
+## Decision Outcome
+
+Chosen option: "Workload-specific GPU allocation with dedicated nodes", where each AI service is pinned to the most appropriate GPU based on requirements.
+
+### Allocation Strategy
+
+| Workload | GPU | Node | Rationale |
+|----------|-----|------|-----------|
+| vLLM (LLM inference) | AMD Strix Halo (64GB) | khelben (dedicated) | Large models need unified memory |
+| Whisper (STT) | NVIDIA RTX 2070 (8GB) | elminster | CUDA optimized, medium memory |
+| XTTS (TTS) | NVIDIA RTX 2070 (8GB) | elminster | Shares with Whisper |
+| BGE Embeddings | AMD Radeon 680M (12GB) | drizzt | ROCm support, batch processing |
+| BGE Reranker | Intel Arc | danilo | Light workload, Intel optimization |
+
+### Positive Consequences
+
+* Each workload gets optimal hardware
+* No GPU memory contention for LLM
+* NVIDIA services can share via time-slicing
+* Cost-effective use of varied hardware
+* Clear ownership and debugging
+
+### Negative Consequences
+
+* More complex scheduling (node taints/tolerations)
+* Less flexibility for workload migration
+* Must maintain multiple GPU driver stacks
+* Some GPUs underutilized at times
+
+## Implementation
+
+### Node Taints
+
+```yaml
+# khelben - dedicated vLLM node
+nodeTaints:
+ dedicated: "vllm:NoSchedule"
+```
+
+### Pod Tolerations and Node Affinity
+
+```yaml
+# vLLM deployment
+spec:
+ tolerations:
+ - key: "dedicated"
+ operator: "Equal"
+ value: "vllm"
+ effect: "NoSchedule"
+ affinity:
+ nodeAffinity:
+ requiredDuringSchedulingIgnoredDuringExecution:
+ nodeSelectorTerms:
+ - matchExpressions:
+ - key: kubernetes.io/hostname
+ operator: In
+ values: ["khelben"]
+```
+
+### Resource Limits
+
+```yaml
+# NVIDIA GPU (elminster)
+resources:
+ limits:
+ nvidia.com/gpu: 1
+
+# AMD GPU (drizzt, khelben)
+resources:
+ limits:
+ amd.com/gpu: 1
+```
+
+## Pros and Cons of the Options
+
+### Single GPU vendor only
+
+* Good, because simpler driver management
+* Good, because consistent tooling
+* Bad, because wastes existing hardware
+* Bad, because higher cost for new hardware
+
+### All workloads on largest GPU
+
+* Good, because simple scheduling
+* Good, because unified memory benefits
+* Bad, because memory contention
+* Bad, because single point of failure
+* Bad, because wastes other GPUs
+
+### Workload-specific allocation (chosen)
+
+* Good, because optimal hardware matching
+* Good, because uses all available GPUs
+* Good, because clear resource boundaries
+* Good, because parallel inference
+* Bad, because more complex configuration
+* Bad, because multiple driver stacks
+
+### Dynamic GPU scheduling
+
+* Good, because flexible
+* Good, because maximizes utilization
+* Bad, because complex to implement
+* Bad, because MIG not available on consumer GPUs
+* Bad, because cross-vendor scheduling immature
+
+## Links
+
+* [Volcano Scheduler](https://volcano.sh)
+* [AMD GPU Device Plugin](https://github.com/ROCm/k8s-device-plugin)
+* [NVIDIA Device Plugin](https://github.com/NVIDIA/k8s-device-plugin)
+* Related: [ADR-0002](0002-use-talos-linux.md) - GPU drivers via Talos schematics
diff --git a/decisions/0006-gitops-with-flux.md b/decisions/0006-gitops-with-flux.md
new file mode 100644
index 0000000..4c55e39
--- /dev/null
+++ b/decisions/0006-gitops-with-flux.md
@@ -0,0 +1,140 @@
+# GitOps with Flux CD
+
+* Status: accepted
+* Date: 2025-11-30
+* Deciders: Billy Davies
+* Technical Story: Implementing GitOps for cluster management
+
+## Context and Problem Statement
+
+Managing a Kubernetes cluster with numerous applications, configurations, and secrets requires a reliable, auditable, and reproducible approach. Manual `kubectl apply` is error-prone and doesn't track state over time.
+
+## Decision Drivers
+
+* Infrastructure as Code (IaC) principles
+* Audit trail for all changes
+* Self-healing cluster state
+* Multi-repository support
+* Secret encryption integration
+* Active community and maintenance
+
+## Considered Options
+
+* Manual kubectl apply
+* ArgoCD
+* Flux CD
+* Rancher Fleet
+* Pulumi/Terraform for Kubernetes
+
+## Decision Outcome
+
+Chosen option: "Flux CD", because it provides a mature GitOps implementation with excellent multi-source support, SOPS integration, and aligns well with the Kubernetes ecosystem.
+
+### Positive Consequences
+
+* Git is single source of truth
+* Automatic drift detection and correction
+* Native SOPS/Age secret encryption
+* Multi-repository support (homelab-k8s2 + llm-workflows)
+* Helm and Kustomize native support
+* Webhook-free sync (pull-based)
+
+### Negative Consequences
+
+* No built-in UI (use CLI or third-party)
+* Learning curve for CRD-based configuration
+* Debugging requires understanding Flux controllers
+
+## Configuration
+
+### Repository Structure
+
+```
+homelab-k8s2/
+โโโ kubernetes/
+โ โโโ flux/ # Flux system config
+โ โ โโโ config/
+โ โ โ โโโ cluster.yaml
+โ โ โ โโโ secrets.yaml # SOPS encrypted
+โ โ โโโ repositories/
+โ โ โโโ helm/ # HelmRepositories
+โ โ โโโ git/ # GitRepositories
+โ โโโ apps/ # Application Kustomizations
+```
+
+### Multi-Repository Sync
+
+```yaml
+# GitRepository for llm-workflows
+apiVersion: source.toolkit.fluxcd.io/v1
+kind: GitRepository
+metadata:
+ name: llm-workflows
+ namespace: flux-system
+spec:
+ url: ssh://git@github.com/Billy-Davies-2/llm-workflows
+ ref:
+ branch: main
+ secretRef:
+ name: github-deploy-key
+```
+
+### SOPS Integration
+
+```yaml
+# .sops.yaml
+creation_rules:
+ - path_regex: .*\.sops\.yaml$
+ age: >-
+ age1... # Public key
+```
+
+## Pros and Cons of the Options
+
+### Manual kubectl apply
+
+* Good, because simple
+* Good, because no setup
+* Bad, because no audit trail
+* Bad, because no drift detection
+* Bad, because not reproducible
+
+### ArgoCD
+
+* Good, because great UI
+* Good, because app-of-apps pattern
+* Good, because large community
+* Bad, because heavier resource usage
+* Bad, because webhook-dependent sync
+* Bad, because SOPS requires plugins
+
+### Flux CD
+
+* Good, because lightweight
+* Good, because pull-based (no webhooks)
+* Good, because native SOPS support
+* Good, because multi-source/multi-tenant
+* Good, because Kubernetes-native CRDs
+* Bad, because no built-in UI
+* Bad, because CRD learning curve
+
+### Rancher Fleet
+
+* Good, because integrated with Rancher
+* Good, because multi-cluster
+* Bad, because Rancher ecosystem lock-in
+* Bad, because smaller community
+
+### Pulumi/Terraform
+
+* Good, because familiar IaC tools
+* Good, because drift detection
+* Bad, because not Kubernetes-native
+* Bad, because requires state management
+* Bad, because not continuous reconciliation
+
+## Links
+
+* [Flux CD](https://fluxcd.io)
+* [SOPS Integration](https://fluxcd.io/flux/guides/mozilla-sops/)
+* [flux-local](https://github.com/allenporter/flux-local) - Local testing
diff --git a/decisions/0007-use-kserve-for-inference.md b/decisions/0007-use-kserve-for-inference.md
new file mode 100644
index 0000000..da9598f
--- /dev/null
+++ b/decisions/0007-use-kserve-for-inference.md
@@ -0,0 +1,115 @@
+# Use KServe for ML Model Serving
+
+* Status: accepted
+* Date: 2025-12-15
+* Deciders: Billy Davies
+* Technical Story: Selecting model serving platform for inference services
+
+## Context and Problem Statement
+
+We need to deploy multiple ML models (Whisper, XTTS, BGE, vLLM) as inference endpoints. Each model has different requirements for scaling, protocols (HTTP/gRPC), and GPU allocation.
+
+## Decision Drivers
+
+* Standardized inference protocol (V2)
+* Autoscaling based on load
+* Traffic splitting for canary deployments
+* Integration with Kubeflow ecosystem
+* GPU resource management
+* Health checks and readiness
+
+## Considered Options
+
+* Raw Kubernetes Deployments + Services
+* KServe InferenceService
+* Seldon Core
+* BentoML
+* Ray Serve only
+
+## Decision Outcome
+
+Chosen option: "KServe InferenceService", because it provides a standardized, Kubernetes-native approach to model serving with built-in autoscaling and traffic management.
+
+### Positive Consequences
+
+* Standardized V2 inference protocol
+* Automatic scale-to-zero capability
+* Canary/blue-green deployments
+* Integration with Kubeflow UI
+* Transformer/Explainer components
+* GPU resource abstraction
+
+### Negative Consequences
+
+* Additional CRDs and operators
+* Learning curve for InferenceService spec
+* Some overhead for simple deployments
+* Knative Serving dependency (optional)
+
+## Pros and Cons of the Options
+
+### Raw Kubernetes Deployments
+
+* Good, because simple
+* Good, because full control
+* Bad, because no autoscaling logic
+* Bad, because manual service mesh
+* Bad, because repetitive configuration
+
+### KServe InferenceService
+
+* Good, because standardized API
+* Good, because autoscaling
+* Good, because traffic management
+* Good, because Kubeflow integration
+* Bad, because operator complexity
+* Bad, because Knative optional dependency
+
+### Seldon Core
+
+* Good, because mature
+* Good, because A/B testing
+* Good, because explainability
+* Bad, because more complex than KServe
+* Bad, because heavier resource usage
+
+### BentoML
+
+* Good, because developer-friendly
+* Good, because packaging focused
+* Bad, because less Kubernetes-native
+* Bad, because smaller community
+
+### Ray Serve
+
+* Good, because unified compute
+* Good, because Python-native
+* Good, because fractional GPU
+* Bad, because less standardized API
+* Bad, because Ray cluster overhead
+
+## Current Configuration
+
+```yaml
+apiVersion: serving.kserve.io/v1beta1
+kind: InferenceService
+metadata:
+ name: whisper
+ namespace: ai-ml
+spec:
+ predictor:
+ minReplicas: 1
+ maxReplicas: 3
+ containers:
+ - name: whisper
+ image: ghcr.io/org/whisper:latest
+ resources:
+ limits:
+ nvidia.com/gpu: 1
+```
+
+## Links
+
+* [KServe](https://kserve.github.io)
+* [V2 Inference Protocol](https://kserve.github.io/website/latest/modelserving/data_plane/v2_protocol/)
+* Related: [ADR-0005](0005-multi-gpu-strategy.md) - GPU allocation
diff --git a/decisions/0008-use-milvus-for-vectors.md b/decisions/0008-use-milvus-for-vectors.md
new file mode 100644
index 0000000..5fab389
--- /dev/null
+++ b/decisions/0008-use-milvus-for-vectors.md
@@ -0,0 +1,107 @@
+# Use Milvus for Vector Storage
+
+* Status: accepted
+* Date: 2025-12-15
+* Deciders: Billy Davies
+* Technical Story: Selecting vector database for RAG system
+
+## Context and Problem Statement
+
+The RAG (Retrieval-Augmented Generation) system requires a vector database to store document embeddings and perform similarity search. We need to store millions of embeddings and query them with low latency.
+
+## Decision Drivers
+
+* Query performance (< 100ms for top-k search)
+* Scalability to millions of vectors
+* Kubernetes-native deployment
+* Active development and community
+* Support for metadata filtering
+* Backup and restore capabilities
+
+## Considered Options
+
+* Milvus
+* Pinecone (managed)
+* Qdrant
+* Weaviate
+* pgvector (PostgreSQL extension)
+* Chroma
+
+## Decision Outcome
+
+Chosen option: "Milvus", because it provides production-grade vector search with excellent Kubernetes support, scalability, and active development.
+
+### Positive Consequences
+
+* High-performance similarity search
+* Horizontal scalability
+* Rich filtering and hybrid search
+* Helm chart for Kubernetes
+* Active CNCF sandbox project
+* GPU acceleration available
+
+### Negative Consequences
+
+* Complex architecture (multiple components)
+* Higher resource usage than simpler alternatives
+* Requires object storage (MinIO)
+* Learning curve for optimization
+
+## Pros and Cons of the Options
+
+### Milvus
+
+* Good, because production-proven at scale
+* Good, because rich query API
+* Good, because Kubernetes-native
+* Good, because hybrid search (vector + scalar)
+* Good, because CNCF project
+* Bad, because complex architecture
+* Bad, because higher resource usage
+
+### Pinecone
+
+* Good, because fully managed
+* Good, because simple API
+* Good, because reliable
+* Bad, because external dependency
+* Bad, because cost at scale
+* Bad, because data sovereignty concerns
+
+### Qdrant
+
+* Good, because simpler than Milvus
+* Good, because Rust performance
+* Good, because good filtering
+* Bad, because smaller community
+* Bad, because less enterprise features
+
+### Weaviate
+
+* Good, because built-in vectorization
+* Good, because GraphQL API
+* Good, because modules system
+* Bad, because more opinionated
+* Bad, because schema requirements
+
+### pgvector
+
+* Good, because familiar PostgreSQL
+* Good, because simple deployment
+* Good, because ACID transactions
+* Bad, because limited scale
+* Bad, because slower for large datasets
+* Bad, because no specialized optimizations
+
+### Chroma
+
+* Good, because simple
+* Good, because embedded option
+* Bad, because not production-ready at scale
+* Bad, because limited features
+
+## Links
+
+* [Milvus](https://milvus.io)
+* [Milvus Helm Chart](https://github.com/milvus-io/milvus-helm)
+* Related: [DOMAIN-MODEL.md](../DOMAIN-MODEL.md) - Chunk/Embedding entities
diff --git a/decisions/0009-dual-workflow-engines.md b/decisions/0009-dual-workflow-engines.md
new file mode 100644
index 0000000..94be24b
--- /dev/null
+++ b/decisions/0009-dual-workflow-engines.md
@@ -0,0 +1,124 @@
+# Dual Workflow Engine Strategy (Argo + Kubeflow)
+
+* Status: accepted
+* Date: 2026-01-15
+* Deciders: Billy Davies
+* Technical Story: Selecting workflow orchestration for ML pipelines
+
+## Context and Problem Statement
+
+The AI platform needs workflow orchestration for:
+- ML training pipelines with caching
+- Document ingestion (batch)
+- Complex DAG workflows (training โ evaluation โ deployment)
+- Hybrid scenarios combining both
+
+Should we use one engine or leverage strengths of multiple?
+
+## Decision Drivers
+
+* ML-specific features (caching, lineage)
+* Complex DAG support
+* Kubernetes-native execution
+* Visibility and debugging
+* Community and ecosystem
+* Integration with existing tools
+
+## Considered Options
+
+* Kubeflow Pipelines only
+* Argo Workflows only
+* Both engines with clear use cases
+* Airflow on Kubernetes
+* Prefect/Dagster
+
+## Decision Outcome
+
+Chosen option: "Both engines with clear use cases", using Kubeflow Pipelines for ML-centric workflows and Argo Workflows for complex DAG orchestration.
+
+### Decision Matrix
+
+| Use Case | Engine | Reason |
+|----------|--------|--------|
+| ML training with caching | Kubeflow | Component caching, experiment tracking |
+| Model evaluation | Kubeflow | Metric collection, comparison |
+| Document ingestion | Argo | Simple DAG, no ML features needed |
+| Batch inference | Argo | Parallelization, retries |
+| Complex DAG with branching | Argo | Superior control flow |
+| Hybrid ML training | Both | Argo orchestrates, KFP for ML steps |
+
+### Positive Consequences
+
+* Best tool for each job
+* ML pipelines get proper caching
+* Complex workflows get better DAG support
+* Can integrate via Argo Events
+* Gradual migration possible
+
+### Negative Consequences
+
+* Two systems to maintain
+* Team needs to learn both
+* More complex debugging
+* Integration overhead
+
+## Integration Architecture
+
+```
+NATS Event โโโบ Argo Events โโโบ Sensor โโโฌโโโบ Argo Workflow
+ โ
+ โโโโบ Kubeflow Pipeline (via API)
+
+ OR
+
+Argo Workflow โโโบ Step: kfp-trigger โโโบ Kubeflow Pipeline
+ (WorkflowTemplate)
+```
+
+## Pros and Cons of the Options
+
+### Kubeflow Pipelines only
+
+* Good, because ML-focused
+* Good, because caching
+* Good, because experiment tracking
+* Bad, because limited DAG features
+* Bad, because less flexible control flow
+
+### Argo Workflows only
+
+* Good, because powerful DAG
+* Good, because flexible
+* Good, because great debugging
+* Bad, because no ML caching
+* Bad, because no experiment tracking
+
+### Both engines (chosen)
+
+* Good, because best of both
+* Good, because appropriate tool per job
+* Good, because can integrate
+* Bad, because operational complexity
+* Bad, because learning two systems
+
+### Airflow
+
+* Good, because mature
+* Good, because large community
+* Bad, because Python-centric
+* Bad, because not Kubernetes-native
+* Bad, because no ML features
+
+### Prefect/Dagster
+
+* Good, because modern design
+* Good, because Python-native
+* Bad, because less Kubernetes-native
+* Bad, because newer/less proven
+
+## Links
+
+* [Kubeflow Pipelines](https://kubeflow.org/docs/components/pipelines/)
+* [Argo Workflows](https://argoproj.github.io/workflows/)
+* [Argo Events](https://argoproj.github.io/events/)
+* Related: [kfp-integration.yaml](../../llm-workflows/argo/kfp-integration.yaml)
diff --git a/decisions/0010-use-envoy-gateway.md b/decisions/0010-use-envoy-gateway.md
new file mode 100644
index 0000000..77ccdde
--- /dev/null
+++ b/decisions/0010-use-envoy-gateway.md
@@ -0,0 +1,120 @@
+# Use Envoy Gateway for Ingress
+
+* Status: accepted
+* Date: 2025-12-01
+* Deciders: Billy Davies
+* Technical Story: Selecting ingress controller for cluster
+
+## Context and Problem Statement
+
+We need an ingress solution that supports:
+- Gateway API (modern Kubernetes standard)
+- gRPC for ML inference
+- WebSocket for real-time chat/voice
+- Header-based routing for A/B testing
+- TLS termination
+
+## Decision Drivers
+
+* Gateway API support (HTTPRoute, GRPCRoute)
+* WebSocket support
+* gRPC support
+* Performance at edge
+* Active development
+* Envoy ecosystem familiarity
+
+## Considered Options
+
+* NGINX Ingress Controller
+* Traefik
+* Envoy Gateway
+* Istio Gateway
+* Contour
+
+## Decision Outcome
+
+Chosen option: "Envoy Gateway", because it's the reference implementation of Gateway API with full Envoy feature set.
+
+### Positive Consequences
+
+* Native Gateway API support
+* Full Envoy feature set
+* WebSocket and gRPC native
+* No Istio complexity
+* CNCF graduated project (Envoy)
+* Easy integration with observability
+
+### Negative Consequences
+
+* Newer than alternatives
+* Less documentation than NGINX
+* Envoy configuration learning curve
+
+## Pros and Cons of the Options
+
+### NGINX Ingress
+
+* Good, because mature
+* Good, because well-documented
+* Good, because familiar
+* Bad, because limited Gateway API
+* Bad, because commercial features gated
+
+### Traefik
+
+* Good, because auto-discovery
+* Good, because good UI
+* Good, because Let's Encrypt
+* Bad, because Gateway API experimental
+* Bad, because less gRPC focus
+
+### Envoy Gateway
+
+* Good, because Gateway API native
+* Good, because full Envoy features
+* Good, because extensible
+* Good, because gRPC/WebSocket native
+* Bad, because newer project
+* Bad, because less community content
+
+### Istio Gateway
+
+* Good, because full mesh features
+* Good, because Gateway API
+* Bad, because overkill without mesh
+* Bad, because resource heavy
+
+### Contour
+
+* Good, because Envoy-based
+* Good, because lightweight
+* Bad, because Gateway API evolving
+* Bad, because smaller community
+
+## Configuration Example
+
+```yaml
+apiVersion: gateway.networking.k8s.io/v1
+kind: HTTPRoute
+metadata:
+ name: companions-chat
+spec:
+ parentRefs:
+ - name: eg-gateway
+ namespace: network
+ hostnames:
+ - companions-chat.lab.daviestechlabs.io
+ rules:
+ - matches:
+ - path:
+ type: PathPrefix
+ value: /
+ backendRefs:
+ - name: companions-chat
+ port: 8080
+```
+
+## Links
+
+* [Envoy Gateway](https://gateway.envoyproxy.io)
+* [Gateway API](https://gateway-api.sigs.k8s.io)
diff --git a/diagrams/README.md b/diagrams/README.md
new file mode 100644
index 0000000..e5e219f
--- /dev/null
+++ b/diagrams/README.md
@@ -0,0 +1,35 @@
+# Diagrams
+
+This directory contains additional architecture diagrams beyond the main C4 diagrams.
+
+## Available Diagrams
+
+| File | Description |
+|------|-------------|
+| [gpu-allocation.mmd](gpu-allocation.mmd) | GPU workload distribution |
+| [data-flow-chat.mmd](data-flow-chat.mmd) | Chat request data flow |
+| [data-flow-voice.mmd](data-flow-voice.mmd) | Voice request data flow |
+
+## Rendering Diagrams
+
+### VS Code
+
+Install the "Markdown Preview Mermaid Support" extension.
+
+### CLI
+
+```bash
+# Using mmdc (Mermaid CLI)
+npx @mermaid-js/mermaid-cli mmdc -i diagram.mmd -o diagram.png
+```
+
+### Online
+
+Use [Mermaid Live Editor](https://mermaid.live)
+
+## Diagram Conventions
+
+1. Use `.mmd` extension for Mermaid diagrams
+2. Include title as comment at top of file
+3. Use consistent styling classes
+4. Keep diagrams focused (one concept per diagram)
diff --git a/diagrams/data-flow-chat.mmd b/diagrams/data-flow-chat.mmd
new file mode 100644
index 0000000..67fd4ea
--- /dev/null
+++ b/diagrams/data-flow-chat.mmd
@@ -0,0 +1,51 @@
+%% Chat Request Data Flow
+%% Sequence diagram showing chat message processing
+
+sequenceDiagram
+ autonumber
+ participant U as User
+ participant W as WebApp
(companions)
+ participant N as NATS
+ participant C as Chat Handler
+ participant V as Valkey
(Cache)
+ participant E as BGE Embeddings
+ participant M as Milvus
+ participant R as Reranker
+ participant L as vLLM
+
+ U->>W: Send message
+ W->>N: Publish ai.chat.user.{id}.message
+ N->>C: Deliver message
+
+ C->>V: Get session history
+ V-->>C: Previous messages
+
+ alt RAG Enabled
+ C->>E: Generate query embedding
+ E-->>C: Query vector
+ C->>M: Search similar chunks
+ M-->>C: Top-K chunks
+
+ opt Reranker Enabled
+ C->>R: Rerank chunks
+ R-->>C: Reordered chunks
+ end
+ end
+
+ C->>L: LLM inference (context + query)
+
+ alt Streaming Enabled
+ loop For each token
+ L-->>C: Token
+ C->>N: Publish ai.chat.response.stream.{id}
+ N-->>W: Deliver chunk
+ W-->>U: Display token
+ end
+ else Non-streaming
+ L-->>C: Full response
+ C->>N: Publish ai.chat.response.{id}
+ N-->>W: Deliver response
+ W-->>U: Display response
+ end
+
+ C->>V: Save to session history
diff --git a/diagrams/data-flow-voice.mmd b/diagrams/data-flow-voice.mmd
new file mode 100644
index 0000000..872b4f9
--- /dev/null
+++ b/diagrams/data-flow-voice.mmd
@@ -0,0 +1,46 @@
+%% Voice Request Data Flow
+%% Sequence diagram showing voice assistant processing
+
+sequenceDiagram
+ autonumber
+ participant U as User
+ participant W as Voice WebApp
+ participant N as NATS
+ participant VA as Voice Assistant
+ participant STT as Whisper
(STT)
+ participant E as BGE Embeddings
+ participant M as Milvus
+ participant R as Reranker
+ participant L as vLLM
+ participant TTS as XTTS
(TTS)
+
+ U->>W: Record audio
+ W->>N: Publish ai.voice.user.{id}.request
(msgpack with audio bytes)
+ N->>VA: Deliver voice request
+
+ VA->>STT: Transcribe audio
+ STT-->>VA: Transcription text
+
+ alt RAG Enabled
+ VA->>E: Generate query embedding
+ E-->>VA: Query vector
+ VA->>M: Search similar chunks
+ M-->>VA: Top-K chunks
+
+ opt Reranker Enabled
+ VA->>R: Rerank chunks
+ R-->>VA: Reordered chunks
+ end
+ end
+
+ VA->>L: LLM inference
+ L-->>VA: Response text
+
+ VA->>TTS: Synthesize speech
+ TTS-->>VA: Audio bytes
+
+ VA->>N: Publish ai.voice.response.{id}
(text + audio)
+ N-->>W: Deliver response
+ W-->>U: Play audio + show text
+
+ Note over VA,TTS: Total latency target: < 3s
diff --git a/diagrams/gpu-allocation.mmd b/diagrams/gpu-allocation.mmd
new file mode 100644
index 0000000..82e4f08
--- /dev/null
+++ b/diagrams/gpu-allocation.mmd
@@ -0,0 +1,47 @@
+%% GPU Allocation Diagram
+%% Shows how AI workloads are distributed across GPU nodes
+
+flowchart TB
+ subgraph khelben["๐ฅ๏ธ khelben (AMD Strix Halo 64GB)"]
+ direction TB
+ vllm["๐ง vLLM
LLM Inference
100% GPU"]
+ end
+
+ subgraph elminster["๐ฅ๏ธ elminster (NVIDIA RTX 2070 8GB)"]
+ direction TB
+ whisper["๐ค Whisper
STT
~50% GPU"]
+ xtts["๐ XTTS
TTS
~50% GPU"]
+ end
+
+ subgraph drizzt["๐ฅ๏ธ drizzt (AMD Radeon 680M 12GB)"]
+ direction TB
+ embeddings["๐ BGE Embeddings
Vector Encoding
~80% GPU"]
+ end
+
+ subgraph danilo["๐ฅ๏ธ danilo (Intel Arc)"]
+ direction TB
+ reranker["๐ BGE Reranker
Document Ranking
~80% GPU"]
+ end
+
+ subgraph workloads["Workload Routing"]
+ chat["๐ฌ Chat Request"]
+ voice["๐ค Voice Request"]
+ end
+
+ chat --> embeddings
+ chat --> reranker
+ chat --> vllm
+
+ voice --> whisper
+ voice --> embeddings
+ voice --> reranker
+ voice --> vllm
+ voice --> xtts
+
+ classDef nvidia fill:#76B900,color:white
+ classDef amd fill:#ED1C24,color:white
+ classDef intel fill:#0071C5,color:white
+
+ class whisper,xtts nvidia
+ class vllm,embeddings amd
+ class reranker intel
diff --git a/specs/BINARY_MESSAGES_AND_JETSTREAM.md b/specs/BINARY_MESSAGES_AND_JETSTREAM.md
new file mode 100644
index 0000000..85c7ca9
--- /dev/null
+++ b/specs/BINARY_MESSAGES_AND_JETSTREAM.md
@@ -0,0 +1,287 @@
+# Binary Messages and JetStream Configuration
+
+> Technical specification for NATS message handling in the AI platform
+
+## Overview
+
+The AI platform uses NATS with JetStream for message persistence. All messages use MessagePack (msgpack) binary format for efficiency, especially when handling audio data.
+
+## Message Format
+
+### Why MessagePack?
+
+1. **Binary efficiency**: Audio data embedded directly without base64 overhead
+2. **Compact**: 20-50% smaller than equivalent JSON
+3. **Fast**: Lower serialization/deserialization overhead
+4. **Compatible**: JSON-like structure, easy debugging
+
+### Schema
+
+All messages follow this general structure:
+
+```python
+{
+ "request_id": str, # UUID for correlation
+ "user_id": str, # User identifier
+ "timestamp": float, # Unix timestamp
+ "payload": Any, # Type-specific data
+ "metadata": dict # Optional metadata
+}
+```
+
+### Chat Message
+
+```python
+{
+ "request_id": "uuid-here",
+ "user_id": "user-123",
+ "username": "john_doe",
+ "message": "Hello, how are you?",
+ "premium": False,
+ "enable_streaming": True,
+ "enable_rag": True,
+ "enable_reranker": True,
+ "top_k": 5,
+ "session_id": "session-abc"
+}
+```
+
+### Voice Message
+
+```python
+{
+ "request_id": "uuid-here",
+ "user_id": "user-123",
+ "audio": b"...", # Raw bytes, not base64!
+ "format": "wav",
+ "sample_rate": 16000,
+ "premium": False,
+ "enable_rag": True,
+ "language": "en"
+}
+```
+
+### Streaming Response Chunk
+
+```python
+{
+ "request_id": "uuid-here",
+ "type": "chunk", # "chunk", "done", "error"
+ "content": "token",
+ "done": False,
+ "timestamp": 1706000000.0
+}
+```
+
+## JetStream Configuration
+
+### Streams
+
+| Stream | Subjects | Retention | Max Age | Storage | Replicas |
+|--------|----------|-----------|---------|---------|----------|
+| `COMPANIONS_LOGINS` | `ai.chat.user.*.login` | Limits | 7 days | File | 1 |
+| `COMPANIONS_CHAT` | `ai.chat.user.*.message`, `ai.chat.user.*.greeting.*` | Limits | 30 days | File | 1 |
+| `AI_CHAT_STREAM` | `ai.chat.response.stream.>` | Limits | 5 min | Memory | 1 |
+| `AI_VOICE_STREAM` | `ai.voice.>` | Limits | 1 hour | File | 1 |
+| `AI_VOICE_RESPONSE_STREAM` | `ai.voice.response.stream.>` | Limits | 5 min | Memory | 1 |
+| `AI_PIPELINE` | `ai.pipeline.>` | Limits | 24 hours | File | 1 |
+
+### Consumer Configuration
+
+```yaml
+# Durable consumer for chat handler
+consumer:
+ name: chat-handler
+ durable_name: chat-handler
+ filter_subjects:
+ - "ai.chat.user.*.message"
+ ack_policy: explicit
+ ack_wait: 30s
+ max_deliver: 3
+ deliver_policy: new
+```
+
+### Stream Creation (CLI)
+
+```bash
+# Create chat stream
+nats stream add COMPANIONS_CHAT \
+ --subjects "ai.chat.user.*.message,ai.chat.user.*.greeting.*" \
+ --retention limits \
+ --max-age 30d \
+ --storage file \
+ --replicas 1
+
+# Create ephemeral stream
+nats stream add AI_CHAT_STREAM \
+ --subjects "ai.chat.response.stream.>" \
+ --retention limits \
+ --max-age 5m \
+ --storage memory \
+ --replicas 1
+```
+
+## Python Implementation
+
+### Publisher
+
+```python
+import nats
+import msgpack
+from datetime import datetime
+
+async def publish_chat_message(nc: nats.NATS, user_id: str, message: str):
+ data = {
+ "request_id": str(uuid.uuid4()),
+ "user_id": user_id,
+ "message": message,
+ "timestamp": datetime.utcnow().timestamp(),
+ "enable_streaming": True,
+ "enable_rag": True,
+ }
+
+ subject = f"ai.chat.user.{user_id}.message"
+ await nc.publish(subject, msgpack.packb(data))
+```
+
+### Subscriber (JetStream)
+
+```python
+async def message_handler(msg):
+ try:
+ data = msgpack.unpackb(msg.data, raw=False)
+
+ # Process message
+ result = await process_chat(data)
+
+ # Publish response
+ response_subject = f"ai.chat.response.{data['request_id']}"
+ await nc.publish(response_subject, msgpack.packb(result))
+
+ # Acknowledge
+ await msg.ack()
+
+ except Exception as e:
+ logger.error(f"Handler error: {e}")
+ await msg.nak(delay=5) # Retry after 5s
+
+# Subscribe with JetStream
+js = nc.jetstream()
+sub = await js.subscribe(
+ "ai.chat.user.*.message",
+ cb=message_handler,
+ durable="chat-handler",
+ manual_ack=True
+)
+```
+
+### Streaming Response
+
+```python
+async def stream_response(nc, request_id: str, response_generator):
+ subject = f"ai.chat.response.stream.{request_id}"
+
+ async for token in response_generator:
+ chunk = {
+ "request_id": request_id,
+ "type": "chunk",
+ "content": token,
+ "done": False
+ }
+ await nc.publish(subject, msgpack.packb(chunk))
+
+ # Send done marker
+ done = {
+ "request_id": request_id,
+ "type": "done",
+ "content": "",
+ "done": True
+ }
+ await nc.publish(subject, msgpack.packb(done))
+```
+
+## Go Implementation
+
+### Publisher
+
+```go
+import (
+ "github.com/nats-io/nats.go"
+ "github.com/vmihailenco/msgpack/v5"
+)
+
+type ChatMessage struct {
+ RequestID string `msgpack:"request_id"`
+ UserID string `msgpack:"user_id"`
+ Message string `msgpack:"message"`
+}
+
+func PublishChat(nc *nats.Conn, userID, message string) error {
+ msg := ChatMessage{
+ RequestID: uuid.New().String(),
+ UserID: userID,
+ Message: message,
+ }
+
+ data, err := msgpack.Marshal(msg)
+ if err != nil {
+ return err
+ }
+
+ subject := fmt.Sprintf("ai.chat.user.%s.message", userID)
+ return nc.Publish(subject, data)
+}
+```
+
+## Error Handling
+
+### NAK with Delay
+
+```python
+# Temporary failure - retry later
+await msg.nak(delay=5) # 5 second delay
+
+# Permanent failure - move to dead letter
+if attempt >= max_retries:
+ await nc.publish("ai.dlq.chat", msg.data)
+ await msg.term() # Terminate delivery
+```
+
+### Dead Letter Queue
+
+```yaml
+stream:
+ name: AI_DLQ
+ subjects:
+ - "ai.dlq.>"
+ retention: limits
+ max_age: 7d
+ storage: file
+```
+
+## Monitoring
+
+### Key Metrics
+
+```bash
+# Stream info
+nats stream info COMPANIONS_CHAT
+
+# Consumer info
+nats consumer info COMPANIONS_CHAT chat-handler
+
+# Message rate
+nats stream report
+```
+
+### Prometheus Metrics
+
+- `nats_stream_messages_total`
+- `nats_consumer_pending_messages`
+- `nats_consumer_ack_pending`
+
+## Related
+
+- [ADR-0003: Use NATS for Messaging](../decisions/0003-use-nats-for-messaging.md)
+- [ADR-0004: Use MessagePack](../decisions/0004-use-messagepack-for-nats.md)
+- [DOMAIN-MODEL.md](../DOMAIN-MODEL.md)
diff --git a/specs/README.md b/specs/README.md
new file mode 100644
index 0000000..1b65dfc
--- /dev/null
+++ b/specs/README.md
@@ -0,0 +1,36 @@
+# Specifications
+
+This directory contains feature-level specifications and technical designs.
+
+## Contents
+
+- [BINARY_MESSAGES_AND_JETSTREAM.md](BINARY_MESSAGES_AND_JETSTREAM.md) - MessagePack format and JetStream configuration
+- Future specs will be added here
+
+## Spec Template
+
+```markdown
+# Feature Name
+
+## Overview
+Brief description of the feature
+
+## Requirements
+- Requirement 1
+- Requirement 2
+
+## Design
+Technical design details
+
+## API
+Interface definitions
+
+## Implementation Notes
+Key implementation considerations
+
+## Testing
+Test strategy
+
+## Open Questions
+Unresolved items
+```