feat: add comprehensive architecture documentation

- Add AGENT-ONBOARDING.md for AI agents - Add ARCHITECTURE.md with full system overview - Add TECH-STACK.md with complete technology inventory - Add DOMAIN-MODEL.md with entities and bounded contexts - Add CODING-CONVENTIONS.md with patterns and practices - Add GLOSSARY.md with terminology reference - Add C4 diagrams (Context and Container levels) - Add 10 ADRs documenting key decisions: - Talos Linux, NATS, MessagePack, Multi-GPU strategy - GitOps with Flux, KServe, Milvus, Dual workflow engines - Envoy Gateway - Add specs directory with JetStream configuration - Add diagrams for GPU allocation and data flows Based on analysis of homelab-k8s2 and llm-workflows repositories and kubectl cluster-info dump data.
2026-02-01 14:30:05 -05:00
parent 4d4f6f464c
commit 832cda34bd
26 changed files with 3805 additions and 2 deletions
--- a/AGENT-ONBOARDING.md
+++ b/AGENT-ONBOARDING.md
@@ -0,0 +1,191 @@
 # 🤖 Agent Onboarding
 > **This is the most important file for AI agents working on this codebase.**
 ## TL;DR
 You are working on a **homelab Kubernetes cluster** running:
 - **Talos Linux v1.12.1** on bare-metal nodes
 - **Kubernetes v1.35.0** with Flux CD GitOps
 - **AI/ML platform** with KServe, Kubeflow, Milvus, NATS
 - **Multi-GPU** (AMD ROCm, NVIDIA CUDA, Intel Arc)
 ## 🗺️ Repository Map
 | Repo | What It Contains | When to Edit |
 |------|------------------|--------------|
 | `homelab-k8s2` | Kubernetes manifests, Talos config, Flux | Infrastructure changes |
 | `llm-workflows` | NATS handlers, Argo/KFP workflows | Workflow/handler changes |
 | `companions-frontend` | Go server, HTMX UI, VRM avatars | Frontend changes |
 | `homelab-design` (this) | Architecture docs, ADRs | Design decisions |
 ## 🏗️ System Architecture (30-Second Version)
 ```
 ┌─────────────────────────────────────────────────────────────────┐
 │                         USER INTERFACES                          │
 │  Companions WebApp │ Voice WebApp │ Kubeflow UI │ CLI           │
 └───────────────────────────┬─────────────────────────────────────┘
                            │ WebSocket/HTTP
                            ▼
 ┌─────────────────────────────────────────────────────────────────┐
 │                      NATS MESSAGE BUS                            │
 │  Subjects: ai.chat.*, ai.voice.*, ai.pipeline.*                 │
 │  Format: MessagePack (binary)                                   │
 └───────────────────────────┬─────────────────────────────────────┘
                            │
        ┌───────────────────┼───────────────────┐
        ▼                   ▼                   ▼
 ┌───────────────┐   ┌───────────────┐   ┌───────────────┐
 │ Chat Handler  │   │Voice Assistant│   │Pipeline Bridge│
 │ (RAG+LLM)     │   │ (STT→LLM→TTS) │   │ (KFP/Argo)    │
 └───────┬───────┘   └───────┬───────┘   └───────┬───────┘
        │                   │                   │
        └───────────────────┼───────────────────┘
                            ▼
 ┌─────────────────────────────────────────────────────────────────┐
 │                       AI SERVICES                                │
 │  Whisper │ XTTS │ vLLM │ Milvus │ BGE Embed │ Reranker         │
 │    STT   │ TTS  │ LLM  │  RAG   │   Embed   │  Rank            │
 └─────────────────────────────────────────────────────────────────┘
 ```
 ## 📁 Key File Locations
 ### Infrastructure (`homelab-k8s2`)
 ```
 kubernetes/apps/
 ├── ai-ml/                    # 🧠 AI/ML services
 │   ├── kserve/               #   InferenceServices
 │   ├── kubeflow/             #   Pipelines, Training Operator
 │   ├── milvus/               #   Vector database
 │   ├── nats/                 #   Message bus
 │   ├── vllm/                 #   LLM inference
 │   └── llm-workflows/        #   GitRepo sync to llm-workflows
 ├── analytics/                # 📊 Spark, Flink, ClickHouse
 ├── observability/            # 📈 Grafana, Alloy, OpenTelemetry
 └── security/                 # 🔒 Vault, Authentik, Falco
 talos/
 ├── talconfig.yaml            # Node definitions
 ├── patches/                  # GPU-specific patches
 │   ├── amd/amdgpu.yaml
 │   └── nvidia/nvidia-runtime.yaml
 ```
 ### Workflows (`llm-workflows`)
 ```
 workflows/                    # NATS handler deployments
 ├── chat-handler.yaml
 ├── voice-assistant.yaml
 └── pipeline-bridge.yaml
 argo/                         # Argo WorkflowTemplates
 ├── document-ingestion.yaml
 ├── batch-inference.yaml
 └── qlora-training.yaml
 pipelines/                    # Kubeflow Pipeline Python
 ├── voice_pipeline.py
 └── document_ingestion_pipeline.py
 ```
 ## 🔌 Service Endpoints (Internal)
 ```python
 # Copy-paste ready for Python code
 NATS_URL = "nats://nats.ai-ml.svc.cluster.local:4222"
 VLLM_URL = "http://llm-draft.ai-ml.svc.cluster.local:8000/v1"
 WHISPER_URL = "http://whisper-predictor.ai-ml.svc.cluster.local"
 TTS_URL = "http://tts-predictor.ai-ml.svc.cluster.local"
 EMBEDDINGS_URL = "http://embeddings-predictor.ai-ml.svc.cluster.local"
 RERANKER_URL = "http://reranker-predictor.ai-ml.svc.cluster.local"
 MILVUS_HOST = "milvus.ai-ml.svc.cluster.local"
 MILVUS_PORT = 19530
 VALKEY_URL = "redis://valkey.ai-ml.svc.cluster.local:6379"
 ```
 ## 📨 NATS Subject Patterns
 ```python
 # Chat
 f"ai.chat.user.{user_id}.message"      # User sends message
 f"ai.chat.response.{request_id}"       # Response back
 f"ai.chat.response.stream.{request_id}" # Streaming tokens
 # Voice
 f"ai.voice.user.{user_id}.request"     # Voice input
 f"ai.voice.response.{request_id}"      # Voice output
 # Pipelines
 "ai.pipeline.trigger"                   # Trigger any pipeline
 f"ai.pipeline.status.{request_id}"     # Status updates
 ```
 ## 🎮 GPU Allocation
 | Node | GPU | Workload | Memory |
 |------|-----|----------|--------|
 | khelben | AMD Strix Halo | vLLM (dedicated) | 64GB unified |
 | elminster | NVIDIA RTX 2070 | Whisper + XTTS | 8GB VRAM |
 | drizzt | AMD Radeon 680M | BGE Embeddings | 12GB VRAM |
 | danilo | Intel Arc | Reranker | 16GB shared |
 ## ⚡ Common Tasks
 ### Deploy a New AI Service
 1. Create InferenceService in `homelab-k8s2/kubernetes/apps/ai-ml/kserve/`
 2. Add endpoint to `llm-workflows/config/ai-services-config.yaml`
 3. Push to main → Flux deploys automatically
 ### Add a New Workflow
 1. Create handler in `llm-workflows/chat-handler/` or `llm-workflows/voice-assistant/`
 2. Add Kubernetes Deployment in `llm-workflows/workflows/`
 3. Push to main → Flux deploys automatically
 ### Create Architecture Decision
 1. Copy `decisions/0000-template.md` to `decisions/NNNN-title.md`
 2. Fill in context, decision, consequences
 3. Submit PR
 ## ❌ Antipatterns to Avoid
 1. **Don't hardcode secrets** - Use External Secrets Operator
 2. **Don't use `latest` tags** - Pin versions for reproducibility
 3. **Don't skip ADRs** - Document significant decisions
 4. **Don't bypass Flux** - All changes via Git, never `kubectl apply` directly
 ## 📚 Where to Learn More
 - [ARCHITECTURE.md](ARCHITECTURE.md) - Full system design
 - [TECH-STACK.md](TECH-STACK.md) - All technologies used
 - [decisions/](decisions/) - Why we made certain choices
 - [DOMAIN-MODEL.md](DOMAIN-MODEL.md) - Core entities
 ## 🆘 Quick Debugging
 ```bash
 # Check Flux sync status
 flux get all -A
 # View NATS JetStream streams
 kubectl exec -n ai-ml deploy/nats-box -- nats stream ls
 # Check GPU allocation
 kubectl describe node khelben | grep -A10 "Allocated"
 # View KServe inference services
 kubectl get inferenceservices -n ai-ml
 # Tail AI service logs
 kubectl logs -n ai-ml -l app=chat-handler -f
 ```
 ---
 *This document is the canonical starting point for AI agents. When in doubt, check the ADRs.*
--- a/ARCHITECTURE.md
+++ b/ARCHITECTURE.md
@@ -0,0 +1,287 @@
 # 🏗️ System Architecture
 > **Comprehensive technical overview of the DaviesTechLabs homelab infrastructure**
 ## Overview
 The homelab is a production-grade Kubernetes cluster running on bare-metal hardware, designed for AI/ML workloads with multi-GPU support. It follows GitOps principles using Flux CD with SOPS-encrypted secrets.
 ## System Layers
 ```
 ┌─────────────────────────────────────────────────────────────────────────────┐
 │                              USER LAYER                                      │
 ├─────────────────────────────────────────────────────────────────────────────┤
 │  ┌──────────────────┐  ┌──────────────────┐  ┌──────────────────┐           │
 │  │ Companions WebApp│  │   Voice WebApp   │  │   Kubeflow UI    │           │
 │  │  HTMX + Alpine   │  │    Gradio UI     │  │  Pipeline Mgmt   │           │
 │  └────────┬─────────┘  └────────┬─────────┘  └────────┬─────────┘           │
 │           │ WebSocket           │ HTTP/WS             │ HTTP                │
 └───────────┴─────────────────────┴─────────────────────┴─────────────────────┘
                                  │
                                  ▼
 ┌─────────────────────────────────────────────────────────────────────────────┐
 │                           INGRESS LAYER                                      │
 ├─────────────────────────────────────────────────────────────────────────────┤
 │  Cloudflared Tunnel ──► Envoy Gateway ──► HTTPRoute CRDs                    │
 │                                                                              │
 │  External: *.daviestechlabs.io          Internal: *.lab.daviestechlabs.io  │
 │  • git.daviestechlabs.io                • kubeflow.lab.daviestechlabs.io   │
 │  • auth.daviestechlabs.io               • companions-chat.lab...           │
 └─────────────────────────────────────────────────────────────────────────────┘
                                  │
                                  ▼
 ┌─────────────────────────────────────────────────────────────────────────────┐
 │                          MESSAGE BUS LAYER                                   │
 ├─────────────────────────────────────────────────────────────────────────────┤
 │                           NATS + JetStream                                   │
 │  ┌─────────────────────────────────────────────────────────────────────┐    │
 │  │  Streams:                                                            │    │
 │  │  • COMPANIONS_LOGINS (7d retention)  - User analytics               │    │
 │  │  • COMPANIONS_CHAT (30d retention)   - Chat history                 │    │
 │  │  • AI_CHAT_STREAM (5min, memory)     - Ephemeral streaming          │    │
 │  │  • AI_VOICE_STREAM (1h, file)        - Voice processing             │    │
 │  │  • AI_PIPELINE (24h, file)           - Workflow triggers            │    │
 │  └─────────────────────────────────────────────────────────────────────┘    │
 │                                                                              │
 │  Message Format: MessagePack (binary, not JSON)                             │
 └─────────────────────────────────────────────────────────────────────────────┘
                                  │
        ┌─────────────────────────┼─────────────────────────┐
        ▼                         ▼                         ▼
 ┌───────────────────┐   ┌───────────────────┐   ┌───────────────────┐
 │   Chat Handler    │   │  Voice Assistant  │   │  Pipeline Bridge  │
 ├───────────────────┤   ├───────────────────┤   ├───────────────────┤
 │ • RAG retrieval   │   │ • STT (Whisper)   │   │ • KFP triggers    │
 │ • LLM inference   │   │ • RAG retrieval   │   │ • Argo triggers   │
 │ • Streaming resp  │   │ • LLM inference   │   │ • Status updates  │
 │ • Session state   │   │ • TTS (XTTS)      │   │ • Error handling  │
 └───────────────────┘   └───────────────────┘   └───────────────────┘
                                  │
                                  ▼
 ┌─────────────────────────────────────────────────────────────────────────────┐
 │                         AI SERVICES LAYER                                    │
 ├─────────────────────────────────────────────────────────────────────────────┤
 │  ┌─────────┐ ┌─────────┐ ┌─────────┐ ┌─────────┐ ┌─────────┐ ┌─────────┐   │
 │  │ Whisper │ │  XTTS   │ │  vLLM   │ │ Milvus  │ │   BGE   │ │Reranker │   │
 │  │  (STT)  │ │  (TTS)  │ │  (LLM)  │ │  (RAG)  │ │(Embed)  │ │  (BGE)  │   │
 │  ├─────────┤ ├─────────┤ ├─────────┤ ├─────────┤ ├─────────┤ ├─────────┤   │
 │  │ KServe  │ │ KServe  │ │ vLLM    │ │  Helm   │ │ KServe  │ │ KServe  │   │
 │  │ nvidia  │ │ nvidia  │ │ ROCm    │ │ Minio   │ │ rdna2   │ │ intel   │   │
 │  └─────────┘ └─────────┘ └─────────┘ └─────────┘ └─────────┘ └─────────┘   │
 └─────────────────────────────────────────────────────────────────────────────┘
                                  │
                                  ▼
 ┌─────────────────────────────────────────────────────────────────────────────┐
 │                       WORKFLOW ENGINE LAYER                                  │
 ├─────────────────────────────────────────────────────────────────────────────┤
 │  ┌────────────────────────────┐    ┌────────────────────────────┐          │
 │  │     Argo Workflows         │◄──►│    Kubeflow Pipelines      │          │
 │  ├────────────────────────────┤    ├────────────────────────────┤          │
 │  │ • Complex DAG orchestration│    │ • ML pipeline caching      │          │
 │  │ • Training workflows       │    │ • Experiment tracking      │          │
 │  │ • Document ingestion       │    │ • Model versioning         │          │
 │  │ • Batch inference          │    │ • Artifact lineage         │          │
 │  └────────────────────────────┘    └────────────────────────────┘          │
 │                                                                              │
 │  Trigger: Argo Events (EventSource → Sensor → Workflow/Pipeline)           │
 └─────────────────────────────────────────────────────────────────────────────┘
                                  │
                                  ▼
 ┌─────────────────────────────────────────────────────────────────────────────┐
 │                        INFRASTRUCTURE LAYER                                  │
 ├─────────────────────────────────────────────────────────────────────────────┤
 │  Storage:                     Compute:                 Security:            │
 │  ├─ Longhorn (block)          ├─ Volcano Scheduler     ├─ Vault (secrets)  │
 │  ├─ NFS CSI (shared)          ├─ GPU Device Plugins    ├─ Authentik (SSO)  │
 │  └─ MinIO (S3)                │   ├─ AMD ROCm          ├─ Falco (runtime)  │
 │                               │   ├─ NVIDIA CUDA       └─ SOPS (GitOps)    │
 │  Databases:                   │   └─ Intel i915/Arc                        │
 │  ├─ CloudNative-PG            └─ Node Feature Discovery                    │
 │  ├─ Valkey (cache)                                                          │
 │  └─ ClickHouse (analytics)                                                  │
 └─────────────────────────────────────────────────────────────────────────────┘
                                  │
                                  ▼
 ┌─────────────────────────────────────────────────────────────────────────────┐
 │                          PLATFORM LAYER                                      │
 ├─────────────────────────────────────────────────────────────────────────────┤
 │  Talos Linux v1.12.1  │  Kubernetes v1.35.0  │  Cilium CNI                 │
 │                                                                              │
 │  Nodes: storm, bruenor, catti (control) │ elminster, khelben, drizzt,      │
 │                                          │ danilo (workers)                 │
 └─────────────────────────────────────────────────────────────────────────────┘
 ```
 ## Node Topology
 ### Control Plane (HA)
 | Node | IP | CPU | Memory | Storage | Role |
 |------|-------|-----|--------|---------|------|
 | storm | 192.168.100.25 | Intel 13th Gen (4c) | 16GB | 500GB NVMe | etcd, API server |
 | bruenor | 192.168.100.26 | Intel 13th Gen (4c) | 16GB | 500GB NVMe | etcd, API server |
 | catti | 192.168.100.27 | Intel 13th Gen (4c) | 16GB | 500GB NVMe | etcd, API server |
 **VIP**: 192.168.100.20 (shared across control plane)
 ### Worker Nodes
 | Node | IP | CPU | GPU | GPU Memory | Workload |
 |------|-------|-----|-----|------------|----------|
 | elminster | 192.168.100.31 | Intel | NVIDIA RTX 2070 | 8GB VRAM | Whisper, XTTS |
 | khelben | 192.168.100.32 | AMD Ryzen | AMD Strix Halo | 64GB Unified | vLLM (dedicated) |
 | drizzt | 192.168.100.40 | AMD Ryzen 7 6800H | AMD Radeon 680M | 12GB VRAM | BGE Embeddings |
 | danilo | 192.168.100.41 | Intel Core Ultra 9 | Intel Arc | 16GB Shared | Reranker |
 ## Networking
 ### External Access
 ```
 Internet → Cloudflare → cloudflared tunnel → Envoy Gateway → Services
 ```
 ### DNS Zones
 - **External**: `*.daviestechlabs.io` (Cloudflare DNS)
 - **Internal**: `*.lab.daviestechlabs.io` (internal split-horizon)
 ### Network CIDRs
 | Network | CIDR | Purpose |
 |---------|------|---------|
 | Node Network | 192.168.100.0/24 | Physical nodes |
 | Pod Network | 10.42.0.0/16 | Kubernetes pods |
 | Service Network | 10.43.0.0/16 | Kubernetes services |
 ## Data Flow: Chat Request
 ```mermaid
 sequenceDiagram
    participant U as User
    participant W as WebApp
    participant N as NATS
    participant C as Chat Handler
    participant M as Milvus
    participant L as vLLM
    participant V as Valkey
    U->>W: Send message
    W->>N: Publish ai.chat.user.{id}.message
    N->>C: Deliver to chat-handler
    C->>V: Get session history
    C->>M: RAG query (if enabled)
    M-->>C: Relevant documents
    C->>L: LLM inference (with context)
    L-->>C: Streaming tokens
    C->>N: Publish ai.chat.response.stream.{id}
    N-->>W: Deliver streaming chunks
    W-->>U: Display tokens
    C->>V: Save to session
 ```
 ## GitOps Flow
 ```
 Developer → Git Push → GitHub/Gitea
                           │
                           ▼
                    ┌─────────────┐
                    │   Flux CD   │
                    │ (reconcile) │
                    └──────┬──────┘
                           │
            ┌──────────────┼──────────────┐
            ▼              ▼              ▼
     ┌──────────┐   ┌──────────┐   ┌──────────┐
     │homelab-  │   │  llm-    │   │  helm    │
     │  k8s2    │   │workflows │   │ charts   │
     └──────────┘   └──────────┘   └──────────┘
            │              │              │
            └──────────────┴──────────────┘
                           │
                           ▼
                    ┌─────────────┐
                    │  Kubernetes │
                    │   Cluster   │
                    └─────────────┘
 ```
 ## Security Architecture
 ### Secrets Management
 ```
 External Secrets Operator ──► Vault / SOPS ──► Kubernetes Secrets
 ```
 ### Authentication
 ```
 User ──► Cloudflare Access ──► Authentik ──► Application
                                   │
                                   └──► OIDC/SAML providers
 ```
 ### Network Security
 - **Cilium**: Network policies, eBPF-based security
 - **Falco**: Runtime security monitoring
 - **RBAC**: Fine-grained Kubernetes permissions
 ## High Availability
 ### Control Plane
 - 3-node etcd cluster with automatic leader election
 - Virtual IP (192.168.100.20) for API server access
 - Automatic failover via Talos
 ### Workloads
 - Pod anti-affinity for critical services
 - HPA for auto-scaling
 - PodDisruptionBudgets for controlled updates
 ### Storage
 - Longhorn 3-replica default
 - MinIO erasure coding for S3
 - Regular Velero backups
 ## Observability
 ### Metrics Pipeline
 ```
 Applications ──► OpenTelemetry Collector ──► Prometheus ──► Grafana
 ```
 ### Logging Pipeline
 ```
 Applications ──► Grafana Alloy ──► Loki ──► Grafana
 ```
 ### Tracing Pipeline
 ```
 Applications ──► OpenTelemetry SDK ──► Jaeger/Tempo ──► Grafana
 ```
 ## Key Design Decisions
 | Decision | Rationale | ADR |
 |----------|-----------|-----|
 | Talos Linux | Immutable, API-driven, secure | [ADR-0002](decisions/0002-use-talos-linux.md) |
 | NATS over Kafka | Simpler ops, sufficient throughput | [ADR-0003](decisions/0003-use-nats-for-messaging.md) |
 | MessagePack over JSON | Binary efficiency for audio | [ADR-0004](decisions/0004-use-messagepack-for-nats.md) |
 | Multi-GPU heterogeneous | Cost optimization, workload matching | [ADR-0005](decisions/0005-multi-gpu-strategy.md) |
 | GitOps with Flux | Declarative, auditable, secure | [ADR-0006](decisions/0006-gitops-with-flux.md) |
 ## Related Documents
 - [TECH-STACK.md](TECH-STACK.md) - Complete technology inventory
 - [DOMAIN-MODEL.md](DOMAIN-MODEL.md) - Core entities and relationships
 - [decisions/](decisions/) - All architecture decisions
--- a/CODING-CONVENTIONS.md
+++ b/CODING-CONVENTIONS.md
@@ -0,0 +1,424 @@
 # 📐 Coding Conventions
 > **Patterns, practices, and folder structure conventions for DaviesTechLabs repositories**
 ## Repository Conventions
 ### homelab-k8s2 (Infrastructure)
 ```
 kubernetes/
 ├── apps/                    # Application deployments
 │   └── {namespace}/         # One folder per namespace
 │       └── {app}/           # One folder per application
 │           ├── app/         # Kubernetes manifests
 │           │   ├── kustomization.yaml
 │           │   ├── helmrelease.yaml   # OR individual manifests
 │           │   └── ...
 │           └── ks.yaml      # Flux Kustomization
 ├── components/              # Reusable Kustomize components
 └── flux/                    # Flux system configuration
 ```
 **Naming Conventions:**
 - Namespaces: lowercase with hyphens (`ai-ml`, `cert-manager`)
 - Apps: lowercase with hyphens (`chat-handler`, `voice-assistant`)
 - Secrets: `{app}-{type}` (e.g., `milvus-credentials`)
 ### llm-workflows (Orchestration)
 ```
 workflows/                   # Kubernetes Deployments for NATS handlers
 ├── {handler}.yaml           # One file per handler
 argo/                        # Argo WorkflowTemplates
 ├── {workflow-name}.yaml     # One file per workflow
 pipelines/                   # Kubeflow Pipeline Python files
 ├── {pipeline}_pipeline.py   # Pipeline definition
 └── kfp-sync-job.yaml       # Upload job
 {handler}/                   # Python source code
 ├── __init__.py
 ├── {handler}.py            # Main entry point
 ├── requirements.txt
 └── Dockerfile
 ```
 ---
 ## Python Conventions
 ### Project Structure
 ```python
 # Use async/await for I/O
 async def handle_message(msg: Msg) -> None:
    ...
 # Use dataclasses for structured data
@dataclass
 class ChatRequest:
    user_id: str
    message: str
    enable_rag: bool = True
 # Use msgpack for NATS messages
 import msgpack
 data = msgpack.packb({"key": "value"})
 ```
 ### Naming
 | Element | Convention | Example |
 |---------|------------|---------|
 | Files | snake_case | `chat_handler.py` |
 | Classes | PascalCase | `ChatHandler` |
 | Functions | snake_case | `process_message` |
 | Constants | UPPER_SNAKE | `NATS_URL` |
 | Private | Leading underscore | `_internal_method` |
 ### Type Hints
 ```python
 # Always use type hints
 from typing import Optional, List, Dict, Any
 async def query_rag(
    query: str,
    collection: str = "knowledge_base",
    top_k: int = 5,
 ) -> List[Dict[str, Any]]:
    ...
 ```
 ### Error Handling
 ```python
 # Use specific exceptions
 class RAGQueryError(Exception):
    """Raised when RAG query fails."""
    pass
 # Log errors with context
 import logging
 logger = logging.getLogger(__name__)
 try:
    result = await milvus.search(...)
 except Exception as e:
    logger.error(f"RAG query failed: {e}", extra={"query": query})
    raise RAGQueryError(f"Failed to query collection {collection}") from e
 ```
 ### NATS Message Handling
 ```python
 import nats
 import msgpack
 async def message_handler(msg: Msg) -> None:
    try:
        # Decode MessagePack
        data = msgpack.unpackb(msg.data, raw=False)
        # Process
        result = await process(data)
        # Reply if request-reply pattern
        if msg.reply:
            await msg.respond(msgpack.packb(result))
        # Acknowledge for JetStream
        await msg.ack()
    except Exception as e:
        logger.error(f"Handler error: {e}")
        # NAK for retry (JetStream)
        await msg.nak()
 ```
 ---
 ## Kubernetes Manifest Conventions
 ### Labels
 ```yaml
 metadata:
  labels:
    # Required
    app.kubernetes.io/name: chat-handler
    app.kubernetes.io/instance: chat-handler
    app.kubernetes.io/component: handler
    app.kubernetes.io/part-of: ai-platform
    # Optional
    app.kubernetes.io/version: "1.0.0"
    app.kubernetes.io/managed-by: flux
 ```
 ### Annotations
 ```yaml
 metadata:
  annotations:
    # Reloader for config changes
    reloader.stakater.com/auto: "true"
    # Documentation
    description: "Handles chat messages via NATS"
 ```
 ### Resource Requests
 ```yaml
 resources:
  requests:
    cpu: 100m
    memory: 256Mi
  limits:
    cpu: 500m
    memory: 512Mi
 # GPU workloads
 resources:
  limits:
    amd.com/gpu: 1        # AMD
    nvidia.com/gpu: 1     # NVIDIA
 ```
 ### Health Checks
 ```yaml
 livenessProbe:
  httpGet:
    path: /health
    port: 8080
  initialDelaySeconds: 10
  periodSeconds: 30
 readinessProbe:
  httpGet:
    path: /ready
    port: 8080
  initialDelaySeconds: 5
  periodSeconds: 10
 ```
 ---
 ## Flux/GitOps Conventions
 ### Kustomization Structure
 ```yaml
 # ks.yaml - Flux Kustomization
 apiVersion: kustomize.toolkit.fluxcd.io/v1
 kind: Kustomization
 metadata:
  name: &app chat-handler
  namespace: flux-system
 spec:
  targetNamespace: ai-ml
  commonMetadata:
    labels:
      app.kubernetes.io/name: *app
  path: ./kubernetes/apps/ai-ml/chat-handler/app
  prune: true
  sourceRef:
    kind: GitRepository
    name: flux-system
  wait: true
  interval: 30m
  retryInterval: 1m
  timeout: 5m
 ```
 ### HelmRelease Structure
 ```yaml
 apiVersion: helm.toolkit.fluxcd.io/v2
 kind: HelmRelease
 metadata:
  name: milvus
 spec:
  interval: 30m
  chart:
    spec:
      chart: milvus
      version: 4.x.x
      sourceRef:
        kind: HelmRepository
        name: milvus
        namespace: flux-system
  values:
    # Values here
 ```
 ### Secret References
 ```yaml
 # Never hardcode secrets
 env:
  - name: DATABASE_PASSWORD
    valueFrom:
      secretKeyRef:
        name: postgres-credentials
        key: password
 ```
 ---
 ## NATS Subject Conventions
 ### Hierarchy
 ```
 ai.{domain}.{scope}.{action}
 Examples:
 ai.chat.user.{userId}.message      # User chat message
 ai.chat.response.{requestId}       # Chat response
 ai.voice.user.{userId}.request     # Voice request
 ai.pipeline.trigger                # Pipeline trigger
 ```
 ### Wildcards
 ```
 ai.chat.>                   # All chat events
 ai.chat.user.*.message      # All user messages
 ai.*.response.{id}          # Any response type
 ```
 ---
 ## Git Conventions
 ### Commit Messages
 ```
 type(scope): subject
 body (optional)
 footer (optional)
 ```
 **Types:**
 - `feat`: New feature
 - `fix`: Bug fix
 - `docs`: Documentation
 - `style`: Formatting
 - `refactor`: Code restructuring
 - `test`: Tests
 - `chore`: Maintenance
 **Examples:**
 ```
 feat(chat-handler): add streaming response support
 fix(voice): handle empty audio gracefully
 docs(adr): add decision for MessagePack format
 ```
 ### Branch Naming
 ```
 feature/short-description
 fix/issue-number-description
 docs/what-changed
 ```
 ---
 ## Configuration Conventions
 ### Environment Variables
 ```python
 # Use pydantic-settings or similar
 from pydantic_settings import BaseSettings
 class Settings(BaseSettings):
    nats_url: str = "nats://localhost:4222"
    vllm_url: str = "http://localhost:8000"
    milvus_host: str = "localhost"
    milvus_port: int = 19530
    log_level: str = "INFO"
    class Config:
        env_prefix = ""  # No prefix
 ```
 ### ConfigMaps
 ```yaml
 apiVersion: v1
 kind: ConfigMap
 metadata:
  name: ai-services-config
 data:
  NATS_URL: "nats://nats.ai-ml.svc.cluster.local:4222"
  VLLM_URL: "http://llm-draft.ai-ml.svc.cluster.local:8000/v1"
  # ... other non-sensitive config
 ```
 ---
 ## Documentation Conventions
 ### ADR Format
 See [decisions/0000-template.md](decisions/0000-template.md)
 ### Code Comments
 ```python
 # Use docstrings for public functions
 async def query_rag(query: str) -> List[Dict]:
    """
    Query the RAG system for relevant documents.
    Args:
        query: The search query string
    Returns:
        List of document chunks with scores
    Raises:
        RAGQueryError: If the query fails
    """
    ...
 ```
 ### README Files
 Each application should have a README with:
 1. Purpose
 2. Configuration
 3. Deployment
 4. Local development
 5. API documentation (if applicable)
 ---
 ## Anti-Patterns to Avoid
 | Don't | Do Instead |
 |-------|------------|
 | `kubectl apply` directly | Commit to Git, let Flux deploy |
 | Hardcode secrets | Use External Secrets Operator |
 | Use `latest` image tags | Pin to specific versions |
 | Skip health checks | Always define liveness/readiness |
 | Ignore resource limits | Set appropriate requests/limits |
 | Use JSON for NATS messages | Use MessagePack (binary) |
 | Synchronous I/O in handlers | Use async/await |
 ---
 ## Related Documents
 - [TECH-STACK.md](TECH-STACK.md) - Technologies used
 - [ARCHITECTURE.md](ARCHITECTURE.md) - System design
 - [decisions/](decisions/) - Why we made certain choices
--- a/CONTAINER-DIAGRAM.mmd
+++ b/CONTAINER-DIAGRAM.mmd
@@ -0,0 +1,123 @@
 %% C4 Container Diagram - Level 2
 %% DaviesTechLabs Homelab AI/ML Platform
 %%
 %% To render: Use Mermaid Live Editor or VS Code Mermaid extension
 graph TB
    subgraph users["Users"]
        user["👤 User"]
    end
    subgraph ingress["Ingress Layer"]
        cloudflared["cloudflared<br/>(Tunnel)"]
        envoy["Envoy Gateway<br/>(HTTPRoute)"]
    end
    subgraph frontends["Frontend Applications"]
        companions["Companions WebApp<br/>[Go + HTMX]<br/>AI Chat Interface"]
        voice["Voice WebApp<br/>[Gradio]<br/>Voice Assistant UI"]
        kubeflow_ui["Kubeflow UI<br/>[React]<br/>Pipeline Management"]
    end
    subgraph messaging["Message Bus"]
        nats["NATS<br/>[JetStream]<br/>Event Streaming"]
    end
    subgraph handlers["NATS Handlers"]
        chat_handler["Chat Handler<br/>[Python]<br/>RAG + LLM Orchestration"]
        voice_handler["Voice Assistant<br/>[Python]<br/>STT → LLM → TTS"]
        pipeline_bridge["Pipeline Bridge<br/>[Python]<br/>Workflow Triggers"]
    end
    subgraph ai_services["AI Services (KServe)"]
        whisper["Whisper<br/>[faster-whisper]<br/>Speech-to-Text"]
        xtts["XTTS<br/>[Coqui]<br/>Text-to-Speech"]
        vllm["vLLM<br/>[ROCm]<br/>LLM Inference"]
        embeddings["BGE Embeddings<br/>[sentence-transformers]<br/>Vector Encoding"]
        reranker["BGE Reranker<br/>[sentence-transformers]<br/>Document Ranking"]
    end
    subgraph storage["Data Stores"]
        milvus["Milvus<br/>[Vector DB]<br/>RAG Storage"]
        valkey["Valkey<br/>[Redis API]<br/>Session Cache"]
        postgres["CloudNative-PG<br/>[PostgreSQL]<br/>Metadata"]
        minio["MinIO<br/>[S3 API]<br/>Object Storage"]
    end
    subgraph workflows["Workflow Engines"]
        argo["Argo Workflows<br/>[DAG Engine]<br/>Complex Pipelines"]
        kfp["Kubeflow Pipelines<br/>[ML Platform]<br/>Training + Inference"]
        argo_events["Argo Events<br/>[Event Source]<br/>NATS → Workflow"]
    end
    subgraph mlops["MLOps"]
        mlflow["MLflow<br/>[Tracking Server]<br/>Experiment Tracking"]
        volcano["Volcano<br/>[Scheduler]<br/>GPU Scheduling"]
    end
    %% User flow
    user --> cloudflared
    cloudflared --> envoy
    envoy --> companions
    envoy --> voice
    envoy --> kubeflow_ui
    %% Frontend to NATS
    companions --> |WebSocket| nats
    voice --> |HTTP/WS| nats
    %% NATS to handlers
    nats --> chat_handler
    nats --> voice_handler
    nats --> pipeline_bridge
    %% Handlers to AI services
    chat_handler --> embeddings
    chat_handler --> reranker
    chat_handler --> vllm
    chat_handler --> milvus
    chat_handler --> valkey
    voice_handler --> whisper
    voice_handler --> embeddings
    voice_handler --> reranker
    voice_handler --> vllm
    voice_handler --> xtts
    %% Pipeline flow
    pipeline_bridge --> argo_events
    argo_events --> argo
    argo_events --> kfp
    kubeflow_ui --> kfp
    %% Workflow to AI
    argo --> ai_services
    kfp --> ai_services
    kfp --> mlflow
    %% Storage connections
    ai_services --> minio
    milvus --> minio
    kfp --> postgres
    mlflow --> postgres
    mlflow --> minio
    %% GPU scheduling
    volcano -.-> vllm
    volcano -.-> whisper
    volcano -.-> xtts
    %% Styling
    classDef frontend fill:#90EE90,stroke:#333
    classDef handler fill:#87CEEB,stroke:#333
    classDef ai fill:#FFB6C1,stroke:#333
    classDef storage fill:#DDA0DD,stroke:#333
    classDef workflow fill:#F0E68C,stroke:#333
    classDef messaging fill:#FFA500,stroke:#333
    class companions,voice,kubeflow_ui frontend
    class chat_handler,voice_handler,pipeline_bridge handler
    class whisper,xtts,vllm,embeddings,reranker ai
    class milvus,valkey,postgres,minio storage
    class argo,kfp,argo_events,mlflow,volcano workflow
    class nats messaging
--- a/CONTEXT-DIAGRAM.mmd
+++ b/CONTEXT-DIAGRAM.mmd
@@ -0,0 +1,69 @@
 %% C4 Context Diagram - Level 1
 %% DaviesTechLabs Homelab System Context
 %%
 %% To render: Use Mermaid Live Editor or VS Code Mermaid extension
 graph TB
    subgraph users["External Users"]
        dev["👤 Developer<br/>(Billy)"]
        family["👥 Family Members"]
        agents["🤖 AI Agents"]
    end
    subgraph external["External Systems"]
        cf["☁️ Cloudflare<br/>DNS + Tunnel"]
        gh["🐙 GitHub<br/>Source Code"]
        ghcr["📦 GHCR<br/>Container Registry"]
        hf["🤗 Hugging Face<br/>Model Registry"]
    end
    subgraph homelab["🏠 DaviesTechLabs Homelab"]
        direction TB
        subgraph apps["Application Layer"]
            companions["💬 Companions<br/>AI Chat"]
            voice["🎤 Voice Assistant"]
            media["🎬 Media Services<br/>(Jellyfin, *arr)"]
            productivity["📝 Productivity<br/>(Nextcloud, Gitea)"]
        end
        subgraph platform["Platform Layer"]
            k8s["☸️ Kubernetes Cluster<br/>Talos Linux"]
        end
        subgraph ai["AI/ML Layer"]
            inference["🧠 Inference Services<br/>(vLLM, Whisper, XTTS)"]
            workflows["⚙️ Workflow Engines<br/>(Kubeflow, Argo)"]
            vectordb["📚 Vector Store<br/>(Milvus)"]
        end
    end
    %% User interactions
    dev --> |manages| productivity
    dev --> |develops| k8s
    family --> |uses| media
    family --> |chats| companions
    agents --> |queries| inference
    %% External integrations
    cf --> |routes traffic| apps
    gh --> |GitOps sync| k8s
    ghcr --> |pulls images| k8s
    hf --> |downloads models| inference
    %% Internal relationships
    apps --> platform
    ai --> platform
    companions --> inference
    voice --> inference
    workflows --> inference
    inference --> vectordb
    %% Styling
    classDef external fill:#f9f,stroke:#333,stroke-width:2px
    classDef homelab fill:#bbf,stroke:#333,stroke-width:2px
    classDef user fill:#bfb,stroke:#333,stroke-width:2px
    class cf,gh,ghcr,hf external
    class companions,voice,media,productivity,k8s,inference,workflows,vectordb homelab
    class dev,family,agents user
--- a/DOMAIN-MODEL.md
+++ b/DOMAIN-MODEL.md
@@ -0,0 +1,345 @@
 # 📊 Domain Model
 > **Core entities, bounded contexts, and relationships in the DaviesTechLabs homelab**
 ## Bounded Contexts
 ```
 ┌─────────────────────────────────────────────────────────────────────────────┐
 │                           BOUNDED CONTEXTS                                   │
 ├─────────────────────────────────────────────────────────────────────────────┤
 │                                                                              │
 │  ┌───────────────────┐   ┌───────────────────┐   ┌───────────────────┐     │
 │  │    CHAT CONTEXT   │   │   VOICE CONTEXT   │   │ WORKFLOW CONTEXT  │     │
 │  ├───────────────────┤   ├───────────────────┤   ├───────────────────┤     │
 │  │ • ChatSession     │   │ • VoiceSession    │   │ • Pipeline        │     │
 │  │ • ChatMessage     │   │ • AudioChunk      │   │ • PipelineRun     │     │
 │  │ • Conversation    │   │ • Transcription   │   │ • Artifact        │     │
 │  │ • User            │   │ • SynthesizedAudio│   │ • Experiment      │     │
 │  └─────────┬─────────┘   └─────────┬─────────┘   └─────────┬─────────┘     │
 │            │                       │                       │                │
 │            └───────────────────────┼───────────────────────┘                │
 │                                    │                                        │
 │                                    ▼                                        │
 │  ┌───────────────────────────────────────────────────────────────────┐     │
 │  │                    INFERENCE CONTEXT                               │     │
 │  ├───────────────────────────────────────────────────────────────────┤     │
 │  │ • InferenceRequest  • Model  • Embedding  • Document  • Chunk     │     │
 │  └───────────────────────────────────────────────────────────────────┘     │
 │                                                                              │
 └─────────────────────────────────────────────────────────────────────────────┘
 ```
 ---
 ## Core Entities
 ### User Context
 ```yaml
 User:
  id: string (UUID)
  username: string
  premium: boolean
  preferences:
    voice_id: string
    model_preference: string
    enable_rag: boolean
  created_at: timestamp
 Session:
  id: string (UUID)
  user_id: string
  type: "chat" | "voice"
  started_at: timestamp
  last_activity: timestamp
  metadata: object
 ```
 ### Chat Context
 ```yaml
 ChatMessage:
  id: string (UUID)
  session_id: string
  user_id: string
  role: "user" | "assistant" | "system"
  content: string
  created_at: timestamp
  metadata:
    tokens_used: integer
    latency_ms: float
    rag_sources: string[]
    model_used: string
 Conversation:
  id: string (UUID)
  user_id: string
  messages: ChatMessage[]
  title: string (auto-generated)
  created_at: timestamp
  updated_at: timestamp
 ```
 ### Voice Context
 ```yaml
 VoiceRequest:
  id: string (UUID)
  user_id: string
  audio_b64: string (base64)
  format: "wav" | "webm" | "mp3"
  language: string
  premium: boolean
  enable_rag: boolean
 VoiceResponse:
  id: string (UUID)
  request_id: string
  transcription: string
  response_text: string
  audio_b64: string (base64)
  audio_format: string
  latency_ms: float
  rag_docs_used: integer
 ```
 ### Inference Context
 ```yaml
 InferenceRequest:
  id: string (UUID)
  service: "llm" | "stt" | "tts" | "embeddings" | "reranker"
  input: string | bytes
  parameters: object
  priority: "standard" | "premium"
 InferenceResponse:
  id: string (UUID)
  request_id: string
  output: string | bytes | float[]
  metadata:
    model: string
    latency_ms: float
    tokens: integer (if applicable)
 ```
 ### RAG Context
 ```yaml
 Document:
  id: string (UUID)
  collection: string
  title: string
  content: string
  source_url: string
  ingested_at: timestamp
 Chunk:
  id: string (UUID)
  document_id: string
  content: string
  embedding: float[1024]  # BGE-large dimensions
  metadata:
    position: integer
    page: integer
 RAGQuery:
  query: string
  collection: string
  top_k: integer (default: 5)
  rerank: boolean (default: true)
  rerank_top_k: integer (default: 3)
 RAGResult:
  chunks: Chunk[]
  scores: float[]
  reranked: boolean
 ```
 ### Workflow Context
 ```yaml
 Pipeline:
  id: string
  name: string
  version: string
  engine: "kubeflow" | "argo"
  definition: object (YAML)
 PipelineRun:
  id: string (UUID)
  pipeline_id: string
  status: "pending" | "running" | "succeeded" | "failed"
  started_at: timestamp
  completed_at: timestamp
  parameters: object
  artifacts: Artifact[]
 Artifact:
  id: string (UUID)
  run_id: string
  name: string
  type: "model" | "dataset" | "metrics" | "logs"
  uri: string (s3://)
  metadata: object
 Experiment:
  id: string (UUID)
  name: string
  runs: PipelineRun[]
  metrics: object
  created_at: timestamp
 ```
 ---
 ## Entity Relationships
 ```mermaid
 erDiagram
    USER ||--o{ SESSION : has
    USER ||--o{ CONVERSATION : owns
    SESSION ||--o{ CHAT_MESSAGE : contains
    CONVERSATION ||--o{ CHAT_MESSAGE : contains
    USER ||--o{ VOICE_REQUEST : makes
    VOICE_REQUEST ||--|| VOICE_RESPONSE : produces
    DOCUMENT ||--o{ CHUNK : contains
    CHUNK }|--|| EMBEDDING : has
    PIPELINE ||--o{ PIPELINE_RUN : executed_as
    PIPELINE_RUN ||--o{ ARTIFACT : produces
    EXPERIMENT ||--o{ PIPELINE_RUN : tracks
    INFERENCE_REQUEST }|--|| INFERENCE_RESPONSE : produces
 ```
 ---
 ## Aggregate Roots
 | Aggregate | Root Entity | Child Entities |
 |-----------|-------------|----------------|
 | Chat | Conversation | ChatMessage |
 | Voice | VoiceRequest | VoiceResponse |
 | RAG | Document | Chunk, Embedding |
 | Workflow | PipelineRun | Artifact |
 | User | User | Session, Preferences |
 ---
 ## Event Flow
 ### Chat Event Stream
 ```
 UserLogin
  └─► SessionCreated
        └─► MessageReceived
              ├─► RAGQueryExecuted (optional)
              ├─► InferenceRequested
              └─► ResponseGenerated
                    └─► MessageStored
 ```
 ### Voice Event Stream
 ```
 VoiceRequestReceived
  └─► TranscriptionStarted
        └─► TranscriptionCompleted
              └─► RAGQueryExecuted (optional)
                    └─► LLMInferenceStarted
                          └─► LLMResponseGenerated
                                └─► TTSSynthesisStarted
                                      └─► AudioResponseReady
 ```
 ### Workflow Event Stream
 ```
 PipelineTriggerReceived
  └─► PipelineRunCreated
        └─► StepStarted (repeated)
              └─► StepCompleted (repeated)
                    └─► ArtifactProduced (repeated)
                          └─► PipelineRunCompleted
 ```
 ---
 ## Data Retention
 | Entity | Retention | Storage |
 |--------|-----------|---------|
 | ChatMessage | 30 days | JetStream → PostgreSQL |
 | VoiceRequest/Response | 1 hour (audio), 30 days (text) | JetStream → PostgreSQL |
 | Chunk/Embedding | Permanent | Milvus |
 | PipelineRun | Permanent | PostgreSQL |
 | Artifact | Permanent | MinIO |
 | Session | 7 days | Valkey |
 ---
 ## Invariants
 ### Chat Context
 - A ChatMessage must belong to exactly one Conversation
 - A Conversation must have at least one ChatMessage
 - Messages are immutable once created
 ### Voice Context
 - VoiceResponse must have corresponding VoiceRequest
 - Audio format must be one of: wav, webm, mp3
 - Transcription cannot be empty for valid audio
 ### RAG Context
 - Chunk must belong to exactly one Document
 - Embedding dimensions must match model (1024 for BGE-large)
 - Document must have at least one Chunk
 ### Workflow Context
 - PipelineRun must reference valid Pipeline
 - Artifacts must have valid S3 URIs
 - Run status transitions: pending → running → (succeeded|failed)
 ---
 ## Value Objects
 ```python
 # Immutable value objects
@dataclass(frozen=True)
 class MessageContent:
    text: str
    tokens: int
@dataclass(frozen=True)  
 class AudioData:
    data: bytes
    format: str
    duration_ms: int
    sample_rate: int
@dataclass(frozen=True)
 class EmbeddingVector:
    values: tuple[float, ...]
    model: str
    dimensions: int
@dataclass(frozen=True)
 class RAGContext:
    chunks: tuple[str, ...]
    scores: tuple[float, ...]
    query: str
 ```
 ---
 ## Related Documents
 - [ARCHITECTURE.md](ARCHITECTURE.md) - System architecture
 - [GLOSSARY.md](GLOSSARY.md) - Term definitions
 - [decisions/0004-use-messagepack-for-nats.md](decisions/0004-use-messagepack-for-nats.md) - Message format decision
--- a/GLOSSARY.md
+++ b/GLOSSARY.md
@@ -0,0 +1,242 @@
 # 📖 Glossary
 > **Terminology and abbreviations used in the DaviesTechLabs homelab**
 ## A
 **ADR (Architecture Decision Record)**
 : A document that captures an important architectural decision, including context, decision, and consequences.
 **Argo Events**
 : Event-driven automation for Kubernetes that triggers workflows based on events from various sources.
 **Argo Workflows**
 : A container-native workflow engine for orchestrating parallel jobs on Kubernetes.
 **Authentik**
 : Self-hosted identity provider supporting SAML, OIDC, and other protocols.
 ## B
 **BGE (BAAI General Embedding)**
 : A family of embedding models from BAAI used for semantic search and RAG.
 **Bounded Context**
 : A DDD concept defining a boundary within which a particular domain model applies.
 ## C
 **C4 Model**
 : A hierarchical approach to software architecture diagrams: Context, Container, Component, Code.
 **Cilium**
 : eBPF-based networking, security, and observability for Kubernetes.
 **CloudNative-PG**
 : Kubernetes operator for PostgreSQL databases.
 **CNI (Container Network Interface)**
 : Standard for configuring network interfaces in Linux containers.
 ## D
 **DDD (Domain-Driven Design)**
 : Software design approach focusing on the core domain and domain logic.
 ## E
 **Embedding**
 : A vector representation of text, used for semantic similarity and search.
 **Envoy Gateway**
 : Kubernetes Gateway API implementation using Envoy proxy.
 **External Secrets Operator (ESO)**
 : Kubernetes operator that syncs secrets from external stores (Vault, etc.).
 ## F
 **Falco**
 : Runtime security tool that detects anomalous activity in containers.
 **Flux CD**
 : GitOps toolkit for Kubernetes, continuously reconciling cluster state with Git.
 ## G
 **GitOps**
 : Operational practice using Git as the single source of truth for declarative infrastructure.
 **GPU Device Plugin**
 : Kubernetes plugin that exposes GPU resources to containers.
 ## H
 **HelmRelease**
 : Flux CRD for managing Helm chart releases declaratively.
 **HTTPRoute**
 : Kubernetes Gateway API resource for HTTP routing rules.
 ## I
 **InferenceService**
 : KServe CRD for deploying ML models with autoscaling and traffic management.
 ## J
 **JetStream**
 : NATS persistence layer providing streaming, key-value, and object stores.
 ## K
 **KServe**
 : Kubernetes-native platform for deploying and serving ML models.
 **Kubeflow**
 : ML toolkit for Kubernetes, including pipelines, training operators, and more.
 **Kustomization**
 : Flux CRD for applying Kustomize overlays from Git sources.
 ## L
 **LLM (Large Language Model)**
 : AI model trained on vast text data, capable of generating human-like text.
 **Longhorn**
 : Cloud-native distributed storage for Kubernetes.
 ## M
 **MessagePack (msgpack)**
 : Binary serialization format, more compact than JSON.
 **Milvus**
 : Open-source vector database for similarity search and AI applications.
 **MLflow**
 : Platform for managing the ML lifecycle: experiments, models, deployment.
 **MinIO**
 : S3-compatible object storage.
 ## N
 **NATS**
 : Cloud-native messaging system for microservices, IoT, and serverless.
 **Node Feature Discovery (NFD)**
 : Kubernetes add-on for detecting hardware features on nodes.
 ## P
 **Pipeline**
 : In ML context, a DAG of components that process data and train/serve models.
 **Premium User**
 : User tier with enhanced features (more RAG docs, priority routing).
 ## R
 **RAG (Retrieval-Augmented Generation)**
 : AI technique combining document retrieval with LLM generation for grounded responses.
 **Reranker**
 : Model that rescores retrieved documents based on relevance to a query.
 **ROCm**
 : AMD's open-source GPU computing platform (alternative to CUDA).
 ## S
 **Schematic**
 : Talos Linux concept for defining system extensions and configurations.
 **SOPS (Secrets OPerationS)**
 : Tool for encrypting secrets in Git repositories.
 **STT (Speech-to-Text)**
 : Converting spoken audio to text (e.g., Whisper).
 **Strix Halo**
 : AMD's unified memory architecture for APUs with large GPU memory.
 ## T
 **Talos Linux**
 : Minimal, immutable Linux distribution designed specifically for Kubernetes.
 **TTS (Text-to-Speech)**
 : Converting text to spoken audio (e.g., XTTS/Coqui).
 ## V
 **Valkey**
 : Redis-compatible in-memory data store (Redis fork).
 **vLLM**
 : High-throughput LLM serving engine with PagedAttention.
 **VIP (Virtual IP)**
 : IP address shared among multiple hosts for high availability.
 **Volcano**
 : Kubernetes batch scheduler for high-performance workloads (ML, HPC).
 **VRM**
 : File format for 3D humanoid avatars.
 ## W
 **Whisper**
 : OpenAI's speech recognition model.
 ## X
 **XTTS**
 : Coqui's multi-language text-to-speech model with voice cloning.
 ---
 ## Acronyms Quick Reference
 | Acronym | Full Form |
 |---------|-----------|
 | ADR | Architecture Decision Record |
 | API | Application Programming Interface |
 | BGE | BAAI General Embedding |
 | CI/CD | Continuous Integration/Continuous Deployment |
 | CRD | Custom Resource Definition |
 | DAG | Directed Acyclic Graph |
 | DDD | Domain-Driven Design |
 | ESO | External Secrets Operator |
 | GPU | Graphics Processing Unit |
 | HA | High Availability |
 | HPA | Horizontal Pod Autoscaler |
 | LLM | Large Language Model |
 | ML | Machine Learning |
 | NATS | (not an acronym, named after message passing in Erlang) |
 | NFD | Node Feature Discovery |
 | OIDC | OpenID Connect |
 | RAG | Retrieval-Augmented Generation |
 | RBAC | Role-Based Access Control |
 | ROCm | Radeon Open Compute |
 | S3 | Simple Storage Service |
 | SAML | Security Assertion Markup Language |
 | SOPS | Secrets OPerationS |
 | SSO | Single Sign-On |
 | STT | Speech-to-Text |
 | TLS | Transport Layer Security |
 | TTS | Text-to-Speech |
 | UUID | Universally Unique Identifier |
 | VIP | Virtual IP |
 | VRAM | Video Random Access Memory |
 ---
 ## Related Documents
 - [ARCHITECTURE.md](ARCHITECTURE.md) - System overview
 - [TECH-STACK.md](TECH-STACK.md) - Technology details
 - [DOMAIN-MODEL.md](DOMAIN-MODEL.md) - Entity definitions
--- a/README.md
+++ b/README.md
@@ -1,3 +1,105 @@
-# homelab-design
+# 🏠 DaviesTechLabs Homelab Architecture
-homelab design process goes here.
+> **Production-grade AI/ML platform running on bare-metal Kubernetes**
 [![Talos](https://img.shields.io/badge/Talos-v1.12.1-blue?logo=linux)](https://talos.dev)
 [![Kubernetes](https://img.shields.io/badge/Kubernetes-v1.35.0-326CE5?logo=kubernetes)](https://kubernetes.io)
 [![Flux](https://img.shields.io/badge/GitOps-Flux-blue?logo=flux)](https://fluxcd.io)
 [![License](https://img.shields.io/badge/License-MIT-green)](LICENSE)
 ## 📖 Quick Navigation
 | Document | Purpose |
 |----------|---------|
 | [AGENT-ONBOARDING.md](AGENT-ONBOARDING.md) | **Start here if you're an AI agent** |
 | [ARCHITECTURE.md](ARCHITECTURE.md) | High-level system overview |
 | [TECH-STACK.md](TECH-STACK.md) | Complete technology stack |
 | [DOMAIN-MODEL.md](DOMAIN-MODEL.md) | Core entities and bounded contexts |
 | [GLOSSARY.md](GLOSSARY.md) | Terminology reference |
 | [decisions/](decisions/) | Architecture Decision Records (ADRs) |
 ## 🎯 What This Is
 A comprehensive architecture documentation repository for the DaviesTechLabs homelab Kubernetes cluster, featuring:
 - **AI/ML Platform**: KServe inference services, RAG pipelines, voice assistants
 - **Multi-GPU Support**: AMD ROCm (RDNA3/Strix Halo), NVIDIA CUDA, Intel Arc
 - **GitOps**: Flux CD with SOPS encryption
 - **Event-Driven**: NATS JetStream for real-time messaging
 - **ML Workflows**: Kubeflow Pipelines + Argo Workflows
 ## 🖥️ Cluster Overview
 | Node | Role | Hardware | GPU |
 |------|------|----------|-----|
 | storm | Control Plane | Intel 13th Gen | Integrated |
 | bruenor | Control Plane | Intel 13th Gen | Integrated |
 | catti | Control Plane | Intel 13th Gen | Integrated |
 | elminster | Worker | NVIDIA RTX 2070 | 8GB CUDA |
 | khelben | Worker (vLLM) | AMD Strix Halo | 64GB Unified |
 | drizzt | Worker | AMD Radeon 680M | 12GB RDNA2 |
 | danilo | Worker | Intel Core Ultra 9 | Intel Arc |
 ## 🚀 Quick Start
 ### View Current Cluster State
 ```bash
 # Get node status
 kubectl get nodes -o wide
 # View AI/ML workloads
 kubectl get pods -n ai-ml
 # Check KServe inference services
 kubectl get inferenceservices -n ai-ml
 ```
 ### Key Endpoints
 | Service | URL | Purpose |
 |---------|-----|---------|
 | Kubeflow | `kubeflow.lab.daviestechlabs.io` | ML Pipeline UI |
 | Companions | `companions-chat.lab.daviestechlabs.io` | AI Chat Interface |
 | Voice | `voice.lab.daviestechlabs.io` | Voice Assistant |
 | Gitea | `git.daviestechlabs.io` | Self-hosted Git |
 ## 📂 Repository Structure
 ```
 homelab-design/
 ├── README.md                          # This file
 ├── AGENT-ONBOARDING.md                # AI agent quick-start
 ├── ARCHITECTURE.md                    # High-level system overview
 ├── CONTEXT-DIAGRAM.mmd                # C4 Level 1 (Mermaid)
 ├── CONTAINER-DIAGRAM.mmd              # C4 Level 2
 ├── TECH-STACK.md                      # Complete tech stack
 ├── DOMAIN-MODEL.md                    # Core entities
 ├── CODING-CONVENTIONS.md              # Patterns & practices
 ├── GLOSSARY.md                        # Terminology
 ├── decisions/                         # ADRs
 │   ├── 0000-template.md
 │   ├── 0001-record-architecture-decisions.md
 │   ├── 0002-use-talos-linux.md
 │   └── ...
 ├── specs/                             # Feature specifications
 └── diagrams/                          # Additional diagrams
 ```
 ## 🔗 Related Repositories
 | Repository | Purpose |
 |------------|---------|
 | [homelab-k8s2](https://github.com/Billy-Davies-2/homelab-k8s2) | Kubernetes manifests, Flux GitOps |
 | [llm-workflows](https://github.com/Billy-Davies-2/llm-workflows) | NATS handlers, Argo/KFP workflows |
 | [companions-frontend](https://github.com/Billy-Davies-2/companions-frontend) | Go web server, HTMX frontend |
 ## 📝 Contributing
 1. For architecture changes, create an ADR in `decisions/`
 2. Update relevant documentation
 3. Submit a PR with context
 ---
 *Last updated: 2026-02-01*
--- a/TECH-STACK.md
+++ b/TECH-STACK.md
@@ -0,0 +1,271 @@
 # 🛠️ Technology Stack
 > **Complete inventory of technologies used in the DaviesTechLabs homelab**
 ## Platform Layer
 ### Operating System
 | Component | Version | Purpose |
 |-----------|---------|---------|
 | [Talos Linux](https://talos.dev) | v1.12.1 | Immutable, API-driven Kubernetes OS |
 | Kernel | 6.18.2-talos | Linux kernel with GPU drivers |
 ### Container Orchestration
 | Component | Version | Purpose |
 |-----------|---------|---------|
 | [Kubernetes](https://kubernetes.io) | v1.35.0 | Container orchestration |
 | [containerd](https://containerd.io) | 2.1.6 | Container runtime |
 | [Cilium](https://cilium.io) | Latest | CNI, network policies, eBPF |
 ### GitOps
 | Component | Version | Purpose |
 |-----------|---------|---------|
 | [Flux CD](https://fluxcd.io) | v2 | GitOps continuous delivery |
 | [SOPS](https://github.com/getsops/sops) | Latest | Secret encryption |
 | [Age](https://github.com/FiloSottile/age) | Latest | Encryption key management |
 ---
 ## AI/ML Layer
 ### Inference Engines
 | Service | Framework | GPU | Model Type |
 |---------|-----------|-----|------------|
 | [vLLM](https://vllm.ai) | ROCm | AMD Strix Halo | Large Language Models |
 | [faster-whisper](https://github.com/guillaumekln/faster-whisper) | CUDA | NVIDIA RTX 2070 | Speech-to-Text |
 | [XTTS](https://github.com/coqui-ai/TTS) | CUDA | NVIDIA RTX 2070 | Text-to-Speech |
 | [BGE Embeddings](https://huggingface.co/BAAI/bge-large-en-v1.5) | ROCm | AMD Radeon 680M | Text Embeddings |
 | [BGE Reranker](https://huggingface.co/BAAI/bge-reranker-large) | Intel | Intel Arc | Document Reranking |
 ### ML Serving
 | Component | Version | Purpose |
 |-----------|---------|---------|
 | [KServe](https://kserve.github.io) | v0.12+ | Model serving framework |
 | [Ray Serve](https://ray.io/serve) | 2.53.0 | Unified inference endpoints |
 ### ML Workflows
 | Component | Version | Purpose |
 |-----------|---------|---------|
 | [Kubeflow Pipelines](https://kubeflow.org) | 2.15.0 | ML pipeline orchestration |
 | [Argo Workflows](https://argoproj.github.io/workflows) | v3.7.8 | DAG-based workflows |
 | [Argo Events](https://argoproj.github.io/events) | Latest | Event-driven triggers |
 | [MLflow](https://mlflow.org) | 3.7.0 | Experiment tracking, model registry |
 ### GPU Scheduling
 | Component | Version | Purpose |
 |-----------|---------|---------|
 | [Volcano](https://volcano.sh) | Latest | GPU-aware scheduling |
 | AMD GPU Device Plugin | v1.4.1 | ROCm GPU allocation |
 | NVIDIA Device Plugin | Latest | CUDA GPU allocation |
 | [Node Feature Discovery](https://github.com/kubernetes-sigs/node-feature-discovery) | v0.18.2 | Hardware detection |
 ---
 ## Data Layer
 ### Databases
 | Component | Version | Purpose |
 |-----------|---------|---------|
 | [CloudNative-PG](https://cloudnative-pg.io) | 16.11 | PostgreSQL for metadata |
 | [Milvus](https://milvus.io) | Latest | Vector database for RAG |
 | [ClickHouse](https://clickhouse.com) | Latest | Analytics, access logs |
 | [Valkey](https://valkey.io) | Latest | Redis-compatible cache |
 ### Object Storage
 | Component | Version | Purpose |
 |-----------|---------|---------|
 | [MinIO](https://min.io) | Latest | S3-compatible storage |
 | [Longhorn](https://longhorn.io) | v1.10.1 | Distributed block storage |
 | NFS CSI Driver | Latest | Shared filesystem |
 ### Messaging
 | Component | Version | Purpose |
 |-----------|---------|---------|
 | [NATS](https://nats.io) | Latest | Message bus |
 | NATS JetStream | Built-in | Persistent streaming |
 ### Data Processing
 | Component | Version | Purpose |
 |-----------|---------|---------|
 | [Apache Spark](https://spark.apache.org) | Latest | Batch analytics |
 | [Apache Flink](https://flink.apache.org) | Latest | Stream processing |
 | [Apache Iceberg](https://iceberg.apache.org) | Latest | Table format |
 | [Nessie](https://projectnessie.org) | Latest | Data catalog |
 | [Trino](https://trino.io) | 479 | SQL query engine |
 ---
 ## Application Layer
 ### Web Frameworks
 | Application | Language | Framework | Purpose |
 |-------------|----------|-----------|---------|
 | Companions | Go | net/http + HTMX | AI chat interface |
 | Voice WebApp | Python | Gradio | Voice assistant UI |
 | Various handlers | Python | asyncio + nats.py | NATS event handlers |
 ### Frontend
 | Technology | Purpose |
 |------------|---------|
 | [HTMX](https://htmx.org) | Dynamic HTML updates |
 | [Alpine.js](https://alpinejs.dev) | Lightweight reactivity |
 | [VRM](https://vrm.dev) | 3D avatar rendering |
 ---
 ## Networking Layer
 ### Ingress
 | Component | Version | Purpose |
 |-----------|---------|---------|
 | [Envoy Gateway](https://gateway.envoyproxy.io) | v1.6.3 | Gateway API implementation |
 | [cloudflared](https://developers.cloudflare.com/cloudflare-one/connections/connect-apps) | Latest | Cloudflare tunnel |
 ### DNS & Certificates
 | Component | Version | Purpose |
 |-----------|---------|---------|
 | [external-dns](https://github.com/kubernetes-sigs/external-dns) | Latest | Automatic DNS management |
 | [cert-manager](https://cert-manager.io) | Latest | TLS certificate automation |
 ### Service Mesh
 | Component | Purpose |
 |-----------|---------|
 | [Spegel](https://github.com/spegel-org/spegel) | P2P container image distribution |
 ---
 ## Security Layer
 ### Identity & Access
 | Component | Version | Purpose |
 |-----------|---------|---------|
 | [Authentik](https://goauthentik.io) | 2025.12.1 | Identity provider, SSO |
 | [Vault](https://vaultproject.io) | 1.21.2 | Secret management |
 | [External Secrets Operator](https://external-secrets.io) | v1.3.1 | Kubernetes secrets sync |
 ### Runtime Security
 | Component | Version | Purpose |
 |-----------|---------|---------|
 | [Falco](https://falco.org) | 0.42.1 | Runtime threat detection |
 | Cilium Network Policies | Built-in | Network segmentation |
 ### Backup
 | Component | Version | Purpose |
 |-----------|---------|---------|
 | [Velero](https://velero.io) | v1.17.1 | Cluster backup/restore |
 ---
 ## Observability Layer
 ### Metrics
 | Component | Purpose |
 |-----------|---------|
 | [Prometheus](https://prometheus.io) | Metrics collection |
 | [Grafana](https://grafana.com) | Dashboards & visualization |
 ### Logging
 | Component | Version | Purpose |
 |-----------|---------|---------|
 | [Grafana Alloy](https://grafana.com/oss/alloy) | v1.12.0 | Log collection |
 | [Loki](https://grafana.com/oss/loki) | Latest | Log aggregation |
 ### Tracing
 | Component | Purpose |
 |-----------|---------|
 | [OpenTelemetry Collector](https://opentelemetry.io) | Trace collection |
 | Tempo/Jaeger | Trace storage & query |
 ---
 ## Development Tools
 ### Local Development
 | Tool | Purpose |
 |------|---------|
 | [mise](https://mise.jdx.dev) | Tool version management |
 | [Task](https://taskfile.dev) | Task runner (Taskfile.yaml) |
 | [flux-local](https://github.com/allenporter/flux-local) | Local Flux testing |
 ### CI/CD
 | Tool | Purpose |
 |------|---------|
 | GitHub Actions | CI/CD pipelines |
 | [Renovate](https://renovatebot.com) | Dependency updates |
 ### Image Building
 | Tool | Purpose |
 |------|---------|
 | Docker | Container builds |
 | GHCR | Container registry |
 ---
 ## Media & Entertainment
 | Component | Version | Purpose |
 |-----------|---------|---------|
 | [Jellyfin](https://jellyfin.org) | 10.11.5 | Media server |
 | [Nextcloud](https://nextcloud.com) | 32.0.5 | File sync & share |
 | Prowlarr, Bazarr, etc. | Various | *arr stack |
 | [Kasm](https://kasmweb.com) | 1.18.1 | Browser isolation |
 ---
 ## Python Dependencies (llm-workflows)
 ```toml
 # Core
 nats-py>=2.7.0          # NATS client
 msgpack>=1.0.0          # Binary serialization
 aiohttp>=3.9.0          # HTTP client
 # ML/AI
 pymilvus>=2.4.0         # Milvus client
 sentence-transformers   # Embeddings
 openai>=1.0.0           # vLLM OpenAI API
 # Kubeflow
 kfp>=2.12.1             # Pipeline SDK
 ```
 ---
 ## Version Pinning Strategy
 | Component Type | Strategy |
 |----------------|----------|
 | Base images | Pin major.minor |
 | Helm charts | Pin exact version |
 | Python packages | Pin minimum version |
 | System extensions | Pin via Talos schematic |
 ## Related Documents
 - [ARCHITECTURE.md](ARCHITECTURE.md) - How components connect
 - [decisions/](decisions/) - Why we chose specific technologies
--- a/decisions/0000-template.md
+++ b/decisions/0000-template.md
@@ -0,0 +1,71 @@
 # [short title of solved problem and solution]
 * Status: [proposed | rejected | accepted | deprecated | superseded by [ADR-NNNN](NNNN-example.md)]
 * Date: YYYY-MM-DD
 * Deciders: [list of people involved in decision]
 * Technical Story: [description | ticket/issue URL]
 ## Context and Problem Statement
 [Describe the context and problem statement, e.g., in free form using two to three sentences. You may want to articulate the problem in form of a question.]
 ## Decision Drivers
 * [driver 1, e.g., a force, facing concern, …]
 * [driver 2, e.g., a force, facing concern, …]
 * … <!-- numbers of drivers can vary -->
 ## Considered Options
 * [option 1]
 * [option 2]
 * [option 3]
 * … <!-- numbers of options can vary -->
 ## Decision Outcome
 Chosen option: "[option N]", because [justification. e.g., only option which meets k.o. criterion decision driver | which resolves force | … | comes out best (see below)].
 ### Positive Consequences
 * [e.g., improvement of quality attribute satisfaction, follow-up decisions required, …]
 * …
 ### Negative Consequences
 * [e.g., compromising quality attribute, follow-up decisions required, …]
 * …
 ## Pros and Cons of the Options
 ### [option 1]
 [example | description | pointer to more information | …]
 * Good, because [argument a]
 * Good, because [argument b]
 * Bad, because [argument c]
 * … <!-- numbers of pros and cons can vary -->
 ### [option 2]
 [example | description | pointer to more information | …]
 * Good, because [argument a]
 * Good, because [argument b]
 * Bad, because [argument c]
 * … <!-- numbers of pros and cons can vary -->
 ### [option 3]
 [example | description | pointer to more information | …]
 * Good, because [argument a]
 * Good, because [argument b]
 * Bad, because [argument c]
 * … <!-- numbers of pros and cons can vary -->
 ## Links
 * [Link type] [Link to ADR] <!-- example: Refined by [ADR-0005](0005-example.md) -->
 * … <!-- numbers of links can vary -->
--- a/decisions/0001-record-architecture-decisions.md
+++ b/decisions/0001-record-architecture-decisions.md
@@ -0,0 +1,79 @@
 # Record Architecture Decisions
 * Status: accepted
 * Date: 2025-11-30
 * Deciders: Billy Davies
 * Technical Story: Initial setup of homelab documentation
 ## Context and Problem Statement
 As the homelab infrastructure grows in complexity with AI/ML services, multi-GPU configurations, and event-driven architectures, we need a way to document and communicate significant architectural decisions. Without documentation, the rationale behind choices gets lost, making future changes risky and onboarding difficult.
 ## Decision Drivers
 * Need to preserve context for why decisions were made
 * Enable future maintainers (including AI agents) to understand the system
 * Provide a structured way to evaluate alternatives
 * Support the wiki/design process for iterative improvements
 ## Considered Options
 * Informal documentation in README files
 * Wiki pages without structure
 * Architecture Decision Records (ADRs)
 * No documentation (rely on code)
 ## Decision Outcome
 Chosen option: "Architecture Decision Records (ADRs)", because they provide a structured format that captures context, alternatives, and consequences. They're lightweight, version-controlled, and well-suited for technical decisions.
 ### Positive Consequences
 * Clear historical record of decisions
 * Structured format makes decisions searchable
 * Forces consideration of alternatives
 * Git-versioned alongside code
 * AI agents can parse and understand decisions
 ### Negative Consequences
 * Requires discipline to create ADRs
 * May accumulate outdated decisions over time
 * Additional overhead for simple decisions
 ## Pros and Cons of the Options
 ### Informal README documentation
 * Good, because low friction
 * Good, because close to code
 * Bad, because no structure for alternatives
 * Bad, because decisions get buried in prose
 ### Wiki pages
 * Good, because easy to edit
 * Good, because supports rich formatting
 * Bad, because separate from code repository
 * Bad, because no enforced structure
 ### ADRs
 * Good, because structured format
 * Good, because version controlled
 * Good, because captures alternatives considered
 * Good, because industry-standard practice
 * Bad, because requires creating new files
 * Bad, because may seem bureaucratic for small decisions
 ### No documentation
 * Good, because no overhead
 * Bad, because context is lost
 * Bad, because makes onboarding difficult
 * Bad, because risky for future changes
 ## Links
 * Based on [MADR template](https://adr.github.io/madr/)
 * [ADR GitHub organization](https://adr.github.io/)
--- a/decisions/0002-use-talos-linux.md
+++ b/decisions/0002-use-talos-linux.md
@@ -0,0 +1,97 @@
 # Use Talos Linux for Kubernetes Nodes
 * Status: accepted
 * Date: 2025-11-30
 * Deciders: Billy Davies
 * Technical Story: Selecting OS for bare-metal Kubernetes cluster
 ## Context and Problem Statement
 We need a reliable, secure operating system for running Kubernetes on bare-metal homelab nodes. The OS should minimize attack surface, be easy to manage at scale, and support our GPU requirements (AMD ROCm, NVIDIA CUDA, Intel).
 ## Decision Drivers
 * Security-first design (immutable, minimal)
 * API-driven management (no SSH)
 * Support for various GPU drivers
 * Kubernetes-native focus
 * Community support and updates
 * Ease of upgrades
 ## Considered Options
 * Ubuntu Server with kubeadm
 * Flatcar Container Linux
 * Talos Linux
 * k3OS (discontinued)
 * Rocky Linux with RKE2
 ## Decision Outcome
 Chosen option: "Talos Linux", because it provides an immutable, API-driven, Kubernetes-focused OS that minimizes attack surface and simplifies operations.
 ### Positive Consequences
 * Immutable root filesystem prevents drift
 * No SSH reduces attack vectors
 * API-driven management integrates well with GitOps
 * Schematic system allows custom kernel modules (GPU drivers)
 * Consistent configuration across all nodes
 * Automatic updates with minimal disruption
 ### Negative Consequences
 * Learning curve for API-driven management
 * Debugging requires different approaches (no SSH)
 * Custom extensions require schematic IDs
 * Less flexibility for non-Kubernetes workloads
 ## Pros and Cons of the Options
 ### Ubuntu Server with kubeadm
 * Good, because familiar
 * Good, because extensive package availability
 * Good, because easy debugging via SSH
 * Bad, because mutable system leads to drift
 * Bad, because large attack surface
 * Bad, because manual package management
 ### Flatcar Container Linux
 * Good, because immutable
 * Good, because auto-updates
 * Good, because container-focused
 * Bad, because less Kubernetes-specific
 * Bad, because smaller community than Talos
 * Bad, because GPU driver setup more complex
 ### Talos Linux
 * Good, because purpose-built for Kubernetes
 * Good, because immutable and minimal
 * Good, because API-driven (no SSH)
 * Good, because excellent Kubernetes integration
 * Good, because active development and community
 * Good, because schematic system for GPU drivers
 * Bad, because learning curve
 * Bad, because no traditional debugging
 ### k3OS
 * Good, because simple
 * Bad, because discontinued
 ### Rocky Linux with RKE2
 * Good, because enterprise-like
 * Good, because familiar Linux experience
 * Bad, because mutable system
 * Bad, because more operational overhead
 * Bad, because larger attack surface
 ## Links
 * [Talos Linux](https://talos.dev)
 * [Talos Image Factory](https://factory.talos.dev)
 * Related: [ADR-0005](0005-multi-gpu-strategy.md) - GPU driver integration via schematics
--- a/decisions/0003-use-nats-for-messaging.md
+++ b/decisions/0003-use-nats-for-messaging.md
@@ -0,0 +1,112 @@
 # Use NATS for AI/ML Messaging
 * Status: accepted
 * Date: 2025-12-01
 * Deciders: Billy Davies
 * Technical Story: Selecting message bus for AI service orchestration
 ## Context and Problem Statement
 The AI/ML platform requires a messaging system for:
 - Real-time chat message routing
 - Voice request/response streaming
 - Pipeline triggers and status updates
 - Event-driven workflow orchestration
 We need a messaging system that handles both ephemeral real-time messages and persistent streams.
 ## Decision Drivers
 * Low latency for real-time chat/voice
 * Persistence for audit and replay
 * Simple operations for homelab
 * Support for request-reply pattern
 * Wildcard subscriptions for routing
 * Binary message support (audio data)
 ## Considered Options
 * Apache Kafka
 * RabbitMQ
 * Redis Pub/Sub + Streams
 * NATS with JetStream
 * Apache Pulsar
 ## Decision Outcome
 Chosen option: "NATS with JetStream", because it provides both fire-and-forget messaging and persistent streams with significantly simpler operations than alternatives.
 ### Positive Consequences
 * Sub-millisecond latency for real-time messages
 * JetStream provides persistence when needed
 * Simple deployment (single binary)
 * Excellent Kubernetes integration
 * Request-reply pattern built-in
 * Wildcard subscriptions for flexible routing
 * Low resource footprint
 ### Negative Consequences
 * Less ecosystem than Kafka
 * JetStream less mature than Kafka Streams
 * No built-in schema registry
 * Smaller community than RabbitMQ
 ## Pros and Cons of the Options
 ### Apache Kafka
 * Good, because industry standard for streaming
 * Good, because rich ecosystem (Kafka Streams, Connect)
 * Good, because schema registry
 * Good, because excellent for high throughput
 * Bad, because operationally complex (ZooKeeper/KRaft)
 * Bad, because high resource requirements
 * Bad, because overkill for homelab scale
 * Bad, because higher latency for real-time messages
 ### RabbitMQ
 * Good, because mature and stable
 * Good, because flexible routing
 * Good, because good management UI
 * Bad, because AMQP protocol overhead
 * Bad, because not designed for streaming
 * Bad, because more complex clustering
 ### Redis Pub/Sub + Streams
 * Good, because simple
 * Good, because already might use Redis
 * Good, because low latency
 * Bad, because pub/sub not persistent
 * Bad, because streams API less intuitive
 * Bad, because not primary purpose of Redis
 ### NATS with JetStream
 * Good, because extremely low latency
 * Good, because simple operations
 * Good, because both pub/sub and persistence
 * Good, because request-reply built-in
 * Good, because wildcard subscriptions
 * Good, because low resource usage
 * Good, because excellent Go/Python clients
 * Bad, because smaller ecosystem
 * Bad, because JetStream newer than Kafka
 ### Apache Pulsar
 * Good, because unified messaging + streaming
 * Good, because multi-tenancy
 * Good, because geo-replication
 * Bad, because complex architecture
 * Bad, because high resource requirements
 * Bad, because smaller community
 ## Links
 * [NATS.io](https://nats.io)
 * [JetStream Documentation](https://docs.nats.io/nats-concepts/jetstream)
 * Related: [ADR-0004](0004-use-messagepack-for-nats.md) - Message format
--- a/decisions/0004-use-messagepack-for-nats.md
+++ b/decisions/0004-use-messagepack-for-nats.md
@@ -0,0 +1,137 @@
 # Use MessagePack for NATS Messages
 * Status: accepted
 * Date: 2025-12-01
 * Deciders: Billy Davies
 * Technical Story: Selecting serialization format for NATS messages
 ## Context and Problem Statement
 NATS messages in the AI platform carry various payloads:
 - Text chat messages (small)
 - Voice audio data (potentially large, base64 or binary)
 - Streaming response chunks
 - Pipeline parameters
 We need a serialization format that handles both text and binary efficiently.
 ## Decision Drivers
 * Efficient binary data handling (audio)
 * Compact message size
 * Fast serialization/deserialization
 * Cross-language support (Python, Go)
 * Debugging ability
 * Schema flexibility
 ## Considered Options
 * JSON
 * Protocol Buffers (protobuf)
 * MessagePack (msgpack)
 * CBOR
 * Avro
 ## Decision Outcome
 Chosen option: "MessagePack (msgpack)", because it provides binary efficiency with JSON-like simplicity and schema-less flexibility.
 ### Positive Consequences
 * Native binary support (no base64 overhead for audio)
 * 20-50% smaller than JSON for typical messages
 * Faster serialization than JSON
 * No schema compilation step
 * Easy debugging (can pretty-print like JSON)
 * Excellent Python and Go libraries
 ### Negative Consequences
 * Less human-readable than JSON when raw
 * No built-in schema validation
 * Slightly less common than JSON
 ## Pros and Cons of the Options
 ### JSON
 * Good, because human-readable
 * Good, because universal support
 * Good, because no setup required
 * Bad, because binary data requires base64 (33% overhead)
 * Bad, because larger message sizes
 * Bad, because slower parsing
 ### Protocol Buffers
 * Good, because very compact
 * Good, because fast
 * Good, because schema validation
 * Good, because cross-language
 * Bad, because requires schema definition
 * Bad, because compilation step
 * Bad, because less flexible for evolving schemas
 * Bad, because overkill for simple messages
 ### MessagePack
 * Good, because binary-efficient
 * Good, because JSON-like simplicity
 * Good, because no schema required
 * Good, because excellent library support
 * Good, because can include raw bytes
 * Bad, because not human-readable raw
 * Bad, because no schema validation
 ### CBOR
 * Good, because binary-efficient
 * Good, because IETF standard
 * Good, because schema-less
 * Bad, because less common libraries
 * Bad, because smaller community
 * Bad, because similar to msgpack with less adoption
 ### Avro
 * Good, because schema evolution
 * Good, because compact
 * Good, because schema registry integration
 * Bad, because requires schema
 * Bad, because more complex setup
 * Bad, because Java-centric ecosystem
 ## Implementation Notes
 ```python
 # Python usage
 import msgpack
 # Serialize
 data = {
    "user_id": "user-123",
    "audio": audio_bytes,  # Raw bytes, no base64
    "premium": True
 }
 payload = msgpack.packb(data)
 # Deserialize
 data = msgpack.unpackb(payload, raw=False)
 ```
 ```go
 // Go usage
 import "github.com/vmihailenco/msgpack/v5"
 type Message struct {
    UserID string `msgpack:"user_id"`
    Audio  []byte `msgpack:"audio"`
 }
 ```
 ## Links
 * [MessagePack Specification](https://msgpack.org)
 * [msgpack-python](https://github.com/msgpack/msgpack-python)
 * Related: [ADR-0003](0003-use-nats-for-messaging.md) - Message bus choice
 * See: [BINARY_MESSAGES_AND_JETSTREAM.md](../specs/BINARY_MESSAGES_AND_JETSTREAM.md)
--- a/decisions/0005-multi-gpu-strategy.md
+++ b/decisions/0005-multi-gpu-strategy.md
@@ -0,0 +1,145 @@
 # Multi-GPU Heterogeneous Strategy
 * Status: accepted
 * Date: 2025-12-01
 * Deciders: Billy Davies
 * Technical Story: GPU allocation strategy for AI workloads
 ## Context and Problem Statement
 The homelab has diverse GPU hardware:
 - AMD Strix Halo (64GB unified memory) - khelben
 - NVIDIA RTX 2070 (8GB VRAM) - elminster  
 - AMD Radeon 680M (12GB VRAM) - drizzt
 - Intel Arc (integrated) - danilo
 Different AI workloads have different requirements. How do we allocate GPUs effectively?
 ## Decision Drivers
 * Maximize utilization of all GPUs
 * Match workloads to appropriate hardware
 * Support concurrent inference services
 * Enable fractional GPU sharing where appropriate
 * Minimize cross-vendor complexity
 ## Considered Options
 * Single GPU vendor only
 * All workloads on largest GPU
 * Workload-specific GPU allocation
 * Dynamic GPU scheduling (MIG/fractional)
 ## Decision Outcome
 Chosen option: "Workload-specific GPU allocation with dedicated nodes", where each AI service is pinned to the most appropriate GPU based on requirements.
 ### Allocation Strategy
 | Workload | GPU | Node | Rationale |
 |----------|-----|------|-----------|
 | vLLM (LLM inference) | AMD Strix Halo (64GB) | khelben (dedicated) | Large models need unified memory |
 | Whisper (STT) | NVIDIA RTX 2070 (8GB) | elminster | CUDA optimized, medium memory |
 | XTTS (TTS) | NVIDIA RTX 2070 (8GB) | elminster | Shares with Whisper |
 | BGE Embeddings | AMD Radeon 680M (12GB) | drizzt | ROCm support, batch processing |
 | BGE Reranker | Intel Arc | danilo | Light workload, Intel optimization |
 ### Positive Consequences
 * Each workload gets optimal hardware
 * No GPU memory contention for LLM
 * NVIDIA services can share via time-slicing
 * Cost-effective use of varied hardware
 * Clear ownership and debugging
 ### Negative Consequences
 * More complex scheduling (node taints/tolerations)
 * Less flexibility for workload migration
 * Must maintain multiple GPU driver stacks
 * Some GPUs underutilized at times
 ## Implementation
 ### Node Taints
 ```yaml
 # khelben - dedicated vLLM node
 nodeTaints:
  dedicated: "vllm:NoSchedule"
 ```
 ### Pod Tolerations and Node Affinity
 ```yaml
 # vLLM deployment
 spec:
  tolerations:
    - key: "dedicated"
      operator: "Equal"
      value: "vllm"
      effect: "NoSchedule"
  affinity:
    nodeAffinity:
      requiredDuringSchedulingIgnoredDuringExecution:
        nodeSelectorTerms:
          - matchExpressions:
              - key: kubernetes.io/hostname
                operator: In
                values: ["khelben"]
 ```
 ### Resource Limits
 ```yaml
 # NVIDIA GPU (elminster)
 resources:
  limits:
    nvidia.com/gpu: 1
 # AMD GPU (drizzt, khelben)  
 resources:
  limits:
    amd.com/gpu: 1
 ```
 ## Pros and Cons of the Options
 ### Single GPU vendor only
 * Good, because simpler driver management
 * Good, because consistent tooling
 * Bad, because wastes existing hardware
 * Bad, because higher cost for new hardware
 ### All workloads on largest GPU
 * Good, because simple scheduling
 * Good, because unified memory benefits
 * Bad, because memory contention
 * Bad, because single point of failure
 * Bad, because wastes other GPUs
 ### Workload-specific allocation (chosen)
 * Good, because optimal hardware matching
 * Good, because uses all available GPUs
 * Good, because clear resource boundaries
 * Good, because parallel inference
 * Bad, because more complex configuration
 * Bad, because multiple driver stacks
 ### Dynamic GPU scheduling
 * Good, because flexible
 * Good, because maximizes utilization
 * Bad, because complex to implement
 * Bad, because MIG not available on consumer GPUs
 * Bad, because cross-vendor scheduling immature
 ## Links
 * [Volcano Scheduler](https://volcano.sh)
 * [AMD GPU Device Plugin](https://github.com/ROCm/k8s-device-plugin)
 * [NVIDIA Device Plugin](https://github.com/NVIDIA/k8s-device-plugin)
 * Related: [ADR-0002](0002-use-talos-linux.md) - GPU drivers via Talos schematics
--- a/decisions/0006-gitops-with-flux.md
+++ b/decisions/0006-gitops-with-flux.md
@@ -0,0 +1,140 @@
 # GitOps with Flux CD
 * Status: accepted
 * Date: 2025-11-30
 * Deciders: Billy Davies
 * Technical Story: Implementing GitOps for cluster management
 ## Context and Problem Statement
 Managing a Kubernetes cluster with numerous applications, configurations, and secrets requires a reliable, auditable, and reproducible approach. Manual `kubectl apply` is error-prone and doesn't track state over time.
 ## Decision Drivers
 * Infrastructure as Code (IaC) principles
 * Audit trail for all changes
 * Self-healing cluster state
 * Multi-repository support
 * Secret encryption integration
 * Active community and maintenance
 ## Considered Options
 * Manual kubectl apply
 * ArgoCD
 * Flux CD
 * Rancher Fleet
 * Pulumi/Terraform for Kubernetes
 ## Decision Outcome
 Chosen option: "Flux CD", because it provides a mature GitOps implementation with excellent multi-source support, SOPS integration, and aligns well with the Kubernetes ecosystem.
 ### Positive Consequences
 * Git is single source of truth
 * Automatic drift detection and correction
 * Native SOPS/Age secret encryption
 * Multi-repository support (homelab-k8s2 + llm-workflows)
 * Helm and Kustomize native support
 * Webhook-free sync (pull-based)
 ### Negative Consequences
 * No built-in UI (use CLI or third-party)
 * Learning curve for CRD-based configuration
 * Debugging requires understanding Flux controllers
 ## Configuration
 ### Repository Structure
 ```
 homelab-k8s2/
 ├── kubernetes/
 │   ├── flux/            # Flux system config
 │   │   ├── config/
 │   │   │   ├── cluster.yaml
 │   │   │   └── secrets.yaml  # SOPS encrypted
 │   │   └── repositories/
 │   │       ├── helm/    # HelmRepositories
 │   │       └── git/     # GitRepositories
 │   └── apps/            # Application Kustomizations
 ```
 ### Multi-Repository Sync
 ```yaml
 # GitRepository for llm-workflows
 apiVersion: source.toolkit.fluxcd.io/v1
 kind: GitRepository
 metadata:
  name: llm-workflows
  namespace: flux-system
 spec:
  url: ssh://git@github.com/Billy-Davies-2/llm-workflows
  ref:
    branch: main
  secretRef:
    name: github-deploy-key
 ```
 ### SOPS Integration
 ```yaml
 # .sops.yaml
 creation_rules:
  - path_regex: .*\.sops\.yaml$
    age: >-
      age1...  # Public key
 ```
 ## Pros and Cons of the Options
 ### Manual kubectl apply
 * Good, because simple
 * Good, because no setup
 * Bad, because no audit trail
 * Bad, because no drift detection
 * Bad, because not reproducible
 ### ArgoCD
 * Good, because great UI
 * Good, because app-of-apps pattern
 * Good, because large community
 * Bad, because heavier resource usage
 * Bad, because webhook-dependent sync
 * Bad, because SOPS requires plugins
 ### Flux CD
 * Good, because lightweight
 * Good, because pull-based (no webhooks)
 * Good, because native SOPS support
 * Good, because multi-source/multi-tenant
 * Good, because Kubernetes-native CRDs
 * Bad, because no built-in UI
 * Bad, because CRD learning curve
 ### Rancher Fleet
 * Good, because integrated with Rancher
 * Good, because multi-cluster
 * Bad, because Rancher ecosystem lock-in
 * Bad, because smaller community
 ### Pulumi/Terraform
 * Good, because familiar IaC tools
 * Good, because drift detection
 * Bad, because not Kubernetes-native
 * Bad, because requires state management
 * Bad, because not continuous reconciliation
 ## Links
 * [Flux CD](https://fluxcd.io)
 * [SOPS Integration](https://fluxcd.io/flux/guides/mozilla-sops/)
 * [flux-local](https://github.com/allenporter/flux-local) - Local testing
--- a/decisions/0007-use-kserve-for-inference.md
+++ b/decisions/0007-use-kserve-for-inference.md
@@ -0,0 +1,115 @@
 # Use KServe for ML Model Serving
 * Status: accepted
 * Date: 2025-12-15
 * Deciders: Billy Davies
 * Technical Story: Selecting model serving platform for inference services
 ## Context and Problem Statement
 We need to deploy multiple ML models (Whisper, XTTS, BGE, vLLM) as inference endpoints. Each model has different requirements for scaling, protocols (HTTP/gRPC), and GPU allocation.
 ## Decision Drivers
 * Standardized inference protocol (V2)
 * Autoscaling based on load
 * Traffic splitting for canary deployments
 * Integration with Kubeflow ecosystem
 * GPU resource management
 * Health checks and readiness
 ## Considered Options
 * Raw Kubernetes Deployments + Services
 * KServe InferenceService
 * Seldon Core
 * BentoML
 * Ray Serve only
 ## Decision Outcome
 Chosen option: "KServe InferenceService", because it provides a standardized, Kubernetes-native approach to model serving with built-in autoscaling and traffic management.
 ### Positive Consequences
 * Standardized V2 inference protocol
 * Automatic scale-to-zero capability
 * Canary/blue-green deployments
 * Integration with Kubeflow UI
 * Transformer/Explainer components
 * GPU resource abstraction
 ### Negative Consequences
 * Additional CRDs and operators
 * Learning curve for InferenceService spec
 * Some overhead for simple deployments
 * Knative Serving dependency (optional)
 ## Pros and Cons of the Options
 ### Raw Kubernetes Deployments
 * Good, because simple
 * Good, because full control
 * Bad, because no autoscaling logic
 * Bad, because manual service mesh
 * Bad, because repetitive configuration
 ### KServe InferenceService
 * Good, because standardized API
 * Good, because autoscaling
 * Good, because traffic management
 * Good, because Kubeflow integration
 * Bad, because operator complexity
 * Bad, because Knative optional dependency
 ### Seldon Core
 * Good, because mature
 * Good, because A/B testing
 * Good, because explainability
 * Bad, because more complex than KServe
 * Bad, because heavier resource usage
 ### BentoML
 * Good, because developer-friendly
 * Good, because packaging focused
 * Bad, because less Kubernetes-native
 * Bad, because smaller community
 ### Ray Serve
 * Good, because unified compute
 * Good, because Python-native
 * Good, because fractional GPU
 * Bad, because less standardized API
 * Bad, because Ray cluster overhead
 ## Current Configuration
 ```yaml
 apiVersion: serving.kserve.io/v1beta1
 kind: InferenceService
 metadata:
  name: whisper
  namespace: ai-ml
 spec:
  predictor:
    minReplicas: 1
    maxReplicas: 3
    containers:
      - name: whisper
        image: ghcr.io/org/whisper:latest
        resources:
          limits:
            nvidia.com/gpu: 1
 ```
 ## Links
 * [KServe](https://kserve.github.io)
 * [V2 Inference Protocol](https://kserve.github.io/website/latest/modelserving/data_plane/v2_protocol/)
 * Related: [ADR-0005](0005-multi-gpu-strategy.md) - GPU allocation
--- a/decisions/0008-use-milvus-for-vectors.md
+++ b/decisions/0008-use-milvus-for-vectors.md
@@ -0,0 +1,107 @@
 # Use Milvus for Vector Storage
 * Status: accepted
 * Date: 2025-12-15
 * Deciders: Billy Davies
 * Technical Story: Selecting vector database for RAG system
 ## Context and Problem Statement
 The RAG (Retrieval-Augmented Generation) system requires a vector database to store document embeddings and perform similarity search. We need to store millions of embeddings and query them with low latency.
 ## Decision Drivers
 * Query performance (< 100ms for top-k search)
 * Scalability to millions of vectors
 * Kubernetes-native deployment
 * Active development and community
 * Support for metadata filtering
 * Backup and restore capabilities
 ## Considered Options
 * Milvus
 * Pinecone (managed)
 * Qdrant
 * Weaviate
 * pgvector (PostgreSQL extension)
 * Chroma
 ## Decision Outcome
 Chosen option: "Milvus", because it provides production-grade vector search with excellent Kubernetes support, scalability, and active development.
 ### Positive Consequences
 * High-performance similarity search
 * Horizontal scalability
 * Rich filtering and hybrid search
 * Helm chart for Kubernetes
 * Active CNCF sandbox project
 * GPU acceleration available
 ### Negative Consequences
 * Complex architecture (multiple components)
 * Higher resource usage than simpler alternatives
 * Requires object storage (MinIO)
 * Learning curve for optimization
 ## Pros and Cons of the Options
 ### Milvus
 * Good, because production-proven at scale
 * Good, because rich query API
 * Good, because Kubernetes-native
 * Good, because hybrid search (vector + scalar)
 * Good, because CNCF project
 * Bad, because complex architecture
 * Bad, because higher resource usage
 ### Pinecone
 * Good, because fully managed
 * Good, because simple API
 * Good, because reliable
 * Bad, because external dependency
 * Bad, because cost at scale
 * Bad, because data sovereignty concerns
 ### Qdrant
 * Good, because simpler than Milvus
 * Good, because Rust performance
 * Good, because good filtering
 * Bad, because smaller community
 * Bad, because less enterprise features
 ### Weaviate
 * Good, because built-in vectorization
 * Good, because GraphQL API
 * Good, because modules system
 * Bad, because more opinionated
 * Bad, because schema requirements
 ### pgvector
 * Good, because familiar PostgreSQL
 * Good, because simple deployment
 * Good, because ACID transactions
 * Bad, because limited scale
 * Bad, because slower for large datasets
 * Bad, because no specialized optimizations
 ### Chroma
 * Good, because simple
 * Good, because embedded option
 * Bad, because not production-ready at scale
 * Bad, because limited features
 ## Links
 * [Milvus](https://milvus.io)
 * [Milvus Helm Chart](https://github.com/milvus-io/milvus-helm)
 * Related: [DOMAIN-MODEL.md](../DOMAIN-MODEL.md) - Chunk/Embedding entities
--- a/decisions/0009-dual-workflow-engines.md
+++ b/decisions/0009-dual-workflow-engines.md
@@ -0,0 +1,124 @@
 # Dual Workflow Engine Strategy (Argo + Kubeflow)
 * Status: accepted
 * Date: 2026-01-15
 * Deciders: Billy Davies
 * Technical Story: Selecting workflow orchestration for ML pipelines
 ## Context and Problem Statement
 The AI platform needs workflow orchestration for:
 - ML training pipelines with caching
 - Document ingestion (batch)
 - Complex DAG workflows (training → evaluation → deployment)
 - Hybrid scenarios combining both
 Should we use one engine or leverage strengths of multiple?
 ## Decision Drivers
 * ML-specific features (caching, lineage)
 * Complex DAG support
 * Kubernetes-native execution
 * Visibility and debugging
 * Community and ecosystem
 * Integration with existing tools
 ## Considered Options
 * Kubeflow Pipelines only
 * Argo Workflows only
 * Both engines with clear use cases
 * Airflow on Kubernetes
 * Prefect/Dagster
 ## Decision Outcome
 Chosen option: "Both engines with clear use cases", using Kubeflow Pipelines for ML-centric workflows and Argo Workflows for complex DAG orchestration.
 ### Decision Matrix
 | Use Case | Engine | Reason |
 |----------|--------|--------|
 | ML training with caching | Kubeflow | Component caching, experiment tracking |
 | Model evaluation | Kubeflow | Metric collection, comparison |
 | Document ingestion | Argo | Simple DAG, no ML features needed |
 | Batch inference | Argo | Parallelization, retries |
 | Complex DAG with branching | Argo | Superior control flow |
 | Hybrid ML training | Both | Argo orchestrates, KFP for ML steps |
 ### Positive Consequences
 * Best tool for each job
 * ML pipelines get proper caching
 * Complex workflows get better DAG support
 * Can integrate via Argo Events
 * Gradual migration possible
 ### Negative Consequences
 * Two systems to maintain
 * Team needs to learn both
 * More complex debugging
 * Integration overhead
 ## Integration Architecture
 ```
 NATS Event ──► Argo Events ──► Sensor ──┬──► Argo Workflow
                                        │
                                        └──► Kubeflow Pipeline (via API)
                    OR
 Argo Workflow ──► Step: kfp-trigger ──► Kubeflow Pipeline
                 (WorkflowTemplate)
 ```
 ## Pros and Cons of the Options
 ### Kubeflow Pipelines only
 * Good, because ML-focused
 * Good, because caching
 * Good, because experiment tracking
 * Bad, because limited DAG features
 * Bad, because less flexible control flow
 ### Argo Workflows only
 * Good, because powerful DAG
 * Good, because flexible
 * Good, because great debugging
 * Bad, because no ML caching
 * Bad, because no experiment tracking
 ### Both engines (chosen)
 * Good, because best of both
 * Good, because appropriate tool per job
 * Good, because can integrate
 * Bad, because operational complexity
 * Bad, because learning two systems
 ### Airflow
 * Good, because mature
 * Good, because large community
 * Bad, because Python-centric
 * Bad, because not Kubernetes-native
 * Bad, because no ML features
 ### Prefect/Dagster
 * Good, because modern design
 * Good, because Python-native
 * Bad, because less Kubernetes-native
 * Bad, because newer/less proven
 ## Links
 * [Kubeflow Pipelines](https://kubeflow.org/docs/components/pipelines/)
 * [Argo Workflows](https://argoproj.github.io/workflows/)
 * [Argo Events](https://argoproj.github.io/events/)
 * Related: [kfp-integration.yaml](../../llm-workflows/argo/kfp-integration.yaml)
--- a/decisions/0010-use-envoy-gateway.md
+++ b/decisions/0010-use-envoy-gateway.md
@@ -0,0 +1,120 @@
 # Use Envoy Gateway for Ingress
 * Status: accepted
 * Date: 2025-12-01
 * Deciders: Billy Davies
 * Technical Story: Selecting ingress controller for cluster
 ## Context and Problem Statement
 We need an ingress solution that supports:
 - Gateway API (modern Kubernetes standard)
 - gRPC for ML inference
 - WebSocket for real-time chat/voice
 - Header-based routing for A/B testing
 - TLS termination
 ## Decision Drivers
 * Gateway API support (HTTPRoute, GRPCRoute)
 * WebSocket support
 * gRPC support
 * Performance at edge
 * Active development
 * Envoy ecosystem familiarity
 ## Considered Options
 * NGINX Ingress Controller
 * Traefik
 * Envoy Gateway
 * Istio Gateway
 * Contour
 ## Decision Outcome
 Chosen option: "Envoy Gateway", because it's the reference implementation of Gateway API with full Envoy feature set.
 ### Positive Consequences
 * Native Gateway API support
 * Full Envoy feature set
 * WebSocket and gRPC native
 * No Istio complexity
 * CNCF graduated project (Envoy)
 * Easy integration with observability
 ### Negative Consequences
 * Newer than alternatives
 * Less documentation than NGINX
 * Envoy configuration learning curve
 ## Pros and Cons of the Options
 ### NGINX Ingress
 * Good, because mature
 * Good, because well-documented
 * Good, because familiar
 * Bad, because limited Gateway API
 * Bad, because commercial features gated
 ### Traefik
 * Good, because auto-discovery
 * Good, because good UI
 * Good, because Let's Encrypt
 * Bad, because Gateway API experimental
 * Bad, because less gRPC focus
 ### Envoy Gateway
 * Good, because Gateway API native
 * Good, because full Envoy features
 * Good, because extensible
 * Good, because gRPC/WebSocket native
 * Bad, because newer project
 * Bad, because less community content
 ### Istio Gateway
 * Good, because full mesh features
 * Good, because Gateway API
 * Bad, because overkill without mesh
 * Bad, because resource heavy
 ### Contour
 * Good, because Envoy-based
 * Good, because lightweight
 * Bad, because Gateway API evolving
 * Bad, because smaller community
 ## Configuration Example
 ```yaml
 apiVersion: gateway.networking.k8s.io/v1
 kind: HTTPRoute
 metadata:
  name: companions-chat
 spec:
  parentRefs:
    - name: eg-gateway
      namespace: network
  hostnames:
    - companions-chat.lab.daviestechlabs.io
  rules:
    - matches:
        - path:
            type: PathPrefix
            value: /
      backendRefs:
        - name: companions-chat
          port: 8080
 ```
 ## Links
 * [Envoy Gateway](https://gateway.envoyproxy.io)
 * [Gateway API](https://gateway-api.sigs.k8s.io)
--- a/diagrams/README.md
+++ b/diagrams/README.md
@@ -0,0 +1,35 @@
 # Diagrams
 This directory contains additional architecture diagrams beyond the main C4 diagrams.
 ## Available Diagrams
 | File | Description |
 |------|-------------|
 | [gpu-allocation.mmd](gpu-allocation.mmd) | GPU workload distribution |
 | [data-flow-chat.mmd](data-flow-chat.mmd) | Chat request data flow |
 | [data-flow-voice.mmd](data-flow-voice.mmd) | Voice request data flow |
 ## Rendering Diagrams
 ### VS Code
 Install the "Markdown Preview Mermaid Support" extension.
 ### CLI
 ```bash
 # Using mmdc (Mermaid CLI)
 npx @mermaid-js/mermaid-cli mmdc -i diagram.mmd -o diagram.png
 ```
 ### Online
 Use [Mermaid Live Editor](https://mermaid.live)
 ## Diagram Conventions
 1. Use `.mmd` extension for Mermaid diagrams
 2. Include title as comment at top of file
 3. Use consistent styling classes
 4. Keep diagrams focused (one concept per diagram)
--- a/diagrams/data-flow-chat.mmd
+++ b/diagrams/data-flow-chat.mmd
@@ -0,0 +1,51 @@
 %% Chat Request Data Flow
 %% Sequence diagram showing chat message processing
 sequenceDiagram
    autonumber
    participant U as User
    participant W as WebApp<br/>(companions)
    participant N as NATS
    participant C as Chat Handler
    participant V as Valkey<br/>(Cache)
    participant E as BGE Embeddings
    participant M as Milvus
    participant R as Reranker
    participant L as vLLM
    U->>W: Send message
    W->>N: Publish ai.chat.user.{id}.message
    N->>C: Deliver message
    C->>V: Get session history
    V-->>C: Previous messages
    alt RAG Enabled
        C->>E: Generate query embedding
        E-->>C: Query vector
        C->>M: Search similar chunks
        M-->>C: Top-K chunks
        opt Reranker Enabled
            C->>R: Rerank chunks
            R-->>C: Reordered chunks
        end
    end
    C->>L: LLM inference (context + query)
    alt Streaming Enabled
        loop For each token
            L-->>C: Token
            C->>N: Publish ai.chat.response.stream.{id}
            N-->>W: Deliver chunk
            W-->>U: Display token
        end
    else Non-streaming
        L-->>C: Full response
        C->>N: Publish ai.chat.response.{id}
        N-->>W: Deliver response
        W-->>U: Display response
    end
    C->>V: Save to session history
--- a/diagrams/data-flow-voice.mmd
+++ b/diagrams/data-flow-voice.mmd
@@ -0,0 +1,46 @@
 %% Voice Request Data Flow
 %% Sequence diagram showing voice assistant processing
 sequenceDiagram
    autonumber
    participant U as User
    participant W as Voice WebApp
    participant N as NATS
    participant VA as Voice Assistant
    participant STT as Whisper<br/>(STT)
    participant E as BGE Embeddings
    participant M as Milvus
    participant R as Reranker
    participant L as vLLM
    participant TTS as XTTS<br/>(TTS)
    U->>W: Record audio
    W->>N: Publish ai.voice.user.{id}.request<br/>(msgpack with audio bytes)
    N->>VA: Deliver voice request
    VA->>STT: Transcribe audio
    STT-->>VA: Transcription text
    alt RAG Enabled
        VA->>E: Generate query embedding
        E-->>VA: Query vector
        VA->>M: Search similar chunks
        M-->>VA: Top-K chunks
        opt Reranker Enabled
            VA->>R: Rerank chunks
            R-->>VA: Reordered chunks
        end
    end
    VA->>L: LLM inference
    L-->>VA: Response text
    VA->>TTS: Synthesize speech
    TTS-->>VA: Audio bytes
    VA->>N: Publish ai.voice.response.{id}<br/>(text + audio)
    N-->>W: Deliver response
    W-->>U: Play audio + show text
    Note over VA,TTS: Total latency target: < 3s
--- a/diagrams/gpu-allocation.mmd
+++ b/diagrams/gpu-allocation.mmd
@@ -0,0 +1,47 @@
 %% GPU Allocation Diagram
 %% Shows how AI workloads are distributed across GPU nodes
 flowchart TB
    subgraph khelben["🖥️ khelben (AMD Strix Halo 64GB)"]
        direction TB
        vllm["🧠 vLLM<br/>LLM Inference<br/>100% GPU"]
    end
    subgraph elminster["🖥️ elminster (NVIDIA RTX 2070 8GB)"]
        direction TB
        whisper["🎤 Whisper<br/>STT<br/>~50% GPU"]
        xtts["🔊 XTTS<br/>TTS<br/>~50% GPU"]
    end
    subgraph drizzt["🖥️ drizzt (AMD Radeon 680M 12GB)"]
        direction TB
        embeddings["📊 BGE Embeddings<br/>Vector Encoding<br/>~80% GPU"]
    end
    subgraph danilo["🖥️ danilo (Intel Arc)"]
        direction TB
        reranker["📋 BGE Reranker<br/>Document Ranking<br/>~80% GPU"]
    end
    subgraph workloads["Workload Routing"]
        chat["💬 Chat Request"]
        voice["🎤 Voice Request"]
    end
    chat --> embeddings
    chat --> reranker
    chat --> vllm
    voice --> whisper
    voice --> embeddings
    voice --> reranker
    voice --> vllm
    voice --> xtts
    classDef nvidia fill:#76B900,color:white
    classDef amd fill:#ED1C24,color:white
    classDef intel fill:#0071C5,color:white
    class whisper,xtts nvidia
    class vllm,embeddings amd
    class reranker intel
--- a/specs/BINARY_MESSAGES_AND_JETSTREAM.md
+++ b/specs/BINARY_MESSAGES_AND_JETSTREAM.md
@@ -0,0 +1,287 @@
 # Binary Messages and JetStream Configuration
 > Technical specification for NATS message handling in the AI platform
 ## Overview
 The AI platform uses NATS with JetStream for message persistence. All messages use MessagePack (msgpack) binary format for efficiency, especially when handling audio data.
 ## Message Format
 ### Why MessagePack?
 1. **Binary efficiency**: Audio data embedded directly without base64 overhead
 2. **Compact**: 20-50% smaller than equivalent JSON
 3. **Fast**: Lower serialization/deserialization overhead
 4. **Compatible**: JSON-like structure, easy debugging
 ### Schema
 All messages follow this general structure:
 ```python
 {
    "request_id": str,       # UUID for correlation
    "user_id": str,          # User identifier
    "timestamp": float,      # Unix timestamp
    "payload": Any,          # Type-specific data
    "metadata": dict         # Optional metadata
 }
 ```
 ### Chat Message
 ```python
 {
    "request_id": "uuid-here",
    "user_id": "user-123",
    "username": "john_doe",
    "message": "Hello, how are you?",
    "premium": False,
    "enable_streaming": True,
    "enable_rag": True,
    "enable_reranker": True,
    "top_k": 5,
    "session_id": "session-abc"
 }
 ```
 ### Voice Message
 ```python
 {
    "request_id": "uuid-here",
    "user_id": "user-123",
    "audio": b"...",           # Raw bytes, not base64!
    "format": "wav",
    "sample_rate": 16000,
    "premium": False,
    "enable_rag": True,
    "language": "en"
 }
 ```
 ### Streaming Response Chunk
 ```python
 {
    "request_id": "uuid-here",
    "type": "chunk",           # "chunk", "done", "error"
    "content": "token",
    "done": False,
    "timestamp": 1706000000.0
 }
 ```
 ## JetStream Configuration
 ### Streams
 | Stream | Subjects | Retention | Max Age | Storage | Replicas |
 |--------|----------|-----------|---------|---------|----------|
 | `COMPANIONS_LOGINS` | `ai.chat.user.*.login` | Limits | 7 days | File | 1 |
 | `COMPANIONS_CHAT` | `ai.chat.user.*.message`, `ai.chat.user.*.greeting.*` | Limits | 30 days | File | 1 |
 | `AI_CHAT_STREAM` | `ai.chat.response.stream.>` | Limits | 5 min | Memory | 1 |
 | `AI_VOICE_STREAM` | `ai.voice.>` | Limits | 1 hour | File | 1 |
 | `AI_VOICE_RESPONSE_STREAM` | `ai.voice.response.stream.>` | Limits | 5 min | Memory | 1 |
 | `AI_PIPELINE` | `ai.pipeline.>` | Limits | 24 hours | File | 1 |
 ### Consumer Configuration
 ```yaml
 # Durable consumer for chat handler
 consumer:
  name: chat-handler
  durable_name: chat-handler
  filter_subjects:
    - "ai.chat.user.*.message"
  ack_policy: explicit
  ack_wait: 30s
  max_deliver: 3
  deliver_policy: new
 ```
 ### Stream Creation (CLI)
 ```bash
 # Create chat stream
 nats stream add COMPANIONS_CHAT \
  --subjects "ai.chat.user.*.message,ai.chat.user.*.greeting.*" \
  --retention limits \
  --max-age 30d \
  --storage file \
  --replicas 1
 # Create ephemeral stream
 nats stream add AI_CHAT_STREAM \
  --subjects "ai.chat.response.stream.>" \
  --retention limits \
  --max-age 5m \
  --storage memory \
  --replicas 1
 ```
 ## Python Implementation
 ### Publisher
 ```python
 import nats
 import msgpack
 from datetime import datetime
 async def publish_chat_message(nc: nats.NATS, user_id: str, message: str):
    data = {
        "request_id": str(uuid.uuid4()),
        "user_id": user_id,
        "message": message,
        "timestamp": datetime.utcnow().timestamp(),
        "enable_streaming": True,
        "enable_rag": True,
    }
    subject = f"ai.chat.user.{user_id}.message"
    await nc.publish(subject, msgpack.packb(data))
 ```
 ### Subscriber (JetStream)
 ```python
 async def message_handler(msg):
    try:
        data = msgpack.unpackb(msg.data, raw=False)
        # Process message
        result = await process_chat(data)
        # Publish response
        response_subject = f"ai.chat.response.{data['request_id']}"
        await nc.publish(response_subject, msgpack.packb(result))
        # Acknowledge
        await msg.ack()
    except Exception as e:
        logger.error(f"Handler error: {e}")
        await msg.nak(delay=5)  # Retry after 5s
 # Subscribe with JetStream
 js = nc.jetstream()
 sub = await js.subscribe(
    "ai.chat.user.*.message",
    cb=message_handler,
    durable="chat-handler",
    manual_ack=True
 )
 ```
 ### Streaming Response
 ```python
 async def stream_response(nc, request_id: str, response_generator):
    subject = f"ai.chat.response.stream.{request_id}"
    async for token in response_generator:
        chunk = {
            "request_id": request_id,
            "type": "chunk",
            "content": token,
            "done": False
        }
        await nc.publish(subject, msgpack.packb(chunk))
    # Send done marker
    done = {
        "request_id": request_id,
        "type": "done",
        "content": "",
        "done": True
    }
    await nc.publish(subject, msgpack.packb(done))
 ```
 ## Go Implementation
 ### Publisher
 ```go
 import (
    "github.com/nats-io/nats.go"
    "github.com/vmihailenco/msgpack/v5"
 )
 type ChatMessage struct {
    RequestID string `msgpack:"request_id"`
    UserID    string `msgpack:"user_id"`
    Message   string `msgpack:"message"`
 }
 func PublishChat(nc *nats.Conn, userID, message string) error {
    msg := ChatMessage{
        RequestID: uuid.New().String(),
        UserID:    userID,
        Message:   message,
    }
    data, err := msgpack.Marshal(msg)
    if err != nil {
        return err
    }
    subject := fmt.Sprintf("ai.chat.user.%s.message", userID)
    return nc.Publish(subject, data)
 }
 ```
 ## Error Handling
 ### NAK with Delay
 ```python
 # Temporary failure - retry later
 await msg.nak(delay=5)  # 5 second delay
 # Permanent failure - move to dead letter
 if attempt >= max_retries:
    await nc.publish("ai.dlq.chat", msg.data)
    await msg.term()  # Terminate delivery
 ```
 ### Dead Letter Queue
 ```yaml
 stream:
  name: AI_DLQ
  subjects:
    - "ai.dlq.>"
  retention: limits
  max_age: 7d
  storage: file
 ```
 ## Monitoring
 ### Key Metrics
 ```bash
 # Stream info
 nats stream info COMPANIONS_CHAT
 # Consumer info
 nats consumer info COMPANIONS_CHAT chat-handler
 # Message rate
 nats stream report
 ```
 ### Prometheus Metrics
 - `nats_stream_messages_total`
 - `nats_consumer_pending_messages`
 - `nats_consumer_ack_pending`
 ## Related
 - [ADR-0003: Use NATS for Messaging](../decisions/0003-use-nats-for-messaging.md)
 - [ADR-0004: Use MessagePack](../decisions/0004-use-messagepack-for-nats.md)
 - [DOMAIN-MODEL.md](../DOMAIN-MODEL.md)
--- a/specs/README.md
+++ b/specs/README.md
@@ -0,0 +1,36 @@
 # Specifications
 This directory contains feature-level specifications and technical designs.
 ## Contents
 - [BINARY_MESSAGES_AND_JETSTREAM.md](BINARY_MESSAGES_AND_JETSTREAM.md) - MessagePack format and JetStream configuration
 - Future specs will be added here
 ## Spec Template
 ```markdown
 # Feature Name
 ## Overview
 Brief description of the feature
 ## Requirements
 - Requirement 1
 - Requirement 2
 ## Design
 Technical design details
 ## API
 Interface definitions
 ## Implementation Notes
 Key implementation considerations
 ## Testing
 Test strategy
 ## Open Questions
 Unresolved items
 ```