feat: add comprehensive architecture documentation

- Add AGENT-ONBOARDING.md for AI agents - Add ARCHITECTURE.md with full system overview - Add TECH-STACK.md with complete technology inventory - Add DOMAIN-MODEL.md with entities and bounded contexts - Add CODING-CONVENTIONS.md with patterns and practices - Add GLOSSARY.md with terminology reference - Add C4 diagrams (Context and Container levels) - Add 10 ADRs documenting key decisions: - Talos Linux, NATS, MessagePack, Multi-GPU strategy - GitOps with Flux, KServe, Milvus, Dual workflow engines - Envoy Gateway - Add specs directory with JetStream configuration - Add diagrams for GPU allocation and data flows Based on analysis of homelab-k8s2 and llm-workflows repositories and kubectl cluster-info dump data.
2026-02-01 14:30:05 -05:00
parent 4d4f6f464c
commit 832cda34bd
26 changed files with 3805 additions and 2 deletions
--- a/decisions/0000-template.md
+++ b/decisions/0000-template.md
@@ -0,0 +1,71 @@
+# [short title of solved problem and solution]
+
+* Status: [proposed | rejected | accepted | deprecated | superseded by [ADR-NNNN](NNNN-example.md)]
+* Date: YYYY-MM-DD
+* Deciders: [list of people involved in decision]
+* Technical Story: [description | ticket/issue URL]
+
+## Context and Problem Statement
+
+[Describe the context and problem statement, e.g., in free form using two to three sentences. You may want to articulate the problem in form of a question.]
+
+## Decision Drivers
+
+* [driver 1, e.g., a force, facing concern, …]
+* [driver 2, e.g., a force, facing concern, …]
+* … <!-- numbers of drivers can vary -->
+
+## Considered Options
+
+* [option 1]
+* [option 2]
+* [option 3]
+* … <!-- numbers of options can vary -->
+
+## Decision Outcome
+
+Chosen option: "[option N]", because [justification. e.g., only option which meets k.o. criterion decision driver | which resolves force | … | comes out best (see below)].
+
+### Positive Consequences
+
+* [e.g., improvement of quality attribute satisfaction, follow-up decisions required, …]
+* …
+
+### Negative Consequences
+
+* [e.g., compromising quality attribute, follow-up decisions required, …]
+* …
+
+## Pros and Cons of the Options
+
+### [option 1]
+
+[example | description | pointer to more information | …]
+
+* Good, because [argument a]
+* Good, because [argument b]
+* Bad, because [argument c]
+* … <!-- numbers of pros and cons can vary -->
+
+### [option 2]
+
+[example | description | pointer to more information | …]
+
+* Good, because [argument a]
+* Good, because [argument b]
+* Bad, because [argument c]
+* … <!-- numbers of pros and cons can vary -->
+
+### [option 3]
+
+[example | description | pointer to more information | …]
+
+* Good, because [argument a]
+* Good, because [argument b]
+* Bad, because [argument c]
+* … <!-- numbers of pros and cons can vary -->
+
+## Links
+
+* [Link type] [Link to ADR] <!-- example: Refined by [ADR-0005](0005-example.md) -->
+* … <!-- numbers of links can vary -->
--- a/decisions/0001-record-architecture-decisions.md
+++ b/decisions/0001-record-architecture-decisions.md
@@ -0,0 +1,79 @@
+# Record Architecture Decisions
+
+* Status: accepted
+* Date: 2025-11-30
+* Deciders: Billy Davies
+* Technical Story: Initial setup of homelab documentation
+
+## Context and Problem Statement
+
+As the homelab infrastructure grows in complexity with AI/ML services, multi-GPU configurations, and event-driven architectures, we need a way to document and communicate significant architectural decisions. Without documentation, the rationale behind choices gets lost, making future changes risky and onboarding difficult.
+
+## Decision Drivers
+
+* Need to preserve context for why decisions were made
+* Enable future maintainers (including AI agents) to understand the system
+* Provide a structured way to evaluate alternatives
+* Support the wiki/design process for iterative improvements
+
+## Considered Options
+
+* Informal documentation in README files
+* Wiki pages without structure
+* Architecture Decision Records (ADRs)
+* No documentation (rely on code)
+
+## Decision Outcome
+
+Chosen option: "Architecture Decision Records (ADRs)", because they provide a structured format that captures context, alternatives, and consequences. They're lightweight, version-controlled, and well-suited for technical decisions.
+
+### Positive Consequences
+
+* Clear historical record of decisions
+* Structured format makes decisions searchable
+* Forces consideration of alternatives
+* Git-versioned alongside code
+* AI agents can parse and understand decisions
+
+### Negative Consequences
+
+* Requires discipline to create ADRs
+* May accumulate outdated decisions over time
+* Additional overhead for simple decisions
+
+## Pros and Cons of the Options
+
+### Informal README documentation
+
+* Good, because low friction
+* Good, because close to code
+* Bad, because no structure for alternatives
+* Bad, because decisions get buried in prose
+
+### Wiki pages
+
+* Good, because easy to edit
+* Good, because supports rich formatting
+* Bad, because separate from code repository
+* Bad, because no enforced structure
+
+### ADRs
+
+* Good, because structured format
+* Good, because version controlled
+* Good, because captures alternatives considered
+* Good, because industry-standard practice
+* Bad, because requires creating new files
+* Bad, because may seem bureaucratic for small decisions
+
+### No documentation
+
+* Good, because no overhead
+* Bad, because context is lost
+* Bad, because makes onboarding difficult
+* Bad, because risky for future changes
+
+## Links
+
+* Based on [MADR template](https://adr.github.io/madr/)
+* [ADR GitHub organization](https://adr.github.io/)
--- a/decisions/0002-use-talos-linux.md
+++ b/decisions/0002-use-talos-linux.md
@@ -0,0 +1,97 @@
+# Use Talos Linux for Kubernetes Nodes
+
+* Status: accepted
+* Date: 2025-11-30
+* Deciders: Billy Davies
+* Technical Story: Selecting OS for bare-metal Kubernetes cluster
+
+## Context and Problem Statement
+
+We need a reliable, secure operating system for running Kubernetes on bare-metal homelab nodes. The OS should minimize attack surface, be easy to manage at scale, and support our GPU requirements (AMD ROCm, NVIDIA CUDA, Intel).
+
+## Decision Drivers
+
+* Security-first design (immutable, minimal)
+* API-driven management (no SSH)
+* Support for various GPU drivers
+* Kubernetes-native focus
+* Community support and updates
+* Ease of upgrades
+
+## Considered Options
+
+* Ubuntu Server with kubeadm
+* Flatcar Container Linux
+* Talos Linux
+* k3OS (discontinued)
+* Rocky Linux with RKE2
+
+## Decision Outcome
+
+Chosen option: "Talos Linux", because it provides an immutable, API-driven, Kubernetes-focused OS that minimizes attack surface and simplifies operations.
+
+### Positive Consequences
+
+* Immutable root filesystem prevents drift
+* No SSH reduces attack vectors
+* API-driven management integrates well with GitOps
+* Schematic system allows custom kernel modules (GPU drivers)
+* Consistent configuration across all nodes
+* Automatic updates with minimal disruption
+
+### Negative Consequences
+
+* Learning curve for API-driven management
+* Debugging requires different approaches (no SSH)
+* Custom extensions require schematic IDs
+* Less flexibility for non-Kubernetes workloads
+
+## Pros and Cons of the Options
+
+### Ubuntu Server with kubeadm
+
+* Good, because familiar
+* Good, because extensive package availability
+* Good, because easy debugging via SSH
+* Bad, because mutable system leads to drift
+* Bad, because large attack surface
+* Bad, because manual package management
+
+### Flatcar Container Linux
+
+* Good, because immutable
+* Good, because auto-updates
+* Good, because container-focused
+* Bad, because less Kubernetes-specific
+* Bad, because smaller community than Talos
+* Bad, because GPU driver setup more complex
+
+### Talos Linux
+
+* Good, because purpose-built for Kubernetes
+* Good, because immutable and minimal
+* Good, because API-driven (no SSH)
+* Good, because excellent Kubernetes integration
+* Good, because active development and community
+* Good, because schematic system for GPU drivers
+* Bad, because learning curve
+* Bad, because no traditional debugging
+
+### k3OS
+
+* Good, because simple
+* Bad, because discontinued
+
+### Rocky Linux with RKE2
+
+* Good, because enterprise-like
+* Good, because familiar Linux experience
+* Bad, because mutable system
+* Bad, because more operational overhead
+* Bad, because larger attack surface
+
+## Links
+
+* [Talos Linux](https://talos.dev)
+* [Talos Image Factory](https://factory.talos.dev)
+* Related: [ADR-0005](0005-multi-gpu-strategy.md) - GPU driver integration via schematics
--- a/decisions/0003-use-nats-for-messaging.md
+++ b/decisions/0003-use-nats-for-messaging.md
@@ -0,0 +1,112 @@
+# Use NATS for AI/ML Messaging
+
+* Status: accepted
+* Date: 2025-12-01
+* Deciders: Billy Davies
+* Technical Story: Selecting message bus for AI service orchestration
+
+## Context and Problem Statement
+
+The AI/ML platform requires a messaging system for:
+- Real-time chat message routing
+- Voice request/response streaming
+- Pipeline triggers and status updates
+- Event-driven workflow orchestration
+
+We need a messaging system that handles both ephemeral real-time messages and persistent streams.
+
+## Decision Drivers
+
+* Low latency for real-time chat/voice
+* Persistence for audit and replay
+* Simple operations for homelab
+* Support for request-reply pattern
+* Wildcard subscriptions for routing
+* Binary message support (audio data)
+
+## Considered Options
+
+* Apache Kafka
+* RabbitMQ
+* Redis Pub/Sub + Streams
+* NATS with JetStream
+* Apache Pulsar
+
+## Decision Outcome
+
+Chosen option: "NATS with JetStream", because it provides both fire-and-forget messaging and persistent streams with significantly simpler operations than alternatives.
+
+### Positive Consequences
+
+* Sub-millisecond latency for real-time messages
+* JetStream provides persistence when needed
+* Simple deployment (single binary)
+* Excellent Kubernetes integration
+* Request-reply pattern built-in
+* Wildcard subscriptions for flexible routing
+* Low resource footprint
+
+### Negative Consequences
+
+* Less ecosystem than Kafka
+* JetStream less mature than Kafka Streams
+* No built-in schema registry
+* Smaller community than RabbitMQ
+
+## Pros and Cons of the Options
+
+### Apache Kafka
+
+* Good, because industry standard for streaming
+* Good, because rich ecosystem (Kafka Streams, Connect)
+* Good, because schema registry
+* Good, because excellent for high throughput
+* Bad, because operationally complex (ZooKeeper/KRaft)
+* Bad, because high resource requirements
+* Bad, because overkill for homelab scale
+* Bad, because higher latency for real-time messages
+
+### RabbitMQ
+
+* Good, because mature and stable
+* Good, because flexible routing
+* Good, because good management UI
+* Bad, because AMQP protocol overhead
+* Bad, because not designed for streaming
+* Bad, because more complex clustering
+
+### Redis Pub/Sub + Streams
+
+* Good, because simple
+* Good, because already might use Redis
+* Good, because low latency
+* Bad, because pub/sub not persistent
+* Bad, because streams API less intuitive
+* Bad, because not primary purpose of Redis
+
+### NATS with JetStream
+
+* Good, because extremely low latency
+* Good, because simple operations
+* Good, because both pub/sub and persistence
+* Good, because request-reply built-in
+* Good, because wildcard subscriptions
+* Good, because low resource usage
+* Good, because excellent Go/Python clients
+* Bad, because smaller ecosystem
+* Bad, because JetStream newer than Kafka
+
+### Apache Pulsar
+
+* Good, because unified messaging + streaming
+* Good, because multi-tenancy
+* Good, because geo-replication
+* Bad, because complex architecture
+* Bad, because high resource requirements
+* Bad, because smaller community
+
+## Links
+
+* [NATS.io](https://nats.io)
+* [JetStream Documentation](https://docs.nats.io/nats-concepts/jetstream)
+* Related: [ADR-0004](0004-use-messagepack-for-nats.md) - Message format
--- a/decisions/0004-use-messagepack-for-nats.md
+++ b/decisions/0004-use-messagepack-for-nats.md
@@ -0,0 +1,137 @@
+# Use MessagePack for NATS Messages
+
+* Status: accepted
+* Date: 2025-12-01
+* Deciders: Billy Davies
+* Technical Story: Selecting serialization format for NATS messages
+
+## Context and Problem Statement
+
+NATS messages in the AI platform carry various payloads:
+- Text chat messages (small)
+- Voice audio data (potentially large, base64 or binary)
+- Streaming response chunks
+- Pipeline parameters
+
+We need a serialization format that handles both text and binary efficiently.
+
+## Decision Drivers
+
+* Efficient binary data handling (audio)
+* Compact message size
+* Fast serialization/deserialization
+* Cross-language support (Python, Go)
+* Debugging ability
+* Schema flexibility
+
+## Considered Options
+
+* JSON
+* Protocol Buffers (protobuf)
+* MessagePack (msgpack)
+* CBOR
+* Avro
+
+## Decision Outcome
+
+Chosen option: "MessagePack (msgpack)", because it provides binary efficiency with JSON-like simplicity and schema-less flexibility.
+
+### Positive Consequences
+
+* Native binary support (no base64 overhead for audio)
+* 20-50% smaller than JSON for typical messages
+* Faster serialization than JSON
+* No schema compilation step
+* Easy debugging (can pretty-print like JSON)
+* Excellent Python and Go libraries
+
+### Negative Consequences
+
+* Less human-readable than JSON when raw
+* No built-in schema validation
+* Slightly less common than JSON
+
+## Pros and Cons of the Options
+
+### JSON
+
+* Good, because human-readable
+* Good, because universal support
+* Good, because no setup required
+* Bad, because binary data requires base64 (33% overhead)
+* Bad, because larger message sizes
+* Bad, because slower parsing
+
+### Protocol Buffers
+
+* Good, because very compact
+* Good, because fast
+* Good, because schema validation
+* Good, because cross-language
+* Bad, because requires schema definition
+* Bad, because compilation step
+* Bad, because less flexible for evolving schemas
+* Bad, because overkill for simple messages
+
+### MessagePack
+
+* Good, because binary-efficient
+* Good, because JSON-like simplicity
+* Good, because no schema required
+* Good, because excellent library support
+* Good, because can include raw bytes
+* Bad, because not human-readable raw
+* Bad, because no schema validation
+
+### CBOR
+
+* Good, because binary-efficient
+* Good, because IETF standard
+* Good, because schema-less
+* Bad, because less common libraries
+* Bad, because smaller community
+* Bad, because similar to msgpack with less adoption
+
+### Avro
+
+* Good, because schema evolution
+* Good, because compact
+* Good, because schema registry integration
+* Bad, because requires schema
+* Bad, because more complex setup
+* Bad, because Java-centric ecosystem
+
+## Implementation Notes
+
+```python
+# Python usage
+import msgpack
+
+# Serialize
+data = {
+    "user_id": "user-123",
+    "audio": audio_bytes,  # Raw bytes, no base64
+    "premium": True
+}
+payload = msgpack.packb(data)
+
+# Deserialize
+data = msgpack.unpackb(payload, raw=False)
+```
+
+```go
+// Go usage
+import "github.com/vmihailenco/msgpack/v5"
+
+type Message struct {
+    UserID string `msgpack:"user_id"`
+    Audio  []byte `msgpack:"audio"`
+}
+```
+
+## Links
+
+* [MessagePack Specification](https://msgpack.org)
+* [msgpack-python](https://github.com/msgpack/msgpack-python)
+* Related: [ADR-0003](0003-use-nats-for-messaging.md) - Message bus choice
+* See: [BINARY_MESSAGES_AND_JETSTREAM.md](../specs/BINARY_MESSAGES_AND_JETSTREAM.md)
--- a/decisions/0005-multi-gpu-strategy.md
+++ b/decisions/0005-multi-gpu-strategy.md
@@ -0,0 +1,145 @@
+# Multi-GPU Heterogeneous Strategy
+
+* Status: accepted
+* Date: 2025-12-01
+* Deciders: Billy Davies
+* Technical Story: GPU allocation strategy for AI workloads
+
+## Context and Problem Statement
+
+The homelab has diverse GPU hardware:
+- AMD Strix Halo (64GB unified memory) - khelben
+- NVIDIA RTX 2070 (8GB VRAM) - elminster  
+- AMD Radeon 680M (12GB VRAM) - drizzt
+- Intel Arc (integrated) - danilo
+
+Different AI workloads have different requirements. How do we allocate GPUs effectively?
+
+## Decision Drivers
+
+* Maximize utilization of all GPUs
+* Match workloads to appropriate hardware
+* Support concurrent inference services
+* Enable fractional GPU sharing where appropriate
+* Minimize cross-vendor complexity
+
+## Considered Options
+
+* Single GPU vendor only
+* All workloads on largest GPU
+* Workload-specific GPU allocation
+* Dynamic GPU scheduling (MIG/fractional)
+
+## Decision Outcome
+
+Chosen option: "Workload-specific GPU allocation with dedicated nodes", where each AI service is pinned to the most appropriate GPU based on requirements.
+
+### Allocation Strategy
+
+| Workload | GPU | Node | Rationale |
+|----------|-----|------|-----------|
+| vLLM (LLM inference) | AMD Strix Halo (64GB) | khelben (dedicated) | Large models need unified memory |
+| Whisper (STT) | NVIDIA RTX 2070 (8GB) | elminster | CUDA optimized, medium memory |
+| XTTS (TTS) | NVIDIA RTX 2070 (8GB) | elminster | Shares with Whisper |
+| BGE Embeddings | AMD Radeon 680M (12GB) | drizzt | ROCm support, batch processing |
+| BGE Reranker | Intel Arc | danilo | Light workload, Intel optimization |
+
+### Positive Consequences
+
+* Each workload gets optimal hardware
+* No GPU memory contention for LLM
+* NVIDIA services can share via time-slicing
+* Cost-effective use of varied hardware
+* Clear ownership and debugging
+
+### Negative Consequences
+
+* More complex scheduling (node taints/tolerations)
+* Less flexibility for workload migration
+* Must maintain multiple GPU driver stacks
+* Some GPUs underutilized at times
+
+## Implementation
+
+### Node Taints
+
+```yaml
+# khelben - dedicated vLLM node
+nodeTaints:
+  dedicated: "vllm:NoSchedule"
+```
+
+### Pod Tolerations and Node Affinity
+
+```yaml
+# vLLM deployment
+spec:
+  tolerations:
+    - key: "dedicated"
+      operator: "Equal"
+      value: "vllm"
+      effect: "NoSchedule"
+  affinity:
+    nodeAffinity:
+      requiredDuringSchedulingIgnoredDuringExecution:
+        nodeSelectorTerms:
+          - matchExpressions:
+              - key: kubernetes.io/hostname
+                operator: In
+                values: ["khelben"]
+```
+
+### Resource Limits
+
+```yaml
+# NVIDIA GPU (elminster)
+resources:
+  limits:
+    nvidia.com/gpu: 1
+
+# AMD GPU (drizzt, khelben)  
+resources:
+  limits:
+    amd.com/gpu: 1
+```
+
+## Pros and Cons of the Options
+
+### Single GPU vendor only
+
+* Good, because simpler driver management
+* Good, because consistent tooling
+* Bad, because wastes existing hardware
+* Bad, because higher cost for new hardware
+
+### All workloads on largest GPU
+
+* Good, because simple scheduling
+* Good, because unified memory benefits
+* Bad, because memory contention
+* Bad, because single point of failure
+* Bad, because wastes other GPUs
+
+### Workload-specific allocation (chosen)
+
+* Good, because optimal hardware matching
+* Good, because uses all available GPUs
+* Good, because clear resource boundaries
+* Good, because parallel inference
+* Bad, because more complex configuration
+* Bad, because multiple driver stacks
+
+### Dynamic GPU scheduling
+
+* Good, because flexible
+* Good, because maximizes utilization
+* Bad, because complex to implement
+* Bad, because MIG not available on consumer GPUs
+* Bad, because cross-vendor scheduling immature
+
+## Links
+
+* [Volcano Scheduler](https://volcano.sh)
+* [AMD GPU Device Plugin](https://github.com/ROCm/k8s-device-plugin)
+* [NVIDIA Device Plugin](https://github.com/NVIDIA/k8s-device-plugin)
+* Related: [ADR-0002](0002-use-talos-linux.md) - GPU drivers via Talos schematics
--- a/decisions/0006-gitops-with-flux.md
+++ b/decisions/0006-gitops-with-flux.md
@@ -0,0 +1,140 @@
+# GitOps with Flux CD
+
+* Status: accepted
+* Date: 2025-11-30
+* Deciders: Billy Davies
+* Technical Story: Implementing GitOps for cluster management
+
+## Context and Problem Statement
+
+Managing a Kubernetes cluster with numerous applications, configurations, and secrets requires a reliable, auditable, and reproducible approach. Manual `kubectl apply` is error-prone and doesn't track state over time.
+
+## Decision Drivers
+
+* Infrastructure as Code (IaC) principles
+* Audit trail for all changes
+* Self-healing cluster state
+* Multi-repository support
+* Secret encryption integration
+* Active community and maintenance
+
+## Considered Options
+
+* Manual kubectl apply
+* ArgoCD
+* Flux CD
+* Rancher Fleet
+* Pulumi/Terraform for Kubernetes
+
+## Decision Outcome
+
+Chosen option: "Flux CD", because it provides a mature GitOps implementation with excellent multi-source support, SOPS integration, and aligns well with the Kubernetes ecosystem.
+
+### Positive Consequences
+
+* Git is single source of truth
+* Automatic drift detection and correction
+* Native SOPS/Age secret encryption
+* Multi-repository support (homelab-k8s2 + llm-workflows)
+* Helm and Kustomize native support
+* Webhook-free sync (pull-based)
+
+### Negative Consequences
+
+* No built-in UI (use CLI or third-party)
+* Learning curve for CRD-based configuration
+* Debugging requires understanding Flux controllers
+
+## Configuration
+
+### Repository Structure
+
+```
+homelab-k8s2/
+├── kubernetes/
+│   ├── flux/            # Flux system config
+│   │   ├── config/
+│   │   │   ├── cluster.yaml
+│   │   │   └── secrets.yaml  # SOPS encrypted
+│   │   └── repositories/
+│   │       ├── helm/    # HelmRepositories
+│   │       └── git/     # GitRepositories
+│   └── apps/            # Application Kustomizations
+```
+
+### Multi-Repository Sync
+
+```yaml
+# GitRepository for llm-workflows
+apiVersion: source.toolkit.fluxcd.io/v1
+kind: GitRepository
+metadata:
+  name: llm-workflows
+  namespace: flux-system
+spec:
+  url: ssh://git@github.com/Billy-Davies-2/llm-workflows
+  ref:
+    branch: main
+  secretRef:
+    name: github-deploy-key
+```
+
+### SOPS Integration
+
+```yaml
+# .sops.yaml
+creation_rules:
+  - path_regex: .*\.sops\.yaml$
+    age: >-
+      age1...  # Public key
+```
+
+## Pros and Cons of the Options
+
+### Manual kubectl apply
+
+* Good, because simple
+* Good, because no setup
+* Bad, because no audit trail
+* Bad, because no drift detection
+* Bad, because not reproducible
+
+### ArgoCD
+
+* Good, because great UI
+* Good, because app-of-apps pattern
+* Good, because large community
+* Bad, because heavier resource usage
+* Bad, because webhook-dependent sync
+* Bad, because SOPS requires plugins
+
+### Flux CD
+
+* Good, because lightweight
+* Good, because pull-based (no webhooks)
+* Good, because native SOPS support
+* Good, because multi-source/multi-tenant
+* Good, because Kubernetes-native CRDs
+* Bad, because no built-in UI
+* Bad, because CRD learning curve
+
+### Rancher Fleet
+
+* Good, because integrated with Rancher
+* Good, because multi-cluster
+* Bad, because Rancher ecosystem lock-in
+* Bad, because smaller community
+
+### Pulumi/Terraform
+
+* Good, because familiar IaC tools
+* Good, because drift detection
+* Bad, because not Kubernetes-native
+* Bad, because requires state management
+* Bad, because not continuous reconciliation
+
+## Links
+
+* [Flux CD](https://fluxcd.io)
+* [SOPS Integration](https://fluxcd.io/flux/guides/mozilla-sops/)
+* [flux-local](https://github.com/allenporter/flux-local) - Local testing
--- a/decisions/0007-use-kserve-for-inference.md
+++ b/decisions/0007-use-kserve-for-inference.md
@@ -0,0 +1,115 @@
+# Use KServe for ML Model Serving
+
+* Status: accepted
+* Date: 2025-12-15
+* Deciders: Billy Davies
+* Technical Story: Selecting model serving platform for inference services
+
+## Context and Problem Statement
+
+We need to deploy multiple ML models (Whisper, XTTS, BGE, vLLM) as inference endpoints. Each model has different requirements for scaling, protocols (HTTP/gRPC), and GPU allocation.
+
+## Decision Drivers
+
+* Standardized inference protocol (V2)
+* Autoscaling based on load
+* Traffic splitting for canary deployments
+* Integration with Kubeflow ecosystem
+* GPU resource management
+* Health checks and readiness
+
+## Considered Options
+
+* Raw Kubernetes Deployments + Services
+* KServe InferenceService
+* Seldon Core
+* BentoML
+* Ray Serve only
+
+## Decision Outcome
+
+Chosen option: "KServe InferenceService", because it provides a standardized, Kubernetes-native approach to model serving with built-in autoscaling and traffic management.
+
+### Positive Consequences
+
+* Standardized V2 inference protocol
+* Automatic scale-to-zero capability
+* Canary/blue-green deployments
+* Integration with Kubeflow UI
+* Transformer/Explainer components
+* GPU resource abstraction
+
+### Negative Consequences
+
+* Additional CRDs and operators
+* Learning curve for InferenceService spec
+* Some overhead for simple deployments
+* Knative Serving dependency (optional)
+
+## Pros and Cons of the Options
+
+### Raw Kubernetes Deployments
+
+* Good, because simple
+* Good, because full control
+* Bad, because no autoscaling logic
+* Bad, because manual service mesh
+* Bad, because repetitive configuration
+
+### KServe InferenceService
+
+* Good, because standardized API
+* Good, because autoscaling
+* Good, because traffic management
+* Good, because Kubeflow integration
+* Bad, because operator complexity
+* Bad, because Knative optional dependency
+
+### Seldon Core
+
+* Good, because mature
+* Good, because A/B testing
+* Good, because explainability
+* Bad, because more complex than KServe
+* Bad, because heavier resource usage
+
+### BentoML
+
+* Good, because developer-friendly
+* Good, because packaging focused
+* Bad, because less Kubernetes-native
+* Bad, because smaller community
+
+### Ray Serve
+
+* Good, because unified compute
+* Good, because Python-native
+* Good, because fractional GPU
+* Bad, because less standardized API
+* Bad, because Ray cluster overhead
+
+## Current Configuration
+
+```yaml
+apiVersion: serving.kserve.io/v1beta1
+kind: InferenceService
+metadata:
+  name: whisper
+  namespace: ai-ml
+spec:
+  predictor:
+    minReplicas: 1
+    maxReplicas: 3
+    containers:
+      - name: whisper
+        image: ghcr.io/org/whisper:latest
+        resources:
+          limits:
+            nvidia.com/gpu: 1
+```
+
+## Links
+
+* [KServe](https://kserve.github.io)
+* [V2 Inference Protocol](https://kserve.github.io/website/latest/modelserving/data_plane/v2_protocol/)
+* Related: [ADR-0005](0005-multi-gpu-strategy.md) - GPU allocation
--- a/decisions/0008-use-milvus-for-vectors.md
+++ b/decisions/0008-use-milvus-for-vectors.md
@@ -0,0 +1,107 @@
+# Use Milvus for Vector Storage
+
+* Status: accepted
+* Date: 2025-12-15
+* Deciders: Billy Davies
+* Technical Story: Selecting vector database for RAG system
+
+## Context and Problem Statement
+
+The RAG (Retrieval-Augmented Generation) system requires a vector database to store document embeddings and perform similarity search. We need to store millions of embeddings and query them with low latency.
+
+## Decision Drivers
+
+* Query performance (< 100ms for top-k search)
+* Scalability to millions of vectors
+* Kubernetes-native deployment
+* Active development and community
+* Support for metadata filtering
+* Backup and restore capabilities
+
+## Considered Options
+
+* Milvus
+* Pinecone (managed)
+* Qdrant
+* Weaviate
+* pgvector (PostgreSQL extension)
+* Chroma
+
+## Decision Outcome
+
+Chosen option: "Milvus", because it provides production-grade vector search with excellent Kubernetes support, scalability, and active development.
+
+### Positive Consequences
+
+* High-performance similarity search
+* Horizontal scalability
+* Rich filtering and hybrid search
+* Helm chart for Kubernetes
+* Active CNCF sandbox project
+* GPU acceleration available
+
+### Negative Consequences
+
+* Complex architecture (multiple components)
+* Higher resource usage than simpler alternatives
+* Requires object storage (MinIO)
+* Learning curve for optimization
+
+## Pros and Cons of the Options
+
+### Milvus
+
+* Good, because production-proven at scale
+* Good, because rich query API
+* Good, because Kubernetes-native
+* Good, because hybrid search (vector + scalar)
+* Good, because CNCF project
+* Bad, because complex architecture
+* Bad, because higher resource usage
+
+### Pinecone
+
+* Good, because fully managed
+* Good, because simple API
+* Good, because reliable
+* Bad, because external dependency
+* Bad, because cost at scale
+* Bad, because data sovereignty concerns
+
+### Qdrant
+
+* Good, because simpler than Milvus
+* Good, because Rust performance
+* Good, because good filtering
+* Bad, because smaller community
+* Bad, because less enterprise features
+
+### Weaviate
+
+* Good, because built-in vectorization
+* Good, because GraphQL API
+* Good, because modules system
+* Bad, because more opinionated
+* Bad, because schema requirements
+
+### pgvector
+
+* Good, because familiar PostgreSQL
+* Good, because simple deployment
+* Good, because ACID transactions
+* Bad, because limited scale
+* Bad, because slower for large datasets
+* Bad, because no specialized optimizations
+
+### Chroma
+
+* Good, because simple
+* Good, because embedded option
+* Bad, because not production-ready at scale
+* Bad, because limited features
+
+## Links
+
+* [Milvus](https://milvus.io)
+* [Milvus Helm Chart](https://github.com/milvus-io/milvus-helm)
+* Related: [DOMAIN-MODEL.md](../DOMAIN-MODEL.md) - Chunk/Embedding entities
--- a/decisions/0009-dual-workflow-engines.md
+++ b/decisions/0009-dual-workflow-engines.md
@@ -0,0 +1,124 @@
+# Dual Workflow Engine Strategy (Argo + Kubeflow)
+
+* Status: accepted
+* Date: 2026-01-15
+* Deciders: Billy Davies
+* Technical Story: Selecting workflow orchestration for ML pipelines
+
+## Context and Problem Statement
+
+The AI platform needs workflow orchestration for:
+- ML training pipelines with caching
+- Document ingestion (batch)
+- Complex DAG workflows (training → evaluation → deployment)
+- Hybrid scenarios combining both
+
+Should we use one engine or leverage strengths of multiple?
+
+## Decision Drivers
+
+* ML-specific features (caching, lineage)
+* Complex DAG support
+* Kubernetes-native execution
+* Visibility and debugging
+* Community and ecosystem
+* Integration with existing tools
+
+## Considered Options
+
+* Kubeflow Pipelines only
+* Argo Workflows only
+* Both engines with clear use cases
+* Airflow on Kubernetes
+* Prefect/Dagster
+
+## Decision Outcome
+
+Chosen option: "Both engines with clear use cases", using Kubeflow Pipelines for ML-centric workflows and Argo Workflows for complex DAG orchestration.
+
+### Decision Matrix
+
+| Use Case | Engine | Reason |
+|----------|--------|--------|
+| ML training with caching | Kubeflow | Component caching, experiment tracking |
+| Model evaluation | Kubeflow | Metric collection, comparison |
+| Document ingestion | Argo | Simple DAG, no ML features needed |
+| Batch inference | Argo | Parallelization, retries |
+| Complex DAG with branching | Argo | Superior control flow |
+| Hybrid ML training | Both | Argo orchestrates, KFP for ML steps |
+
+### Positive Consequences
+
+* Best tool for each job
+* ML pipelines get proper caching
+* Complex workflows get better DAG support
+* Can integrate via Argo Events
+* Gradual migration possible
+
+### Negative Consequences
+
+* Two systems to maintain
+* Team needs to learn both
+* More complex debugging
+* Integration overhead
+
+## Integration Architecture
+
+```
+NATS Event ──► Argo Events ──► Sensor ──┬──► Argo Workflow
+                                        │
+                                        └──► Kubeflow Pipeline (via API)
+
+                    OR
+
+Argo Workflow ──► Step: kfp-trigger ──► Kubeflow Pipeline
+                 (WorkflowTemplate)
+```
+
+## Pros and Cons of the Options
+
+### Kubeflow Pipelines only
+
+* Good, because ML-focused
+* Good, because caching
+* Good, because experiment tracking
+* Bad, because limited DAG features
+* Bad, because less flexible control flow
+
+### Argo Workflows only
+
+* Good, because powerful DAG
+* Good, because flexible
+* Good, because great debugging
+* Bad, because no ML caching
+* Bad, because no experiment tracking
+
+### Both engines (chosen)
+
+* Good, because best of both
+* Good, because appropriate tool per job
+* Good, because can integrate
+* Bad, because operational complexity
+* Bad, because learning two systems
+
+### Airflow
+
+* Good, because mature
+* Good, because large community
+* Bad, because Python-centric
+* Bad, because not Kubernetes-native
+* Bad, because no ML features
+
+### Prefect/Dagster
+
+* Good, because modern design
+* Good, because Python-native
+* Bad, because less Kubernetes-native
+* Bad, because newer/less proven
+
+## Links
+
+* [Kubeflow Pipelines](https://kubeflow.org/docs/components/pipelines/)
+* [Argo Workflows](https://argoproj.github.io/workflows/)
+* [Argo Events](https://argoproj.github.io/events/)
+* Related: [kfp-integration.yaml](../../llm-workflows/argo/kfp-integration.yaml)
--- a/decisions/0010-use-envoy-gateway.md
+++ b/decisions/0010-use-envoy-gateway.md
@@ -0,0 +1,120 @@
+# Use Envoy Gateway for Ingress
+
+* Status: accepted
+* Date: 2025-12-01
+* Deciders: Billy Davies
+* Technical Story: Selecting ingress controller for cluster
+
+## Context and Problem Statement
+
+We need an ingress solution that supports:
+- Gateway API (modern Kubernetes standard)
+- gRPC for ML inference
+- WebSocket for real-time chat/voice
+- Header-based routing for A/B testing
+- TLS termination
+
+## Decision Drivers
+
+* Gateway API support (HTTPRoute, GRPCRoute)
+* WebSocket support
+* gRPC support
+* Performance at edge
+* Active development
+* Envoy ecosystem familiarity
+
+## Considered Options
+
+* NGINX Ingress Controller
+* Traefik
+* Envoy Gateway
+* Istio Gateway
+* Contour
+
+## Decision Outcome
+
+Chosen option: "Envoy Gateway", because it's the reference implementation of Gateway API with full Envoy feature set.
+
+### Positive Consequences
+
+* Native Gateway API support
+* Full Envoy feature set
+* WebSocket and gRPC native
+* No Istio complexity
+* CNCF graduated project (Envoy)
+* Easy integration with observability
+
+### Negative Consequences
+
+* Newer than alternatives
+* Less documentation than NGINX
+* Envoy configuration learning curve
+
+## Pros and Cons of the Options
+
+### NGINX Ingress
+
+* Good, because mature
+* Good, because well-documented
+* Good, because familiar
+* Bad, because limited Gateway API
+* Bad, because commercial features gated
+
+### Traefik
+
+* Good, because auto-discovery
+* Good, because good UI
+* Good, because Let's Encrypt
+* Bad, because Gateway API experimental
+* Bad, because less gRPC focus
+
+### Envoy Gateway
+
+* Good, because Gateway API native
+* Good, because full Envoy features
+* Good, because extensible
+* Good, because gRPC/WebSocket native
+* Bad, because newer project
+* Bad, because less community content
+
+### Istio Gateway
+
+* Good, because full mesh features
+* Good, because Gateway API
+* Bad, because overkill without mesh
+* Bad, because resource heavy
+
+### Contour
+
+* Good, because Envoy-based
+* Good, because lightweight
+* Bad, because Gateway API evolving
+* Bad, because smaller community
+
+## Configuration Example
+
+```yaml
+apiVersion: gateway.networking.k8s.io/v1
+kind: HTTPRoute
+metadata:
+  name: companions-chat
+spec:
+  parentRefs:
+    - name: eg-gateway
+      namespace: network
+  hostnames:
+    - companions-chat.lab.daviestechlabs.io
+  rules:
+    - matches:
+        - path:
+            type: PathPrefix
+            value: /
+      backendRefs:
+        - name: companions-chat
+          port: 8080
+```
+
+## Links
+
+* [Envoy Gateway](https://gateway.envoyproxy.io)
+* [Gateway API](https://gateway-api.sigs.k8s.io)