feat: add comprehensive architecture documentation

- Add AGENT-ONBOARDING.md for AI agents - Add ARCHITECTURE.md with full system overview - Add TECH-STACK.md with complete technology inventory - Add DOMAIN-MODEL.md with entities and bounded contexts - Add CODING-CONVENTIONS.md with patterns and practices - Add GLOSSARY.md with terminology reference - Add C4 diagrams (Context and Container levels) - Add 10 ADRs documenting key decisions: - Talos Linux, NATS, MessagePack, Multi-GPU strategy - GitOps with Flux, KServe, Milvus, Dual workflow engines - Envoy Gateway - Add specs directory with JetStream configuration - Add diagrams for GPU allocation and data flows Based on analysis of homelab-k8s2 and llm-workflows repositories and kubectl cluster-info dump data.
2026-02-01 14:30:05 -05:00
parent 4d4f6f464c
commit 832cda34bd
26 changed files with 3805 additions and 2 deletions
--- a/decisions/0007-use-kserve-for-inference.md
+++ b/decisions/0007-use-kserve-for-inference.md
@@ -0,0 +1,115 @@
+# Use KServe for ML Model Serving
+
+* Status: accepted
+* Date: 2025-12-15
+* Deciders: Billy Davies
+* Technical Story: Selecting model serving platform for inference services
+
+## Context and Problem Statement
+
+We need to deploy multiple ML models (Whisper, XTTS, BGE, vLLM) as inference endpoints. Each model has different requirements for scaling, protocols (HTTP/gRPC), and GPU allocation.
+
+## Decision Drivers
+
+* Standardized inference protocol (V2)
+* Autoscaling based on load
+* Traffic splitting for canary deployments
+* Integration with Kubeflow ecosystem
+* GPU resource management
+* Health checks and readiness
+
+## Considered Options
+
+* Raw Kubernetes Deployments + Services
+* KServe InferenceService
+* Seldon Core
+* BentoML
+* Ray Serve only
+
+## Decision Outcome
+
+Chosen option: "KServe InferenceService", because it provides a standardized, Kubernetes-native approach to model serving with built-in autoscaling and traffic management.
+
+### Positive Consequences
+
+* Standardized V2 inference protocol
+* Automatic scale-to-zero capability
+* Canary/blue-green deployments
+* Integration with Kubeflow UI
+* Transformer/Explainer components
+* GPU resource abstraction
+
+### Negative Consequences
+
+* Additional CRDs and operators
+* Learning curve for InferenceService spec
+* Some overhead for simple deployments
+* Knative Serving dependency (optional)
+
+## Pros and Cons of the Options
+
+### Raw Kubernetes Deployments
+
+* Good, because simple
+* Good, because full control
+* Bad, because no autoscaling logic
+* Bad, because manual service mesh
+* Bad, because repetitive configuration
+
+### KServe InferenceService
+
+* Good, because standardized API
+* Good, because autoscaling
+* Good, because traffic management
+* Good, because Kubeflow integration
+* Bad, because operator complexity
+* Bad, because Knative optional dependency
+
+### Seldon Core
+
+* Good, because mature
+* Good, because A/B testing
+* Good, because explainability
+* Bad, because more complex than KServe
+* Bad, because heavier resource usage
+
+### BentoML
+
+* Good, because developer-friendly
+* Good, because packaging focused
+* Bad, because less Kubernetes-native
+* Bad, because smaller community
+
+### Ray Serve
+
+* Good, because unified compute
+* Good, because Python-native
+* Good, because fractional GPU
+* Bad, because less standardized API
+* Bad, because Ray cluster overhead
+
+## Current Configuration
+
+```yaml
+apiVersion: serving.kserve.io/v1beta1
+kind: InferenceService
+metadata:
+  name: whisper
+  namespace: ai-ml
+spec:
+  predictor:
+    minReplicas: 1
+    maxReplicas: 3
+    containers:
+      - name: whisper
+        image: ghcr.io/org/whisper:latest
+        resources:
+          limits:
+            nvidia.com/gpu: 1
+```
+
+## Links
+
+* [KServe](https://kserve.github.io)
+* [V2 Inference Protocol](https://kserve.github.io/website/latest/modelserving/data_plane/v2_protocol/)
+* Related: [ADR-0005](0005-multi-gpu-strategy.md) - GPU allocation