feat: add comprehensive architecture documentation

- Add AGENT-ONBOARDING.md for AI agents
- Add ARCHITECTURE.md with full system overview
- Add TECH-STACK.md with complete technology inventory
- Add DOMAIN-MODEL.md with entities and bounded contexts
- Add CODING-CONVENTIONS.md with patterns and practices
- Add GLOSSARY.md with terminology reference
- Add C4 diagrams (Context and Container levels)
- Add 10 ADRs documenting key decisions:
  - Talos Linux, NATS, MessagePack, Multi-GPU strategy
  - GitOps with Flux, KServe, Milvus, Dual workflow engines
  - Envoy Gateway
- Add specs directory with JetStream configuration
- Add diagrams for GPU allocation and data flows

Based on analysis of homelab-k8s2 and llm-workflows repositories
and kubectl cluster-info dump data.
This commit is contained in:
2026-02-01 14:30:05 -05:00
parent 4d4f6f464c
commit 832cda34bd
26 changed files with 3805 additions and 2 deletions

View File

@@ -0,0 +1,115 @@
# Use KServe for ML Model Serving
* Status: accepted
* Date: 2025-12-15
* Deciders: Billy Davies
* Technical Story: Selecting model serving platform for inference services
## Context and Problem Statement
We need to deploy multiple ML models (Whisper, XTTS, BGE, vLLM) as inference endpoints. Each model has different requirements for scaling, protocols (HTTP/gRPC), and GPU allocation.
## Decision Drivers
* Standardized inference protocol (V2)
* Autoscaling based on load
* Traffic splitting for canary deployments
* Integration with Kubeflow ecosystem
* GPU resource management
* Health checks and readiness
## Considered Options
* Raw Kubernetes Deployments + Services
* KServe InferenceService
* Seldon Core
* BentoML
* Ray Serve only
## Decision Outcome
Chosen option: "KServe InferenceService", because it provides a standardized, Kubernetes-native approach to model serving with built-in autoscaling and traffic management.
### Positive Consequences
* Standardized V2 inference protocol
* Automatic scale-to-zero capability
* Canary/blue-green deployments
* Integration with Kubeflow UI
* Transformer/Explainer components
* GPU resource abstraction
### Negative Consequences
* Additional CRDs and operators
* Learning curve for InferenceService spec
* Some overhead for simple deployments
* Knative Serving dependency (optional)
## Pros and Cons of the Options
### Raw Kubernetes Deployments
* Good, because simple
* Good, because full control
* Bad, because no autoscaling logic
* Bad, because manual service mesh
* Bad, because repetitive configuration
### KServe InferenceService
* Good, because standardized API
* Good, because autoscaling
* Good, because traffic management
* Good, because Kubeflow integration
* Bad, because operator complexity
* Bad, because Knative optional dependency
### Seldon Core
* Good, because mature
* Good, because A/B testing
* Good, because explainability
* Bad, because more complex than KServe
* Bad, because heavier resource usage
### BentoML
* Good, because developer-friendly
* Good, because packaging focused
* Bad, because less Kubernetes-native
* Bad, because smaller community
### Ray Serve
* Good, because unified compute
* Good, because Python-native
* Good, because fractional GPU
* Bad, because less standardized API
* Bad, because Ray cluster overhead
## Current Configuration
```yaml
apiVersion: serving.kserve.io/v1beta1
kind: InferenceService
metadata:
name: whisper
namespace: ai-ml
spec:
predictor:
minReplicas: 1
maxReplicas: 3
containers:
- name: whisper
image: ghcr.io/org/whisper:latest
resources:
limits:
nvidia.com/gpu: 1
```
## Links
* [KServe](https://kserve.github.io)
* [V2 Inference Protocol](https://kserve.github.io/website/latest/modelserving/data_plane/v2_protocol/)
* Related: [ADR-0005](0005-multi-gpu-strategy.md) - GPU allocation