feat: add comprehensive architecture documentation
- Add AGENT-ONBOARDING.md for AI agents - Add ARCHITECTURE.md with full system overview - Add TECH-STACK.md with complete technology inventory - Add DOMAIN-MODEL.md with entities and bounded contexts - Add CODING-CONVENTIONS.md with patterns and practices - Add GLOSSARY.md with terminology reference - Add C4 diagrams (Context and Container levels) - Add 10 ADRs documenting key decisions: - Talos Linux, NATS, MessagePack, Multi-GPU strategy - GitOps with Flux, KServe, Milvus, Dual workflow engines - Envoy Gateway - Add specs directory with JetStream configuration - Add diagrams for GPU allocation and data flows Based on analysis of homelab-k8s2 and llm-workflows repositories and kubectl cluster-info dump data.
This commit is contained in:
115
decisions/0007-use-kserve-for-inference.md
Normal file
115
decisions/0007-use-kserve-for-inference.md
Normal file
@@ -0,0 +1,115 @@
|
||||
# Use KServe for ML Model Serving
|
||||
|
||||
* Status: accepted
|
||||
* Date: 2025-12-15
|
||||
* Deciders: Billy Davies
|
||||
* Technical Story: Selecting model serving platform for inference services
|
||||
|
||||
## Context and Problem Statement
|
||||
|
||||
We need to deploy multiple ML models (Whisper, XTTS, BGE, vLLM) as inference endpoints. Each model has different requirements for scaling, protocols (HTTP/gRPC), and GPU allocation.
|
||||
|
||||
## Decision Drivers
|
||||
|
||||
* Standardized inference protocol (V2)
|
||||
* Autoscaling based on load
|
||||
* Traffic splitting for canary deployments
|
||||
* Integration with Kubeflow ecosystem
|
||||
* GPU resource management
|
||||
* Health checks and readiness
|
||||
|
||||
## Considered Options
|
||||
|
||||
* Raw Kubernetes Deployments + Services
|
||||
* KServe InferenceService
|
||||
* Seldon Core
|
||||
* BentoML
|
||||
* Ray Serve only
|
||||
|
||||
## Decision Outcome
|
||||
|
||||
Chosen option: "KServe InferenceService", because it provides a standardized, Kubernetes-native approach to model serving with built-in autoscaling and traffic management.
|
||||
|
||||
### Positive Consequences
|
||||
|
||||
* Standardized V2 inference protocol
|
||||
* Automatic scale-to-zero capability
|
||||
* Canary/blue-green deployments
|
||||
* Integration with Kubeflow UI
|
||||
* Transformer/Explainer components
|
||||
* GPU resource abstraction
|
||||
|
||||
### Negative Consequences
|
||||
|
||||
* Additional CRDs and operators
|
||||
* Learning curve for InferenceService spec
|
||||
* Some overhead for simple deployments
|
||||
* Knative Serving dependency (optional)
|
||||
|
||||
## Pros and Cons of the Options
|
||||
|
||||
### Raw Kubernetes Deployments
|
||||
|
||||
* Good, because simple
|
||||
* Good, because full control
|
||||
* Bad, because no autoscaling logic
|
||||
* Bad, because manual service mesh
|
||||
* Bad, because repetitive configuration
|
||||
|
||||
### KServe InferenceService
|
||||
|
||||
* Good, because standardized API
|
||||
* Good, because autoscaling
|
||||
* Good, because traffic management
|
||||
* Good, because Kubeflow integration
|
||||
* Bad, because operator complexity
|
||||
* Bad, because Knative optional dependency
|
||||
|
||||
### Seldon Core
|
||||
|
||||
* Good, because mature
|
||||
* Good, because A/B testing
|
||||
* Good, because explainability
|
||||
* Bad, because more complex than KServe
|
||||
* Bad, because heavier resource usage
|
||||
|
||||
### BentoML
|
||||
|
||||
* Good, because developer-friendly
|
||||
* Good, because packaging focused
|
||||
* Bad, because less Kubernetes-native
|
||||
* Bad, because smaller community
|
||||
|
||||
### Ray Serve
|
||||
|
||||
* Good, because unified compute
|
||||
* Good, because Python-native
|
||||
* Good, because fractional GPU
|
||||
* Bad, because less standardized API
|
||||
* Bad, because Ray cluster overhead
|
||||
|
||||
## Current Configuration
|
||||
|
||||
```yaml
|
||||
apiVersion: serving.kserve.io/v1beta1
|
||||
kind: InferenceService
|
||||
metadata:
|
||||
name: whisper
|
||||
namespace: ai-ml
|
||||
spec:
|
||||
predictor:
|
||||
minReplicas: 1
|
||||
maxReplicas: 3
|
||||
containers:
|
||||
- name: whisper
|
||||
image: ghcr.io/org/whisper:latest
|
||||
resources:
|
||||
limits:
|
||||
nvidia.com/gpu: 1
|
||||
```
|
||||
|
||||
## Links
|
||||
|
||||
* [KServe](https://kserve.github.io)
|
||||
* [V2 Inference Protocol](https://kserve.github.io/website/latest/modelserving/data_plane/v2_protocol/)
|
||||
* Related: [ADR-0005](0005-multi-gpu-strategy.md) - GPU allocation
|
||||
Reference in New Issue
Block a user