- Add AGENT-ONBOARDING.md for AI agents - Add ARCHITECTURE.md with full system overview - Add TECH-STACK.md with complete technology inventory - Add DOMAIN-MODEL.md with entities and bounded contexts - Add CODING-CONVENTIONS.md with patterns and practices - Add GLOSSARY.md with terminology reference - Add C4 diagrams (Context and Container levels) - Add 10 ADRs documenting key decisions: - Talos Linux, NATS, MessagePack, Multi-GPU strategy - GitOps with Flux, KServe, Milvus, Dual workflow engines - Envoy Gateway - Add specs directory with JetStream configuration - Add diagrams for GPU allocation and data flows Based on analysis of homelab-k8s2 and llm-workflows repositories and kubectl cluster-info dump data.
2.8 KiB
2.8 KiB
Use KServe for ML Model Serving
- Status: accepted
- Date: 2025-12-15
- Deciders: Billy Davies
- Technical Story: Selecting model serving platform for inference services
Context and Problem Statement
We need to deploy multiple ML models (Whisper, XTTS, BGE, vLLM) as inference endpoints. Each model has different requirements for scaling, protocols (HTTP/gRPC), and GPU allocation.
Decision Drivers
- Standardized inference protocol (V2)
- Autoscaling based on load
- Traffic splitting for canary deployments
- Integration with Kubeflow ecosystem
- GPU resource management
- Health checks and readiness
Considered Options
- Raw Kubernetes Deployments + Services
- KServe InferenceService
- Seldon Core
- BentoML
- Ray Serve only
Decision Outcome
Chosen option: "KServe InferenceService", because it provides a standardized, Kubernetes-native approach to model serving with built-in autoscaling and traffic management.
Positive Consequences
- Standardized V2 inference protocol
- Automatic scale-to-zero capability
- Canary/blue-green deployments
- Integration with Kubeflow UI
- Transformer/Explainer components
- GPU resource abstraction
Negative Consequences
- Additional CRDs and operators
- Learning curve for InferenceService spec
- Some overhead for simple deployments
- Knative Serving dependency (optional)
Pros and Cons of the Options
Raw Kubernetes Deployments
- Good, because simple
- Good, because full control
- Bad, because no autoscaling logic
- Bad, because manual service mesh
- Bad, because repetitive configuration
KServe InferenceService
- Good, because standardized API
- Good, because autoscaling
- Good, because traffic management
- Good, because Kubeflow integration
- Bad, because operator complexity
- Bad, because Knative optional dependency
Seldon Core
- Good, because mature
- Good, because A/B testing
- Good, because explainability
- Bad, because more complex than KServe
- Bad, because heavier resource usage
BentoML
- Good, because developer-friendly
- Good, because packaging focused
- Bad, because less Kubernetes-native
- Bad, because smaller community
Ray Serve
- Good, because unified compute
- Good, because Python-native
- Good, because fractional GPU
- Bad, because less standardized API
- Bad, because Ray cluster overhead
Current Configuration
apiVersion: serving.kserve.io/v1beta1
kind: InferenceService
metadata:
name: whisper
namespace: ai-ml
spec:
predictor:
minReplicas: 1
maxReplicas: 3
containers:
- name: whisper
image: ghcr.io/org/whisper:latest
resources:
limits:
nvidia.com/gpu: 1
Links
- KServe
- V2 Inference Protocol
- Related: ADR-0005 - GPU allocation