4.0 KiB
Use KServe for ML Model Serving
- Status: superseded by ADR-0011
- Date: 2025-12-15 (Updated: 2026-02-02)
- Deciders: Billy Davies
- Technical Story: Selecting model serving platform for inference services
Context and Problem Statement
We need to deploy multiple ML models (Whisper, XTTS, BGE, vLLM) as inference endpoints. Each model has different requirements for scaling, protocols (HTTP/gRPC), and GPU allocation.
Decision Drivers
- Standardized inference protocol (V2)
- Autoscaling based on load
- Traffic splitting for canary deployments
- Integration with Kubeflow ecosystem
- GPU resource management
- Health checks and readiness
Considered Options
- Raw Kubernetes Deployments + Services
- KServe InferenceService
- Seldon Core
- BentoML
- Ray Serve only
Decision Outcome
Chosen option: "KServe InferenceService", because it provides a standardized, Kubernetes-native approach to model serving with built-in autoscaling and traffic management.
UPDATE (2026-02-02): While KServe remains installed, all GPU inference now runs on KubeRay RayService with Ray Serve (see ADR-0011). KServe now serves as an abstraction layer via ExternalName services that provide KServe-compatible naming ({model}-predictor.ai-ml) while routing to the unified Ray Serve endpoint.
Current Role of KServe
KServe is retained for:
- Service naming convention:
{model}-predictor.ai-ml.svc.cluster.local - Future flexibility: Can be used for non-GPU models or canary deployments
- Kubeflow integration: KServe InferenceServices appear in Kubeflow UI
Positive Consequences
- Standardized V2 inference protocol
- Automatic scale-to-zero capability
- Canary/blue-green deployments
- Integration with Kubeflow UI
- Transformer/Explainer components
- GPU resource abstraction
Negative Consequences
- Additional CRDs and operators
- Learning curve for InferenceService spec
- Some overhead for simple deployments
- Knative Serving dependency (optional)
Pros and Cons of the Options
Raw Kubernetes Deployments
- Good, because simple
- Good, because full control
- Bad, because no autoscaling logic
- Bad, because manual service mesh
- Bad, because repetitive configuration
KServe InferenceService
- Good, because standardized API
- Good, because autoscaling
- Good, because traffic management
- Good, because Kubeflow integration
- Bad, because operator complexity
- Bad, because Knative optional dependency
Seldon Core
- Good, because mature
- Good, because A/B testing
- Good, because explainability
- Bad, because more complex than KServe
- Bad, because heavier resource usage
BentoML
- Good, because developer-friendly
- Good, because packaging focused
- Bad, because less Kubernetes-native
- Bad, because smaller community
Ray Serve
- Good, because unified compute
- Good, because Python-native
- Good, because fractional GPU
- Bad, because less standardized API
- Bad, because Ray cluster overhead
Current Configuration
KServe-compatible ExternalName services route to the unified Ray Serve endpoint:
# KServe-compatible service alias (services-ray-aliases.yaml)
apiVersion: v1
kind: Service
metadata:
name: whisper-predictor
namespace: ai-ml
labels:
serving.kserve.io/inferenceservice: whisper
spec:
type: ExternalName
externalName: ai-inference-serve-svc.ai-ml.svc.cluster.local
ports:
- port: 8000
targetPort: 8000
---
# Usage: http://whisper-predictor.ai-ml.svc.cluster.local:8000/whisper/...
# All traffic routes to Ray Serve, which handles GPU allocation
For the actual Ray Serve configuration, see ADR-0011.
Links
- KServe
- V2 Inference Protocol
- KubeRay
- Related: ADR-0005 - GPU allocation
- Superseded by: ADR-0011 - KubeRay unified backend