133 lines
4.0 KiB
Markdown
133 lines
4.0 KiB
Markdown
# Use KServe for ML Model Serving
|
|
|
|
* Status: superseded by [ADR-0011](0011-kuberay-unified-gpu-backend.md)
|
|
* Date: 2025-12-15 (Updated: 2026-02-02)
|
|
* Deciders: Billy Davies
|
|
* Technical Story: Selecting model serving platform for inference services
|
|
|
|
## Context and Problem Statement
|
|
|
|
We need to deploy multiple ML models (Whisper, XTTS, BGE, vLLM) as inference endpoints. Each model has different requirements for scaling, protocols (HTTP/gRPC), and GPU allocation.
|
|
|
|
## Decision Drivers
|
|
|
|
* Standardized inference protocol (V2)
|
|
* Autoscaling based on load
|
|
* Traffic splitting for canary deployments
|
|
* Integration with Kubeflow ecosystem
|
|
* GPU resource management
|
|
* Health checks and readiness
|
|
|
|
## Considered Options
|
|
|
|
* Raw Kubernetes Deployments + Services
|
|
* KServe InferenceService
|
|
* Seldon Core
|
|
* BentoML
|
|
* Ray Serve only
|
|
|
|
## Decision Outcome
|
|
|
|
Chosen option: "KServe InferenceService", because it provides a standardized, Kubernetes-native approach to model serving with built-in autoscaling and traffic management.
|
|
|
|
**UPDATE (2026-02-02)**: While KServe remains installed, all GPU inference now runs on **KubeRay RayService with Ray Serve** (see [ADR-0011](0011-kuberay-unified-gpu-backend.md)). KServe now serves as an **abstraction layer** via ExternalName services that provide KServe-compatible naming (`{model}-predictor.ai-ml`) while routing to the unified Ray Serve endpoint.
|
|
|
|
### Current Role of KServe
|
|
|
|
KServe is retained for:
|
|
- **Service naming convention**: `{model}-predictor.ai-ml.svc.cluster.local`
|
|
- **Future flexibility**: Can be used for non-GPU models or canary deployments
|
|
- **Kubeflow integration**: KServe InferenceServices appear in Kubeflow UI
|
|
|
|
### Positive Consequences
|
|
|
|
* Standardized V2 inference protocol
|
|
* Automatic scale-to-zero capability
|
|
* Canary/blue-green deployments
|
|
* Integration with Kubeflow UI
|
|
* Transformer/Explainer components
|
|
* GPU resource abstraction
|
|
|
|
### Negative Consequences
|
|
|
|
* Additional CRDs and operators
|
|
* Learning curve for InferenceService spec
|
|
* Some overhead for simple deployments
|
|
* Knative Serving dependency (optional)
|
|
|
|
## Pros and Cons of the Options
|
|
|
|
### Raw Kubernetes Deployments
|
|
|
|
* Good, because simple
|
|
* Good, because full control
|
|
* Bad, because no autoscaling logic
|
|
* Bad, because manual service mesh
|
|
* Bad, because repetitive configuration
|
|
|
|
### KServe InferenceService
|
|
|
|
* Good, because standardized API
|
|
* Good, because autoscaling
|
|
* Good, because traffic management
|
|
* Good, because Kubeflow integration
|
|
* Bad, because operator complexity
|
|
* Bad, because Knative optional dependency
|
|
|
|
### Seldon Core
|
|
|
|
* Good, because mature
|
|
* Good, because A/B testing
|
|
* Good, because explainability
|
|
* Bad, because more complex than KServe
|
|
* Bad, because heavier resource usage
|
|
|
|
### BentoML
|
|
|
|
* Good, because developer-friendly
|
|
* Good, because packaging focused
|
|
* Bad, because less Kubernetes-native
|
|
* Bad, because smaller community
|
|
|
|
### Ray Serve
|
|
|
|
* Good, because unified compute
|
|
* Good, because Python-native
|
|
* Good, because fractional GPU
|
|
* Bad, because less standardized API
|
|
* Bad, because Ray cluster overhead
|
|
|
|
## Current Configuration
|
|
|
|
KServe-compatible ExternalName services route to the unified Ray Serve endpoint:
|
|
|
|
```yaml
|
|
# KServe-compatible service alias (services-ray-aliases.yaml)
|
|
apiVersion: v1
|
|
kind: Service
|
|
metadata:
|
|
name: whisper-predictor
|
|
namespace: ai-ml
|
|
labels:
|
|
serving.kserve.io/inferenceservice: whisper
|
|
spec:
|
|
type: ExternalName
|
|
externalName: ai-inference-serve-svc.ai-ml.svc.cluster.local
|
|
ports:
|
|
- port: 8000
|
|
targetPort: 8000
|
|
---
|
|
# Usage: http://whisper-predictor.ai-ml.svc.cluster.local:8000/whisper/...
|
|
# All traffic routes to Ray Serve, which handles GPU allocation
|
|
```
|
|
|
|
For the actual Ray Serve configuration, see [ADR-0011](0011-kuberay-unified-gpu-backend.md).
|
|
|
|
## Links
|
|
|
|
* [KServe](https://kserve.github.io)
|
|
* [V2 Inference Protocol](https://kserve.github.io/website/latest/modelserving/data_plane/v2_protocol/)
|
|
* [KubeRay](https://ray-project.github.io/kuberay/)
|
|
* Related: [ADR-0005](0005-multi-gpu-strategy.md) - GPU allocation
|
|
* Superseded by: [ADR-0011](0011-kuberay-unified-gpu-backend.md) - KubeRay unified backend
|