# Use KServe for ML Model Serving * Status: superseded by [ADR-0011](0011-kuberay-unified-gpu-backend.md) * Date: 2025-12-15 (Updated: 2026-02-02) * Deciders: Billy Davies * Technical Story: Selecting model serving platform for inference services ## Context and Problem Statement We need to deploy multiple ML models (Whisper, XTTS, BGE, vLLM) as inference endpoints. Each model has different requirements for scaling, protocols (HTTP/gRPC), and GPU allocation. ## Decision Drivers * Standardized inference protocol (V2) * Autoscaling based on load * Traffic splitting for canary deployments * Integration with Kubeflow ecosystem * GPU resource management * Health checks and readiness ## Considered Options * Raw Kubernetes Deployments + Services * KServe InferenceService * Seldon Core * BentoML * Ray Serve only ## Decision Outcome Chosen option: "KServe InferenceService", because it provides a standardized, Kubernetes-native approach to model serving with built-in autoscaling and traffic management. **UPDATE (2026-02-02)**: While KServe remains installed, all GPU inference now runs on **KubeRay RayService with Ray Serve** (see [ADR-0011](0011-kuberay-unified-gpu-backend.md)). KServe now serves as an **abstraction layer** via ExternalName services that provide KServe-compatible naming (`{model}-predictor.ai-ml`) while routing to the unified Ray Serve endpoint. ### Current Role of KServe KServe is retained for: - **Service naming convention**: `{model}-predictor.ai-ml.svc.cluster.local` - **Future flexibility**: Can be used for non-GPU models or canary deployments - **Kubeflow integration**: KServe InferenceServices appear in Kubeflow UI ### Positive Consequences * Standardized V2 inference protocol * Automatic scale-to-zero capability * Canary/blue-green deployments * Integration with Kubeflow UI * Transformer/Explainer components * GPU resource abstraction ### Negative Consequences * Additional CRDs and operators * Learning curve for InferenceService spec * Some overhead for simple deployments * Knative Serving dependency (optional) ## Pros and Cons of the Options ### Raw Kubernetes Deployments * Good, because simple * Good, because full control * Bad, because no autoscaling logic * Bad, because manual service mesh * Bad, because repetitive configuration ### KServe InferenceService * Good, because standardized API * Good, because autoscaling * Good, because traffic management * Good, because Kubeflow integration * Bad, because operator complexity * Bad, because Knative optional dependency ### Seldon Core * Good, because mature * Good, because A/B testing * Good, because explainability * Bad, because more complex than KServe * Bad, because heavier resource usage ### BentoML * Good, because developer-friendly * Good, because packaging focused * Bad, because less Kubernetes-native * Bad, because smaller community ### Ray Serve * Good, because unified compute * Good, because Python-native * Good, because fractional GPU * Bad, because less standardized API * Bad, because Ray cluster overhead ## Current Configuration KServe-compatible ExternalName services route to the unified Ray Serve endpoint: ```yaml # KServe-compatible service alias (services-ray-aliases.yaml) apiVersion: v1 kind: Service metadata: name: whisper-predictor namespace: ai-ml labels: serving.kserve.io/inferenceservice: whisper spec: type: ExternalName externalName: ai-inference-serve-svc.ai-ml.svc.cluster.local ports: - port: 8000 targetPort: 8000 --- # Usage: http://whisper-predictor.ai-ml.svc.cluster.local:8000/whisper/... # All traffic routes to Ray Serve, which handles GPU allocation ``` For the actual Ray Serve configuration, see [ADR-0011](0011-kuberay-unified-gpu-backend.md). ## Links * [KServe](https://kserve.github.io) * [V2 Inference Protocol](https://kserve.github.io/website/latest/modelserving/data_plane/v2_protocol/) * [KubeRay](https://ray-project.github.io/kuberay/) * Related: [ADR-0005](0005-multi-gpu-strategy.md) - GPU allocation * Superseded by: [ADR-0011](0011-kuberay-unified-gpu-backend.md) - KubeRay unified backend