# Use KServe for ML Model Serving * Status: accepted * Date: 2025-12-15 * Deciders: Billy Davies * Technical Story: Selecting model serving platform for inference services ## Context and Problem Statement We need to deploy multiple ML models (Whisper, XTTS, BGE, vLLM) as inference endpoints. Each model has different requirements for scaling, protocols (HTTP/gRPC), and GPU allocation. ## Decision Drivers * Standardized inference protocol (V2) * Autoscaling based on load * Traffic splitting for canary deployments * Integration with Kubeflow ecosystem * GPU resource management * Health checks and readiness ## Considered Options * Raw Kubernetes Deployments + Services * KServe InferenceService * Seldon Core * BentoML * Ray Serve only ## Decision Outcome Chosen option: "KServe InferenceService", because it provides a standardized, Kubernetes-native approach to model serving with built-in autoscaling and traffic management. ### Positive Consequences * Standardized V2 inference protocol * Automatic scale-to-zero capability * Canary/blue-green deployments * Integration with Kubeflow UI * Transformer/Explainer components * GPU resource abstraction ### Negative Consequences * Additional CRDs and operators * Learning curve for InferenceService spec * Some overhead for simple deployments * Knative Serving dependency (optional) ## Pros and Cons of the Options ### Raw Kubernetes Deployments * Good, because simple * Good, because full control * Bad, because no autoscaling logic * Bad, because manual service mesh * Bad, because repetitive configuration ### KServe InferenceService * Good, because standardized API * Good, because autoscaling * Good, because traffic management * Good, because Kubeflow integration * Bad, because operator complexity * Bad, because Knative optional dependency ### Seldon Core * Good, because mature * Good, because A/B testing * Good, because explainability * Bad, because more complex than KServe * Bad, because heavier resource usage ### BentoML * Good, because developer-friendly * Good, because packaging focused * Bad, because less Kubernetes-native * Bad, because smaller community ### Ray Serve * Good, because unified compute * Good, because Python-native * Good, because fractional GPU * Bad, because less standardized API * Bad, because Ray cluster overhead ## Current Configuration ```yaml apiVersion: serving.kserve.io/v1beta1 kind: InferenceService metadata: name: whisper namespace: ai-ml spec: predictor: minReplicas: 1 maxReplicas: 3 containers: - name: whisper image: ghcr.io/org/whisper:latest resources: limits: nvidia.com/gpu: 1 ``` ## Links * [KServe](https://kserve.github.io) * [V2 Inference Protocol](https://kserve.github.io/website/latest/modelserving/data_plane/v2_protocol/) * Related: [ADR-0005](0005-multi-gpu-strategy.md) - GPU allocation