Use KServe for ML Model Serving

Status: superseded by ADR-0011
Date: 2025-12-15 (Updated: 2026-02-02)
Deciders: Billy Davies
Technical Story: Selecting model serving platform for inference services

Context and Problem Statement

We need to deploy multiple ML models (Whisper, XTTS, BGE, vLLM) as inference endpoints. Each model has different requirements for scaling, protocols (HTTP/gRPC), and GPU allocation.

Decision Drivers

Standardized inference protocol (V2)
Autoscaling based on load
Traffic splitting for canary deployments
Integration with Kubeflow ecosystem
GPU resource management
Health checks and readiness

Considered Options

Raw Kubernetes Deployments + Services
KServe InferenceService
Seldon Core
BentoML
Ray Serve only

Decision Outcome

Chosen option: "KServe InferenceService", because it provides a standardized, Kubernetes-native approach to model serving with built-in autoscaling and traffic management.

UPDATE (2026-02-02): While KServe remains installed, all GPU inference now runs on KubeRay RayService with Ray Serve (see ADR-0011). KServe now serves as an abstraction layer via ExternalName services that provide KServe-compatible naming ({model}-predictor.ai-ml) while routing to the unified Ray Serve endpoint.

Current Role of KServe

KServe is retained for:

Service naming convention: {model}-predictor.ai-ml.svc.cluster.local
Future flexibility: Can be used for non-GPU models or canary deployments
Kubeflow integration: KServe InferenceServices appear in Kubeflow UI

Positive Consequences

Standardized V2 inference protocol
Automatic scale-to-zero capability
Canary/blue-green deployments
Integration with Kubeflow UI
Transformer/Explainer components
GPU resource abstraction

Negative Consequences

Additional CRDs and operators
Learning curve for InferenceService spec
Some overhead for simple deployments
Knative Serving dependency (optional)

Pros and Cons of the Options

Raw Kubernetes Deployments

Good, because simple
Good, because full control
Bad, because no autoscaling logic
Bad, because manual service mesh
Bad, because repetitive configuration

KServe InferenceService

Good, because standardized API
Good, because autoscaling
Good, because traffic management
Good, because Kubeflow integration
Bad, because operator complexity
Bad, because Knative optional dependency

Seldon Core

Good, because mature
Good, because A/B testing
Good, because explainability
Bad, because more complex than KServe
Bad, because heavier resource usage

BentoML

Good, because developer-friendly
Good, because packaging focused
Bad, because less Kubernetes-native
Bad, because smaller community

Ray Serve

Good, because unified compute
Good, because Python-native
Good, because fractional GPU
Bad, because less standardized API
Bad, because Ray cluster overhead

Current Configuration

KServe-compatible ExternalName services route to the unified Ray Serve endpoint:

# KServe-compatible service alias (services-ray-aliases.yaml)
apiVersion: v1
kind: Service
metadata:
  name: whisper-predictor
  namespace: ai-ml
  labels:
    serving.kserve.io/inferenceservice: whisper
spec:
  type: ExternalName
  externalName: ai-inference-serve-svc.ai-ml.svc.cluster.local
  ports:
    - port: 8000
      targetPort: 8000
---
# Usage: http://whisper-predictor.ai-ml.svc.cluster.local:8000/whisper/...
# All traffic routes to Ray Serve, which handles GPU allocation

For the actual Ray Serve configuration, see ADR-0011.

4.0 KiB Raw Blame History