Files
homelab-design/decisions/0007-use-kserve-for-inference.md

4.0 KiB

Use KServe for ML Model Serving

  • Status: superseded by ADR-0011
  • Date: 2025-12-15 (Updated: 2026-02-02)
  • Deciders: Billy Davies
  • Technical Story: Selecting model serving platform for inference services

Context and Problem Statement

We need to deploy multiple ML models (Whisper, XTTS, BGE, vLLM) as inference endpoints. Each model has different requirements for scaling, protocols (HTTP/gRPC), and GPU allocation.

Decision Drivers

  • Standardized inference protocol (V2)
  • Autoscaling based on load
  • Traffic splitting for canary deployments
  • Integration with Kubeflow ecosystem
  • GPU resource management
  • Health checks and readiness

Considered Options

  • Raw Kubernetes Deployments + Services
  • KServe InferenceService
  • Seldon Core
  • BentoML
  • Ray Serve only

Decision Outcome

Chosen option: "KServe InferenceService", because it provides a standardized, Kubernetes-native approach to model serving with built-in autoscaling and traffic management.

UPDATE (2026-02-02): While KServe remains installed, all GPU inference now runs on KubeRay RayService with Ray Serve (see ADR-0011). KServe now serves as an abstraction layer via ExternalName services that provide KServe-compatible naming ({model}-predictor.ai-ml) while routing to the unified Ray Serve endpoint.

Current Role of KServe

KServe is retained for:

  • Service naming convention: {model}-predictor.ai-ml.svc.cluster.local
  • Future flexibility: Can be used for non-GPU models or canary deployments
  • Kubeflow integration: KServe InferenceServices appear in Kubeflow UI

Positive Consequences

  • Standardized V2 inference protocol
  • Automatic scale-to-zero capability
  • Canary/blue-green deployments
  • Integration with Kubeflow UI
  • Transformer/Explainer components
  • GPU resource abstraction

Negative Consequences

  • Additional CRDs and operators
  • Learning curve for InferenceService spec
  • Some overhead for simple deployments
  • Knative Serving dependency (optional)

Pros and Cons of the Options

Raw Kubernetes Deployments

  • Good, because simple
  • Good, because full control
  • Bad, because no autoscaling logic
  • Bad, because manual service mesh
  • Bad, because repetitive configuration

KServe InferenceService

  • Good, because standardized API
  • Good, because autoscaling
  • Good, because traffic management
  • Good, because Kubeflow integration
  • Bad, because operator complexity
  • Bad, because Knative optional dependency

Seldon Core

  • Good, because mature
  • Good, because A/B testing
  • Good, because explainability
  • Bad, because more complex than KServe
  • Bad, because heavier resource usage

BentoML

  • Good, because developer-friendly
  • Good, because packaging focused
  • Bad, because less Kubernetes-native
  • Bad, because smaller community

Ray Serve

  • Good, because unified compute
  • Good, because Python-native
  • Good, because fractional GPU
  • Bad, because less standardized API
  • Bad, because Ray cluster overhead

Current Configuration

KServe-compatible ExternalName services route to the unified Ray Serve endpoint:

# KServe-compatible service alias (services-ray-aliases.yaml)
apiVersion: v1
kind: Service
metadata:
  name: whisper-predictor
  namespace: ai-ml
  labels:
    serving.kserve.io/inferenceservice: whisper
spec:
  type: ExternalName
  externalName: ai-inference-serve-svc.ai-ml.svc.cluster.local
  ports:
    - port: 8000
      targetPort: 8000
---
# Usage: http://whisper-predictor.ai-ml.svc.cluster.local:8000/whisper/...
# All traffic routes to Ray Serve, which handles GPU allocation

For the actual Ray Serve configuration, see ADR-0011.