homelab-design/decisions/0007-use-kserve-for-inference.md

# Use KServe for ML Model Serving

* Status: superseded by [ADR-0011](0011-kuberay-unified-gpu-backend.md)
* Date: 2025-12-15 (Updated: 2026-02-02)
* Deciders: Billy Davies
* Technical Story: Selecting model serving platform for inference services

## Context and Problem Statement

We need to deploy multiple ML models (Whisper, XTTS, BGE, vLLM) as inference endpoints. Each model has different requirements for scaling, protocols (HTTP/gRPC), and GPU allocation.

## Decision Drivers

* Standardized inference protocol (V2)
* Autoscaling based on load
* Traffic splitting for canary deployments
* Integration with Kubeflow ecosystem
* GPU resource management
* Health checks and readiness

## Considered Options

* Raw Kubernetes Deployments + Services
* KServe InferenceService
* Seldon Core
* BentoML
* Ray Serve only

## Decision Outcome

Chosen option: "KServe InferenceService", because it provides a standardized, Kubernetes-native approach to model serving with built-in autoscaling and traffic management.

**UPDATE (2026-02-02)**: While KServe remains installed, all GPU inference now runs on **KubeRay RayService with Ray Serve** (see [ADR-0011](0011-kuberay-unified-gpu-backend.md)). KServe now serves as an **abstraction layer** via ExternalName services that provide KServe-compatible naming (`{model}-predictor.ai-ml`) while routing to the unified Ray Serve endpoint.

### Current Role of KServe

KServe is retained for:
- **Service naming convention**: `{model}-predictor.ai-ml.svc.cluster.local`
- **Future flexibility**: Can be used for non-GPU models or canary deployments
- **Kubeflow integration**: KServe InferenceServices appear in Kubeflow UI

### Positive Consequences

* Standardized V2 inference protocol
* Automatic scale-to-zero capability
* Canary/blue-green deployments
* Integration with Kubeflow UI
* Transformer/Explainer components
* GPU resource abstraction

### Negative Consequences

* Additional CRDs and operators
* Learning curve for InferenceService spec
* Some overhead for simple deployments
* Knative Serving dependency (optional)

## Pros and Cons of the Options

### Raw Kubernetes Deployments

* Good, because simple
* Good, because full control
* Bad, because no autoscaling logic
* Bad, because manual service mesh
* Bad, because repetitive configuration

### KServe InferenceService

* Good, because standardized API
* Good, because autoscaling
* Good, because traffic management
* Good, because Kubeflow integration
* Bad, because operator complexity
* Bad, because Knative optional dependency

### Seldon Core

* Good, because mature
* Good, because A/B testing
* Good, because explainability
* Bad, because more complex than KServe
* Bad, because heavier resource usage

### BentoML

* Good, because developer-friendly
* Good, because packaging focused
* Bad, because less Kubernetes-native
* Bad, because smaller community

### Ray Serve

* Good, because unified compute
* Good, because Python-native
* Good, because fractional GPU
* Bad, because less standardized API
* Bad, because Ray cluster overhead

## Current Configuration

KServe-compatible ExternalName services route to the unified Ray Serve endpoint:

```yaml
# KServe-compatible service alias (services-ray-aliases.yaml)
apiVersion: v1
kind: Service
metadata:
  name: whisper-predictor
  namespace: ai-ml
  labels:
    serving.kserve.io/inferenceservice: whisper
spec:
  type: ExternalName
  externalName: ai-inference-serve-svc.ai-ml.svc.cluster.local
  ports:
    - port: 8000
      targetPort: 8000
---
# Usage: http://whisper-predictor.ai-ml.svc.cluster.local:8000/whisper/...
# All traffic routes to Ray Serve, which handles GPU allocation
```

For the actual Ray Serve configuration, see [ADR-0011](0011-kuberay-unified-gpu-backend.md).

## Links

* [KServe](https://kserve.github.io)
* [V2 Inference Protocol](https://kserve.github.io/website/latest/modelserving/data_plane/v2_protocol/)
* [KubeRay](https://ray-project.github.io/kuberay/)
* Related: [ADR-0005](0005-multi-gpu-strategy.md) - GPU allocation
* Superseded by: [ADR-0011](0011-kuberay-unified-gpu-backend.md) - KubeRay unified backend