Files
homelab-design/decisions/0007-use-kserve-for-inference.md
Billy D. 832cda34bd feat: add comprehensive architecture documentation
- Add AGENT-ONBOARDING.md for AI agents
- Add ARCHITECTURE.md with full system overview
- Add TECH-STACK.md with complete technology inventory
- Add DOMAIN-MODEL.md with entities and bounded contexts
- Add CODING-CONVENTIONS.md with patterns and practices
- Add GLOSSARY.md with terminology reference
- Add C4 diagrams (Context and Container levels)
- Add 10 ADRs documenting key decisions:
  - Talos Linux, NATS, MessagePack, Multi-GPU strategy
  - GitOps with Flux, KServe, Milvus, Dual workflow engines
  - Envoy Gateway
- Add specs directory with JetStream configuration
- Add diagrams for GPU allocation and data flows

Based on analysis of homelab-k8s2 and llm-workflows repositories
and kubectl cluster-info dump data.
2026-02-01 14:30:05 -05:00

2.8 KiB

Use KServe for ML Model Serving

  • Status: accepted
  • Date: 2025-12-15
  • Deciders: Billy Davies
  • Technical Story: Selecting model serving platform for inference services

Context and Problem Statement

We need to deploy multiple ML models (Whisper, XTTS, BGE, vLLM) as inference endpoints. Each model has different requirements for scaling, protocols (HTTP/gRPC), and GPU allocation.

Decision Drivers

  • Standardized inference protocol (V2)
  • Autoscaling based on load
  • Traffic splitting for canary deployments
  • Integration with Kubeflow ecosystem
  • GPU resource management
  • Health checks and readiness

Considered Options

  • Raw Kubernetes Deployments + Services
  • KServe InferenceService
  • Seldon Core
  • BentoML
  • Ray Serve only

Decision Outcome

Chosen option: "KServe InferenceService", because it provides a standardized, Kubernetes-native approach to model serving with built-in autoscaling and traffic management.

Positive Consequences

  • Standardized V2 inference protocol
  • Automatic scale-to-zero capability
  • Canary/blue-green deployments
  • Integration with Kubeflow UI
  • Transformer/Explainer components
  • GPU resource abstraction

Negative Consequences

  • Additional CRDs and operators
  • Learning curve for InferenceService spec
  • Some overhead for simple deployments
  • Knative Serving dependency (optional)

Pros and Cons of the Options

Raw Kubernetes Deployments

  • Good, because simple
  • Good, because full control
  • Bad, because no autoscaling logic
  • Bad, because manual service mesh
  • Bad, because repetitive configuration

KServe InferenceService

  • Good, because standardized API
  • Good, because autoscaling
  • Good, because traffic management
  • Good, because Kubeflow integration
  • Bad, because operator complexity
  • Bad, because Knative optional dependency

Seldon Core

  • Good, because mature
  • Good, because A/B testing
  • Good, because explainability
  • Bad, because more complex than KServe
  • Bad, because heavier resource usage

BentoML

  • Good, because developer-friendly
  • Good, because packaging focused
  • Bad, because less Kubernetes-native
  • Bad, because smaller community

Ray Serve

  • Good, because unified compute
  • Good, because Python-native
  • Good, because fractional GPU
  • Bad, because less standardized API
  • Bad, because Ray cluster overhead

Current Configuration

apiVersion: serving.kserve.io/v1beta1
kind: InferenceService
metadata:
  name: whisper
  namespace: ai-ml
spec:
  predictor:
    minReplicas: 1
    maxReplicas: 3
    containers:
      - name: whisper
        image: ghcr.io/org/whisper:latest
        resources:
          limits:
            nvidia.com/gpu: 1