Files

Billy D. 832cda34bd feat: add comprehensive architecture documentation

- Add AGENT-ONBOARDING.md for AI agents
- Add ARCHITECTURE.md with full system overview
- Add TECH-STACK.md with complete technology inventory
- Add DOMAIN-MODEL.md with entities and bounded contexts
- Add CODING-CONVENTIONS.md with patterns and practices
- Add GLOSSARY.md with terminology reference
- Add C4 diagrams (Context and Container levels)
- Add 10 ADRs documenting key decisions:
  - Talos Linux, NATS, MessagePack, Multi-GPU strategy
  - GitOps with Flux, KServe, Milvus, Dual workflow engines
  - Envoy Gateway
- Add specs directory with JetStream configuration
- Add diagrams for GPU allocation and data flows

Based on analysis of homelab-k8s2 and llm-workflows repositories
and kubectl cluster-info dump data.

2026-02-01 14:30:05 -05:00

2.8 KiB

Raw Blame History

Use KServe for ML Model Serving

Status: accepted
Date: 2025-12-15
Deciders: Billy Davies
Technical Story: Selecting model serving platform for inference services

Context and Problem Statement

We need to deploy multiple ML models (Whisper, XTTS, BGE, vLLM) as inference endpoints. Each model has different requirements for scaling, protocols (HTTP/gRPC), and GPU allocation.

Decision Drivers

Standardized inference protocol (V2)
Autoscaling based on load
Traffic splitting for canary deployments
Integration with Kubeflow ecosystem
GPU resource management
Health checks and readiness

Considered Options

Raw Kubernetes Deployments + Services
KServe InferenceService
Seldon Core
BentoML
Ray Serve only

Decision Outcome

Chosen option: "KServe InferenceService", because it provides a standardized, Kubernetes-native approach to model serving with built-in autoscaling and traffic management.

Positive Consequences

Standardized V2 inference protocol
Automatic scale-to-zero capability
Canary/blue-green deployments
Integration with Kubeflow UI
Transformer/Explainer components
GPU resource abstraction

Negative Consequences

Additional CRDs and operators
Learning curve for InferenceService spec
Some overhead for simple deployments
Knative Serving dependency (optional)

Pros and Cons of the Options

Raw Kubernetes Deployments

Good, because simple
Good, because full control
Bad, because no autoscaling logic
Bad, because manual service mesh
Bad, because repetitive configuration

KServe InferenceService

Good, because standardized API
Good, because autoscaling
Good, because traffic management
Good, because Kubeflow integration
Bad, because operator complexity
Bad, because Knative optional dependency

Seldon Core

Good, because mature
Good, because A/B testing
Good, because explainability
Bad, because more complex than KServe
Bad, because heavier resource usage

BentoML

Good, because developer-friendly
Good, because packaging focused
Bad, because less Kubernetes-native
Bad, because smaller community

Ray Serve

Good, because unified compute
Good, because Python-native
Good, because fractional GPU
Bad, because less standardized API
Bad, because Ray cluster overhead

Current Configuration

apiVersion: serving.kserve.io/v1beta1
kind: InferenceService
metadata:
  name: whisper
  namespace: ai-ml
spec:
  predictor:
    minReplicas: 1
    maxReplicas: 3
    containers:
      - name: whisper
        image: ghcr.io/org/whisper:latest
        resources:
          limits:
            nvidia.com/gpu: 1

2.8 KiB Raw Blame History