docs: add ADR-0011 (KubeRay), ADR-0012 (uv), update architecture docs

2026-02-02 07:10:47 -05:00
parent b6f7605fab
commit 598875c5a9
6 changed files with 438 additions and 35 deletions
--- a/ARCHITECTURE.md
+++ b/ARCHITECTURE.md
@@ -60,15 +60,24 @@ The homelab is a production-grade Kubernetes cluster running on bare-metal hardw
                                  │
                                  ▼
 ┌─────────────────────────────────────────────────────────────────────────────┐
-│                         AI SERVICES LAYER                                    │
+│                      GPU INFERENCE LAYER (KubeRay)                           │
 ├─────────────────────────────────────────────────────────────────────────────┤
-│  ┌─────────┐ ┌─────────┐ ┌─────────┐ ┌─────────┐ ┌─────────┐ ┌─────────┐   │
+│  RayService: ai-inference-serve-svc:8000                                    │
-│  │ Whisper │ │  XTTS   │ │  vLLM   │ │ Milvus  │ │   BGE   │ │Reranker │   │
+│  ┌─────────────────────────────────────────────────────────────────────┐    │
-│  │  (STT)  │ │  (TTS)  │ │  (LLM)  │ │  (RAG)  │ │(Embed)  │ │  (BGE)  │   │
+│  │                    Ray Serve (Unified Endpoint)                      │    │
-│  ├─────────┤ ├─────────┤ ├─────────┤ ├─────────┤ ├─────────┤ ├─────────┤   │
+│  │  ┌──────────┐ ┌──────────┐ ┌──────────┐ ┌──────────┐ ┌──────────┐   │    │
-│  │ KServe  │ │ KServe  │ │ vLLM    │ │  Helm   │ │ KServe  │ │ KServe  │   │
+│  │  │ /whisper │ │   /tts   │ │   /llm   │ │/embeddings│ │/reranker │   │    │
-│  │ nvidia  │ │ nvidia  │ │ ROCm    │ │ Minio   │ │ rdna2   │ │ intel   │   │
+│  │  │ Whisper  │ │  XTTS    │ │  vLLM    │ │  BGE-L    │ │ BGE-Rnk  │   │    │
-│  └─────────┘ └─────────┘ └─────────┘ └─────────┘ └─────────┘ └─────────┘   │
+│  │  │ (0.5 GPU)│ │(0.5 GPU) │ │(0.95 GPU)│ │ (0.8 GPU) │ │(0.8 GPU) │   │    │
 │  │  ├──────────┤ ├──────────┤ ├──────────┤ ├──────────┤ ├──────────┤   │    │
 │  │  │elminster │ │elminster │ │ khelben  │ │  drizzt  │ │  danilo  │   │    │
 │  │  │RTX 2070  │ │RTX 2070  │ │Strix Halo│ │Radeon 680│ │Intel Arc │   │    │
 │  │  │  CUDA    │ │  CUDA    │ │  ROCm    │ │  ROCm    │ │  Intel   │   │    │
 │  │  └──────────┘ └──────────┘ └──────────┘ └──────────┘ └──────────┘   │    │
 │  └─────────────────────────────────────────────────────────────────────┘    │
 │                                                                              │
 │  KServe Aliases: {whisper,tts,llm,embeddings,reranker}-predictor.ai-ml     │
 │  Milvus: Vector database for RAG (Helm, MinIO backend)                      │
 └─────────────────────────────────────────────────────────────────────────────┘
                                  │
                                  ▼
@@ -279,6 +288,8 @@ Applications ──► OpenTelemetry SDK ──► Jaeger/Tempo ──► Grafan
 | MessagePack over JSON | Binary efficiency for audio | [ADR-0004](decisions/0004-use-messagepack-for-nats.md) |
 | Multi-GPU heterogeneous | Cost optimization, workload matching | [ADR-0005](decisions/0005-multi-gpu-strategy.md) |
 | GitOps with Flux | Declarative, auditable, secure | [ADR-0006](decisions/0006-gitops-with-flux.md) |
 | KServe for inference | Standardized API, autoscaling | [ADR-0007](decisions/0007-use-kserve-for-inference.md) |
 | KubeRay unified backend | Fractional GPU, single endpoint | [ADR-0011](decisions/0011-kuberay-unified-gpu-backend.md) |
 ## Related Documents
--- a/CODING-CONVENTIONS.md
+++ b/CODING-CONVENTIONS.md
@@ -36,13 +36,19 @@ handler-base/                # Shared library for all handlers
 │   ├── health.py            # K8s probes
 │   ├── telemetry.py         # OpenTelemetry
 │   └── clients/             # Service clients
 ├── tests/
 └── pyproject.toml
 chat-handler/                # Text chat service
 voice-assistant/             # Voice pipeline service
-├── {name}.py                # Standalone version
+pipeline-bridge/             # Workflow engine bridge
-├── {name}_v2.py             # Handler-base version (preferred)
+├── {name}.py                # Handler implementation (uses handler-base)
-└── Dockerfile.v2
+├── pyproject.toml           # PEP 621 project metadata (see ADR-0012)
 ├── uv.lock                  # Deterministic lock file
 ├── tests/
 │   ├── conftest.py
 │   └── test_{name}.py
 └── Dockerfile
 argo/                        # Argo WorkflowTemplates
 ├── {workflow-name}.yaml
@@ -59,6 +65,29 @@ kuberay-images/              # GPU worker images
 ## Python Conventions
 ### Package Management (ADR-0012)
 Use **uv** for local development and **pip** in Docker for reproducibility:
 ```bash
 # Install uv (one-time)
 curl -LsSf https://astral.sh/uv/install.sh | sh
 # Create virtual environment and install
 uv venv
 source .venv/bin/activate
 uv pip install -e ".[dev]"
 # Or use uv sync with lock file
 uv sync
 # Update lock file after changing pyproject.toml
 uv lock
 # Run tests
 uv run pytest
 ```
 ### Project Structure
 ```python
--- a/TECH-STACK.md
+++ b/TECH-STACK.md
@@ -31,22 +31,27 @@
 ## AI/ML Layer
-### Inference Engines
+### GPU Inference (KubeRay RayService)
-| Service | Framework | GPU | Model Type |
+All AI inference runs on a unified Ray Serve endpoint with fractional GPU allocation:
 |---------|-----------|-----|------------|
 | [vLLM](https://vllm.ai) | ROCm | AMD Strix Halo | Large Language Models |
 | [faster-whisper](https://github.com/guillaumekln/faster-whisper) | CUDA | NVIDIA RTX 2070 | Speech-to-Text |
 | [XTTS](https://github.com/coqui-ai/TTS) | CUDA | NVIDIA RTX 2070 | Text-to-Speech |
 | [BGE Embeddings](https://huggingface.co/BAAI/bge-large-en-v1.5) | ROCm | AMD Radeon 680M | Text Embeddings |
 | [BGE Reranker](https://huggingface.co/BAAI/bge-reranker-large) | Intel | Intel Arc | Document Reranking |
-### ML Serving
+| Service | Model | GPU Node | GPU Type | Allocation |
 |---------|-------|----------|----------|------------|
 | `/llm` | [vLLM](https://vllm.ai) (Llama 3.1 70B) | khelben | AMD Strix Halo 64GB | 0.95 GPU |
 | `/whisper` | [faster-whisper](https://github.com/guillaumekln/faster-whisper) v3 | elminster | NVIDIA RTX 2070 8GB | 0.5 GPU |
 | `/tts` | [XTTS](https://github.com/coqui-ai/TTS) | elminster | NVIDIA RTX 2070 8GB | 0.5 GPU |
 | `/embeddings` | [BGE-Large](https://huggingface.co/BAAI/bge-large-en-v1.5) | drizzt | AMD Radeon 680M 12GB | 0.8 GPU |
 | `/reranker` | [BGE-Reranker](https://huggingface.co/BAAI/bge-reranker-large) | danilo | Intel Arc 16GB | 0.8 GPU |
 **Endpoint**: `ai-inference-serve-svc.ai-ml.svc.cluster.local:8000/{service}`
 ### ML Serving Stack
 | Component | Version | Purpose |
 |-----------|---------|---------|
-| [KServe](https://kserve.github.io) | v0.12+ | Model serving framework |
+| [KubeRay](https://ray-project.github.io/kuberay/) | 1.4+ | Ray cluster operator |
 | [Ray Serve](https://ray.io/serve) | 2.53.0 | Unified inference endpoints |
 | [KServe](https://kserve.github.io) | v0.12+ | Abstraction layer (ExternalName aliases) |
 ### ML Workflows
--- a/decisions/0007-use-kserve-for-inference.md
+++ b/decisions/0007-use-kserve-for-inference.md
@@ -1,7 +1,7 @@
 # Use KServe for ML Model Serving
-* Status: accepted
+* Status: superseded by [ADR-0011](0011-kuberay-unified-gpu-backend.md)
-* Date: 2025-12-15
+* Date: 2025-12-15 (Updated: 2026-02-02)
 * Deciders: Billy Davies
 * Technical Story: Selecting model serving platform for inference services
@@ -30,6 +30,15 @@ We need to deploy multiple ML models (Whisper, XTTS, BGE, vLLM) as inference end
 Chosen option: "KServe InferenceService", because it provides a standardized, Kubernetes-native approach to model serving with built-in autoscaling and traffic management.
 **UPDATE (2026-02-02)**: While KServe remains installed, all GPU inference now runs on **KubeRay RayService with Ray Serve** (see [ADR-0011](0011-kuberay-unified-gpu-backend.md)). KServe now serves as an **abstraction layer** via ExternalName services that provide KServe-compatible naming (`{model}-predictor.ai-ml`) while routing to the unified Ray Serve endpoint.
 ### Current Role of KServe
 KServe is retained for:
 - **Service naming convention**: `{model}-predictor.ai-ml.svc.cluster.local`
 - **Future flexibility**: Can be used for non-GPU models or canary deployments
 - **Kubeflow integration**: KServe InferenceServices appear in Kubeflow UI
 ### Positive Consequences
 * Standardized V2 inference protocol
@@ -90,26 +99,34 @@ Chosen option: "KServe InferenceService", because it provides a standardized, Ku
 ## Current Configuration
 KServe-compatible ExternalName services route to the unified Ray Serve endpoint:
 ```yaml
-apiVersion: serving.kserve.io/v1beta1
+# KServe-compatible service alias (services-ray-aliases.yaml)
-kind: InferenceService
+apiVersion: v1
 kind: Service
 metadata:
-  name: whisper
+  name: whisper-predictor
  namespace: ai-ml
  labels:
    serving.kserve.io/inferenceservice: whisper
 spec:
-  predictor:
+  type: ExternalName
-    minReplicas: 1
+  externalName: ai-inference-serve-svc.ai-ml.svc.cluster.local
-    maxReplicas: 3
+  ports:
-    containers:
+    - port: 8000
-      - name: whisper
+      targetPort: 8000
-        image: ghcr.io/org/whisper:latest
+---
-        resources:
+# Usage: http://whisper-predictor.ai-ml.svc.cluster.local:8000/whisper/...
-          limits:
+# All traffic routes to Ray Serve, which handles GPU allocation
            nvidia.com/gpu: 1
 ```
 For the actual Ray Serve configuration, see [ADR-0011](0011-kuberay-unified-gpu-backend.md).
 ## Links
 * [KServe](https://kserve.github.io)
 * [V2 Inference Protocol](https://kserve.github.io/website/latest/modelserving/data_plane/v2_protocol/)
 * [KubeRay](https://ray-project.github.io/kuberay/)
 * Related: [ADR-0005](0005-multi-gpu-strategy.md) - GPU allocation
 * Superseded by: [ADR-0011](0011-kuberay-unified-gpu-backend.md) - KubeRay unified backend
--- a/decisions/0011-kuberay-unified-gpu-backend.md
+++ b/decisions/0011-kuberay-unified-gpu-backend.md
@@ -0,0 +1,146 @@
 # Use KubeRay as Unified GPU Backend
 * Status: accepted
 * Date: 2026-02-02
 * Deciders: Billy Davies
 * Technical Story: Consolidating GPU inference workloads onto a single Ray cluster
 ## Context and Problem Statement
 We have multiple AI inference services (LLM, STT, TTS, Embeddings, Reranker) running on a heterogeneous GPU fleet (AMD Strix Halo, NVIDIA RTX 2070, AMD 680M iGPU, Intel Arc). Initially, each service was deployed as a standalone KServe InferenceService, including a llama.cpp proof-of-concept for LLM inference. This resulted in:
 1. Complex scheduling across GPU types
 2. No GPU sharing (each pod claimed entire GPU)
 3. Multiple containers competing for GPU memory
 4. Inconsistent service discovery patterns
 How do we efficiently utilize our GPU fleet while providing unified inference endpoints?
 ## Decision Drivers
 * Fractional GPU allocation (multiple models per GPU)
 * Unified endpoint for all AI services
 * Heterogeneous GPU support (CUDA, ROCm, Intel)
 * Simplified service discovery
 * GPU memory optimization
 * Single point of observability
 ## Considered Options
 * Standalone KServe InferenceServices per model
 * NVIDIA MPS for GPU sharing
 * KubeRay RayService with Ray Serve
 * vLLM standalone deployment
 ## Decision Outcome
 Chosen option: "KubeRay RayService with Ray Serve", because it provides native fractional GPU allocation, supports all GPU types, and unifies all inference services behind a single endpoint with path-based routing.
 The llama.cpp proof-of-concept has been deprecated and removed. vLLM now runs as a Ray Serve deployment within the RayService.
 ### Positive Consequences
 * Fractional GPU: Whisper (0.5) + TTS (0.5) share RTX 2070
 * Single service endpoint: `ai-inference-serve-svc:8000/{model}`
 * Path-based routing: `/whisper`, `/tts`, `/llm`, `/embeddings`, `/reranker`
 * GPU-aware scheduling via Ray's resource system
 * Unified metrics and logging through Ray Dashboard
 * Hot-reloading of models without restarting pods
 ### Negative Consequences
 * Ray cluster overhead (head node, dashboard)
 * Learning curve for Ray Serve configuration
 * Custom container images per GPU architecture
 * Less granular scaling (RayService vs per-model replicas)
 ## Pros and Cons of the Options
 ### Standalone KServe InferenceServices
 * Good, because simple per-model configuration
 * Good, because independent scaling per model
 * Good, because standard Kubernetes resources
 * Bad, because no GPU sharing (1 GPU per pod)
 * Bad, because multiple service endpoints
 * Bad, because scheduling complexity across GPU types
 ### NVIDIA MPS for GPU sharing
 * Good, because transparent GPU sharing
 * Good, because works with existing containers
 * Bad, because NVIDIA-only (no ROCm, no Intel)
 * Bad, because limited memory isolation
 * Bad, because complex setup per node
 ### KubeRay RayService with Ray Serve
 * Good, because fractional GPU allocation
 * Good, because unified endpoint
 * Good, because multi-GPU-vendor support
 * Good, because built-in autoscaling
 * Good, because hot model reloading
 * Bad, because Ray cluster overhead
 * Bad, because custom Ray Serve deployment code
 ### vLLM standalone deployment
 * Good, because optimized for LLM inference
 * Good, because OpenAI-compatible API
 * Bad, because LLM-only (not STT/TTS/Embeddings)
 * Bad, because requires dedicated GPU
 ## Architecture
 ```
 ┌─────────────────────────────────────────────────────────────────────────────┐
 │                              KubeRay RayService                              │
 ├─────────────────────────────────────────────────────────────────────────────┤
 │  Service: ai-inference-serve-svc:8000                                       │
 │                                                                              │
 │  ┌─────────────────┐ ┌─────────────────┐ ┌─────────────────┐                │
 │  │  /llm           │ │  /whisper       │ │  /tts           │                │
 │  │  vLLM 70B       │ │  Whisper v3     │ │  XTTS           │                │
 │  │  ───────────    │ │  ───────────    │ │  ───────────    │                │
 │  │  khelben        │ │  elminster      │ │  elminster      │                │
 │  │  Strix Halo     │ │  RTX 2070       │ │  RTX 2070       │                │
 │  │  (0.95 GPU)     │ │  (0.5 GPU)      │ │  (0.5 GPU)      │                │
 │  └─────────────────┘ └─────────────────┘ └─────────────────┘                │
 │                                                                              │
 │  ┌─────────────────┐ ┌─────────────────┐                                    │
 │  │  /embeddings    │ │  /reranker      │                                    │
 │  │  BGE-Large      │ │  BGE-Reranker   │                                    │
 │  │  ───────────    │ │  ───────────    │                                    │
 │  │  drizzt         │ │  danilo         │                                    │
 │  │  Radeon 680M    │ │  Intel Arc      │                                    │
 │  │  (0.8 GPU)      │ │  (0.8 GPU)      │                                    │
 │  └─────────────────┘ └─────────────────┘                                    │
 └─────────────────────────────────────────────────────────────────────────────┘
                                  │
                                  ▼
 ┌─────────────────────────────────────────────────────────────────────────────┐
 │                         KServe Compatibility Layer                           │
 ├─────────────────────────────────────────────────────────────────────────────┤
 │  ExternalName Services (KServe-style naming):                               │
 │  • whisper-predictor.ai-ml → ai-inference-serve-svc:8000                    │
 │  • tts-predictor.ai-ml → ai-inference-serve-svc:8000                        │
 │  • embeddings-predictor.ai-ml → ai-inference-serve-svc:8000                 │
 │  • reranker-predictor.ai-ml → ai-inference-serve-svc:8000                   │
 │  • llm-predictor.ai-ml → ai-inference-serve-svc:8000                        │
 └─────────────────────────────────────────────────────────────────────────────┘
 ```
 ## Migration Notes
 1. **Removed**: `kubernetes/apps/ai-ml/llm-inference/` - llama.cpp proof-of-concept
 2. **Added**: Ray Serve deployments in `kuberay/app/rayservice.yaml`
 3. **Added**: KServe-compatible ExternalName services in `kuberay/app/services-ray-aliases.yaml`
 4. **Updated**: All clients now use `ai-inference-serve-svc:8000/{model}`
 ## Links
 * [Ray Serve](https://docs.ray.io/en/latest/serve/)
 * [KubeRay](https://ray-project.github.io/kuberay/)
 * [vLLM on Ray Serve](https://docs.vllm.ai/en/latest/serving/distributed_serving.html)
 * Related: [ADR-0005](0005-multi-gpu-strategy.md) - Multi-GPU strategy
 * Related: [ADR-0007](0007-use-kserve-for-inference.md) - KServe for inference (now abstraction layer)
--- a/decisions/0012-use-uv-for-python-development.md
+++ b/decisions/0012-use-uv-for-python-development.md
@@ -0,0 +1,195 @@
 # Use uv for Python Development, pip for Docker Builds
 * Status: accepted
 * Date: 2026-02-02
 * Deciders: Billy Davies
 * Technical Story: Standardizing Python package management across development and production
 ## Context and Problem Statement
 Our Python projects use a mix of `requirements.txt` and `pyproject.toml` for dependency management. Local development with `pip` is slow, and we need a consistent approach across all repositories while maintaining reproducible Docker builds.
 ## Decision Drivers
 * Fast local development iteration
 * Reproducible production builds
 * Modern Python packaging standards (PEP 517/518/621)
 * Lock file support for deterministic installs
 * Compatibility with existing CI/CD pipelines
 ## Considered Options
 * pip only (traditional)
 * Poetry
 * PDM
 * uv (by Astral)
 * uv for development, pip for Docker
 ## Decision Outcome
 Chosen option: "uv for development, pip for Docker", because uv provides extremely fast package resolution and installation for local development (10-100x faster than pip), while pip in Docker ensures maximum compatibility and reproducibility without requiring uv to be installed in production images.
 ### Positive Consequences
 * 10-100x faster package installs during development
 * `uv.lock` provides deterministic dependency resolution
 * `pyproject.toml` is the modern Python standard (PEP 621)
 * Docker builds remain simple with standard pip
 * `uv pip compile` can generate `requirements.txt` from `pyproject.toml`
 * No uv runtime dependency in production containers
 ### Negative Consequences
 * Two tools to maintain (uv locally, pip in Docker)
 * Team must install uv for local development
 * Lock file must be kept in sync with pyproject.toml
 ## Pros and Cons of the Options
 ### pip only (traditional)
 * Good, because universal compatibility
 * Good, because no additional tools
 * Bad, because slow resolution and installation
 * Bad, because no built-in lock file
 * Bad, because `requirements.txt` lacks metadata
 ### Poetry
 * Good, because mature ecosystem
 * Good, because lock file support
 * Good, because virtual environment management
 * Bad, because slower than uv
 * Bad, because non-standard `pyproject.toml` sections
 * Bad, because complex dependency resolver
 ### PDM
 * Good, because PEP 621 compliant
 * Good, because lock file support
 * Good, because fast resolver
 * Bad, because less adoption than Poetry
 * Bad, because still slower than uv
 ### uv (by Astral)
 * Good, because 10-100x faster than pip
 * Good, because drop-in pip replacement
 * Good, because supports PEP 621 pyproject.toml
 * Good, because uv.lock for deterministic builds
 * Good, because from the creators of Ruff
 * Bad, because newer tool (less mature)
 * Bad, because requires installation
 ### uv for development, pip for Docker (Chosen)
 * Good, because fast local development
 * Good, because simple Docker builds
 * Good, because no uv in production images
 * Good, because pip compatibility maintained
 * Bad, because two tools in workflow
 * Bad, because must sync lock file
 ## Implementation
 ### Local Development Setup
 ```bash
 # Install uv (one-time)
 curl -LsSf https://astral.sh/uv/install.sh | sh
 # Create virtual environment and install dependencies
 uv venv
 source .venv/bin/activate
 uv pip install -e ".[dev]"
 # Or use uv sync with lock file
 uv sync
 ```
 ### Project Structure
 ```
 my-handler/
 ├── pyproject.toml       # PEP 621 project metadata and dependencies
 ├── uv.lock              # Deterministic lock file (committed)
 ├── requirements.txt     # Generated from uv.lock for Docker (optional)
 ├── src/
 │   └── my_handler/
 └── tests/
 ```
 ### pyproject.toml Example
 ```toml
 [project]
 name = "my-handler"
 version = "1.0.0"
 requires-python = ">=3.11"
 dependencies = [
    "handler-base @ git+https://git.daviestechlabs.io/daviestechlabs/handler-base.git",
    "httpx>=0.27.0",
 ]
 [project.optional-dependencies]
 dev = [
    "pytest>=8.0.0",
    "pytest-asyncio>=0.23.0",
    "ruff>=0.1.0",
 ]
 [build-system]
 requires = ["hatchling"]
 build-backend = "hatchling.build"
 ```
 ### Dockerfile Pattern
 The Dockerfile uses uv for speed but installs via pip-compatible interface:
 ```dockerfile
 FROM python:3.13-slim
 # Copy uv for fast installs (optional - can use pip directly)
 COPY --from=ghcr.io/astral-sh/uv:latest /uv /usr/local/bin/uv
 # Install from pyproject.toml
 COPY pyproject.toml ./
 RUN uv pip install --system --no-cache .
 # OR for maximum reproducibility, use requirements.txt
 COPY requirements.txt ./
 RUN pip install --no-cache-dir -r requirements.txt
 ```
 ### Generating requirements.txt from uv.lock
 ```bash
 # Generate pinned requirements from lock file
 uv pip compile pyproject.toml -o requirements.txt
 # Or export from lock
 uv export --format requirements-txt > requirements.txt
 ```
 ## Workflow
 1. **Add dependency**: Edit `pyproject.toml`
 2. **Update lock**: Run `uv lock`
 3. **Install locally**: Run `uv sync`
 4. **For Docker**: Optionally generate `requirements.txt` or use `uv pip install` in Dockerfile
 5. **Commit**: Both `pyproject.toml` and `uv.lock`
 ## Migration Path
 1. Create `pyproject.toml` from existing `requirements.txt`
 2. Run `uv lock` to generate `uv.lock`
 3. Update Dockerfile to use pyproject.toml
 4. Delete `requirements.txt` (or keep as generated artifact)
 ## Links
 * [uv Documentation](https://docs.astral.sh/uv/)
 * [PEP 621 - Project Metadata](https://peps.python.org/pep-0621/)
 * [Astral (uv creators)](https://astral.sh/)
 * Related: handler-base already uses uv in Dockerfile