docs: add ADR-0011 (KubeRay), ADR-0012 (uv), update architecture docs

2026-02-02 07:10:47 -05:00
parent b6f7605fab
commit 598875c5a9
6 changed files with 438 additions and 35 deletions
--- a/decisions/0007-use-kserve-for-inference.md
+++ b/decisions/0007-use-kserve-for-inference.md
@@ -1,7 +1,7 @@
 # Use KServe for ML Model Serving

-* Status: accepted
-* Date: 2025-12-15
+* Status: superseded by [ADR-0011](0011-kuberay-unified-gpu-backend.md)
+* Date: 2025-12-15 (Updated: 2026-02-02)
 * Deciders: Billy Davies
 * Technical Story: Selecting model serving platform for inference services

@@ -30,6 +30,15 @@ We need to deploy multiple ML models (Whisper, XTTS, BGE, vLLM) as inference end

 Chosen option: "KServe InferenceService", because it provides a standardized, Kubernetes-native approach to model serving with built-in autoscaling and traffic management.

+**UPDATE (2026-02-02)**: While KServe remains installed, all GPU inference now runs on **KubeRay RayService with Ray Serve** (see [ADR-0011](0011-kuberay-unified-gpu-backend.md)). KServe now serves as an **abstraction layer** via ExternalName services that provide KServe-compatible naming (`{model}-predictor.ai-ml`) while routing to the unified Ray Serve endpoint.
+
+### Current Role of KServe
+
+KServe is retained for:
+- **Service naming convention**: `{model}-predictor.ai-ml.svc.cluster.local`
+- **Future flexibility**: Can be used for non-GPU models or canary deployments
+- **Kubeflow integration**: KServe InferenceServices appear in Kubeflow UI
+
 ### Positive Consequences

 * Standardized V2 inference protocol
@@ -90,26 +99,34 @@ Chosen option: "KServe InferenceService", because it provides a standardized, Ku

 ## Current Configuration

+KServe-compatible ExternalName services route to the unified Ray Serve endpoint:
+
 ```yaml
-apiVersion: serving.kserve.io/v1beta1
-kind: InferenceService
+# KServe-compatible service alias (services-ray-aliases.yaml)
+apiVersion: v1
+kind: Service
 metadata:
-  name: whisper
+  name: whisper-predictor
  namespace: ai-ml
+  labels:
+    serving.kserve.io/inferenceservice: whisper
 spec:
-  predictor:
-    minReplicas: 1
-    maxReplicas: 3
-    containers:
-      - name: whisper
-        image: ghcr.io/org/whisper:latest
-        resources:
-          limits:
-            nvidia.com/gpu: 1
+  type: ExternalName
+  externalName: ai-inference-serve-svc.ai-ml.svc.cluster.local
+  ports:
+    - port: 8000
+      targetPort: 8000
+---
+# Usage: http://whisper-predictor.ai-ml.svc.cluster.local:8000/whisper/...
+# All traffic routes to Ray Serve, which handles GPU allocation
 ```

+For the actual Ray Serve configuration, see [ADR-0011](0011-kuberay-unified-gpu-backend.md).
+
 ## Links

 * [KServe](https://kserve.github.io)
 * [V2 Inference Protocol](https://kserve.github.io/website/latest/modelserving/data_plane/v2_protocol/)
+* [KubeRay](https://ray-project.github.io/kuberay/)
 * Related: [ADR-0005](0005-multi-gpu-strategy.md) - GPU allocation
+* Superseded by: [ADR-0011](0011-kuberay-unified-gpu-backend.md) - KubeRay unified backend
--- a/decisions/0011-kuberay-unified-gpu-backend.md
+++ b/decisions/0011-kuberay-unified-gpu-backend.md
@@ -0,0 +1,146 @@
+# Use KubeRay as Unified GPU Backend
+
+* Status: accepted
+* Date: 2026-02-02
+* Deciders: Billy Davies
+* Technical Story: Consolidating GPU inference workloads onto a single Ray cluster
+
+## Context and Problem Statement
+
+We have multiple AI inference services (LLM, STT, TTS, Embeddings, Reranker) running on a heterogeneous GPU fleet (AMD Strix Halo, NVIDIA RTX 2070, AMD 680M iGPU, Intel Arc). Initially, each service was deployed as a standalone KServe InferenceService, including a llama.cpp proof-of-concept for LLM inference. This resulted in:
+
+1. Complex scheduling across GPU types
+2. No GPU sharing (each pod claimed entire GPU)
+3. Multiple containers competing for GPU memory
+4. Inconsistent service discovery patterns
+
+How do we efficiently utilize our GPU fleet while providing unified inference endpoints?
+
+## Decision Drivers
+
+* Fractional GPU allocation (multiple models per GPU)
+* Unified endpoint for all AI services
+* Heterogeneous GPU support (CUDA, ROCm, Intel)
+* Simplified service discovery
+* GPU memory optimization
+* Single point of observability
+
+## Considered Options
+
+* Standalone KServe InferenceServices per model
+* NVIDIA MPS for GPU sharing
+* KubeRay RayService with Ray Serve
+* vLLM standalone deployment
+
+## Decision Outcome
+
+Chosen option: "KubeRay RayService with Ray Serve", because it provides native fractional GPU allocation, supports all GPU types, and unifies all inference services behind a single endpoint with path-based routing.
+
+The llama.cpp proof-of-concept has been deprecated and removed. vLLM now runs as a Ray Serve deployment within the RayService.
+
+### Positive Consequences
+
+* Fractional GPU: Whisper (0.5) + TTS (0.5) share RTX 2070
+* Single service endpoint: `ai-inference-serve-svc:8000/{model}`
+* Path-based routing: `/whisper`, `/tts`, `/llm`, `/embeddings`, `/reranker`
+* GPU-aware scheduling via Ray's resource system
+* Unified metrics and logging through Ray Dashboard
+* Hot-reloading of models without restarting pods
+
+### Negative Consequences
+
+* Ray cluster overhead (head node, dashboard)
+* Learning curve for Ray Serve configuration
+* Custom container images per GPU architecture
+* Less granular scaling (RayService vs per-model replicas)
+
+## Pros and Cons of the Options
+
+### Standalone KServe InferenceServices
+
+* Good, because simple per-model configuration
+* Good, because independent scaling per model
+* Good, because standard Kubernetes resources
+* Bad, because no GPU sharing (1 GPU per pod)
+* Bad, because multiple service endpoints
+* Bad, because scheduling complexity across GPU types
+
+### NVIDIA MPS for GPU sharing
+
+* Good, because transparent GPU sharing
+* Good, because works with existing containers
+* Bad, because NVIDIA-only (no ROCm, no Intel)
+* Bad, because limited memory isolation
+* Bad, because complex setup per node
+
+### KubeRay RayService with Ray Serve
+
+* Good, because fractional GPU allocation
+* Good, because unified endpoint
+* Good, because multi-GPU-vendor support
+* Good, because built-in autoscaling
+* Good, because hot model reloading
+* Bad, because Ray cluster overhead
+* Bad, because custom Ray Serve deployment code
+
+### vLLM standalone deployment
+
+* Good, because optimized for LLM inference
+* Good, because OpenAI-compatible API
+* Bad, because LLM-only (not STT/TTS/Embeddings)
+* Bad, because requires dedicated GPU
+
+## Architecture
+
+```
+┌─────────────────────────────────────────────────────────────────────────────┐
+│                              KubeRay RayService                              │
+├─────────────────────────────────────────────────────────────────────────────┤
+│  Service: ai-inference-serve-svc:8000                                       │
+│                                                                              │
+│  ┌─────────────────┐ ┌─────────────────┐ ┌─────────────────┐                │
+│  │  /llm           │ │  /whisper       │ │  /tts           │                │
+│  │  vLLM 70B       │ │  Whisper v3     │ │  XTTS           │                │
+│  │  ───────────    │ │  ───────────    │ │  ───────────    │                │
+│  │  khelben        │ │  elminster      │ │  elminster      │                │
+│  │  Strix Halo     │ │  RTX 2070       │ │  RTX 2070       │                │
+│  │  (0.95 GPU)     │ │  (0.5 GPU)      │ │  (0.5 GPU)      │                │
+│  └─────────────────┘ └─────────────────┘ └─────────────────┘                │
+│                                                                              │
+│  ┌─────────────────┐ ┌─────────────────┐                                    │
+│  │  /embeddings    │ │  /reranker      │                                    │
+│  │  BGE-Large      │ │  BGE-Reranker   │                                    │
+│  │  ───────────    │ │  ───────────    │                                    │
+│  │  drizzt         │ │  danilo         │                                    │
+│  │  Radeon 680M    │ │  Intel Arc      │                                    │
+│  │  (0.8 GPU)      │ │  (0.8 GPU)      │                                    │
+│  └─────────────────┘ └─────────────────┘                                    │
+└─────────────────────────────────────────────────────────────────────────────┘
+                                  │
+                                  ▼
+┌─────────────────────────────────────────────────────────────────────────────┐
+│                         KServe Compatibility Layer                           │
+├─────────────────────────────────────────────────────────────────────────────┤
+│  ExternalName Services (KServe-style naming):                               │
+│  • whisper-predictor.ai-ml → ai-inference-serve-svc:8000                    │
+│  • tts-predictor.ai-ml → ai-inference-serve-svc:8000                        │
+│  • embeddings-predictor.ai-ml → ai-inference-serve-svc:8000                 │
+│  • reranker-predictor.ai-ml → ai-inference-serve-svc:8000                   │
+│  • llm-predictor.ai-ml → ai-inference-serve-svc:8000                        │
+└─────────────────────────────────────────────────────────────────────────────┘
+```
+
+## Migration Notes
+
+1. **Removed**: `kubernetes/apps/ai-ml/llm-inference/` - llama.cpp proof-of-concept
+2. **Added**: Ray Serve deployments in `kuberay/app/rayservice.yaml`
+3. **Added**: KServe-compatible ExternalName services in `kuberay/app/services-ray-aliases.yaml`
+4. **Updated**: All clients now use `ai-inference-serve-svc:8000/{model}`
+
+## Links
+
+* [Ray Serve](https://docs.ray.io/en/latest/serve/)
+* [KubeRay](https://ray-project.github.io/kuberay/)
+* [vLLM on Ray Serve](https://docs.vllm.ai/en/latest/serving/distributed_serving.html)
+* Related: [ADR-0005](0005-multi-gpu-strategy.md) - Multi-GPU strategy
+* Related: [ADR-0007](0007-use-kserve-for-inference.md) - KServe for inference (now abstraction layer)
--- a/decisions/0012-use-uv-for-python-development.md
+++ b/decisions/0012-use-uv-for-python-development.md
@@ -0,0 +1,195 @@
+# Use uv for Python Development, pip for Docker Builds
+
+* Status: accepted
+* Date: 2026-02-02
+* Deciders: Billy Davies
+* Technical Story: Standardizing Python package management across development and production
+
+## Context and Problem Statement
+
+Our Python projects use a mix of `requirements.txt` and `pyproject.toml` for dependency management. Local development with `pip` is slow, and we need a consistent approach across all repositories while maintaining reproducible Docker builds.
+
+## Decision Drivers
+
+* Fast local development iteration
+* Reproducible production builds
+* Modern Python packaging standards (PEP 517/518/621)
+* Lock file support for deterministic installs
+* Compatibility with existing CI/CD pipelines
+
+## Considered Options
+
+* pip only (traditional)
+* Poetry
+* PDM
+* uv (by Astral)
+* uv for development, pip for Docker
+
+## Decision Outcome
+
+Chosen option: "uv for development, pip for Docker", because uv provides extremely fast package resolution and installation for local development (10-100x faster than pip), while pip in Docker ensures maximum compatibility and reproducibility without requiring uv to be installed in production images.
+
+### Positive Consequences
+
+* 10-100x faster package installs during development
+* `uv.lock` provides deterministic dependency resolution
+* `pyproject.toml` is the modern Python standard (PEP 621)
+* Docker builds remain simple with standard pip
+* `uv pip compile` can generate `requirements.txt` from `pyproject.toml`
+* No uv runtime dependency in production containers
+
+### Negative Consequences
+
+* Two tools to maintain (uv locally, pip in Docker)
+* Team must install uv for local development
+* Lock file must be kept in sync with pyproject.toml
+
+## Pros and Cons of the Options
+
+### pip only (traditional)
+
+* Good, because universal compatibility
+* Good, because no additional tools
+* Bad, because slow resolution and installation
+* Bad, because no built-in lock file
+* Bad, because `requirements.txt` lacks metadata
+
+### Poetry
+
+* Good, because mature ecosystem
+* Good, because lock file support
+* Good, because virtual environment management
+* Bad, because slower than uv
+* Bad, because non-standard `pyproject.toml` sections
+* Bad, because complex dependency resolver
+
+### PDM
+
+* Good, because PEP 621 compliant
+* Good, because lock file support
+* Good, because fast resolver
+* Bad, because less adoption than Poetry
+* Bad, because still slower than uv
+
+### uv (by Astral)
+
+* Good, because 10-100x faster than pip
+* Good, because drop-in pip replacement
+* Good, because supports PEP 621 pyproject.toml
+* Good, because uv.lock for deterministic builds
+* Good, because from the creators of Ruff
+* Bad, because newer tool (less mature)
+* Bad, because requires installation
+
+### uv for development, pip for Docker (Chosen)
+
+* Good, because fast local development
+* Good, because simple Docker builds
+* Good, because no uv in production images
+* Good, because pip compatibility maintained
+* Bad, because two tools in workflow
+* Bad, because must sync lock file
+
+## Implementation
+
+### Local Development Setup
+
+```bash
+# Install uv (one-time)
+curl -LsSf https://astral.sh/uv/install.sh | sh
+
+# Create virtual environment and install dependencies
+uv venv
+source .venv/bin/activate
+uv pip install -e ".[dev]"
+
+# Or use uv sync with lock file
+uv sync
+```
+
+### Project Structure
+
+```
+my-handler/
+├── pyproject.toml       # PEP 621 project metadata and dependencies
+├── uv.lock              # Deterministic lock file (committed)
+├── requirements.txt     # Generated from uv.lock for Docker (optional)
+├── src/
+│   └── my_handler/
+└── tests/
+```
+
+### pyproject.toml Example
+
+```toml
+[project]
+name = "my-handler"
+version = "1.0.0"
+requires-python = ">=3.11"
+dependencies = [
+    "handler-base @ git+https://git.daviestechlabs.io/daviestechlabs/handler-base.git",
+    "httpx>=0.27.0",
+]
+
+[project.optional-dependencies]
+dev = [
+    "pytest>=8.0.0",
+    "pytest-asyncio>=0.23.0",
+    "ruff>=0.1.0",
+]
+
+[build-system]
+requires = ["hatchling"]
+build-backend = "hatchling.build"
+```
+
+### Dockerfile Pattern
+
+The Dockerfile uses uv for speed but installs via pip-compatible interface:
+
+```dockerfile
+FROM python:3.13-slim
+
+# Copy uv for fast installs (optional - can use pip directly)
+COPY --from=ghcr.io/astral-sh/uv:latest /uv /usr/local/bin/uv
+
+# Install from pyproject.toml
+COPY pyproject.toml ./
+RUN uv pip install --system --no-cache .
+
+# OR for maximum reproducibility, use requirements.txt
+COPY requirements.txt ./
+RUN pip install --no-cache-dir -r requirements.txt
+```
+
+### Generating requirements.txt from uv.lock
+
+```bash
+# Generate pinned requirements from lock file
+uv pip compile pyproject.toml -o requirements.txt
+
+# Or export from lock
+uv export --format requirements-txt > requirements.txt
+```
+
+## Workflow
+
+1. **Add dependency**: Edit `pyproject.toml`
+2. **Update lock**: Run `uv lock`
+3. **Install locally**: Run `uv sync`
+4. **For Docker**: Optionally generate `requirements.txt` or use `uv pip install` in Dockerfile
+5. **Commit**: Both `pyproject.toml` and `uv.lock`
+
+## Migration Path
+
+1. Create `pyproject.toml` from existing `requirements.txt`
+2. Run `uv lock` to generate `uv.lock`
+3. Update Dockerfile to use pyproject.toml
+4. Delete `requirements.txt` (or keep as generated artifact)
+
+## Links
+
+* [uv Documentation](https://docs.astral.sh/uv/)
+* [PEP 621 - Project Metadata](https://peps.python.org/pep-0621/)
+* [Astral (uv creators)](https://astral.sh/)
+* Related: handler-base already uses uv in Dockerfile