docs: add ADR-0011 (KubeRay), ADR-0012 (uv), update architecture docs

This commit is contained in:
2026-02-02 07:10:47 -05:00
parent b6f7605fab
commit 598875c5a9
6 changed files with 438 additions and 35 deletions

View File

@@ -1,7 +1,7 @@
# Use KServe for ML Model Serving
* Status: accepted
* Date: 2025-12-15
* Status: superseded by [ADR-0011](0011-kuberay-unified-gpu-backend.md)
* Date: 2025-12-15 (Updated: 2026-02-02)
* Deciders: Billy Davies
* Technical Story: Selecting model serving platform for inference services
@@ -30,6 +30,15 @@ We need to deploy multiple ML models (Whisper, XTTS, BGE, vLLM) as inference end
Chosen option: "KServe InferenceService", because it provides a standardized, Kubernetes-native approach to model serving with built-in autoscaling and traffic management.
**UPDATE (2026-02-02)**: While KServe remains installed, all GPU inference now runs on **KubeRay RayService with Ray Serve** (see [ADR-0011](0011-kuberay-unified-gpu-backend.md)). KServe now serves as an **abstraction layer** via ExternalName services that provide KServe-compatible naming (`{model}-predictor.ai-ml`) while routing to the unified Ray Serve endpoint.
### Current Role of KServe
KServe is retained for:
- **Service naming convention**: `{model}-predictor.ai-ml.svc.cluster.local`
- **Future flexibility**: Can be used for non-GPU models or canary deployments
- **Kubeflow integration**: KServe InferenceServices appear in Kubeflow UI
### Positive Consequences
* Standardized V2 inference protocol
@@ -90,26 +99,34 @@ Chosen option: "KServe InferenceService", because it provides a standardized, Ku
## Current Configuration
KServe-compatible ExternalName services route to the unified Ray Serve endpoint:
```yaml
apiVersion: serving.kserve.io/v1beta1
kind: InferenceService
# KServe-compatible service alias (services-ray-aliases.yaml)
apiVersion: v1
kind: Service
metadata:
name: whisper
name: whisper-predictor
namespace: ai-ml
labels:
serving.kserve.io/inferenceservice: whisper
spec:
predictor:
minReplicas: 1
maxReplicas: 3
containers:
- name: whisper
image: ghcr.io/org/whisper:latest
resources:
limits:
nvidia.com/gpu: 1
type: ExternalName
externalName: ai-inference-serve-svc.ai-ml.svc.cluster.local
ports:
- port: 8000
targetPort: 8000
---
# Usage: http://whisper-predictor.ai-ml.svc.cluster.local:8000/whisper/...
# All traffic routes to Ray Serve, which handles GPU allocation
```
For the actual Ray Serve configuration, see [ADR-0011](0011-kuberay-unified-gpu-backend.md).
## Links
* [KServe](https://kserve.github.io)
* [V2 Inference Protocol](https://kserve.github.io/website/latest/modelserving/data_plane/v2_protocol/)
* [KubeRay](https://ray-project.github.io/kuberay/)
* Related: [ADR-0005](0005-multi-gpu-strategy.md) - GPU allocation
* Superseded by: [ADR-0011](0011-kuberay-unified-gpu-backend.md) - KubeRay unified backend

View File

@@ -0,0 +1,146 @@
# Use KubeRay as Unified GPU Backend
* Status: accepted
* Date: 2026-02-02
* Deciders: Billy Davies
* Technical Story: Consolidating GPU inference workloads onto a single Ray cluster
## Context and Problem Statement
We have multiple AI inference services (LLM, STT, TTS, Embeddings, Reranker) running on a heterogeneous GPU fleet (AMD Strix Halo, NVIDIA RTX 2070, AMD 680M iGPU, Intel Arc). Initially, each service was deployed as a standalone KServe InferenceService, including a llama.cpp proof-of-concept for LLM inference. This resulted in:
1. Complex scheduling across GPU types
2. No GPU sharing (each pod claimed entire GPU)
3. Multiple containers competing for GPU memory
4. Inconsistent service discovery patterns
How do we efficiently utilize our GPU fleet while providing unified inference endpoints?
## Decision Drivers
* Fractional GPU allocation (multiple models per GPU)
* Unified endpoint for all AI services
* Heterogeneous GPU support (CUDA, ROCm, Intel)
* Simplified service discovery
* GPU memory optimization
* Single point of observability
## Considered Options
* Standalone KServe InferenceServices per model
* NVIDIA MPS for GPU sharing
* KubeRay RayService with Ray Serve
* vLLM standalone deployment
## Decision Outcome
Chosen option: "KubeRay RayService with Ray Serve", because it provides native fractional GPU allocation, supports all GPU types, and unifies all inference services behind a single endpoint with path-based routing.
The llama.cpp proof-of-concept has been deprecated and removed. vLLM now runs as a Ray Serve deployment within the RayService.
### Positive Consequences
* Fractional GPU: Whisper (0.5) + TTS (0.5) share RTX 2070
* Single service endpoint: `ai-inference-serve-svc:8000/{model}`
* Path-based routing: `/whisper`, `/tts`, `/llm`, `/embeddings`, `/reranker`
* GPU-aware scheduling via Ray's resource system
* Unified metrics and logging through Ray Dashboard
* Hot-reloading of models without restarting pods
### Negative Consequences
* Ray cluster overhead (head node, dashboard)
* Learning curve for Ray Serve configuration
* Custom container images per GPU architecture
* Less granular scaling (RayService vs per-model replicas)
## Pros and Cons of the Options
### Standalone KServe InferenceServices
* Good, because simple per-model configuration
* Good, because independent scaling per model
* Good, because standard Kubernetes resources
* Bad, because no GPU sharing (1 GPU per pod)
* Bad, because multiple service endpoints
* Bad, because scheduling complexity across GPU types
### NVIDIA MPS for GPU sharing
* Good, because transparent GPU sharing
* Good, because works with existing containers
* Bad, because NVIDIA-only (no ROCm, no Intel)
* Bad, because limited memory isolation
* Bad, because complex setup per node
### KubeRay RayService with Ray Serve
* Good, because fractional GPU allocation
* Good, because unified endpoint
* Good, because multi-GPU-vendor support
* Good, because built-in autoscaling
* Good, because hot model reloading
* Bad, because Ray cluster overhead
* Bad, because custom Ray Serve deployment code
### vLLM standalone deployment
* Good, because optimized for LLM inference
* Good, because OpenAI-compatible API
* Bad, because LLM-only (not STT/TTS/Embeddings)
* Bad, because requires dedicated GPU
## Architecture
```
┌─────────────────────────────────────────────────────────────────────────────┐
│ KubeRay RayService │
├─────────────────────────────────────────────────────────────────────────────┤
│ Service: ai-inference-serve-svc:8000 │
│ │
│ ┌─────────────────┐ ┌─────────────────┐ ┌─────────────────┐ │
│ │ /llm │ │ /whisper │ │ /tts │ │
│ │ vLLM 70B │ │ Whisper v3 │ │ XTTS │ │
│ │ ─────────── │ │ ─────────── │ │ ─────────── │ │
│ │ khelben │ │ elminster │ │ elminster │ │
│ │ Strix Halo │ │ RTX 2070 │ │ RTX 2070 │ │
│ │ (0.95 GPU) │ │ (0.5 GPU) │ │ (0.5 GPU) │ │
│ └─────────────────┘ └─────────────────┘ └─────────────────┘ │
│ │
│ ┌─────────────────┐ ┌─────────────────┐ │
│ │ /embeddings │ │ /reranker │ │
│ │ BGE-Large │ │ BGE-Reranker │ │
│ │ ─────────── │ │ ─────────── │ │
│ │ drizzt │ │ danilo │ │
│ │ Radeon 680M │ │ Intel Arc │ │
│ │ (0.8 GPU) │ │ (0.8 GPU) │ │
│ └─────────────────┘ └─────────────────┘ │
└─────────────────────────────────────────────────────────────────────────────┘
┌─────────────────────────────────────────────────────────────────────────────┐
│ KServe Compatibility Layer │
├─────────────────────────────────────────────────────────────────────────────┤
│ ExternalName Services (KServe-style naming): │
│ • whisper-predictor.ai-ml → ai-inference-serve-svc:8000 │
│ • tts-predictor.ai-ml → ai-inference-serve-svc:8000 │
│ • embeddings-predictor.ai-ml → ai-inference-serve-svc:8000 │
│ • reranker-predictor.ai-ml → ai-inference-serve-svc:8000 │
│ • llm-predictor.ai-ml → ai-inference-serve-svc:8000 │
└─────────────────────────────────────────────────────────────────────────────┘
```
## Migration Notes
1. **Removed**: `kubernetes/apps/ai-ml/llm-inference/` - llama.cpp proof-of-concept
2. **Added**: Ray Serve deployments in `kuberay/app/rayservice.yaml`
3. **Added**: KServe-compatible ExternalName services in `kuberay/app/services-ray-aliases.yaml`
4. **Updated**: All clients now use `ai-inference-serve-svc:8000/{model}`
## Links
* [Ray Serve](https://docs.ray.io/en/latest/serve/)
* [KubeRay](https://ray-project.github.io/kuberay/)
* [vLLM on Ray Serve](https://docs.vllm.ai/en/latest/serving/distributed_serving.html)
* Related: [ADR-0005](0005-multi-gpu-strategy.md) - Multi-GPU strategy
* Related: [ADR-0007](0007-use-kserve-for-inference.md) - KServe for inference (now abstraction layer)

View File

@@ -0,0 +1,195 @@
# Use uv for Python Development, pip for Docker Builds
* Status: accepted
* Date: 2026-02-02
* Deciders: Billy Davies
* Technical Story: Standardizing Python package management across development and production
## Context and Problem Statement
Our Python projects use a mix of `requirements.txt` and `pyproject.toml` for dependency management. Local development with `pip` is slow, and we need a consistent approach across all repositories while maintaining reproducible Docker builds.
## Decision Drivers
* Fast local development iteration
* Reproducible production builds
* Modern Python packaging standards (PEP 517/518/621)
* Lock file support for deterministic installs
* Compatibility with existing CI/CD pipelines
## Considered Options
* pip only (traditional)
* Poetry
* PDM
* uv (by Astral)
* uv for development, pip for Docker
## Decision Outcome
Chosen option: "uv for development, pip for Docker", because uv provides extremely fast package resolution and installation for local development (10-100x faster than pip), while pip in Docker ensures maximum compatibility and reproducibility without requiring uv to be installed in production images.
### Positive Consequences
* 10-100x faster package installs during development
* `uv.lock` provides deterministic dependency resolution
* `pyproject.toml` is the modern Python standard (PEP 621)
* Docker builds remain simple with standard pip
* `uv pip compile` can generate `requirements.txt` from `pyproject.toml`
* No uv runtime dependency in production containers
### Negative Consequences
* Two tools to maintain (uv locally, pip in Docker)
* Team must install uv for local development
* Lock file must be kept in sync with pyproject.toml
## Pros and Cons of the Options
### pip only (traditional)
* Good, because universal compatibility
* Good, because no additional tools
* Bad, because slow resolution and installation
* Bad, because no built-in lock file
* Bad, because `requirements.txt` lacks metadata
### Poetry
* Good, because mature ecosystem
* Good, because lock file support
* Good, because virtual environment management
* Bad, because slower than uv
* Bad, because non-standard `pyproject.toml` sections
* Bad, because complex dependency resolver
### PDM
* Good, because PEP 621 compliant
* Good, because lock file support
* Good, because fast resolver
* Bad, because less adoption than Poetry
* Bad, because still slower than uv
### uv (by Astral)
* Good, because 10-100x faster than pip
* Good, because drop-in pip replacement
* Good, because supports PEP 621 pyproject.toml
* Good, because uv.lock for deterministic builds
* Good, because from the creators of Ruff
* Bad, because newer tool (less mature)
* Bad, because requires installation
### uv for development, pip for Docker (Chosen)
* Good, because fast local development
* Good, because simple Docker builds
* Good, because no uv in production images
* Good, because pip compatibility maintained
* Bad, because two tools in workflow
* Bad, because must sync lock file
## Implementation
### Local Development Setup
```bash
# Install uv (one-time)
curl -LsSf https://astral.sh/uv/install.sh | sh
# Create virtual environment and install dependencies
uv venv
source .venv/bin/activate
uv pip install -e ".[dev]"
# Or use uv sync with lock file
uv sync
```
### Project Structure
```
my-handler/
├── pyproject.toml # PEP 621 project metadata and dependencies
├── uv.lock # Deterministic lock file (committed)
├── requirements.txt # Generated from uv.lock for Docker (optional)
├── src/
│ └── my_handler/
└── tests/
```
### pyproject.toml Example
```toml
[project]
name = "my-handler"
version = "1.0.0"
requires-python = ">=3.11"
dependencies = [
"handler-base @ git+https://git.daviestechlabs.io/daviestechlabs/handler-base.git",
"httpx>=0.27.0",
]
[project.optional-dependencies]
dev = [
"pytest>=8.0.0",
"pytest-asyncio>=0.23.0",
"ruff>=0.1.0",
]
[build-system]
requires = ["hatchling"]
build-backend = "hatchling.build"
```
### Dockerfile Pattern
The Dockerfile uses uv for speed but installs via pip-compatible interface:
```dockerfile
FROM python:3.13-slim
# Copy uv for fast installs (optional - can use pip directly)
COPY --from=ghcr.io/astral-sh/uv:latest /uv /usr/local/bin/uv
# Install from pyproject.toml
COPY pyproject.toml ./
RUN uv pip install --system --no-cache .
# OR for maximum reproducibility, use requirements.txt
COPY requirements.txt ./
RUN pip install --no-cache-dir -r requirements.txt
```
### Generating requirements.txt from uv.lock
```bash
# Generate pinned requirements from lock file
uv pip compile pyproject.toml -o requirements.txt
# Or export from lock
uv export --format requirements-txt > requirements.txt
```
## Workflow
1. **Add dependency**: Edit `pyproject.toml`
2. **Update lock**: Run `uv lock`
3. **Install locally**: Run `uv sync`
4. **For Docker**: Optionally generate `requirements.txt` or use `uv pip install` in Dockerfile
5. **Commit**: Both `pyproject.toml` and `uv.lock`
## Migration Path
1. Create `pyproject.toml` from existing `requirements.txt`
2. Run `uv lock` to generate `uv.lock`
3. Update Dockerfile to use pyproject.toml
4. Delete `requirements.txt` (or keep as generated artifact)
## Links
* [uv Documentation](https://docs.astral.sh/uv/)
* [PEP 621 - Project Metadata](https://peps.python.org/pep-0621/)
* [Astral (uv creators)](https://astral.sh/)
* Related: handler-base already uses uv in Dockerfile