docs: add ADR-0011 (KubeRay), ADR-0012 (uv), update architecture docs
This commit is contained in:
@@ -1,7 +1,7 @@
|
||||
# Use KServe for ML Model Serving
|
||||
|
||||
* Status: accepted
|
||||
* Date: 2025-12-15
|
||||
* Status: superseded by [ADR-0011](0011-kuberay-unified-gpu-backend.md)
|
||||
* Date: 2025-12-15 (Updated: 2026-02-02)
|
||||
* Deciders: Billy Davies
|
||||
* Technical Story: Selecting model serving platform for inference services
|
||||
|
||||
@@ -30,6 +30,15 @@ We need to deploy multiple ML models (Whisper, XTTS, BGE, vLLM) as inference end
|
||||
|
||||
Chosen option: "KServe InferenceService", because it provides a standardized, Kubernetes-native approach to model serving with built-in autoscaling and traffic management.
|
||||
|
||||
**UPDATE (2026-02-02)**: While KServe remains installed, all GPU inference now runs on **KubeRay RayService with Ray Serve** (see [ADR-0011](0011-kuberay-unified-gpu-backend.md)). KServe now serves as an **abstraction layer** via ExternalName services that provide KServe-compatible naming (`{model}-predictor.ai-ml`) while routing to the unified Ray Serve endpoint.
|
||||
|
||||
### Current Role of KServe
|
||||
|
||||
KServe is retained for:
|
||||
- **Service naming convention**: `{model}-predictor.ai-ml.svc.cluster.local`
|
||||
- **Future flexibility**: Can be used for non-GPU models or canary deployments
|
||||
- **Kubeflow integration**: KServe InferenceServices appear in Kubeflow UI
|
||||
|
||||
### Positive Consequences
|
||||
|
||||
* Standardized V2 inference protocol
|
||||
@@ -90,26 +99,34 @@ Chosen option: "KServe InferenceService", because it provides a standardized, Ku
|
||||
|
||||
## Current Configuration
|
||||
|
||||
KServe-compatible ExternalName services route to the unified Ray Serve endpoint:
|
||||
|
||||
```yaml
|
||||
apiVersion: serving.kserve.io/v1beta1
|
||||
kind: InferenceService
|
||||
# KServe-compatible service alias (services-ray-aliases.yaml)
|
||||
apiVersion: v1
|
||||
kind: Service
|
||||
metadata:
|
||||
name: whisper
|
||||
name: whisper-predictor
|
||||
namespace: ai-ml
|
||||
labels:
|
||||
serving.kserve.io/inferenceservice: whisper
|
||||
spec:
|
||||
predictor:
|
||||
minReplicas: 1
|
||||
maxReplicas: 3
|
||||
containers:
|
||||
- name: whisper
|
||||
image: ghcr.io/org/whisper:latest
|
||||
resources:
|
||||
limits:
|
||||
nvidia.com/gpu: 1
|
||||
type: ExternalName
|
||||
externalName: ai-inference-serve-svc.ai-ml.svc.cluster.local
|
||||
ports:
|
||||
- port: 8000
|
||||
targetPort: 8000
|
||||
---
|
||||
# Usage: http://whisper-predictor.ai-ml.svc.cluster.local:8000/whisper/...
|
||||
# All traffic routes to Ray Serve, which handles GPU allocation
|
||||
```
|
||||
|
||||
For the actual Ray Serve configuration, see [ADR-0011](0011-kuberay-unified-gpu-backend.md).
|
||||
|
||||
## Links
|
||||
|
||||
* [KServe](https://kserve.github.io)
|
||||
* [V2 Inference Protocol](https://kserve.github.io/website/latest/modelserving/data_plane/v2_protocol/)
|
||||
* [KubeRay](https://ray-project.github.io/kuberay/)
|
||||
* Related: [ADR-0005](0005-multi-gpu-strategy.md) - GPU allocation
|
||||
* Superseded by: [ADR-0011](0011-kuberay-unified-gpu-backend.md) - KubeRay unified backend
|
||||
|
||||
146
decisions/0011-kuberay-unified-gpu-backend.md
Normal file
146
decisions/0011-kuberay-unified-gpu-backend.md
Normal file
@@ -0,0 +1,146 @@
|
||||
# Use KubeRay as Unified GPU Backend
|
||||
|
||||
* Status: accepted
|
||||
* Date: 2026-02-02
|
||||
* Deciders: Billy Davies
|
||||
* Technical Story: Consolidating GPU inference workloads onto a single Ray cluster
|
||||
|
||||
## Context and Problem Statement
|
||||
|
||||
We have multiple AI inference services (LLM, STT, TTS, Embeddings, Reranker) running on a heterogeneous GPU fleet (AMD Strix Halo, NVIDIA RTX 2070, AMD 680M iGPU, Intel Arc). Initially, each service was deployed as a standalone KServe InferenceService, including a llama.cpp proof-of-concept for LLM inference. This resulted in:
|
||||
|
||||
1. Complex scheduling across GPU types
|
||||
2. No GPU sharing (each pod claimed entire GPU)
|
||||
3. Multiple containers competing for GPU memory
|
||||
4. Inconsistent service discovery patterns
|
||||
|
||||
How do we efficiently utilize our GPU fleet while providing unified inference endpoints?
|
||||
|
||||
## Decision Drivers
|
||||
|
||||
* Fractional GPU allocation (multiple models per GPU)
|
||||
* Unified endpoint for all AI services
|
||||
* Heterogeneous GPU support (CUDA, ROCm, Intel)
|
||||
* Simplified service discovery
|
||||
* GPU memory optimization
|
||||
* Single point of observability
|
||||
|
||||
## Considered Options
|
||||
|
||||
* Standalone KServe InferenceServices per model
|
||||
* NVIDIA MPS for GPU sharing
|
||||
* KubeRay RayService with Ray Serve
|
||||
* vLLM standalone deployment
|
||||
|
||||
## Decision Outcome
|
||||
|
||||
Chosen option: "KubeRay RayService with Ray Serve", because it provides native fractional GPU allocation, supports all GPU types, and unifies all inference services behind a single endpoint with path-based routing.
|
||||
|
||||
The llama.cpp proof-of-concept has been deprecated and removed. vLLM now runs as a Ray Serve deployment within the RayService.
|
||||
|
||||
### Positive Consequences
|
||||
|
||||
* Fractional GPU: Whisper (0.5) + TTS (0.5) share RTX 2070
|
||||
* Single service endpoint: `ai-inference-serve-svc:8000/{model}`
|
||||
* Path-based routing: `/whisper`, `/tts`, `/llm`, `/embeddings`, `/reranker`
|
||||
* GPU-aware scheduling via Ray's resource system
|
||||
* Unified metrics and logging through Ray Dashboard
|
||||
* Hot-reloading of models without restarting pods
|
||||
|
||||
### Negative Consequences
|
||||
|
||||
* Ray cluster overhead (head node, dashboard)
|
||||
* Learning curve for Ray Serve configuration
|
||||
* Custom container images per GPU architecture
|
||||
* Less granular scaling (RayService vs per-model replicas)
|
||||
|
||||
## Pros and Cons of the Options
|
||||
|
||||
### Standalone KServe InferenceServices
|
||||
|
||||
* Good, because simple per-model configuration
|
||||
* Good, because independent scaling per model
|
||||
* Good, because standard Kubernetes resources
|
||||
* Bad, because no GPU sharing (1 GPU per pod)
|
||||
* Bad, because multiple service endpoints
|
||||
* Bad, because scheduling complexity across GPU types
|
||||
|
||||
### NVIDIA MPS for GPU sharing
|
||||
|
||||
* Good, because transparent GPU sharing
|
||||
* Good, because works with existing containers
|
||||
* Bad, because NVIDIA-only (no ROCm, no Intel)
|
||||
* Bad, because limited memory isolation
|
||||
* Bad, because complex setup per node
|
||||
|
||||
### KubeRay RayService with Ray Serve
|
||||
|
||||
* Good, because fractional GPU allocation
|
||||
* Good, because unified endpoint
|
||||
* Good, because multi-GPU-vendor support
|
||||
* Good, because built-in autoscaling
|
||||
* Good, because hot model reloading
|
||||
* Bad, because Ray cluster overhead
|
||||
* Bad, because custom Ray Serve deployment code
|
||||
|
||||
### vLLM standalone deployment
|
||||
|
||||
* Good, because optimized for LLM inference
|
||||
* Good, because OpenAI-compatible API
|
||||
* Bad, because LLM-only (not STT/TTS/Embeddings)
|
||||
* Bad, because requires dedicated GPU
|
||||
|
||||
## Architecture
|
||||
|
||||
```
|
||||
┌─────────────────────────────────────────────────────────────────────────────┐
|
||||
│ KubeRay RayService │
|
||||
├─────────────────────────────────────────────────────────────────────────────┤
|
||||
│ Service: ai-inference-serve-svc:8000 │
|
||||
│ │
|
||||
│ ┌─────────────────┐ ┌─────────────────┐ ┌─────────────────┐ │
|
||||
│ │ /llm │ │ /whisper │ │ /tts │ │
|
||||
│ │ vLLM 70B │ │ Whisper v3 │ │ XTTS │ │
|
||||
│ │ ─────────── │ │ ─────────── │ │ ─────────── │ │
|
||||
│ │ khelben │ │ elminster │ │ elminster │ │
|
||||
│ │ Strix Halo │ │ RTX 2070 │ │ RTX 2070 │ │
|
||||
│ │ (0.95 GPU) │ │ (0.5 GPU) │ │ (0.5 GPU) │ │
|
||||
│ └─────────────────┘ └─────────────────┘ └─────────────────┘ │
|
||||
│ │
|
||||
│ ┌─────────────────┐ ┌─────────────────┐ │
|
||||
│ │ /embeddings │ │ /reranker │ │
|
||||
│ │ BGE-Large │ │ BGE-Reranker │ │
|
||||
│ │ ─────────── │ │ ─────────── │ │
|
||||
│ │ drizzt │ │ danilo │ │
|
||||
│ │ Radeon 680M │ │ Intel Arc │ │
|
||||
│ │ (0.8 GPU) │ │ (0.8 GPU) │ │
|
||||
│ └─────────────────┘ └─────────────────┘ │
|
||||
└─────────────────────────────────────────────────────────────────────────────┘
|
||||
│
|
||||
▼
|
||||
┌─────────────────────────────────────────────────────────────────────────────┐
|
||||
│ KServe Compatibility Layer │
|
||||
├─────────────────────────────────────────────────────────────────────────────┤
|
||||
│ ExternalName Services (KServe-style naming): │
|
||||
│ • whisper-predictor.ai-ml → ai-inference-serve-svc:8000 │
|
||||
│ • tts-predictor.ai-ml → ai-inference-serve-svc:8000 │
|
||||
│ • embeddings-predictor.ai-ml → ai-inference-serve-svc:8000 │
|
||||
│ • reranker-predictor.ai-ml → ai-inference-serve-svc:8000 │
|
||||
│ • llm-predictor.ai-ml → ai-inference-serve-svc:8000 │
|
||||
└─────────────────────────────────────────────────────────────────────────────┘
|
||||
```
|
||||
|
||||
## Migration Notes
|
||||
|
||||
1. **Removed**: `kubernetes/apps/ai-ml/llm-inference/` - llama.cpp proof-of-concept
|
||||
2. **Added**: Ray Serve deployments in `kuberay/app/rayservice.yaml`
|
||||
3. **Added**: KServe-compatible ExternalName services in `kuberay/app/services-ray-aliases.yaml`
|
||||
4. **Updated**: All clients now use `ai-inference-serve-svc:8000/{model}`
|
||||
|
||||
## Links
|
||||
|
||||
* [Ray Serve](https://docs.ray.io/en/latest/serve/)
|
||||
* [KubeRay](https://ray-project.github.io/kuberay/)
|
||||
* [vLLM on Ray Serve](https://docs.vllm.ai/en/latest/serving/distributed_serving.html)
|
||||
* Related: [ADR-0005](0005-multi-gpu-strategy.md) - Multi-GPU strategy
|
||||
* Related: [ADR-0007](0007-use-kserve-for-inference.md) - KServe for inference (now abstraction layer)
|
||||
195
decisions/0012-use-uv-for-python-development.md
Normal file
195
decisions/0012-use-uv-for-python-development.md
Normal file
@@ -0,0 +1,195 @@
|
||||
# Use uv for Python Development, pip for Docker Builds
|
||||
|
||||
* Status: accepted
|
||||
* Date: 2026-02-02
|
||||
* Deciders: Billy Davies
|
||||
* Technical Story: Standardizing Python package management across development and production
|
||||
|
||||
## Context and Problem Statement
|
||||
|
||||
Our Python projects use a mix of `requirements.txt` and `pyproject.toml` for dependency management. Local development with `pip` is slow, and we need a consistent approach across all repositories while maintaining reproducible Docker builds.
|
||||
|
||||
## Decision Drivers
|
||||
|
||||
* Fast local development iteration
|
||||
* Reproducible production builds
|
||||
* Modern Python packaging standards (PEP 517/518/621)
|
||||
* Lock file support for deterministic installs
|
||||
* Compatibility with existing CI/CD pipelines
|
||||
|
||||
## Considered Options
|
||||
|
||||
* pip only (traditional)
|
||||
* Poetry
|
||||
* PDM
|
||||
* uv (by Astral)
|
||||
* uv for development, pip for Docker
|
||||
|
||||
## Decision Outcome
|
||||
|
||||
Chosen option: "uv for development, pip for Docker", because uv provides extremely fast package resolution and installation for local development (10-100x faster than pip), while pip in Docker ensures maximum compatibility and reproducibility without requiring uv to be installed in production images.
|
||||
|
||||
### Positive Consequences
|
||||
|
||||
* 10-100x faster package installs during development
|
||||
* `uv.lock` provides deterministic dependency resolution
|
||||
* `pyproject.toml` is the modern Python standard (PEP 621)
|
||||
* Docker builds remain simple with standard pip
|
||||
* `uv pip compile` can generate `requirements.txt` from `pyproject.toml`
|
||||
* No uv runtime dependency in production containers
|
||||
|
||||
### Negative Consequences
|
||||
|
||||
* Two tools to maintain (uv locally, pip in Docker)
|
||||
* Team must install uv for local development
|
||||
* Lock file must be kept in sync with pyproject.toml
|
||||
|
||||
## Pros and Cons of the Options
|
||||
|
||||
### pip only (traditional)
|
||||
|
||||
* Good, because universal compatibility
|
||||
* Good, because no additional tools
|
||||
* Bad, because slow resolution and installation
|
||||
* Bad, because no built-in lock file
|
||||
* Bad, because `requirements.txt` lacks metadata
|
||||
|
||||
### Poetry
|
||||
|
||||
* Good, because mature ecosystem
|
||||
* Good, because lock file support
|
||||
* Good, because virtual environment management
|
||||
* Bad, because slower than uv
|
||||
* Bad, because non-standard `pyproject.toml` sections
|
||||
* Bad, because complex dependency resolver
|
||||
|
||||
### PDM
|
||||
|
||||
* Good, because PEP 621 compliant
|
||||
* Good, because lock file support
|
||||
* Good, because fast resolver
|
||||
* Bad, because less adoption than Poetry
|
||||
* Bad, because still slower than uv
|
||||
|
||||
### uv (by Astral)
|
||||
|
||||
* Good, because 10-100x faster than pip
|
||||
* Good, because drop-in pip replacement
|
||||
* Good, because supports PEP 621 pyproject.toml
|
||||
* Good, because uv.lock for deterministic builds
|
||||
* Good, because from the creators of Ruff
|
||||
* Bad, because newer tool (less mature)
|
||||
* Bad, because requires installation
|
||||
|
||||
### uv for development, pip for Docker (Chosen)
|
||||
|
||||
* Good, because fast local development
|
||||
* Good, because simple Docker builds
|
||||
* Good, because no uv in production images
|
||||
* Good, because pip compatibility maintained
|
||||
* Bad, because two tools in workflow
|
||||
* Bad, because must sync lock file
|
||||
|
||||
## Implementation
|
||||
|
||||
### Local Development Setup
|
||||
|
||||
```bash
|
||||
# Install uv (one-time)
|
||||
curl -LsSf https://astral.sh/uv/install.sh | sh
|
||||
|
||||
# Create virtual environment and install dependencies
|
||||
uv venv
|
||||
source .venv/bin/activate
|
||||
uv pip install -e ".[dev]"
|
||||
|
||||
# Or use uv sync with lock file
|
||||
uv sync
|
||||
```
|
||||
|
||||
### Project Structure
|
||||
|
||||
```
|
||||
my-handler/
|
||||
├── pyproject.toml # PEP 621 project metadata and dependencies
|
||||
├── uv.lock # Deterministic lock file (committed)
|
||||
├── requirements.txt # Generated from uv.lock for Docker (optional)
|
||||
├── src/
|
||||
│ └── my_handler/
|
||||
└── tests/
|
||||
```
|
||||
|
||||
### pyproject.toml Example
|
||||
|
||||
```toml
|
||||
[project]
|
||||
name = "my-handler"
|
||||
version = "1.0.0"
|
||||
requires-python = ">=3.11"
|
||||
dependencies = [
|
||||
"handler-base @ git+https://git.daviestechlabs.io/daviestechlabs/handler-base.git",
|
||||
"httpx>=0.27.0",
|
||||
]
|
||||
|
||||
[project.optional-dependencies]
|
||||
dev = [
|
||||
"pytest>=8.0.0",
|
||||
"pytest-asyncio>=0.23.0",
|
||||
"ruff>=0.1.0",
|
||||
]
|
||||
|
||||
[build-system]
|
||||
requires = ["hatchling"]
|
||||
build-backend = "hatchling.build"
|
||||
```
|
||||
|
||||
### Dockerfile Pattern
|
||||
|
||||
The Dockerfile uses uv for speed but installs via pip-compatible interface:
|
||||
|
||||
```dockerfile
|
||||
FROM python:3.13-slim
|
||||
|
||||
# Copy uv for fast installs (optional - can use pip directly)
|
||||
COPY --from=ghcr.io/astral-sh/uv:latest /uv /usr/local/bin/uv
|
||||
|
||||
# Install from pyproject.toml
|
||||
COPY pyproject.toml ./
|
||||
RUN uv pip install --system --no-cache .
|
||||
|
||||
# OR for maximum reproducibility, use requirements.txt
|
||||
COPY requirements.txt ./
|
||||
RUN pip install --no-cache-dir -r requirements.txt
|
||||
```
|
||||
|
||||
### Generating requirements.txt from uv.lock
|
||||
|
||||
```bash
|
||||
# Generate pinned requirements from lock file
|
||||
uv pip compile pyproject.toml -o requirements.txt
|
||||
|
||||
# Or export from lock
|
||||
uv export --format requirements-txt > requirements.txt
|
||||
```
|
||||
|
||||
## Workflow
|
||||
|
||||
1. **Add dependency**: Edit `pyproject.toml`
|
||||
2. **Update lock**: Run `uv lock`
|
||||
3. **Install locally**: Run `uv sync`
|
||||
4. **For Docker**: Optionally generate `requirements.txt` or use `uv pip install` in Dockerfile
|
||||
5. **Commit**: Both `pyproject.toml` and `uv.lock`
|
||||
|
||||
## Migration Path
|
||||
|
||||
1. Create `pyproject.toml` from existing `requirements.txt`
|
||||
2. Run `uv lock` to generate `uv.lock`
|
||||
3. Update Dockerfile to use pyproject.toml
|
||||
4. Delete `requirements.txt` (or keep as generated artifact)
|
||||
|
||||
## Links
|
||||
|
||||
* [uv Documentation](https://docs.astral.sh/uv/)
|
||||
* [PEP 621 - Project Metadata](https://peps.python.org/pep-0621/)
|
||||
* [Astral (uv creators)](https://astral.sh/)
|
||||
* Related: handler-base already uses uv in Dockerfile
|
||||
Reference in New Issue
Block a user