diff --git a/ARCHITECTURE.md b/ARCHITECTURE.md index d554afa..727a9eb 100644 --- a/ARCHITECTURE.md +++ b/ARCHITECTURE.md @@ -60,15 +60,24 @@ The homelab is a production-grade Kubernetes cluster running on bare-metal hardw │ ▼ ┌─────────────────────────────────────────────────────────────────────────────┐ -│ AI SERVICES LAYER │ +│ GPU INFERENCE LAYER (KubeRay) │ ├─────────────────────────────────────────────────────────────────────────────┤ -│ ┌─────────┐ ┌─────────┐ ┌─────────┐ ┌─────────┐ ┌─────────┐ ┌─────────┐ │ -│ │ Whisper │ │ XTTS │ │ vLLM │ │ Milvus │ │ BGE │ │Reranker │ │ -│ │ (STT) │ │ (TTS) │ │ (LLM) │ │ (RAG) │ │(Embed) │ │ (BGE) │ │ -│ ├─────────┤ ├─────────┤ ├─────────┤ ├─────────┤ ├─────────┤ ├─────────┤ │ -│ │ KServe │ │ KServe │ │ vLLM │ │ Helm │ │ KServe │ │ KServe │ │ -│ │ nvidia │ │ nvidia │ │ ROCm │ │ Minio │ │ rdna2 │ │ intel │ │ -│ └─────────┘ └─────────┘ └─────────┘ └─────────┘ └─────────┘ └─────────┘ │ +│ RayService: ai-inference-serve-svc:8000 │ +│ ┌─────────────────────────────────────────────────────────────────────┐ │ +│ │ Ray Serve (Unified Endpoint) │ │ +│ │ ┌──────────┐ ┌──────────┐ ┌──────────┐ ┌──────────┐ ┌──────────┐ │ │ +│ │ │ /whisper │ │ /tts │ │ /llm │ │/embeddings│ │/reranker │ │ │ +│ │ │ Whisper │ │ XTTS │ │ vLLM │ │ BGE-L │ │ BGE-Rnk │ │ │ +│ │ │ (0.5 GPU)│ │(0.5 GPU) │ │(0.95 GPU)│ │ (0.8 GPU) │ │(0.8 GPU) │ │ │ +│ │ ├──────────┤ ├──────────┤ ├──────────┤ ├──────────┤ ├──────────┤ │ │ +│ │ │elminster │ │elminster │ │ khelben │ │ drizzt │ │ danilo │ │ │ +│ │ │RTX 2070 │ │RTX 2070 │ │Strix Halo│ │Radeon 680│ │Intel Arc │ │ │ +│ │ │ CUDA │ │ CUDA │ │ ROCm │ │ ROCm │ │ Intel │ │ │ +│ │ └──────────┘ └──────────┘ └──────────┘ └──────────┘ └──────────┘ │ │ +│ └─────────────────────────────────────────────────────────────────────┘ │ +│ │ +│ KServe Aliases: {whisper,tts,llm,embeddings,reranker}-predictor.ai-ml │ +│ Milvus: Vector database for RAG (Helm, MinIO backend) │ └─────────────────────────────────────────────────────────────────────────────┘ │ ▼ @@ -279,6 +288,8 @@ Applications ──► OpenTelemetry SDK ──► Jaeger/Tempo ──► Grafan | MessagePack over JSON | Binary efficiency for audio | [ADR-0004](decisions/0004-use-messagepack-for-nats.md) | | Multi-GPU heterogeneous | Cost optimization, workload matching | [ADR-0005](decisions/0005-multi-gpu-strategy.md) | | GitOps with Flux | Declarative, auditable, secure | [ADR-0006](decisions/0006-gitops-with-flux.md) | +| KServe for inference | Standardized API, autoscaling | [ADR-0007](decisions/0007-use-kserve-for-inference.md) | +| KubeRay unified backend | Fractional GPU, single endpoint | [ADR-0011](decisions/0011-kuberay-unified-gpu-backend.md) | ## Related Documents diff --git a/CODING-CONVENTIONS.md b/CODING-CONVENTIONS.md index 30aac87..cc6ebf9 100644 --- a/CODING-CONVENTIONS.md +++ b/CODING-CONVENTIONS.md @@ -36,13 +36,19 @@ handler-base/ # Shared library for all handlers │ ├── health.py # K8s probes │ ├── telemetry.py # OpenTelemetry │ └── clients/ # Service clients +├── tests/ └── pyproject.toml chat-handler/ # Text chat service voice-assistant/ # Voice pipeline service -├── {name}.py # Standalone version -├── {name}_v2.py # Handler-base version (preferred) -└── Dockerfile.v2 +pipeline-bridge/ # Workflow engine bridge +├── {name}.py # Handler implementation (uses handler-base) +├── pyproject.toml # PEP 621 project metadata (see ADR-0012) +├── uv.lock # Deterministic lock file +├── tests/ +│ ├── conftest.py +│ └── test_{name}.py +└── Dockerfile argo/ # Argo WorkflowTemplates ├── {workflow-name}.yaml @@ -59,6 +65,29 @@ kuberay-images/ # GPU worker images ## Python Conventions +### Package Management (ADR-0012) + +Use **uv** for local development and **pip** in Docker for reproducibility: + +```bash +# Install uv (one-time) +curl -LsSf https://astral.sh/uv/install.sh | sh + +# Create virtual environment and install +uv venv +source .venv/bin/activate +uv pip install -e ".[dev]" + +# Or use uv sync with lock file +uv sync + +# Update lock file after changing pyproject.toml +uv lock + +# Run tests +uv run pytest +``` + ### Project Structure ```python diff --git a/TECH-STACK.md b/TECH-STACK.md index 03e5fce..72ee526 100644 --- a/TECH-STACK.md +++ b/TECH-STACK.md @@ -31,22 +31,27 @@ ## AI/ML Layer -### Inference Engines +### GPU Inference (KubeRay RayService) -| Service | Framework | GPU | Model Type | -|---------|-----------|-----|------------| -| [vLLM](https://vllm.ai) | ROCm | AMD Strix Halo | Large Language Models | -| [faster-whisper](https://github.com/guillaumekln/faster-whisper) | CUDA | NVIDIA RTX 2070 | Speech-to-Text | -| [XTTS](https://github.com/coqui-ai/TTS) | CUDA | NVIDIA RTX 2070 | Text-to-Speech | -| [BGE Embeddings](https://huggingface.co/BAAI/bge-large-en-v1.5) | ROCm | AMD Radeon 680M | Text Embeddings | -| [BGE Reranker](https://huggingface.co/BAAI/bge-reranker-large) | Intel | Intel Arc | Document Reranking | +All AI inference runs on a unified Ray Serve endpoint with fractional GPU allocation: -### ML Serving +| Service | Model | GPU Node | GPU Type | Allocation | +|---------|-------|----------|----------|------------| +| `/llm` | [vLLM](https://vllm.ai) (Llama 3.1 70B) | khelben | AMD Strix Halo 64GB | 0.95 GPU | +| `/whisper` | [faster-whisper](https://github.com/guillaumekln/faster-whisper) v3 | elminster | NVIDIA RTX 2070 8GB | 0.5 GPU | +| `/tts` | [XTTS](https://github.com/coqui-ai/TTS) | elminster | NVIDIA RTX 2070 8GB | 0.5 GPU | +| `/embeddings` | [BGE-Large](https://huggingface.co/BAAI/bge-large-en-v1.5) | drizzt | AMD Radeon 680M 12GB | 0.8 GPU | +| `/reranker` | [BGE-Reranker](https://huggingface.co/BAAI/bge-reranker-large) | danilo | Intel Arc 16GB | 0.8 GPU | + +**Endpoint**: `ai-inference-serve-svc.ai-ml.svc.cluster.local:8000/{service}` + +### ML Serving Stack | Component | Version | Purpose | |-----------|---------|---------| -| [KServe](https://kserve.github.io) | v0.12+ | Model serving framework | +| [KubeRay](https://ray-project.github.io/kuberay/) | 1.4+ | Ray cluster operator | | [Ray Serve](https://ray.io/serve) | 2.53.0 | Unified inference endpoints | +| [KServe](https://kserve.github.io) | v0.12+ | Abstraction layer (ExternalName aliases) | ### ML Workflows diff --git a/decisions/0007-use-kserve-for-inference.md b/decisions/0007-use-kserve-for-inference.md index da9598f..d3e5695 100644 --- a/decisions/0007-use-kserve-for-inference.md +++ b/decisions/0007-use-kserve-for-inference.md @@ -1,7 +1,7 @@ # Use KServe for ML Model Serving -* Status: accepted -* Date: 2025-12-15 +* Status: superseded by [ADR-0011](0011-kuberay-unified-gpu-backend.md) +* Date: 2025-12-15 (Updated: 2026-02-02) * Deciders: Billy Davies * Technical Story: Selecting model serving platform for inference services @@ -30,6 +30,15 @@ We need to deploy multiple ML models (Whisper, XTTS, BGE, vLLM) as inference end Chosen option: "KServe InferenceService", because it provides a standardized, Kubernetes-native approach to model serving with built-in autoscaling and traffic management. +**UPDATE (2026-02-02)**: While KServe remains installed, all GPU inference now runs on **KubeRay RayService with Ray Serve** (see [ADR-0011](0011-kuberay-unified-gpu-backend.md)). KServe now serves as an **abstraction layer** via ExternalName services that provide KServe-compatible naming (`{model}-predictor.ai-ml`) while routing to the unified Ray Serve endpoint. + +### Current Role of KServe + +KServe is retained for: +- **Service naming convention**: `{model}-predictor.ai-ml.svc.cluster.local` +- **Future flexibility**: Can be used for non-GPU models or canary deployments +- **Kubeflow integration**: KServe InferenceServices appear in Kubeflow UI + ### Positive Consequences * Standardized V2 inference protocol @@ -90,26 +99,34 @@ Chosen option: "KServe InferenceService", because it provides a standardized, Ku ## Current Configuration +KServe-compatible ExternalName services route to the unified Ray Serve endpoint: + ```yaml -apiVersion: serving.kserve.io/v1beta1 -kind: InferenceService +# KServe-compatible service alias (services-ray-aliases.yaml) +apiVersion: v1 +kind: Service metadata: - name: whisper + name: whisper-predictor namespace: ai-ml + labels: + serving.kserve.io/inferenceservice: whisper spec: - predictor: - minReplicas: 1 - maxReplicas: 3 - containers: - - name: whisper - image: ghcr.io/org/whisper:latest - resources: - limits: - nvidia.com/gpu: 1 + type: ExternalName + externalName: ai-inference-serve-svc.ai-ml.svc.cluster.local + ports: + - port: 8000 + targetPort: 8000 +--- +# Usage: http://whisper-predictor.ai-ml.svc.cluster.local:8000/whisper/... +# All traffic routes to Ray Serve, which handles GPU allocation ``` +For the actual Ray Serve configuration, see [ADR-0011](0011-kuberay-unified-gpu-backend.md). + ## Links * [KServe](https://kserve.github.io) * [V2 Inference Protocol](https://kserve.github.io/website/latest/modelserving/data_plane/v2_protocol/) +* [KubeRay](https://ray-project.github.io/kuberay/) * Related: [ADR-0005](0005-multi-gpu-strategy.md) - GPU allocation +* Superseded by: [ADR-0011](0011-kuberay-unified-gpu-backend.md) - KubeRay unified backend diff --git a/decisions/0011-kuberay-unified-gpu-backend.md b/decisions/0011-kuberay-unified-gpu-backend.md new file mode 100644 index 0000000..3bc6535 --- /dev/null +++ b/decisions/0011-kuberay-unified-gpu-backend.md @@ -0,0 +1,146 @@ +# Use KubeRay as Unified GPU Backend + +* Status: accepted +* Date: 2026-02-02 +* Deciders: Billy Davies +* Technical Story: Consolidating GPU inference workloads onto a single Ray cluster + +## Context and Problem Statement + +We have multiple AI inference services (LLM, STT, TTS, Embeddings, Reranker) running on a heterogeneous GPU fleet (AMD Strix Halo, NVIDIA RTX 2070, AMD 680M iGPU, Intel Arc). Initially, each service was deployed as a standalone KServe InferenceService, including a llama.cpp proof-of-concept for LLM inference. This resulted in: + +1. Complex scheduling across GPU types +2. No GPU sharing (each pod claimed entire GPU) +3. Multiple containers competing for GPU memory +4. Inconsistent service discovery patterns + +How do we efficiently utilize our GPU fleet while providing unified inference endpoints? + +## Decision Drivers + +* Fractional GPU allocation (multiple models per GPU) +* Unified endpoint for all AI services +* Heterogeneous GPU support (CUDA, ROCm, Intel) +* Simplified service discovery +* GPU memory optimization +* Single point of observability + +## Considered Options + +* Standalone KServe InferenceServices per model +* NVIDIA MPS for GPU sharing +* KubeRay RayService with Ray Serve +* vLLM standalone deployment + +## Decision Outcome + +Chosen option: "KubeRay RayService with Ray Serve", because it provides native fractional GPU allocation, supports all GPU types, and unifies all inference services behind a single endpoint with path-based routing. + +The llama.cpp proof-of-concept has been deprecated and removed. vLLM now runs as a Ray Serve deployment within the RayService. + +### Positive Consequences + +* Fractional GPU: Whisper (0.5) + TTS (0.5) share RTX 2070 +* Single service endpoint: `ai-inference-serve-svc:8000/{model}` +* Path-based routing: `/whisper`, `/tts`, `/llm`, `/embeddings`, `/reranker` +* GPU-aware scheduling via Ray's resource system +* Unified metrics and logging through Ray Dashboard +* Hot-reloading of models without restarting pods + +### Negative Consequences + +* Ray cluster overhead (head node, dashboard) +* Learning curve for Ray Serve configuration +* Custom container images per GPU architecture +* Less granular scaling (RayService vs per-model replicas) + +## Pros and Cons of the Options + +### Standalone KServe InferenceServices + +* Good, because simple per-model configuration +* Good, because independent scaling per model +* Good, because standard Kubernetes resources +* Bad, because no GPU sharing (1 GPU per pod) +* Bad, because multiple service endpoints +* Bad, because scheduling complexity across GPU types + +### NVIDIA MPS for GPU sharing + +* Good, because transparent GPU sharing +* Good, because works with existing containers +* Bad, because NVIDIA-only (no ROCm, no Intel) +* Bad, because limited memory isolation +* Bad, because complex setup per node + +### KubeRay RayService with Ray Serve + +* Good, because fractional GPU allocation +* Good, because unified endpoint +* Good, because multi-GPU-vendor support +* Good, because built-in autoscaling +* Good, because hot model reloading +* Bad, because Ray cluster overhead +* Bad, because custom Ray Serve deployment code + +### vLLM standalone deployment + +* Good, because optimized for LLM inference +* Good, because OpenAI-compatible API +* Bad, because LLM-only (not STT/TTS/Embeddings) +* Bad, because requires dedicated GPU + +## Architecture + +``` +┌─────────────────────────────────────────────────────────────────────────────┐ +│ KubeRay RayService │ +├─────────────────────────────────────────────────────────────────────────────┤ +│ Service: ai-inference-serve-svc:8000 │ +│ │ +│ ┌─────────────────┐ ┌─────────────────┐ ┌─────────────────┐ │ +│ │ /llm │ │ /whisper │ │ /tts │ │ +│ │ vLLM 70B │ │ Whisper v3 │ │ XTTS │ │ +│ │ ─────────── │ │ ─────────── │ │ ─────────── │ │ +│ │ khelben │ │ elminster │ │ elminster │ │ +│ │ Strix Halo │ │ RTX 2070 │ │ RTX 2070 │ │ +│ │ (0.95 GPU) │ │ (0.5 GPU) │ │ (0.5 GPU) │ │ +│ └─────────────────┘ └─────────────────┘ └─────────────────┘ │ +│ │ +│ ┌─────────────────┐ ┌─────────────────┐ │ +│ │ /embeddings │ │ /reranker │ │ +│ │ BGE-Large │ │ BGE-Reranker │ │ +│ │ ─────────── │ │ ─────────── │ │ +│ │ drizzt │ │ danilo │ │ +│ │ Radeon 680M │ │ Intel Arc │ │ +│ │ (0.8 GPU) │ │ (0.8 GPU) │ │ +│ └─────────────────┘ └─────────────────┘ │ +└─────────────────────────────────────────────────────────────────────────────┘ + │ + ▼ +┌─────────────────────────────────────────────────────────────────────────────┐ +│ KServe Compatibility Layer │ +├─────────────────────────────────────────────────────────────────────────────┤ +│ ExternalName Services (KServe-style naming): │ +│ • whisper-predictor.ai-ml → ai-inference-serve-svc:8000 │ +│ • tts-predictor.ai-ml → ai-inference-serve-svc:8000 │ +│ • embeddings-predictor.ai-ml → ai-inference-serve-svc:8000 │ +│ • reranker-predictor.ai-ml → ai-inference-serve-svc:8000 │ +│ • llm-predictor.ai-ml → ai-inference-serve-svc:8000 │ +└─────────────────────────────────────────────────────────────────────────────┘ +``` + +## Migration Notes + +1. **Removed**: `kubernetes/apps/ai-ml/llm-inference/` - llama.cpp proof-of-concept +2. **Added**: Ray Serve deployments in `kuberay/app/rayservice.yaml` +3. **Added**: KServe-compatible ExternalName services in `kuberay/app/services-ray-aliases.yaml` +4. **Updated**: All clients now use `ai-inference-serve-svc:8000/{model}` + +## Links + +* [Ray Serve](https://docs.ray.io/en/latest/serve/) +* [KubeRay](https://ray-project.github.io/kuberay/) +* [vLLM on Ray Serve](https://docs.vllm.ai/en/latest/serving/distributed_serving.html) +* Related: [ADR-0005](0005-multi-gpu-strategy.md) - Multi-GPU strategy +* Related: [ADR-0007](0007-use-kserve-for-inference.md) - KServe for inference (now abstraction layer) diff --git a/decisions/0012-use-uv-for-python-development.md b/decisions/0012-use-uv-for-python-development.md new file mode 100644 index 0000000..7a991b2 --- /dev/null +++ b/decisions/0012-use-uv-for-python-development.md @@ -0,0 +1,195 @@ +# Use uv for Python Development, pip for Docker Builds + +* Status: accepted +* Date: 2026-02-02 +* Deciders: Billy Davies +* Technical Story: Standardizing Python package management across development and production + +## Context and Problem Statement + +Our Python projects use a mix of `requirements.txt` and `pyproject.toml` for dependency management. Local development with `pip` is slow, and we need a consistent approach across all repositories while maintaining reproducible Docker builds. + +## Decision Drivers + +* Fast local development iteration +* Reproducible production builds +* Modern Python packaging standards (PEP 517/518/621) +* Lock file support for deterministic installs +* Compatibility with existing CI/CD pipelines + +## Considered Options + +* pip only (traditional) +* Poetry +* PDM +* uv (by Astral) +* uv for development, pip for Docker + +## Decision Outcome + +Chosen option: "uv for development, pip for Docker", because uv provides extremely fast package resolution and installation for local development (10-100x faster than pip), while pip in Docker ensures maximum compatibility and reproducibility without requiring uv to be installed in production images. + +### Positive Consequences + +* 10-100x faster package installs during development +* `uv.lock` provides deterministic dependency resolution +* `pyproject.toml` is the modern Python standard (PEP 621) +* Docker builds remain simple with standard pip +* `uv pip compile` can generate `requirements.txt` from `pyproject.toml` +* No uv runtime dependency in production containers + +### Negative Consequences + +* Two tools to maintain (uv locally, pip in Docker) +* Team must install uv for local development +* Lock file must be kept in sync with pyproject.toml + +## Pros and Cons of the Options + +### pip only (traditional) + +* Good, because universal compatibility +* Good, because no additional tools +* Bad, because slow resolution and installation +* Bad, because no built-in lock file +* Bad, because `requirements.txt` lacks metadata + +### Poetry + +* Good, because mature ecosystem +* Good, because lock file support +* Good, because virtual environment management +* Bad, because slower than uv +* Bad, because non-standard `pyproject.toml` sections +* Bad, because complex dependency resolver + +### PDM + +* Good, because PEP 621 compliant +* Good, because lock file support +* Good, because fast resolver +* Bad, because less adoption than Poetry +* Bad, because still slower than uv + +### uv (by Astral) + +* Good, because 10-100x faster than pip +* Good, because drop-in pip replacement +* Good, because supports PEP 621 pyproject.toml +* Good, because uv.lock for deterministic builds +* Good, because from the creators of Ruff +* Bad, because newer tool (less mature) +* Bad, because requires installation + +### uv for development, pip for Docker (Chosen) + +* Good, because fast local development +* Good, because simple Docker builds +* Good, because no uv in production images +* Good, because pip compatibility maintained +* Bad, because two tools in workflow +* Bad, because must sync lock file + +## Implementation + +### Local Development Setup + +```bash +# Install uv (one-time) +curl -LsSf https://astral.sh/uv/install.sh | sh + +# Create virtual environment and install dependencies +uv venv +source .venv/bin/activate +uv pip install -e ".[dev]" + +# Or use uv sync with lock file +uv sync +``` + +### Project Structure + +``` +my-handler/ +├── pyproject.toml # PEP 621 project metadata and dependencies +├── uv.lock # Deterministic lock file (committed) +├── requirements.txt # Generated from uv.lock for Docker (optional) +├── src/ +│ └── my_handler/ +└── tests/ +``` + +### pyproject.toml Example + +```toml +[project] +name = "my-handler" +version = "1.0.0" +requires-python = ">=3.11" +dependencies = [ + "handler-base @ git+https://git.daviestechlabs.io/daviestechlabs/handler-base.git", + "httpx>=0.27.0", +] + +[project.optional-dependencies] +dev = [ + "pytest>=8.0.0", + "pytest-asyncio>=0.23.0", + "ruff>=0.1.0", +] + +[build-system] +requires = ["hatchling"] +build-backend = "hatchling.build" +``` + +### Dockerfile Pattern + +The Dockerfile uses uv for speed but installs via pip-compatible interface: + +```dockerfile +FROM python:3.13-slim + +# Copy uv for fast installs (optional - can use pip directly) +COPY --from=ghcr.io/astral-sh/uv:latest /uv /usr/local/bin/uv + +# Install from pyproject.toml +COPY pyproject.toml ./ +RUN uv pip install --system --no-cache . + +# OR for maximum reproducibility, use requirements.txt +COPY requirements.txt ./ +RUN pip install --no-cache-dir -r requirements.txt +``` + +### Generating requirements.txt from uv.lock + +```bash +# Generate pinned requirements from lock file +uv pip compile pyproject.toml -o requirements.txt + +# Or export from lock +uv export --format requirements-txt > requirements.txt +``` + +## Workflow + +1. **Add dependency**: Edit `pyproject.toml` +2. **Update lock**: Run `uv lock` +3. **Install locally**: Run `uv sync` +4. **For Docker**: Optionally generate `requirements.txt` or use `uv pip install` in Dockerfile +5. **Commit**: Both `pyproject.toml` and `uv.lock` + +## Migration Path + +1. Create `pyproject.toml` from existing `requirements.txt` +2. Run `uv lock` to generate `uv.lock` +3. Update Dockerfile to use pyproject.toml +4. Delete `requirements.txt` (or keep as generated artifact) + +## Links + +* [uv Documentation](https://docs.astral.sh/uv/) +* [PEP 621 - Project Metadata](https://peps.python.org/pep-0621/) +* [Astral (uv creators)](https://astral.sh/) +* Related: handler-base already uses uv in Dockerfile