docs: add ADR-0011 (KubeRay), ADR-0012 (uv), update architecture docs
This commit is contained in:
@@ -60,15 +60,24 @@ The homelab is a production-grade Kubernetes cluster running on bare-metal hardw
|
|||||||
│
|
│
|
||||||
▼
|
▼
|
||||||
┌─────────────────────────────────────────────────────────────────────────────┐
|
┌─────────────────────────────────────────────────────────────────────────────┐
|
||||||
│ AI SERVICES LAYER │
|
│ GPU INFERENCE LAYER (KubeRay) │
|
||||||
├─────────────────────────────────────────────────────────────────────────────┤
|
├─────────────────────────────────────────────────────────────────────────────┤
|
||||||
│ ┌─────────┐ ┌─────────┐ ┌─────────┐ ┌─────────┐ ┌─────────┐ ┌─────────┐ │
|
│ RayService: ai-inference-serve-svc:8000 │
|
||||||
│ │ Whisper │ │ XTTS │ │ vLLM │ │ Milvus │ │ BGE │ │Reranker │ │
|
│ ┌─────────────────────────────────────────────────────────────────────┐ │
|
||||||
│ │ (STT) │ │ (TTS) │ │ (LLM) │ │ (RAG) │ │(Embed) │ │ (BGE) │ │
|
│ │ Ray Serve (Unified Endpoint) │ │
|
||||||
│ ├─────────┤ ├─────────┤ ├─────────┤ ├─────────┤ ├─────────┤ ├─────────┤ │
|
│ │ ┌──────────┐ ┌──────────┐ ┌──────────┐ ┌──────────┐ ┌──────────┐ │ │
|
||||||
│ │ KServe │ │ KServe │ │ vLLM │ │ Helm │ │ KServe │ │ KServe │ │
|
│ │ │ /whisper │ │ /tts │ │ /llm │ │/embeddings│ │/reranker │ │ │
|
||||||
│ │ nvidia │ │ nvidia │ │ ROCm │ │ Minio │ │ rdna2 │ │ intel │ │
|
│ │ │ Whisper │ │ XTTS │ │ vLLM │ │ BGE-L │ │ BGE-Rnk │ │ │
|
||||||
│ └─────────┘ └─────────┘ └─────────┘ └─────────┘ └─────────┘ └─────────┘ │
|
│ │ │ (0.5 GPU)│ │(0.5 GPU) │ │(0.95 GPU)│ │ (0.8 GPU) │ │(0.8 GPU) │ │ │
|
||||||
|
│ │ ├──────────┤ ├──────────┤ ├──────────┤ ├──────────┤ ├──────────┤ │ │
|
||||||
|
│ │ │elminster │ │elminster │ │ khelben │ │ drizzt │ │ danilo │ │ │
|
||||||
|
│ │ │RTX 2070 │ │RTX 2070 │ │Strix Halo│ │Radeon 680│ │Intel Arc │ │ │
|
||||||
|
│ │ │ CUDA │ │ CUDA │ │ ROCm │ │ ROCm │ │ Intel │ │ │
|
||||||
|
│ │ └──────────┘ └──────────┘ └──────────┘ └──────────┘ └──────────┘ │ │
|
||||||
|
│ └─────────────────────────────────────────────────────────────────────┘ │
|
||||||
|
│ │
|
||||||
|
│ KServe Aliases: {whisper,tts,llm,embeddings,reranker}-predictor.ai-ml │
|
||||||
|
│ Milvus: Vector database for RAG (Helm, MinIO backend) │
|
||||||
└─────────────────────────────────────────────────────────────────────────────┘
|
└─────────────────────────────────────────────────────────────────────────────┘
|
||||||
│
|
│
|
||||||
▼
|
▼
|
||||||
@@ -279,6 +288,8 @@ Applications ──► OpenTelemetry SDK ──► Jaeger/Tempo ──► Grafan
|
|||||||
| MessagePack over JSON | Binary efficiency for audio | [ADR-0004](decisions/0004-use-messagepack-for-nats.md) |
|
| MessagePack over JSON | Binary efficiency for audio | [ADR-0004](decisions/0004-use-messagepack-for-nats.md) |
|
||||||
| Multi-GPU heterogeneous | Cost optimization, workload matching | [ADR-0005](decisions/0005-multi-gpu-strategy.md) |
|
| Multi-GPU heterogeneous | Cost optimization, workload matching | [ADR-0005](decisions/0005-multi-gpu-strategy.md) |
|
||||||
| GitOps with Flux | Declarative, auditable, secure | [ADR-0006](decisions/0006-gitops-with-flux.md) |
|
| GitOps with Flux | Declarative, auditable, secure | [ADR-0006](decisions/0006-gitops-with-flux.md) |
|
||||||
|
| KServe for inference | Standardized API, autoscaling | [ADR-0007](decisions/0007-use-kserve-for-inference.md) |
|
||||||
|
| KubeRay unified backend | Fractional GPU, single endpoint | [ADR-0011](decisions/0011-kuberay-unified-gpu-backend.md) |
|
||||||
|
|
||||||
## Related Documents
|
## Related Documents
|
||||||
|
|
||||||
|
|||||||
@@ -36,13 +36,19 @@ handler-base/ # Shared library for all handlers
|
|||||||
│ ├── health.py # K8s probes
|
│ ├── health.py # K8s probes
|
||||||
│ ├── telemetry.py # OpenTelemetry
|
│ ├── telemetry.py # OpenTelemetry
|
||||||
│ └── clients/ # Service clients
|
│ └── clients/ # Service clients
|
||||||
|
├── tests/
|
||||||
└── pyproject.toml
|
└── pyproject.toml
|
||||||
|
|
||||||
chat-handler/ # Text chat service
|
chat-handler/ # Text chat service
|
||||||
voice-assistant/ # Voice pipeline service
|
voice-assistant/ # Voice pipeline service
|
||||||
├── {name}.py # Standalone version
|
pipeline-bridge/ # Workflow engine bridge
|
||||||
├── {name}_v2.py # Handler-base version (preferred)
|
├── {name}.py # Handler implementation (uses handler-base)
|
||||||
└── Dockerfile.v2
|
├── pyproject.toml # PEP 621 project metadata (see ADR-0012)
|
||||||
|
├── uv.lock # Deterministic lock file
|
||||||
|
├── tests/
|
||||||
|
│ ├── conftest.py
|
||||||
|
│ └── test_{name}.py
|
||||||
|
└── Dockerfile
|
||||||
|
|
||||||
argo/ # Argo WorkflowTemplates
|
argo/ # Argo WorkflowTemplates
|
||||||
├── {workflow-name}.yaml
|
├── {workflow-name}.yaml
|
||||||
@@ -59,6 +65,29 @@ kuberay-images/ # GPU worker images
|
|||||||
|
|
||||||
## Python Conventions
|
## Python Conventions
|
||||||
|
|
||||||
|
### Package Management (ADR-0012)
|
||||||
|
|
||||||
|
Use **uv** for local development and **pip** in Docker for reproducibility:
|
||||||
|
|
||||||
|
```bash
|
||||||
|
# Install uv (one-time)
|
||||||
|
curl -LsSf https://astral.sh/uv/install.sh | sh
|
||||||
|
|
||||||
|
# Create virtual environment and install
|
||||||
|
uv venv
|
||||||
|
source .venv/bin/activate
|
||||||
|
uv pip install -e ".[dev]"
|
||||||
|
|
||||||
|
# Or use uv sync with lock file
|
||||||
|
uv sync
|
||||||
|
|
||||||
|
# Update lock file after changing pyproject.toml
|
||||||
|
uv lock
|
||||||
|
|
||||||
|
# Run tests
|
||||||
|
uv run pytest
|
||||||
|
```
|
||||||
|
|
||||||
### Project Structure
|
### Project Structure
|
||||||
|
|
||||||
```python
|
```python
|
||||||
|
|||||||
@@ -31,22 +31,27 @@
|
|||||||
|
|
||||||
## AI/ML Layer
|
## AI/ML Layer
|
||||||
|
|
||||||
### Inference Engines
|
### GPU Inference (KubeRay RayService)
|
||||||
|
|
||||||
| Service | Framework | GPU | Model Type |
|
All AI inference runs on a unified Ray Serve endpoint with fractional GPU allocation:
|
||||||
|---------|-----------|-----|------------|
|
|
||||||
| [vLLM](https://vllm.ai) | ROCm | AMD Strix Halo | Large Language Models |
|
|
||||||
| [faster-whisper](https://github.com/guillaumekln/faster-whisper) | CUDA | NVIDIA RTX 2070 | Speech-to-Text |
|
|
||||||
| [XTTS](https://github.com/coqui-ai/TTS) | CUDA | NVIDIA RTX 2070 | Text-to-Speech |
|
|
||||||
| [BGE Embeddings](https://huggingface.co/BAAI/bge-large-en-v1.5) | ROCm | AMD Radeon 680M | Text Embeddings |
|
|
||||||
| [BGE Reranker](https://huggingface.co/BAAI/bge-reranker-large) | Intel | Intel Arc | Document Reranking |
|
|
||||||
|
|
||||||
### ML Serving
|
| Service | Model | GPU Node | GPU Type | Allocation |
|
||||||
|
|---------|-------|----------|----------|------------|
|
||||||
|
| `/llm` | [vLLM](https://vllm.ai) (Llama 3.1 70B) | khelben | AMD Strix Halo 64GB | 0.95 GPU |
|
||||||
|
| `/whisper` | [faster-whisper](https://github.com/guillaumekln/faster-whisper) v3 | elminster | NVIDIA RTX 2070 8GB | 0.5 GPU |
|
||||||
|
| `/tts` | [XTTS](https://github.com/coqui-ai/TTS) | elminster | NVIDIA RTX 2070 8GB | 0.5 GPU |
|
||||||
|
| `/embeddings` | [BGE-Large](https://huggingface.co/BAAI/bge-large-en-v1.5) | drizzt | AMD Radeon 680M 12GB | 0.8 GPU |
|
||||||
|
| `/reranker` | [BGE-Reranker](https://huggingface.co/BAAI/bge-reranker-large) | danilo | Intel Arc 16GB | 0.8 GPU |
|
||||||
|
|
||||||
|
**Endpoint**: `ai-inference-serve-svc.ai-ml.svc.cluster.local:8000/{service}`
|
||||||
|
|
||||||
|
### ML Serving Stack
|
||||||
|
|
||||||
| Component | Version | Purpose |
|
| Component | Version | Purpose |
|
||||||
|-----------|---------|---------|
|
|-----------|---------|---------|
|
||||||
| [KServe](https://kserve.github.io) | v0.12+ | Model serving framework |
|
| [KubeRay](https://ray-project.github.io/kuberay/) | 1.4+ | Ray cluster operator |
|
||||||
| [Ray Serve](https://ray.io/serve) | 2.53.0 | Unified inference endpoints |
|
| [Ray Serve](https://ray.io/serve) | 2.53.0 | Unified inference endpoints |
|
||||||
|
| [KServe](https://kserve.github.io) | v0.12+ | Abstraction layer (ExternalName aliases) |
|
||||||
|
|
||||||
### ML Workflows
|
### ML Workflows
|
||||||
|
|
||||||
|
|||||||
@@ -1,7 +1,7 @@
|
|||||||
# Use KServe for ML Model Serving
|
# Use KServe for ML Model Serving
|
||||||
|
|
||||||
* Status: accepted
|
* Status: superseded by [ADR-0011](0011-kuberay-unified-gpu-backend.md)
|
||||||
* Date: 2025-12-15
|
* Date: 2025-12-15 (Updated: 2026-02-02)
|
||||||
* Deciders: Billy Davies
|
* Deciders: Billy Davies
|
||||||
* Technical Story: Selecting model serving platform for inference services
|
* Technical Story: Selecting model serving platform for inference services
|
||||||
|
|
||||||
@@ -30,6 +30,15 @@ We need to deploy multiple ML models (Whisper, XTTS, BGE, vLLM) as inference end
|
|||||||
|
|
||||||
Chosen option: "KServe InferenceService", because it provides a standardized, Kubernetes-native approach to model serving with built-in autoscaling and traffic management.
|
Chosen option: "KServe InferenceService", because it provides a standardized, Kubernetes-native approach to model serving with built-in autoscaling and traffic management.
|
||||||
|
|
||||||
|
**UPDATE (2026-02-02)**: While KServe remains installed, all GPU inference now runs on **KubeRay RayService with Ray Serve** (see [ADR-0011](0011-kuberay-unified-gpu-backend.md)). KServe now serves as an **abstraction layer** via ExternalName services that provide KServe-compatible naming (`{model}-predictor.ai-ml`) while routing to the unified Ray Serve endpoint.
|
||||||
|
|
||||||
|
### Current Role of KServe
|
||||||
|
|
||||||
|
KServe is retained for:
|
||||||
|
- **Service naming convention**: `{model}-predictor.ai-ml.svc.cluster.local`
|
||||||
|
- **Future flexibility**: Can be used for non-GPU models or canary deployments
|
||||||
|
- **Kubeflow integration**: KServe InferenceServices appear in Kubeflow UI
|
||||||
|
|
||||||
### Positive Consequences
|
### Positive Consequences
|
||||||
|
|
||||||
* Standardized V2 inference protocol
|
* Standardized V2 inference protocol
|
||||||
@@ -90,26 +99,34 @@ Chosen option: "KServe InferenceService", because it provides a standardized, Ku
|
|||||||
|
|
||||||
## Current Configuration
|
## Current Configuration
|
||||||
|
|
||||||
|
KServe-compatible ExternalName services route to the unified Ray Serve endpoint:
|
||||||
|
|
||||||
```yaml
|
```yaml
|
||||||
apiVersion: serving.kserve.io/v1beta1
|
# KServe-compatible service alias (services-ray-aliases.yaml)
|
||||||
kind: InferenceService
|
apiVersion: v1
|
||||||
|
kind: Service
|
||||||
metadata:
|
metadata:
|
||||||
name: whisper
|
name: whisper-predictor
|
||||||
namespace: ai-ml
|
namespace: ai-ml
|
||||||
|
labels:
|
||||||
|
serving.kserve.io/inferenceservice: whisper
|
||||||
spec:
|
spec:
|
||||||
predictor:
|
type: ExternalName
|
||||||
minReplicas: 1
|
externalName: ai-inference-serve-svc.ai-ml.svc.cluster.local
|
||||||
maxReplicas: 3
|
ports:
|
||||||
containers:
|
- port: 8000
|
||||||
- name: whisper
|
targetPort: 8000
|
||||||
image: ghcr.io/org/whisper:latest
|
---
|
||||||
resources:
|
# Usage: http://whisper-predictor.ai-ml.svc.cluster.local:8000/whisper/...
|
||||||
limits:
|
# All traffic routes to Ray Serve, which handles GPU allocation
|
||||||
nvidia.com/gpu: 1
|
|
||||||
```
|
```
|
||||||
|
|
||||||
|
For the actual Ray Serve configuration, see [ADR-0011](0011-kuberay-unified-gpu-backend.md).
|
||||||
|
|
||||||
## Links
|
## Links
|
||||||
|
|
||||||
* [KServe](https://kserve.github.io)
|
* [KServe](https://kserve.github.io)
|
||||||
* [V2 Inference Protocol](https://kserve.github.io/website/latest/modelserving/data_plane/v2_protocol/)
|
* [V2 Inference Protocol](https://kserve.github.io/website/latest/modelserving/data_plane/v2_protocol/)
|
||||||
|
* [KubeRay](https://ray-project.github.io/kuberay/)
|
||||||
* Related: [ADR-0005](0005-multi-gpu-strategy.md) - GPU allocation
|
* Related: [ADR-0005](0005-multi-gpu-strategy.md) - GPU allocation
|
||||||
|
* Superseded by: [ADR-0011](0011-kuberay-unified-gpu-backend.md) - KubeRay unified backend
|
||||||
|
|||||||
146
decisions/0011-kuberay-unified-gpu-backend.md
Normal file
146
decisions/0011-kuberay-unified-gpu-backend.md
Normal file
@@ -0,0 +1,146 @@
|
|||||||
|
# Use KubeRay as Unified GPU Backend
|
||||||
|
|
||||||
|
* Status: accepted
|
||||||
|
* Date: 2026-02-02
|
||||||
|
* Deciders: Billy Davies
|
||||||
|
* Technical Story: Consolidating GPU inference workloads onto a single Ray cluster
|
||||||
|
|
||||||
|
## Context and Problem Statement
|
||||||
|
|
||||||
|
We have multiple AI inference services (LLM, STT, TTS, Embeddings, Reranker) running on a heterogeneous GPU fleet (AMD Strix Halo, NVIDIA RTX 2070, AMD 680M iGPU, Intel Arc). Initially, each service was deployed as a standalone KServe InferenceService, including a llama.cpp proof-of-concept for LLM inference. This resulted in:
|
||||||
|
|
||||||
|
1. Complex scheduling across GPU types
|
||||||
|
2. No GPU sharing (each pod claimed entire GPU)
|
||||||
|
3. Multiple containers competing for GPU memory
|
||||||
|
4. Inconsistent service discovery patterns
|
||||||
|
|
||||||
|
How do we efficiently utilize our GPU fleet while providing unified inference endpoints?
|
||||||
|
|
||||||
|
## Decision Drivers
|
||||||
|
|
||||||
|
* Fractional GPU allocation (multiple models per GPU)
|
||||||
|
* Unified endpoint for all AI services
|
||||||
|
* Heterogeneous GPU support (CUDA, ROCm, Intel)
|
||||||
|
* Simplified service discovery
|
||||||
|
* GPU memory optimization
|
||||||
|
* Single point of observability
|
||||||
|
|
||||||
|
## Considered Options
|
||||||
|
|
||||||
|
* Standalone KServe InferenceServices per model
|
||||||
|
* NVIDIA MPS for GPU sharing
|
||||||
|
* KubeRay RayService with Ray Serve
|
||||||
|
* vLLM standalone deployment
|
||||||
|
|
||||||
|
## Decision Outcome
|
||||||
|
|
||||||
|
Chosen option: "KubeRay RayService with Ray Serve", because it provides native fractional GPU allocation, supports all GPU types, and unifies all inference services behind a single endpoint with path-based routing.
|
||||||
|
|
||||||
|
The llama.cpp proof-of-concept has been deprecated and removed. vLLM now runs as a Ray Serve deployment within the RayService.
|
||||||
|
|
||||||
|
### Positive Consequences
|
||||||
|
|
||||||
|
* Fractional GPU: Whisper (0.5) + TTS (0.5) share RTX 2070
|
||||||
|
* Single service endpoint: `ai-inference-serve-svc:8000/{model}`
|
||||||
|
* Path-based routing: `/whisper`, `/tts`, `/llm`, `/embeddings`, `/reranker`
|
||||||
|
* GPU-aware scheduling via Ray's resource system
|
||||||
|
* Unified metrics and logging through Ray Dashboard
|
||||||
|
* Hot-reloading of models without restarting pods
|
||||||
|
|
||||||
|
### Negative Consequences
|
||||||
|
|
||||||
|
* Ray cluster overhead (head node, dashboard)
|
||||||
|
* Learning curve for Ray Serve configuration
|
||||||
|
* Custom container images per GPU architecture
|
||||||
|
* Less granular scaling (RayService vs per-model replicas)
|
||||||
|
|
||||||
|
## Pros and Cons of the Options
|
||||||
|
|
||||||
|
### Standalone KServe InferenceServices
|
||||||
|
|
||||||
|
* Good, because simple per-model configuration
|
||||||
|
* Good, because independent scaling per model
|
||||||
|
* Good, because standard Kubernetes resources
|
||||||
|
* Bad, because no GPU sharing (1 GPU per pod)
|
||||||
|
* Bad, because multiple service endpoints
|
||||||
|
* Bad, because scheduling complexity across GPU types
|
||||||
|
|
||||||
|
### NVIDIA MPS for GPU sharing
|
||||||
|
|
||||||
|
* Good, because transparent GPU sharing
|
||||||
|
* Good, because works with existing containers
|
||||||
|
* Bad, because NVIDIA-only (no ROCm, no Intel)
|
||||||
|
* Bad, because limited memory isolation
|
||||||
|
* Bad, because complex setup per node
|
||||||
|
|
||||||
|
### KubeRay RayService with Ray Serve
|
||||||
|
|
||||||
|
* Good, because fractional GPU allocation
|
||||||
|
* Good, because unified endpoint
|
||||||
|
* Good, because multi-GPU-vendor support
|
||||||
|
* Good, because built-in autoscaling
|
||||||
|
* Good, because hot model reloading
|
||||||
|
* Bad, because Ray cluster overhead
|
||||||
|
* Bad, because custom Ray Serve deployment code
|
||||||
|
|
||||||
|
### vLLM standalone deployment
|
||||||
|
|
||||||
|
* Good, because optimized for LLM inference
|
||||||
|
* Good, because OpenAI-compatible API
|
||||||
|
* Bad, because LLM-only (not STT/TTS/Embeddings)
|
||||||
|
* Bad, because requires dedicated GPU
|
||||||
|
|
||||||
|
## Architecture
|
||||||
|
|
||||||
|
```
|
||||||
|
┌─────────────────────────────────────────────────────────────────────────────┐
|
||||||
|
│ KubeRay RayService │
|
||||||
|
├─────────────────────────────────────────────────────────────────────────────┤
|
||||||
|
│ Service: ai-inference-serve-svc:8000 │
|
||||||
|
│ │
|
||||||
|
│ ┌─────────────────┐ ┌─────────────────┐ ┌─────────────────┐ │
|
||||||
|
│ │ /llm │ │ /whisper │ │ /tts │ │
|
||||||
|
│ │ vLLM 70B │ │ Whisper v3 │ │ XTTS │ │
|
||||||
|
│ │ ─────────── │ │ ─────────── │ │ ─────────── │ │
|
||||||
|
│ │ khelben │ │ elminster │ │ elminster │ │
|
||||||
|
│ │ Strix Halo │ │ RTX 2070 │ │ RTX 2070 │ │
|
||||||
|
│ │ (0.95 GPU) │ │ (0.5 GPU) │ │ (0.5 GPU) │ │
|
||||||
|
│ └─────────────────┘ └─────────────────┘ └─────────────────┘ │
|
||||||
|
│ │
|
||||||
|
│ ┌─────────────────┐ ┌─────────────────┐ │
|
||||||
|
│ │ /embeddings │ │ /reranker │ │
|
||||||
|
│ │ BGE-Large │ │ BGE-Reranker │ │
|
||||||
|
│ │ ─────────── │ │ ─────────── │ │
|
||||||
|
│ │ drizzt │ │ danilo │ │
|
||||||
|
│ │ Radeon 680M │ │ Intel Arc │ │
|
||||||
|
│ │ (0.8 GPU) │ │ (0.8 GPU) │ │
|
||||||
|
│ └─────────────────┘ └─────────────────┘ │
|
||||||
|
└─────────────────────────────────────────────────────────────────────────────┘
|
||||||
|
│
|
||||||
|
▼
|
||||||
|
┌─────────────────────────────────────────────────────────────────────────────┐
|
||||||
|
│ KServe Compatibility Layer │
|
||||||
|
├─────────────────────────────────────────────────────────────────────────────┤
|
||||||
|
│ ExternalName Services (KServe-style naming): │
|
||||||
|
│ • whisper-predictor.ai-ml → ai-inference-serve-svc:8000 │
|
||||||
|
│ • tts-predictor.ai-ml → ai-inference-serve-svc:8000 │
|
||||||
|
│ • embeddings-predictor.ai-ml → ai-inference-serve-svc:8000 │
|
||||||
|
│ • reranker-predictor.ai-ml → ai-inference-serve-svc:8000 │
|
||||||
|
│ • llm-predictor.ai-ml → ai-inference-serve-svc:8000 │
|
||||||
|
└─────────────────────────────────────────────────────────────────────────────┘
|
||||||
|
```
|
||||||
|
|
||||||
|
## Migration Notes
|
||||||
|
|
||||||
|
1. **Removed**: `kubernetes/apps/ai-ml/llm-inference/` - llama.cpp proof-of-concept
|
||||||
|
2. **Added**: Ray Serve deployments in `kuberay/app/rayservice.yaml`
|
||||||
|
3. **Added**: KServe-compatible ExternalName services in `kuberay/app/services-ray-aliases.yaml`
|
||||||
|
4. **Updated**: All clients now use `ai-inference-serve-svc:8000/{model}`
|
||||||
|
|
||||||
|
## Links
|
||||||
|
|
||||||
|
* [Ray Serve](https://docs.ray.io/en/latest/serve/)
|
||||||
|
* [KubeRay](https://ray-project.github.io/kuberay/)
|
||||||
|
* [vLLM on Ray Serve](https://docs.vllm.ai/en/latest/serving/distributed_serving.html)
|
||||||
|
* Related: [ADR-0005](0005-multi-gpu-strategy.md) - Multi-GPU strategy
|
||||||
|
* Related: [ADR-0007](0007-use-kserve-for-inference.md) - KServe for inference (now abstraction layer)
|
||||||
195
decisions/0012-use-uv-for-python-development.md
Normal file
195
decisions/0012-use-uv-for-python-development.md
Normal file
@@ -0,0 +1,195 @@
|
|||||||
|
# Use uv for Python Development, pip for Docker Builds
|
||||||
|
|
||||||
|
* Status: accepted
|
||||||
|
* Date: 2026-02-02
|
||||||
|
* Deciders: Billy Davies
|
||||||
|
* Technical Story: Standardizing Python package management across development and production
|
||||||
|
|
||||||
|
## Context and Problem Statement
|
||||||
|
|
||||||
|
Our Python projects use a mix of `requirements.txt` and `pyproject.toml` for dependency management. Local development with `pip` is slow, and we need a consistent approach across all repositories while maintaining reproducible Docker builds.
|
||||||
|
|
||||||
|
## Decision Drivers
|
||||||
|
|
||||||
|
* Fast local development iteration
|
||||||
|
* Reproducible production builds
|
||||||
|
* Modern Python packaging standards (PEP 517/518/621)
|
||||||
|
* Lock file support for deterministic installs
|
||||||
|
* Compatibility with existing CI/CD pipelines
|
||||||
|
|
||||||
|
## Considered Options
|
||||||
|
|
||||||
|
* pip only (traditional)
|
||||||
|
* Poetry
|
||||||
|
* PDM
|
||||||
|
* uv (by Astral)
|
||||||
|
* uv for development, pip for Docker
|
||||||
|
|
||||||
|
## Decision Outcome
|
||||||
|
|
||||||
|
Chosen option: "uv for development, pip for Docker", because uv provides extremely fast package resolution and installation for local development (10-100x faster than pip), while pip in Docker ensures maximum compatibility and reproducibility without requiring uv to be installed in production images.
|
||||||
|
|
||||||
|
### Positive Consequences
|
||||||
|
|
||||||
|
* 10-100x faster package installs during development
|
||||||
|
* `uv.lock` provides deterministic dependency resolution
|
||||||
|
* `pyproject.toml` is the modern Python standard (PEP 621)
|
||||||
|
* Docker builds remain simple with standard pip
|
||||||
|
* `uv pip compile` can generate `requirements.txt` from `pyproject.toml`
|
||||||
|
* No uv runtime dependency in production containers
|
||||||
|
|
||||||
|
### Negative Consequences
|
||||||
|
|
||||||
|
* Two tools to maintain (uv locally, pip in Docker)
|
||||||
|
* Team must install uv for local development
|
||||||
|
* Lock file must be kept in sync with pyproject.toml
|
||||||
|
|
||||||
|
## Pros and Cons of the Options
|
||||||
|
|
||||||
|
### pip only (traditional)
|
||||||
|
|
||||||
|
* Good, because universal compatibility
|
||||||
|
* Good, because no additional tools
|
||||||
|
* Bad, because slow resolution and installation
|
||||||
|
* Bad, because no built-in lock file
|
||||||
|
* Bad, because `requirements.txt` lacks metadata
|
||||||
|
|
||||||
|
### Poetry
|
||||||
|
|
||||||
|
* Good, because mature ecosystem
|
||||||
|
* Good, because lock file support
|
||||||
|
* Good, because virtual environment management
|
||||||
|
* Bad, because slower than uv
|
||||||
|
* Bad, because non-standard `pyproject.toml` sections
|
||||||
|
* Bad, because complex dependency resolver
|
||||||
|
|
||||||
|
### PDM
|
||||||
|
|
||||||
|
* Good, because PEP 621 compliant
|
||||||
|
* Good, because lock file support
|
||||||
|
* Good, because fast resolver
|
||||||
|
* Bad, because less adoption than Poetry
|
||||||
|
* Bad, because still slower than uv
|
||||||
|
|
||||||
|
### uv (by Astral)
|
||||||
|
|
||||||
|
* Good, because 10-100x faster than pip
|
||||||
|
* Good, because drop-in pip replacement
|
||||||
|
* Good, because supports PEP 621 pyproject.toml
|
||||||
|
* Good, because uv.lock for deterministic builds
|
||||||
|
* Good, because from the creators of Ruff
|
||||||
|
* Bad, because newer tool (less mature)
|
||||||
|
* Bad, because requires installation
|
||||||
|
|
||||||
|
### uv for development, pip for Docker (Chosen)
|
||||||
|
|
||||||
|
* Good, because fast local development
|
||||||
|
* Good, because simple Docker builds
|
||||||
|
* Good, because no uv in production images
|
||||||
|
* Good, because pip compatibility maintained
|
||||||
|
* Bad, because two tools in workflow
|
||||||
|
* Bad, because must sync lock file
|
||||||
|
|
||||||
|
## Implementation
|
||||||
|
|
||||||
|
### Local Development Setup
|
||||||
|
|
||||||
|
```bash
|
||||||
|
# Install uv (one-time)
|
||||||
|
curl -LsSf https://astral.sh/uv/install.sh | sh
|
||||||
|
|
||||||
|
# Create virtual environment and install dependencies
|
||||||
|
uv venv
|
||||||
|
source .venv/bin/activate
|
||||||
|
uv pip install -e ".[dev]"
|
||||||
|
|
||||||
|
# Or use uv sync with lock file
|
||||||
|
uv sync
|
||||||
|
```
|
||||||
|
|
||||||
|
### Project Structure
|
||||||
|
|
||||||
|
```
|
||||||
|
my-handler/
|
||||||
|
├── pyproject.toml # PEP 621 project metadata and dependencies
|
||||||
|
├── uv.lock # Deterministic lock file (committed)
|
||||||
|
├── requirements.txt # Generated from uv.lock for Docker (optional)
|
||||||
|
├── src/
|
||||||
|
│ └── my_handler/
|
||||||
|
└── tests/
|
||||||
|
```
|
||||||
|
|
||||||
|
### pyproject.toml Example
|
||||||
|
|
||||||
|
```toml
|
||||||
|
[project]
|
||||||
|
name = "my-handler"
|
||||||
|
version = "1.0.0"
|
||||||
|
requires-python = ">=3.11"
|
||||||
|
dependencies = [
|
||||||
|
"handler-base @ git+https://git.daviestechlabs.io/daviestechlabs/handler-base.git",
|
||||||
|
"httpx>=0.27.0",
|
||||||
|
]
|
||||||
|
|
||||||
|
[project.optional-dependencies]
|
||||||
|
dev = [
|
||||||
|
"pytest>=8.0.0",
|
||||||
|
"pytest-asyncio>=0.23.0",
|
||||||
|
"ruff>=0.1.0",
|
||||||
|
]
|
||||||
|
|
||||||
|
[build-system]
|
||||||
|
requires = ["hatchling"]
|
||||||
|
build-backend = "hatchling.build"
|
||||||
|
```
|
||||||
|
|
||||||
|
### Dockerfile Pattern
|
||||||
|
|
||||||
|
The Dockerfile uses uv for speed but installs via pip-compatible interface:
|
||||||
|
|
||||||
|
```dockerfile
|
||||||
|
FROM python:3.13-slim
|
||||||
|
|
||||||
|
# Copy uv for fast installs (optional - can use pip directly)
|
||||||
|
COPY --from=ghcr.io/astral-sh/uv:latest /uv /usr/local/bin/uv
|
||||||
|
|
||||||
|
# Install from pyproject.toml
|
||||||
|
COPY pyproject.toml ./
|
||||||
|
RUN uv pip install --system --no-cache .
|
||||||
|
|
||||||
|
# OR for maximum reproducibility, use requirements.txt
|
||||||
|
COPY requirements.txt ./
|
||||||
|
RUN pip install --no-cache-dir -r requirements.txt
|
||||||
|
```
|
||||||
|
|
||||||
|
### Generating requirements.txt from uv.lock
|
||||||
|
|
||||||
|
```bash
|
||||||
|
# Generate pinned requirements from lock file
|
||||||
|
uv pip compile pyproject.toml -o requirements.txt
|
||||||
|
|
||||||
|
# Or export from lock
|
||||||
|
uv export --format requirements-txt > requirements.txt
|
||||||
|
```
|
||||||
|
|
||||||
|
## Workflow
|
||||||
|
|
||||||
|
1. **Add dependency**: Edit `pyproject.toml`
|
||||||
|
2. **Update lock**: Run `uv lock`
|
||||||
|
3. **Install locally**: Run `uv sync`
|
||||||
|
4. **For Docker**: Optionally generate `requirements.txt` or use `uv pip install` in Dockerfile
|
||||||
|
5. **Commit**: Both `pyproject.toml` and `uv.lock`
|
||||||
|
|
||||||
|
## Migration Path
|
||||||
|
|
||||||
|
1. Create `pyproject.toml` from existing `requirements.txt`
|
||||||
|
2. Run `uv lock` to generate `uv.lock`
|
||||||
|
3. Update Dockerfile to use pyproject.toml
|
||||||
|
4. Delete `requirements.txt` (or keep as generated artifact)
|
||||||
|
|
||||||
|
## Links
|
||||||
|
|
||||||
|
* [uv Documentation](https://docs.astral.sh/uv/)
|
||||||
|
* [PEP 621 - Project Metadata](https://peps.python.org/pep-0621/)
|
||||||
|
* [Astral (uv creators)](https://astral.sh/)
|
||||||
|
* Related: handler-base already uses uv in Dockerfile
|
||||||
Reference in New Issue
Block a user