docs: add ADR-0011 (KubeRay), ADR-0012 (uv), update architecture docs

2026-02-02 07:10:47 -05:00
parent b6f7605fab
commit 598875c5a9
6 changed files with 438 additions and 35 deletions
--- a/decisions/0011-kuberay-unified-gpu-backend.md
+++ b/decisions/0011-kuberay-unified-gpu-backend.md
@@ -0,0 +1,146 @@
+# Use KubeRay as Unified GPU Backend
+
+* Status: accepted
+* Date: 2026-02-02
+* Deciders: Billy Davies
+* Technical Story: Consolidating GPU inference workloads onto a single Ray cluster
+
+## Context and Problem Statement
+
+We have multiple AI inference services (LLM, STT, TTS, Embeddings, Reranker) running on a heterogeneous GPU fleet (AMD Strix Halo, NVIDIA RTX 2070, AMD 680M iGPU, Intel Arc). Initially, each service was deployed as a standalone KServe InferenceService, including a llama.cpp proof-of-concept for LLM inference. This resulted in:
+
+1. Complex scheduling across GPU types
+2. No GPU sharing (each pod claimed entire GPU)
+3. Multiple containers competing for GPU memory
+4. Inconsistent service discovery patterns
+
+How do we efficiently utilize our GPU fleet while providing unified inference endpoints?
+
+## Decision Drivers
+
+* Fractional GPU allocation (multiple models per GPU)
+* Unified endpoint for all AI services
+* Heterogeneous GPU support (CUDA, ROCm, Intel)
+* Simplified service discovery
+* GPU memory optimization
+* Single point of observability
+
+## Considered Options
+
+* Standalone KServe InferenceServices per model
+* NVIDIA MPS for GPU sharing
+* KubeRay RayService with Ray Serve
+* vLLM standalone deployment
+
+## Decision Outcome
+
+Chosen option: "KubeRay RayService with Ray Serve", because it provides native fractional GPU allocation, supports all GPU types, and unifies all inference services behind a single endpoint with path-based routing.
+
+The llama.cpp proof-of-concept has been deprecated and removed. vLLM now runs as a Ray Serve deployment within the RayService.
+
+### Positive Consequences
+
+* Fractional GPU: Whisper (0.5) + TTS (0.5) share RTX 2070
+* Single service endpoint: `ai-inference-serve-svc:8000/{model}`
+* Path-based routing: `/whisper`, `/tts`, `/llm`, `/embeddings`, `/reranker`
+* GPU-aware scheduling via Ray's resource system
+* Unified metrics and logging through Ray Dashboard
+* Hot-reloading of models without restarting pods
+
+### Negative Consequences
+
+* Ray cluster overhead (head node, dashboard)
+* Learning curve for Ray Serve configuration
+* Custom container images per GPU architecture
+* Less granular scaling (RayService vs per-model replicas)
+
+## Pros and Cons of the Options
+
+### Standalone KServe InferenceServices
+
+* Good, because simple per-model configuration
+* Good, because independent scaling per model
+* Good, because standard Kubernetes resources
+* Bad, because no GPU sharing (1 GPU per pod)
+* Bad, because multiple service endpoints
+* Bad, because scheduling complexity across GPU types
+
+### NVIDIA MPS for GPU sharing
+
+* Good, because transparent GPU sharing
+* Good, because works with existing containers
+* Bad, because NVIDIA-only (no ROCm, no Intel)
+* Bad, because limited memory isolation
+* Bad, because complex setup per node
+
+### KubeRay RayService with Ray Serve
+
+* Good, because fractional GPU allocation
+* Good, because unified endpoint
+* Good, because multi-GPU-vendor support
+* Good, because built-in autoscaling
+* Good, because hot model reloading
+* Bad, because Ray cluster overhead
+* Bad, because custom Ray Serve deployment code
+
+### vLLM standalone deployment
+
+* Good, because optimized for LLM inference
+* Good, because OpenAI-compatible API
+* Bad, because LLM-only (not STT/TTS/Embeddings)
+* Bad, because requires dedicated GPU
+
+## Architecture
+
+```
+┌─────────────────────────────────────────────────────────────────────────────┐
+│                              KubeRay RayService                              │
+├─────────────────────────────────────────────────────────────────────────────┤
+│  Service: ai-inference-serve-svc:8000                                       │
+│                                                                              │
+│  ┌─────────────────┐ ┌─────────────────┐ ┌─────────────────┐                │
+│  │  /llm           │ │  /whisper       │ │  /tts           │                │
+│  │  vLLM 70B       │ │  Whisper v3     │ │  XTTS           │                │
+│  │  ───────────    │ │  ───────────    │ │  ───────────    │                │
+│  │  khelben        │ │  elminster      │ │  elminster      │                │
+│  │  Strix Halo     │ │  RTX 2070       │ │  RTX 2070       │                │
+│  │  (0.95 GPU)     │ │  (0.5 GPU)      │ │  (0.5 GPU)      │                │
+│  └─────────────────┘ └─────────────────┘ └─────────────────┘                │
+│                                                                              │
+│  ┌─────────────────┐ ┌─────────────────┐                                    │
+│  │  /embeddings    │ │  /reranker      │                                    │
+│  │  BGE-Large      │ │  BGE-Reranker   │                                    │
+│  │  ───────────    │ │  ───────────    │                                    │
+│  │  drizzt         │ │  danilo         │                                    │
+│  │  Radeon 680M    │ │  Intel Arc      │                                    │
+│  │  (0.8 GPU)      │ │  (0.8 GPU)      │                                    │
+│  └─────────────────┘ └─────────────────┘                                    │
+└─────────────────────────────────────────────────────────────────────────────┘
+                                  │
+                                  ▼
+┌─────────────────────────────────────────────────────────────────────────────┐
+│                         KServe Compatibility Layer                           │
+├─────────────────────────────────────────────────────────────────────────────┤
+│  ExternalName Services (KServe-style naming):                               │
+│  • whisper-predictor.ai-ml → ai-inference-serve-svc:8000                    │
+│  • tts-predictor.ai-ml → ai-inference-serve-svc:8000                        │
+│  • embeddings-predictor.ai-ml → ai-inference-serve-svc:8000                 │
+│  • reranker-predictor.ai-ml → ai-inference-serve-svc:8000                   │
+│  • llm-predictor.ai-ml → ai-inference-serve-svc:8000                        │
+└─────────────────────────────────────────────────────────────────────────────┘
+```
+
+## Migration Notes
+
+1. **Removed**: `kubernetes/apps/ai-ml/llm-inference/` - llama.cpp proof-of-concept
+2. **Added**: Ray Serve deployments in `kuberay/app/rayservice.yaml`
+3. **Added**: KServe-compatible ExternalName services in `kuberay/app/services-ray-aliases.yaml`
+4. **Updated**: All clients now use `ai-inference-serve-svc:8000/{model}`
+
+## Links
+
+* [Ray Serve](https://docs.ray.io/en/latest/serve/)
+* [KubeRay](https://ray-project.github.io/kuberay/)
+* [vLLM on Ray Serve](https://docs.vllm.ai/en/latest/serving/distributed_serving.html)
+* Related: [ADR-0005](0005-multi-gpu-strategy.md) - Multi-GPU strategy
+* Related: [ADR-0007](0007-use-kserve-for-inference.md) - KServe for inference (now abstraction layer)