docs: add ADR-0011 (KubeRay), ADR-0012 (uv), update architecture docs

This commit is contained in:
2026-02-02 07:10:47 -05:00
parent b6f7605fab
commit 598875c5a9
6 changed files with 438 additions and 35 deletions

View File

@@ -0,0 +1,146 @@
# Use KubeRay as Unified GPU Backend
* Status: accepted
* Date: 2026-02-02
* Deciders: Billy Davies
* Technical Story: Consolidating GPU inference workloads onto a single Ray cluster
## Context and Problem Statement
We have multiple AI inference services (LLM, STT, TTS, Embeddings, Reranker) running on a heterogeneous GPU fleet (AMD Strix Halo, NVIDIA RTX 2070, AMD 680M iGPU, Intel Arc). Initially, each service was deployed as a standalone KServe InferenceService, including a llama.cpp proof-of-concept for LLM inference. This resulted in:
1. Complex scheduling across GPU types
2. No GPU sharing (each pod claimed entire GPU)
3. Multiple containers competing for GPU memory
4. Inconsistent service discovery patterns
How do we efficiently utilize our GPU fleet while providing unified inference endpoints?
## Decision Drivers
* Fractional GPU allocation (multiple models per GPU)
* Unified endpoint for all AI services
* Heterogeneous GPU support (CUDA, ROCm, Intel)
* Simplified service discovery
* GPU memory optimization
* Single point of observability
## Considered Options
* Standalone KServe InferenceServices per model
* NVIDIA MPS for GPU sharing
* KubeRay RayService with Ray Serve
* vLLM standalone deployment
## Decision Outcome
Chosen option: "KubeRay RayService with Ray Serve", because it provides native fractional GPU allocation, supports all GPU types, and unifies all inference services behind a single endpoint with path-based routing.
The llama.cpp proof-of-concept has been deprecated and removed. vLLM now runs as a Ray Serve deployment within the RayService.
### Positive Consequences
* Fractional GPU: Whisper (0.5) + TTS (0.5) share RTX 2070
* Single service endpoint: `ai-inference-serve-svc:8000/{model}`
* Path-based routing: `/whisper`, `/tts`, `/llm`, `/embeddings`, `/reranker`
* GPU-aware scheduling via Ray's resource system
* Unified metrics and logging through Ray Dashboard
* Hot-reloading of models without restarting pods
### Negative Consequences
* Ray cluster overhead (head node, dashboard)
* Learning curve for Ray Serve configuration
* Custom container images per GPU architecture
* Less granular scaling (RayService vs per-model replicas)
## Pros and Cons of the Options
### Standalone KServe InferenceServices
* Good, because simple per-model configuration
* Good, because independent scaling per model
* Good, because standard Kubernetes resources
* Bad, because no GPU sharing (1 GPU per pod)
* Bad, because multiple service endpoints
* Bad, because scheduling complexity across GPU types
### NVIDIA MPS for GPU sharing
* Good, because transparent GPU sharing
* Good, because works with existing containers
* Bad, because NVIDIA-only (no ROCm, no Intel)
* Bad, because limited memory isolation
* Bad, because complex setup per node
### KubeRay RayService with Ray Serve
* Good, because fractional GPU allocation
* Good, because unified endpoint
* Good, because multi-GPU-vendor support
* Good, because built-in autoscaling
* Good, because hot model reloading
* Bad, because Ray cluster overhead
* Bad, because custom Ray Serve deployment code
### vLLM standalone deployment
* Good, because optimized for LLM inference
* Good, because OpenAI-compatible API
* Bad, because LLM-only (not STT/TTS/Embeddings)
* Bad, because requires dedicated GPU
## Architecture
```
┌─────────────────────────────────────────────────────────────────────────────┐
│ KubeRay RayService │
├─────────────────────────────────────────────────────────────────────────────┤
│ Service: ai-inference-serve-svc:8000 │
│ │
│ ┌─────────────────┐ ┌─────────────────┐ ┌─────────────────┐ │
│ │ /llm │ │ /whisper │ │ /tts │ │
│ │ vLLM 70B │ │ Whisper v3 │ │ XTTS │ │
│ │ ─────────── │ │ ─────────── │ │ ─────────── │ │
│ │ khelben │ │ elminster │ │ elminster │ │
│ │ Strix Halo │ │ RTX 2070 │ │ RTX 2070 │ │
│ │ (0.95 GPU) │ │ (0.5 GPU) │ │ (0.5 GPU) │ │
│ └─────────────────┘ └─────────────────┘ └─────────────────┘ │
│ │
│ ┌─────────────────┐ ┌─────────────────┐ │
│ │ /embeddings │ │ /reranker │ │
│ │ BGE-Large │ │ BGE-Reranker │ │
│ │ ─────────── │ │ ─────────── │ │
│ │ drizzt │ │ danilo │ │
│ │ Radeon 680M │ │ Intel Arc │ │
│ │ (0.8 GPU) │ │ (0.8 GPU) │ │
│ └─────────────────┘ └─────────────────┘ │
└─────────────────────────────────────────────────────────────────────────────┘
┌─────────────────────────────────────────────────────────────────────────────┐
│ KServe Compatibility Layer │
├─────────────────────────────────────────────────────────────────────────────┤
│ ExternalName Services (KServe-style naming): │
│ • whisper-predictor.ai-ml → ai-inference-serve-svc:8000 │
│ • tts-predictor.ai-ml → ai-inference-serve-svc:8000 │
│ • embeddings-predictor.ai-ml → ai-inference-serve-svc:8000 │
│ • reranker-predictor.ai-ml → ai-inference-serve-svc:8000 │
│ • llm-predictor.ai-ml → ai-inference-serve-svc:8000 │
└─────────────────────────────────────────────────────────────────────────────┘
```
## Migration Notes
1. **Removed**: `kubernetes/apps/ai-ml/llm-inference/` - llama.cpp proof-of-concept
2. **Added**: Ray Serve deployments in `kuberay/app/rayservice.yaml`
3. **Added**: KServe-compatible ExternalName services in `kuberay/app/services-ray-aliases.yaml`
4. **Updated**: All clients now use `ai-inference-serve-svc:8000/{model}`
## Links
* [Ray Serve](https://docs.ray.io/en/latest/serve/)
* [KubeRay](https://ray-project.github.io/kuberay/)
* [vLLM on Ray Serve](https://docs.vllm.ai/en/latest/serving/distributed_serving.html)
* Related: [ADR-0005](0005-multi-gpu-strategy.md) - Multi-GPU strategy
* Related: [ADR-0007](0007-use-kserve-for-inference.md) - KServe for inference (now abstraction layer)