docs: add ADR-0011 (KubeRay), ADR-0012 (uv), update architecture docs
This commit is contained in:
146
decisions/0011-kuberay-unified-gpu-backend.md
Normal file
146
decisions/0011-kuberay-unified-gpu-backend.md
Normal file
@@ -0,0 +1,146 @@
|
||||
# Use KubeRay as Unified GPU Backend
|
||||
|
||||
* Status: accepted
|
||||
* Date: 2026-02-02
|
||||
* Deciders: Billy Davies
|
||||
* Technical Story: Consolidating GPU inference workloads onto a single Ray cluster
|
||||
|
||||
## Context and Problem Statement
|
||||
|
||||
We have multiple AI inference services (LLM, STT, TTS, Embeddings, Reranker) running on a heterogeneous GPU fleet (AMD Strix Halo, NVIDIA RTX 2070, AMD 680M iGPU, Intel Arc). Initially, each service was deployed as a standalone KServe InferenceService, including a llama.cpp proof-of-concept for LLM inference. This resulted in:
|
||||
|
||||
1. Complex scheduling across GPU types
|
||||
2. No GPU sharing (each pod claimed entire GPU)
|
||||
3. Multiple containers competing for GPU memory
|
||||
4. Inconsistent service discovery patterns
|
||||
|
||||
How do we efficiently utilize our GPU fleet while providing unified inference endpoints?
|
||||
|
||||
## Decision Drivers
|
||||
|
||||
* Fractional GPU allocation (multiple models per GPU)
|
||||
* Unified endpoint for all AI services
|
||||
* Heterogeneous GPU support (CUDA, ROCm, Intel)
|
||||
* Simplified service discovery
|
||||
* GPU memory optimization
|
||||
* Single point of observability
|
||||
|
||||
## Considered Options
|
||||
|
||||
* Standalone KServe InferenceServices per model
|
||||
* NVIDIA MPS for GPU sharing
|
||||
* KubeRay RayService with Ray Serve
|
||||
* vLLM standalone deployment
|
||||
|
||||
## Decision Outcome
|
||||
|
||||
Chosen option: "KubeRay RayService with Ray Serve", because it provides native fractional GPU allocation, supports all GPU types, and unifies all inference services behind a single endpoint with path-based routing.
|
||||
|
||||
The llama.cpp proof-of-concept has been deprecated and removed. vLLM now runs as a Ray Serve deployment within the RayService.
|
||||
|
||||
### Positive Consequences
|
||||
|
||||
* Fractional GPU: Whisper (0.5) + TTS (0.5) share RTX 2070
|
||||
* Single service endpoint: `ai-inference-serve-svc:8000/{model}`
|
||||
* Path-based routing: `/whisper`, `/tts`, `/llm`, `/embeddings`, `/reranker`
|
||||
* GPU-aware scheduling via Ray's resource system
|
||||
* Unified metrics and logging through Ray Dashboard
|
||||
* Hot-reloading of models without restarting pods
|
||||
|
||||
### Negative Consequences
|
||||
|
||||
* Ray cluster overhead (head node, dashboard)
|
||||
* Learning curve for Ray Serve configuration
|
||||
* Custom container images per GPU architecture
|
||||
* Less granular scaling (RayService vs per-model replicas)
|
||||
|
||||
## Pros and Cons of the Options
|
||||
|
||||
### Standalone KServe InferenceServices
|
||||
|
||||
* Good, because simple per-model configuration
|
||||
* Good, because independent scaling per model
|
||||
* Good, because standard Kubernetes resources
|
||||
* Bad, because no GPU sharing (1 GPU per pod)
|
||||
* Bad, because multiple service endpoints
|
||||
* Bad, because scheduling complexity across GPU types
|
||||
|
||||
### NVIDIA MPS for GPU sharing
|
||||
|
||||
* Good, because transparent GPU sharing
|
||||
* Good, because works with existing containers
|
||||
* Bad, because NVIDIA-only (no ROCm, no Intel)
|
||||
* Bad, because limited memory isolation
|
||||
* Bad, because complex setup per node
|
||||
|
||||
### KubeRay RayService with Ray Serve
|
||||
|
||||
* Good, because fractional GPU allocation
|
||||
* Good, because unified endpoint
|
||||
* Good, because multi-GPU-vendor support
|
||||
* Good, because built-in autoscaling
|
||||
* Good, because hot model reloading
|
||||
* Bad, because Ray cluster overhead
|
||||
* Bad, because custom Ray Serve deployment code
|
||||
|
||||
### vLLM standalone deployment
|
||||
|
||||
* Good, because optimized for LLM inference
|
||||
* Good, because OpenAI-compatible API
|
||||
* Bad, because LLM-only (not STT/TTS/Embeddings)
|
||||
* Bad, because requires dedicated GPU
|
||||
|
||||
## Architecture
|
||||
|
||||
```
|
||||
┌─────────────────────────────────────────────────────────────────────────────┐
|
||||
│ KubeRay RayService │
|
||||
├─────────────────────────────────────────────────────────────────────────────┤
|
||||
│ Service: ai-inference-serve-svc:8000 │
|
||||
│ │
|
||||
│ ┌─────────────────┐ ┌─────────────────┐ ┌─────────────────┐ │
|
||||
│ │ /llm │ │ /whisper │ │ /tts │ │
|
||||
│ │ vLLM 70B │ │ Whisper v3 │ │ XTTS │ │
|
||||
│ │ ─────────── │ │ ─────────── │ │ ─────────── │ │
|
||||
│ │ khelben │ │ elminster │ │ elminster │ │
|
||||
│ │ Strix Halo │ │ RTX 2070 │ │ RTX 2070 │ │
|
||||
│ │ (0.95 GPU) │ │ (0.5 GPU) │ │ (0.5 GPU) │ │
|
||||
│ └─────────────────┘ └─────────────────┘ └─────────────────┘ │
|
||||
│ │
|
||||
│ ┌─────────────────┐ ┌─────────────────┐ │
|
||||
│ │ /embeddings │ │ /reranker │ │
|
||||
│ │ BGE-Large │ │ BGE-Reranker │ │
|
||||
│ │ ─────────── │ │ ─────────── │ │
|
||||
│ │ drizzt │ │ danilo │ │
|
||||
│ │ Radeon 680M │ │ Intel Arc │ │
|
||||
│ │ (0.8 GPU) │ │ (0.8 GPU) │ │
|
||||
│ └─────────────────┘ └─────────────────┘ │
|
||||
└─────────────────────────────────────────────────────────────────────────────┘
|
||||
│
|
||||
▼
|
||||
┌─────────────────────────────────────────────────────────────────────────────┐
|
||||
│ KServe Compatibility Layer │
|
||||
├─────────────────────────────────────────────────────────────────────────────┤
|
||||
│ ExternalName Services (KServe-style naming): │
|
||||
│ • whisper-predictor.ai-ml → ai-inference-serve-svc:8000 │
|
||||
│ • tts-predictor.ai-ml → ai-inference-serve-svc:8000 │
|
||||
│ • embeddings-predictor.ai-ml → ai-inference-serve-svc:8000 │
|
||||
│ • reranker-predictor.ai-ml → ai-inference-serve-svc:8000 │
|
||||
│ • llm-predictor.ai-ml → ai-inference-serve-svc:8000 │
|
||||
└─────────────────────────────────────────────────────────────────────────────┘
|
||||
```
|
||||
|
||||
## Migration Notes
|
||||
|
||||
1. **Removed**: `kubernetes/apps/ai-ml/llm-inference/` - llama.cpp proof-of-concept
|
||||
2. **Added**: Ray Serve deployments in `kuberay/app/rayservice.yaml`
|
||||
3. **Added**: KServe-compatible ExternalName services in `kuberay/app/services-ray-aliases.yaml`
|
||||
4. **Updated**: All clients now use `ai-inference-serve-svc:8000/{model}`
|
||||
|
||||
## Links
|
||||
|
||||
* [Ray Serve](https://docs.ray.io/en/latest/serve/)
|
||||
* [KubeRay](https://ray-project.github.io/kuberay/)
|
||||
* [vLLM on Ray Serve](https://docs.vllm.ai/en/latest/serving/distributed_serving.html)
|
||||
* Related: [ADR-0005](0005-multi-gpu-strategy.md) - Multi-GPU strategy
|
||||
* Related: [ADR-0007](0007-use-kserve-for-inference.md) - KServe for inference (now abstraction layer)
|
||||
Reference in New Issue
Block a user