147 lines
8.3 KiB
Markdown
147 lines
8.3 KiB
Markdown
# Use KubeRay as Unified GPU Backend
|
|
|
|
* Status: accepted
|
|
* Date: 2026-02-02
|
|
* Deciders: Billy Davies
|
|
* Technical Story: Consolidating GPU inference workloads onto a single Ray cluster
|
|
|
|
## Context and Problem Statement
|
|
|
|
We have multiple AI inference services (LLM, STT, TTS, Embeddings, Reranker) running on a heterogeneous GPU fleet (AMD Strix Halo, NVIDIA RTX 2070, AMD 680M iGPU, Intel Arc). Initially, each service was deployed as a standalone KServe InferenceService, including a llama.cpp proof-of-concept for LLM inference. This resulted in:
|
|
|
|
1. Complex scheduling across GPU types
|
|
2. No GPU sharing (each pod claimed entire GPU)
|
|
3. Multiple containers competing for GPU memory
|
|
4. Inconsistent service discovery patterns
|
|
|
|
How do we efficiently utilize our GPU fleet while providing unified inference endpoints?
|
|
|
|
## Decision Drivers
|
|
|
|
* Fractional GPU allocation (multiple models per GPU)
|
|
* Unified endpoint for all AI services
|
|
* Heterogeneous GPU support (CUDA, ROCm, Intel)
|
|
* Simplified service discovery
|
|
* GPU memory optimization
|
|
* Single point of observability
|
|
|
|
## Considered Options
|
|
|
|
* Standalone KServe InferenceServices per model
|
|
* NVIDIA MPS for GPU sharing
|
|
* KubeRay RayService with Ray Serve
|
|
* vLLM standalone deployment
|
|
|
|
## Decision Outcome
|
|
|
|
Chosen option: "KubeRay RayService with Ray Serve", because it provides native fractional GPU allocation, supports all GPU types, and unifies all inference services behind a single endpoint with path-based routing.
|
|
|
|
The llama.cpp proof-of-concept has been deprecated and removed. vLLM now runs as a Ray Serve deployment within the RayService.
|
|
|
|
### Positive Consequences
|
|
|
|
* Fractional GPU: Whisper (0.5) + TTS (0.5) share RTX 2070
|
|
* Single service endpoint: `ai-inference-serve-svc:8000/{model}`
|
|
* Path-based routing: `/whisper`, `/tts`, `/llm`, `/embeddings`, `/reranker`
|
|
* GPU-aware scheduling via Ray's resource system
|
|
* Unified metrics and logging through Ray Dashboard
|
|
* Hot-reloading of models without restarting pods
|
|
|
|
### Negative Consequences
|
|
|
|
* Ray cluster overhead (head node, dashboard)
|
|
* Learning curve for Ray Serve configuration
|
|
* Custom container images per GPU architecture
|
|
* Less granular scaling (RayService vs per-model replicas)
|
|
|
|
## Pros and Cons of the Options
|
|
|
|
### Standalone KServe InferenceServices
|
|
|
|
* Good, because simple per-model configuration
|
|
* Good, because independent scaling per model
|
|
* Good, because standard Kubernetes resources
|
|
* Bad, because no GPU sharing (1 GPU per pod)
|
|
* Bad, because multiple service endpoints
|
|
* Bad, because scheduling complexity across GPU types
|
|
|
|
### NVIDIA MPS for GPU sharing
|
|
|
|
* Good, because transparent GPU sharing
|
|
* Good, because works with existing containers
|
|
* Bad, because NVIDIA-only (no ROCm, no Intel)
|
|
* Bad, because limited memory isolation
|
|
* Bad, because complex setup per node
|
|
|
|
### KubeRay RayService with Ray Serve
|
|
|
|
* Good, because fractional GPU allocation
|
|
* Good, because unified endpoint
|
|
* Good, because multi-GPU-vendor support
|
|
* Good, because built-in autoscaling
|
|
* Good, because hot model reloading
|
|
* Bad, because Ray cluster overhead
|
|
* Bad, because custom Ray Serve deployment code
|
|
|
|
### vLLM standalone deployment
|
|
|
|
* Good, because optimized for LLM inference
|
|
* Good, because OpenAI-compatible API
|
|
* Bad, because LLM-only (not STT/TTS/Embeddings)
|
|
* Bad, because requires dedicated GPU
|
|
|
|
## Architecture
|
|
|
|
```
|
|
┌─────────────────────────────────────────────────────────────────────────────┐
|
|
│ KubeRay RayService │
|
|
├─────────────────────────────────────────────────────────────────────────────┤
|
|
│ Service: ai-inference-serve-svc:8000 │
|
|
│ │
|
|
│ ┌─────────────────┐ ┌─────────────────┐ ┌─────────────────┐ │
|
|
│ │ /llm │ │ /whisper │ │ /tts │ │
|
|
│ │ vLLM 70B │ │ Whisper v3 │ │ XTTS │ │
|
|
│ │ ─────────── │ │ ─────────── │ │ ─────────── │ │
|
|
│ │ khelben │ │ elminster │ │ elminster │ │
|
|
│ │ Strix Halo │ │ RTX 2070 │ │ RTX 2070 │ │
|
|
│ │ (0.95 GPU) │ │ (0.5 GPU) │ │ (0.5 GPU) │ │
|
|
│ └─────────────────┘ └─────────────────┘ └─────────────────┘ │
|
|
│ │
|
|
│ ┌─────────────────┐ ┌─────────────────┐ │
|
|
│ │ /embeddings │ │ /reranker │ │
|
|
│ │ BGE-Large │ │ BGE-Reranker │ │
|
|
│ │ ─────────── │ │ ─────────── │ │
|
|
│ │ drizzt │ │ danilo │ │
|
|
│ │ Radeon 680M │ │ Intel Arc │ │
|
|
│ │ (0.8 GPU) │ │ (0.8 GPU) │ │
|
|
│ └─────────────────┘ └─────────────────┘ │
|
|
└─────────────────────────────────────────────────────────────────────────────┘
|
|
│
|
|
▼
|
|
┌─────────────────────────────────────────────────────────────────────────────┐
|
|
│ KServe Compatibility Layer │
|
|
├─────────────────────────────────────────────────────────────────────────────┤
|
|
│ ExternalName Services (KServe-style naming): │
|
|
│ • whisper-predictor.ai-ml → ai-inference-serve-svc:8000 │
|
|
│ • tts-predictor.ai-ml → ai-inference-serve-svc:8000 │
|
|
│ • embeddings-predictor.ai-ml → ai-inference-serve-svc:8000 │
|
|
│ • reranker-predictor.ai-ml → ai-inference-serve-svc:8000 │
|
|
│ • llm-predictor.ai-ml → ai-inference-serve-svc:8000 │
|
|
└─────────────────────────────────────────────────────────────────────────────┘
|
|
```
|
|
|
|
## Migration Notes
|
|
|
|
1. **Removed**: `kubernetes/apps/ai-ml/llm-inference/` - llama.cpp proof-of-concept
|
|
2. **Added**: Ray Serve deployments in `kuberay/app/rayservice.yaml`
|
|
3. **Added**: KServe-compatible ExternalName services in `kuberay/app/services-ray-aliases.yaml`
|
|
4. **Updated**: All clients now use `ai-inference-serve-svc:8000/{model}`
|
|
|
|
## Links
|
|
|
|
* [Ray Serve](https://docs.ray.io/en/latest/serve/)
|
|
* [KubeRay](https://ray-project.github.io/kuberay/)
|
|
* [vLLM on Ray Serve](https://docs.vllm.ai/en/latest/serving/distributed_serving.html)
|
|
* Related: [ADR-0005](0005-multi-gpu-strategy.md) - Multi-GPU strategy
|
|
* Related: [ADR-0007](0007-use-kserve-for-inference.md) - KServe for inference (now abstraction layer)
|