8.3 KiB
Use KubeRay as Unified GPU Backend
- Status: accepted
- Date: 2026-02-02
- Deciders: Billy Davies
- Technical Story: Consolidating GPU inference workloads onto a single Ray cluster
Context and Problem Statement
We have multiple AI inference services (LLM, STT, TTS, Embeddings, Reranker) running on a heterogeneous GPU fleet (AMD Strix Halo, NVIDIA RTX 2070, AMD 680M iGPU, Intel Arc). Initially, each service was deployed as a standalone KServe InferenceService, including a llama.cpp proof-of-concept for LLM inference. This resulted in:
- Complex scheduling across GPU types
- No GPU sharing (each pod claimed entire GPU)
- Multiple containers competing for GPU memory
- Inconsistent service discovery patterns
How do we efficiently utilize our GPU fleet while providing unified inference endpoints?
Decision Drivers
- Fractional GPU allocation (multiple models per GPU)
- Unified endpoint for all AI services
- Heterogeneous GPU support (CUDA, ROCm, Intel)
- Simplified service discovery
- GPU memory optimization
- Single point of observability
Considered Options
- Standalone KServe InferenceServices per model
- NVIDIA MPS for GPU sharing
- KubeRay RayService with Ray Serve
- vLLM standalone deployment
Decision Outcome
Chosen option: "KubeRay RayService with Ray Serve", because it provides native fractional GPU allocation, supports all GPU types, and unifies all inference services behind a single endpoint with path-based routing.
The llama.cpp proof-of-concept has been deprecated and removed. vLLM now runs as a Ray Serve deployment within the RayService.
Positive Consequences
- Fractional GPU: Whisper (0.5) + TTS (0.5) share RTX 2070
- Single service endpoint:
ai-inference-serve-svc:8000/{model} - Path-based routing:
/whisper,/tts,/llm,/embeddings,/reranker - GPU-aware scheduling via Ray's resource system
- Unified metrics and logging through Ray Dashboard
- Hot-reloading of models without restarting pods
Negative Consequences
- Ray cluster overhead (head node, dashboard)
- Learning curve for Ray Serve configuration
- Custom container images per GPU architecture
- Less granular scaling (RayService vs per-model replicas)
Pros and Cons of the Options
Standalone KServe InferenceServices
- Good, because simple per-model configuration
- Good, because independent scaling per model
- Good, because standard Kubernetes resources
- Bad, because no GPU sharing (1 GPU per pod)
- Bad, because multiple service endpoints
- Bad, because scheduling complexity across GPU types
NVIDIA MPS for GPU sharing
- Good, because transparent GPU sharing
- Good, because works with existing containers
- Bad, because NVIDIA-only (no ROCm, no Intel)
- Bad, because limited memory isolation
- Bad, because complex setup per node
KubeRay RayService with Ray Serve
- Good, because fractional GPU allocation
- Good, because unified endpoint
- Good, because multi-GPU-vendor support
- Good, because built-in autoscaling
- Good, because hot model reloading
- Bad, because Ray cluster overhead
- Bad, because custom Ray Serve deployment code
vLLM standalone deployment
- Good, because optimized for LLM inference
- Good, because OpenAI-compatible API
- Bad, because LLM-only (not STT/TTS/Embeddings)
- Bad, because requires dedicated GPU
Architecture
┌─────────────────────────────────────────────────────────────────────────────┐
│ KubeRay RayService │
├─────────────────────────────────────────────────────────────────────────────┤
│ Service: ai-inference-serve-svc:8000 │
│ │
│ ┌─────────────────┐ ┌─────────────────┐ ┌─────────────────┐ │
│ │ /llm │ │ /whisper │ │ /tts │ │
│ │ vLLM 70B │ │ Whisper v3 │ │ XTTS │ │
│ │ ─────────── │ │ ─────────── │ │ ─────────── │ │
│ │ khelben │ │ elminster │ │ elminster │ │
│ │ Strix Halo │ │ RTX 2070 │ │ RTX 2070 │ │
│ │ (0.95 GPU) │ │ (0.5 GPU) │ │ (0.5 GPU) │ │
│ └─────────────────┘ └─────────────────┘ └─────────────────┘ │
│ │
│ ┌─────────────────┐ ┌─────────────────┐ │
│ │ /embeddings │ │ /reranker │ │
│ │ BGE-Large │ │ BGE-Reranker │ │
│ │ ─────────── │ │ ─────────── │ │
│ │ drizzt │ │ danilo │ │
│ │ Radeon 680M │ │ Intel Arc │ │
│ │ (0.8 GPU) │ │ (0.8 GPU) │ │
│ └─────────────────┘ └─────────────────┘ │
└─────────────────────────────────────────────────────────────────────────────┘
│
▼
┌─────────────────────────────────────────────────────────────────────────────┐
│ KServe Compatibility Layer │
├─────────────────────────────────────────────────────────────────────────────┤
│ ExternalName Services (KServe-style naming): │
│ • whisper-predictor.ai-ml → ai-inference-serve-svc:8000 │
│ • tts-predictor.ai-ml → ai-inference-serve-svc:8000 │
│ • embeddings-predictor.ai-ml → ai-inference-serve-svc:8000 │
│ • reranker-predictor.ai-ml → ai-inference-serve-svc:8000 │
│ • llm-predictor.ai-ml → ai-inference-serve-svc:8000 │
└─────────────────────────────────────────────────────────────────────────────┘
Migration Notes
- Removed:
kubernetes/apps/ai-ml/llm-inference/- llama.cpp proof-of-concept - Added: Ray Serve deployments in
kuberay/app/rayservice.yaml - Added: KServe-compatible ExternalName services in
kuberay/app/services-ray-aliases.yaml - Updated: All clients now use
ai-inference-serve-svc:8000/{model}
Links
- Ray Serve
- KubeRay
- vLLM on Ray Serve
- Related: ADR-0005 - Multi-GPU strategy
- Related: ADR-0007 - KServe for inference (now abstraction layer)