Use KubeRay as Unified GPU Backend

Status: accepted
Date: 2026-02-02
Deciders: Billy Davies
Technical Story: Consolidating GPU inference workloads onto a single Ray cluster

Context and Problem Statement

We have multiple AI inference services (LLM, STT, TTS, Embeddings, Reranker) running on a heterogeneous GPU fleet (AMD Strix Halo, NVIDIA RTX 2070, AMD 680M iGPU, Intel Arc). Initially, each service was deployed as a standalone KServe InferenceService, including a llama.cpp proof-of-concept for LLM inference. This resulted in:

Complex scheduling across GPU types
No GPU sharing (each pod claimed entire GPU)
Multiple containers competing for GPU memory
Inconsistent service discovery patterns

How do we efficiently utilize our GPU fleet while providing unified inference endpoints?

Decision Drivers

Fractional GPU allocation (multiple models per GPU)
Unified endpoint for all AI services
Heterogeneous GPU support (CUDA, ROCm, Intel)
Simplified service discovery
GPU memory optimization
Single point of observability

Considered Options

Standalone KServe InferenceServices per model
NVIDIA MPS for GPU sharing
KubeRay RayService with Ray Serve
vLLM standalone deployment

Decision Outcome

Chosen option: "KubeRay RayService with Ray Serve", because it provides native fractional GPU allocation, supports all GPU types, and unifies all inference services behind a single endpoint with path-based routing.

The llama.cpp proof-of-concept has been deprecated and removed. vLLM now runs as a Ray Serve deployment within the RayService.

Positive Consequences

Fractional GPU: Whisper (0.5) + TTS (0.5) share RTX 2070
Single service endpoint: ai-inference-serve-svc:8000/{model}
Path-based routing: /whisper, /tts, /llm, /embeddings, /reranker
GPU-aware scheduling via Ray's resource system
Unified metrics and logging through Ray Dashboard
Hot-reloading of models without restarting pods

Negative Consequences

Ray cluster overhead (head node, dashboard)
Learning curve for Ray Serve configuration
Custom container images per GPU architecture
Less granular scaling (RayService vs per-model replicas)

Pros and Cons of the Options

Standalone KServe InferenceServices

Good, because simple per-model configuration
Good, because independent scaling per model
Good, because standard Kubernetes resources
Bad, because no GPU sharing (1 GPU per pod)
Bad, because multiple service endpoints
Bad, because scheduling complexity across GPU types

Good, because transparent GPU sharing
Good, because works with existing containers
Bad, because NVIDIA-only (no ROCm, no Intel)
Bad, because limited memory isolation
Bad, because complex setup per node

KubeRay RayService with Ray Serve

Good, because fractional GPU allocation
Good, because unified endpoint
Good, because multi-GPU-vendor support
Good, because built-in autoscaling
Good, because hot model reloading
Bad, because Ray cluster overhead
Bad, because custom Ray Serve deployment code

vLLM standalone deployment

Good, because optimized for LLM inference
Good, because OpenAI-compatible API
Bad, because LLM-only (not STT/TTS/Embeddings)
Bad, because requires dedicated GPU

Architecture

┌─────────────────────────────────────────────────────────────────────────────┐
│                              KubeRay RayService                              │
├─────────────────────────────────────────────────────────────────────────────┤
│  Service: ai-inference-serve-svc:8000                                       │
│                                                                              │
│  ┌─────────────────┐ ┌─────────────────┐ ┌─────────────────┐                │
│  │  /llm           │ │  /whisper       │ │  /tts           │                │
│  │  vLLM 70B       │ │  Whisper v3     │ │  XTTS           │                │
│  │  ───────────    │ │  ───────────    │ │  ───────────    │                │
│  │  khelben        │ │  elminster      │ │  elminster      │                │
│  │  Strix Halo     │ │  RTX 2070       │ │  RTX 2070       │                │
│  │  (0.95 GPU)     │ │  (0.5 GPU)      │ │  (0.5 GPU)      │                │
│  └─────────────────┘ └─────────────────┘ └─────────────────┘                │
│                                                                              │
│  ┌─────────────────┐ ┌─────────────────┐                                    │
│  │  /embeddings    │ │  /reranker      │                                    │
│  │  BGE-Large      │ │  BGE-Reranker   │                                    │
│  │  ───────────    │ │  ───────────    │                                    │
│  │  drizzt         │ │  danilo         │                                    │
│  │  Radeon 680M    │ │  Intel Arc      │                                    │
│  │  (0.8 GPU)      │ │  (0.8 GPU)      │                                    │
│  └─────────────────┘ └─────────────────┘                                    │
└─────────────────────────────────────────────────────────────────────────────┘
                                  │
                                  ▼
┌─────────────────────────────────────────────────────────────────────────────┐
│                         KServe Compatibility Layer                           │
├─────────────────────────────────────────────────────────────────────────────┤
│  ExternalName Services (KServe-style naming):                               │
│  • whisper-predictor.ai-ml → ai-inference-serve-svc:8000                    │
│  • tts-predictor.ai-ml → ai-inference-serve-svc:8000                        │
│  • embeddings-predictor.ai-ml → ai-inference-serve-svc:8000                 │
│  • reranker-predictor.ai-ml → ai-inference-serve-svc:8000                   │
│  • llm-predictor.ai-ml → ai-inference-serve-svc:8000                        │
└─────────────────────────────────────────────────────────────────────────────┘

Migration Notes

Removed: kubernetes/apps/ai-ml/llm-inference/ - llama.cpp proof-of-concept
Added: Ray Serve deployments in kuberay/app/rayservice.yaml
Added: KServe-compatible ExternalName services in kuberay/app/services-ray-aliases.yaml
Updated: All clients now use ai-inference-serve-svc:8000/{model}

8.3 KiB Raw Blame History