docs: add ADR-0011 (KubeRay), ADR-0012 (uv), update architecture docs

2026-02-02 07:10:47 -05:00
parent b6f7605fab
commit 598875c5a9
6 changed files with 438 additions and 35 deletions
--- a/decisions/0007-use-kserve-for-inference.md
+++ b/decisions/0007-use-kserve-for-inference.md
@@ -1,7 +1,7 @@
 # Use KServe for ML Model Serving

-* Status: accepted
-* Date: 2025-12-15
+* Status: superseded by [ADR-0011](0011-kuberay-unified-gpu-backend.md)
+* Date: 2025-12-15 (Updated: 2026-02-02)
 * Deciders: Billy Davies
 * Technical Story: Selecting model serving platform for inference services

@@ -30,6 +30,15 @@ We need to deploy multiple ML models (Whisper, XTTS, BGE, vLLM) as inference end

 Chosen option: "KServe InferenceService", because it provides a standardized, Kubernetes-native approach to model serving with built-in autoscaling and traffic management.

+**UPDATE (2026-02-02)**: While KServe remains installed, all GPU inference now runs on **KubeRay RayService with Ray Serve** (see [ADR-0011](0011-kuberay-unified-gpu-backend.md)). KServe now serves as an **abstraction layer** via ExternalName services that provide KServe-compatible naming (`{model}-predictor.ai-ml`) while routing to the unified Ray Serve endpoint.
+
+### Current Role of KServe
+
+KServe is retained for:
+- **Service naming convention**: `{model}-predictor.ai-ml.svc.cluster.local`
+- **Future flexibility**: Can be used for non-GPU models or canary deployments
+- **Kubeflow integration**: KServe InferenceServices appear in Kubeflow UI
+
 ### Positive Consequences

 * Standardized V2 inference protocol
@@ -90,26 +99,34 @@ Chosen option: "KServe InferenceService", because it provides a standardized, Ku

 ## Current Configuration

+KServe-compatible ExternalName services route to the unified Ray Serve endpoint:
+
 ```yaml
-apiVersion: serving.kserve.io/v1beta1
-kind: InferenceService
+# KServe-compatible service alias (services-ray-aliases.yaml)
+apiVersion: v1
+kind: Service
 metadata:
-  name: whisper
+  name: whisper-predictor
  namespace: ai-ml
+  labels:
+    serving.kserve.io/inferenceservice: whisper
 spec:
-  predictor:
-    minReplicas: 1
-    maxReplicas: 3
-    containers:
-      - name: whisper
-        image: ghcr.io/org/whisper:latest
-        resources:
-          limits:
-            nvidia.com/gpu: 1
+  type: ExternalName
+  externalName: ai-inference-serve-svc.ai-ml.svc.cluster.local
+  ports:
+    - port: 8000
+      targetPort: 8000
+---
+# Usage: http://whisper-predictor.ai-ml.svc.cluster.local:8000/whisper/...
+# All traffic routes to Ray Serve, which handles GPU allocation
 ```

+For the actual Ray Serve configuration, see [ADR-0011](0011-kuberay-unified-gpu-backend.md).
+
 ## Links

 * [KServe](https://kserve.github.io)
 * [V2 Inference Protocol](https://kserve.github.io/website/latest/modelserving/data_plane/v2_protocol/)
+* [KubeRay](https://ray-project.github.io/kuberay/)
 * Related: [ADR-0005](0005-multi-gpu-strategy.md) - GPU allocation
+* Superseded by: [ADR-0011](0011-kuberay-unified-gpu-backend.md) - KubeRay unified backend