docs: add ADR-0011 (KubeRay), ADR-0012 (uv), update architecture docs
This commit is contained in:
@@ -1,7 +1,7 @@
|
||||
# Use KServe for ML Model Serving
|
||||
|
||||
* Status: accepted
|
||||
* Date: 2025-12-15
|
||||
* Status: superseded by [ADR-0011](0011-kuberay-unified-gpu-backend.md)
|
||||
* Date: 2025-12-15 (Updated: 2026-02-02)
|
||||
* Deciders: Billy Davies
|
||||
* Technical Story: Selecting model serving platform for inference services
|
||||
|
||||
@@ -30,6 +30,15 @@ We need to deploy multiple ML models (Whisper, XTTS, BGE, vLLM) as inference end
|
||||
|
||||
Chosen option: "KServe InferenceService", because it provides a standardized, Kubernetes-native approach to model serving with built-in autoscaling and traffic management.
|
||||
|
||||
**UPDATE (2026-02-02)**: While KServe remains installed, all GPU inference now runs on **KubeRay RayService with Ray Serve** (see [ADR-0011](0011-kuberay-unified-gpu-backend.md)). KServe now serves as an **abstraction layer** via ExternalName services that provide KServe-compatible naming (`{model}-predictor.ai-ml`) while routing to the unified Ray Serve endpoint.
|
||||
|
||||
### Current Role of KServe
|
||||
|
||||
KServe is retained for:
|
||||
- **Service naming convention**: `{model}-predictor.ai-ml.svc.cluster.local`
|
||||
- **Future flexibility**: Can be used for non-GPU models or canary deployments
|
||||
- **Kubeflow integration**: KServe InferenceServices appear in Kubeflow UI
|
||||
|
||||
### Positive Consequences
|
||||
|
||||
* Standardized V2 inference protocol
|
||||
@@ -90,26 +99,34 @@ Chosen option: "KServe InferenceService", because it provides a standardized, Ku
|
||||
|
||||
## Current Configuration
|
||||
|
||||
KServe-compatible ExternalName services route to the unified Ray Serve endpoint:
|
||||
|
||||
```yaml
|
||||
apiVersion: serving.kserve.io/v1beta1
|
||||
kind: InferenceService
|
||||
# KServe-compatible service alias (services-ray-aliases.yaml)
|
||||
apiVersion: v1
|
||||
kind: Service
|
||||
metadata:
|
||||
name: whisper
|
||||
name: whisper-predictor
|
||||
namespace: ai-ml
|
||||
labels:
|
||||
serving.kserve.io/inferenceservice: whisper
|
||||
spec:
|
||||
predictor:
|
||||
minReplicas: 1
|
||||
maxReplicas: 3
|
||||
containers:
|
||||
- name: whisper
|
||||
image: ghcr.io/org/whisper:latest
|
||||
resources:
|
||||
limits:
|
||||
nvidia.com/gpu: 1
|
||||
type: ExternalName
|
||||
externalName: ai-inference-serve-svc.ai-ml.svc.cluster.local
|
||||
ports:
|
||||
- port: 8000
|
||||
targetPort: 8000
|
||||
---
|
||||
# Usage: http://whisper-predictor.ai-ml.svc.cluster.local:8000/whisper/...
|
||||
# All traffic routes to Ray Serve, which handles GPU allocation
|
||||
```
|
||||
|
||||
For the actual Ray Serve configuration, see [ADR-0011](0011-kuberay-unified-gpu-backend.md).
|
||||
|
||||
## Links
|
||||
|
||||
* [KServe](https://kserve.github.io)
|
||||
* [V2 Inference Protocol](https://kserve.github.io/website/latest/modelserving/data_plane/v2_protocol/)
|
||||
* [KubeRay](https://ray-project.github.io/kuberay/)
|
||||
* Related: [ADR-0005](0005-multi-gpu-strategy.md) - GPU allocation
|
||||
* Superseded by: [ADR-0011](0011-kuberay-unified-gpu-backend.md) - KubeRay unified backend
|
||||
|
||||
Reference in New Issue
Block a user