docs: add ADR-0011 (KubeRay), ADR-0012 (uv), update architecture docs

2026-02-02 07:10:47 -05:00
parent b6f7605fab
commit 598875c5a9
6 changed files with 438 additions and 35 deletions
--- a/TECH-STACK.md
+++ b/TECH-STACK.md
@@ -31,22 +31,27 @@

 ## AI/ML Layer

-### Inference Engines
+### GPU Inference (KubeRay RayService)

-| Service | Framework | GPU | Model Type |
-|---------|-----------|-----|------------|
-| [vLLM](https://vllm.ai) | ROCm | AMD Strix Halo | Large Language Models |
-| [faster-whisper](https://github.com/guillaumekln/faster-whisper) | CUDA | NVIDIA RTX 2070 | Speech-to-Text |
-| [XTTS](https://github.com/coqui-ai/TTS) | CUDA | NVIDIA RTX 2070 | Text-to-Speech |
-| [BGE Embeddings](https://huggingface.co/BAAI/bge-large-en-v1.5) | ROCm | AMD Radeon 680M | Text Embeddings |
-| [BGE Reranker](https://huggingface.co/BAAI/bge-reranker-large) | Intel | Intel Arc | Document Reranking |
+All AI inference runs on a unified Ray Serve endpoint with fractional GPU allocation:

-### ML Serving
+| Service | Model | GPU Node | GPU Type | Allocation |
+|---------|-------|----------|----------|------------|
+| `/llm` | [vLLM](https://vllm.ai) (Llama 3.1 70B) | khelben | AMD Strix Halo 64GB | 0.95 GPU |
+| `/whisper` | [faster-whisper](https://github.com/guillaumekln/faster-whisper) v3 | elminster | NVIDIA RTX 2070 8GB | 0.5 GPU |
+| `/tts` | [XTTS](https://github.com/coqui-ai/TTS) | elminster | NVIDIA RTX 2070 8GB | 0.5 GPU |
+| `/embeddings` | [BGE-Large](https://huggingface.co/BAAI/bge-large-en-v1.5) | drizzt | AMD Radeon 680M 12GB | 0.8 GPU |
+| `/reranker` | [BGE-Reranker](https://huggingface.co/BAAI/bge-reranker-large) | danilo | Intel Arc 16GB | 0.8 GPU |
+
+**Endpoint**: `ai-inference-serve-svc.ai-ml.svc.cluster.local:8000/{service}`
+
+### ML Serving Stack

 | Component | Version | Purpose |
 |-----------|---------|---------|
-| [KServe](https://kserve.github.io) | v0.12+ | Model serving framework |
+| [KubeRay](https://ray-project.github.io/kuberay/) | 1.4+ | Ray cluster operator |
 | [Ray Serve](https://ray.io/serve) | 2.53.0 | Unified inference endpoints |
+| [KServe](https://kserve.github.io) | v0.12+ | Abstraction layer (ExternalName aliases) |

 ### ML Workflows