docs: add ADR-0011 (KubeRay), ADR-0012 (uv), update architecture docs

This commit is contained in:
2026-02-02 07:10:47 -05:00
parent b6f7605fab
commit 598875c5a9
6 changed files with 438 additions and 35 deletions

View File

@@ -31,22 +31,27 @@
## AI/ML Layer
### Inference Engines
### GPU Inference (KubeRay RayService)
| Service | Framework | GPU | Model Type |
|---------|-----------|-----|------------|
| [vLLM](https://vllm.ai) | ROCm | AMD Strix Halo | Large Language Models |
| [faster-whisper](https://github.com/guillaumekln/faster-whisper) | CUDA | NVIDIA RTX 2070 | Speech-to-Text |
| [XTTS](https://github.com/coqui-ai/TTS) | CUDA | NVIDIA RTX 2070 | Text-to-Speech |
| [BGE Embeddings](https://huggingface.co/BAAI/bge-large-en-v1.5) | ROCm | AMD Radeon 680M | Text Embeddings |
| [BGE Reranker](https://huggingface.co/BAAI/bge-reranker-large) | Intel | Intel Arc | Document Reranking |
All AI inference runs on a unified Ray Serve endpoint with fractional GPU allocation:
### ML Serving
| Service | Model | GPU Node | GPU Type | Allocation |
|---------|-------|----------|----------|------------|
| `/llm` | [vLLM](https://vllm.ai) (Llama 3.1 70B) | khelben | AMD Strix Halo 64GB | 0.95 GPU |
| `/whisper` | [faster-whisper](https://github.com/guillaumekln/faster-whisper) v3 | elminster | NVIDIA RTX 2070 8GB | 0.5 GPU |
| `/tts` | [XTTS](https://github.com/coqui-ai/TTS) | elminster | NVIDIA RTX 2070 8GB | 0.5 GPU |
| `/embeddings` | [BGE-Large](https://huggingface.co/BAAI/bge-large-en-v1.5) | drizzt | AMD Radeon 680M 12GB | 0.8 GPU |
| `/reranker` | [BGE-Reranker](https://huggingface.co/BAAI/bge-reranker-large) | danilo | Intel Arc 16GB | 0.8 GPU |
**Endpoint**: `ai-inference-serve-svc.ai-ml.svc.cluster.local:8000/{service}`
### ML Serving Stack
| Component | Version | Purpose |
|-----------|---------|---------|
| [KServe](https://kserve.github.io) | v0.12+ | Model serving framework |
| [KubeRay](https://ray-project.github.io/kuberay/) | 1.4+ | Ray cluster operator |
| [Ray Serve](https://ray.io/serve) | 2.53.0 | Unified inference endpoints |
| [KServe](https://kserve.github.io) | v0.12+ | Abstraction layer (ExternalName aliases) |
### ML Workflows