docs: add ADR-0011 (KubeRay), ADR-0012 (uv), update architecture docs
This commit is contained in:
@@ -31,22 +31,27 @@
|
||||
|
||||
## AI/ML Layer
|
||||
|
||||
### Inference Engines
|
||||
### GPU Inference (KubeRay RayService)
|
||||
|
||||
| Service | Framework | GPU | Model Type |
|
||||
|---------|-----------|-----|------------|
|
||||
| [vLLM](https://vllm.ai) | ROCm | AMD Strix Halo | Large Language Models |
|
||||
| [faster-whisper](https://github.com/guillaumekln/faster-whisper) | CUDA | NVIDIA RTX 2070 | Speech-to-Text |
|
||||
| [XTTS](https://github.com/coqui-ai/TTS) | CUDA | NVIDIA RTX 2070 | Text-to-Speech |
|
||||
| [BGE Embeddings](https://huggingface.co/BAAI/bge-large-en-v1.5) | ROCm | AMD Radeon 680M | Text Embeddings |
|
||||
| [BGE Reranker](https://huggingface.co/BAAI/bge-reranker-large) | Intel | Intel Arc | Document Reranking |
|
||||
All AI inference runs on a unified Ray Serve endpoint with fractional GPU allocation:
|
||||
|
||||
### ML Serving
|
||||
| Service | Model | GPU Node | GPU Type | Allocation |
|
||||
|---------|-------|----------|----------|------------|
|
||||
| `/llm` | [vLLM](https://vllm.ai) (Llama 3.1 70B) | khelben | AMD Strix Halo 64GB | 0.95 GPU |
|
||||
| `/whisper` | [faster-whisper](https://github.com/guillaumekln/faster-whisper) v3 | elminster | NVIDIA RTX 2070 8GB | 0.5 GPU |
|
||||
| `/tts` | [XTTS](https://github.com/coqui-ai/TTS) | elminster | NVIDIA RTX 2070 8GB | 0.5 GPU |
|
||||
| `/embeddings` | [BGE-Large](https://huggingface.co/BAAI/bge-large-en-v1.5) | drizzt | AMD Radeon 680M 12GB | 0.8 GPU |
|
||||
| `/reranker` | [BGE-Reranker](https://huggingface.co/BAAI/bge-reranker-large) | danilo | Intel Arc 16GB | 0.8 GPU |
|
||||
|
||||
**Endpoint**: `ai-inference-serve-svc.ai-ml.svc.cluster.local:8000/{service}`
|
||||
|
||||
### ML Serving Stack
|
||||
|
||||
| Component | Version | Purpose |
|
||||
|-----------|---------|---------|
|
||||
| [KServe](https://kserve.github.io) | v0.12+ | Model serving framework |
|
||||
| [KubeRay](https://ray-project.github.io/kuberay/) | 1.4+ | Ray cluster operator |
|
||||
| [Ray Serve](https://ray.io/serve) | 2.53.0 | Unified inference endpoints |
|
||||
| [KServe](https://kserve.github.io) | v0.12+ | Abstraction layer (ExternalName aliases) |
|
||||
|
||||
### ML Workflows
|
||||
|
||||
|
||||
Reference in New Issue
Block a user