- Added ADR-0016: Affine email verification strategy - Moved ADRs 0019-0024 from docs/adr/ to decisions/ - Renamed to consistent format (removed ADR- prefix)
366 lines
14 KiB
Markdown
366 lines
14 KiB
Markdown
# ADR-0019: Python Module Deployment Strategy
|
|
|
|
## Status
|
|
|
|
Accepted
|
|
|
|
## Date
|
|
|
|
2026-02-02
|
|
|
|
## Context
|
|
|
|
We have Python modules for AI/ML workflows that need to run on our unified GPU cluster:
|
|
|
|
| Repo | Purpose | Needs GPU? |
|
|
|------|---------|------------|
|
|
| `handler-base` | Shared library (NATS, clients, telemetry) | No |
|
|
| `chat-handler` | Text chat → RAG → LLM pipeline | No (calls GPU endpoints) |
|
|
| `voice-assistant` | Audio → STT → RAG → LLM → TTS pipeline | No (calls GPU endpoints) |
|
|
| `pipeline-bridge` | Kubeflow ↔ NATS integration | No |
|
|
| `kuberay-images/ray-serve/` | Inference deployments (Whisper, TTS, LLM, etc.) | **Yes** |
|
|
|
|
### Current Architecture
|
|
|
|
```
|
|
┌─────────────────────────────────────────────────────────────────────┐
|
|
│ PLATFORM LAYERS │
|
|
├─────────────────────────────────────────────────────────────────────┤
|
|
│ Kubeflow Pipelines │ KServe (visibility) │ MLflow (registry) │
|
|
│ [Orchestration] │ [InferenceServices] │ [Models/Metrics] │
|
|
├─────────────────────────────────────────────────────────────────────┤
|
|
│ RAY CLUSTER │
|
|
│ ┌────────────────────────────────────────────────────────────────┐ │
|
|
│ │ Ray Serve Applications (GPU inference) │ │
|
|
│ │ ├─ /llm → VLLMDeployment (khelben, 0.95 GPU) │ │
|
|
│ │ ├─ /whisper → WhisperDeployment (elminster, 0.5 GPU) │ │
|
|
│ │ ├─ /tts → TTSDeployment (elminster, 0.5 GPU) │ │
|
|
│ │ ├─ /embeddings → EmbeddingsDeployment (drizzt, 0.8 GPU) │ │
|
|
│ │ └─ /reranker → RerankerDeployment (danilo, 0.8 GPU) │ │
|
|
│ ├────────────────────────────────────────────────────────────────┤ │
|
|
│ │ Ray Serve Applications (CPU orchestration) ← WHERE HANDLERS GO │ │
|
|
│ │ ├─ /chat → ChatHandler (head node, 0 GPU) │ │
|
|
│ │ └─ /voice → VoiceHandler (head node, 0 GPU) │ │
|
|
│ └────────────────────────────────────────────────────────────────┘ │
|
|
├─────────────────────────────────────────────────────────────────────┤
|
|
│ RayJob (batch/training) │ NATS (events) │ Milvus (vectors) │
|
|
└─────────────────────────────────────────────────────────────────────┘
|
|
```
|
|
|
|
The key insight is that **handlers ARE Ray Serve applications** - they just don't need GPUs.
|
|
They should run inside the Ray cluster to:
|
|
1. Use Ray's internal calling (faster than HTTP)
|
|
2. Share observability (Ray Dashboard)
|
|
3. Leverage Ray's scheduling for resource management
|
|
|
|
## Decision
|
|
|
|
**Deploy handlers as Ray Serve applications inside the Ray cluster**, using `runtime_env`
|
|
to install Python packages from Gitea's package registry at deployment time.
|
|
|
|
### Why Ray Serve (not standalone containers)?
|
|
|
|
1. **Unified Platform**: Everything runs in Ray - inference AND orchestration
|
|
2. **Internal Calls**: Handlers can call inference deployments via Ray handles (no HTTP)
|
|
3. **Resource Sharing**: Ray head node has spare CPU/memory for orchestration
|
|
4. **Single Observability**: Ray Dashboard shows all applications
|
|
5. **Simpler Ops**: One RayService to manage, not multiple Deployments
|
|
|
|
### Why runtime_env with pip (not baked into images)?
|
|
|
|
1. **Faster Iteration**: Change handler code → push to PyPI → redeploy RayService
|
|
2. **Decoupled Releases**: Handlers update independently of worker images
|
|
3. **Smaller Images**: Worker images only need inference dependencies
|
|
4. **MLflow Integration**: Can version handlers as MLflow models if needed
|
|
|
|
## Implementation Plan
|
|
|
|
### Phase 1: Publish Packages to Gitea PyPI
|
|
|
|
Each handler repo publishes to Gitea's built-in package registry on release:
|
|
|
|
```yaml
|
|
# .gitea/workflows/ci.yml
|
|
name: CI
|
|
|
|
on:
|
|
push:
|
|
branches: [main]
|
|
tags: ['v*']
|
|
pull_request:
|
|
branches: [main]
|
|
|
|
jobs:
|
|
lint:
|
|
# ... existing lint job
|
|
|
|
test:
|
|
# ... existing test job
|
|
|
|
publish:
|
|
runs-on: ubuntu-latest
|
|
needs: [lint, test]
|
|
if: startsWith(github.ref, 'refs/tags/v')
|
|
steps:
|
|
- uses: actions/checkout@v4
|
|
|
|
- name: Set up Python
|
|
uses: actions/setup-python@v5
|
|
with:
|
|
python-version: '3.11'
|
|
|
|
- name: Install uv
|
|
uses: astral-sh/setup-uv@v5
|
|
|
|
- name: Build package
|
|
run: uv build
|
|
|
|
- name: Publish to Gitea PyPI
|
|
env:
|
|
UV_PUBLISH_URL: https://git.daviestechlabs.io/api/packages/daviestechlabs/pypi
|
|
UV_PUBLISH_TOKEN: ${{ secrets.GITEA_TOKEN }}
|
|
run: uv publish
|
|
```
|
|
|
|
### Phase 2: Update RayService with Handler Applications
|
|
|
|
Add handler applications to the existing RayService:
|
|
|
|
```yaml
|
|
# rayservice.yaml additions
|
|
spec:
|
|
serveConfigV2: |
|
|
applications:
|
|
# ... existing GPU inference applications ...
|
|
|
|
# ============================================
|
|
# HANDLERS (CPU - runs on head node)
|
|
# ============================================
|
|
|
|
# Chat Handler - RAG + LLM pipeline
|
|
- name: chat-handler
|
|
route_prefix: /chat
|
|
import_path: chat_handler:app
|
|
runtime_env:
|
|
pip:
|
|
- handler-base>=0.1.0
|
|
- chat-handler>=0.1.0
|
|
pip_find_links:
|
|
- https://git.daviestechlabs.io/api/packages/daviestechlabs/pypi/simple/
|
|
env_vars:
|
|
NATS_URL: "nats://nats.ai-ml.svc.cluster.local:4222"
|
|
MILVUS_HOST: "milvus.ai-ml.svc.cluster.local"
|
|
OTEL_EXPORTER_OTLP_ENDPOINT: "http://otel-collector.monitoring.svc.cluster.local:4317"
|
|
deployments:
|
|
- name: ChatDeployment
|
|
num_replicas: 2
|
|
ray_actor_options:
|
|
num_cpus: 0.5
|
|
num_gpus: 0 # No GPU needed
|
|
max_ongoing_requests: 50
|
|
|
|
# Voice Assistant - STT → RAG → LLM → TTS pipeline
|
|
- name: voice-assistant
|
|
route_prefix: /voice
|
|
import_path: voice_assistant:app
|
|
runtime_env:
|
|
pip:
|
|
- handler-base>=0.1.0
|
|
- voice-assistant>=0.1.0
|
|
pip_find_links:
|
|
- https://git.daviestechlabs.io/api/packages/daviestechlabs/pypi/simple/
|
|
env_vars:
|
|
NATS_URL: "nats://nats.ai-ml.svc.cluster.local:4222"
|
|
MILVUS_HOST: "milvus.ai-ml.svc.cluster.local"
|
|
deployments:
|
|
- name: VoiceDeployment
|
|
num_replicas: 2
|
|
ray_actor_options:
|
|
num_cpus: 1
|
|
num_gpus: 0
|
|
max_ongoing_requests: 20
|
|
```
|
|
|
|
### Phase 3: Refactor Handlers for Ray Serve
|
|
|
|
Convert handlers from standalone NATS subscribers to Ray Serve deployments that can
|
|
also optionally subscribe to NATS:
|
|
|
|
```python
|
|
# chat_handler.py (refactored)
|
|
from ray import serve
|
|
from handler_base import Settings
|
|
from handler_base.clients import EmbeddingsClient, LLMClient, RerankerClient, MilvusClient
|
|
|
|
@serve.deployment(
|
|
name="ChatDeployment",
|
|
num_replicas=2,
|
|
ray_actor_options={"num_cpus": 0.5, "num_gpus": 0}
|
|
)
|
|
class ChatHandler:
|
|
def __init__(self):
|
|
self.settings = Settings()
|
|
|
|
# Initialize clients - these can use Ray handles for internal calls
|
|
self.embeddings = EmbeddingsClient()
|
|
self.llm = LLMClient()
|
|
self.reranker = RerankerClient()
|
|
self.milvus = MilvusClient()
|
|
|
|
async def __call__(self, request) -> dict:
|
|
"""Handle HTTP requests (from Gradio, etc.)"""
|
|
data = await request.json()
|
|
return await self.process_chat(data)
|
|
|
|
async def process_chat(self, data: dict) -> dict:
|
|
"""Core chat logic - called by HTTP or NATS"""
|
|
query = data["query"]
|
|
|
|
# 1. Generate embeddings
|
|
embedding = await self.embeddings.embed(query)
|
|
|
|
# 2. Vector search
|
|
results = await self.milvus.search(embedding, top_k=10)
|
|
|
|
# 3. Rerank
|
|
reranked = await self.reranker.rerank(query, results)
|
|
|
|
# 4. Generate response
|
|
response = await self.llm.generate(query, context=reranked[:5])
|
|
|
|
return {
|
|
"response": response,
|
|
"sources": reranked[:5]
|
|
}
|
|
|
|
# Ray Serve app binding
|
|
app = ChatHandler.bind()
|
|
```
|
|
|
|
### Phase 4: Use Ray Handles for Internal Calls (Optional Optimization)
|
|
|
|
Update handler-base clients to use Ray handles when running inside Ray:
|
|
|
|
```python
|
|
# handler_base/clients/embeddings.py
|
|
import ray
|
|
from ray import serve
|
|
|
|
class EmbeddingsClient:
|
|
def __init__(self, url: str = None):
|
|
self.url = url
|
|
self._handle = None
|
|
|
|
# If running inside Ray, get handle to embeddings deployment
|
|
if ray.is_initialized():
|
|
try:
|
|
self._handle = serve.get_deployment_handle(
|
|
"EmbeddingsDeployment",
|
|
app_name="embeddings"
|
|
)
|
|
except Exception:
|
|
pass # Fall back to HTTP
|
|
|
|
async def embed(self, text: str) -> list[float]:
|
|
if self._handle:
|
|
# Fast internal Ray call
|
|
return await self._handle.embed.remote(text)
|
|
else:
|
|
# HTTP fallback for external callers
|
|
async with httpx.AsyncClient() as client:
|
|
resp = await client.post(f"{self.url}/v1/embeddings", json={"input": text})
|
|
return resp.json()["data"][0]["embedding"]
|
|
```
|
|
|
|
### Phase 5: NATS Bridge (Optional)
|
|
|
|
If you still want NATS integration, add a separate NATS bridge that forwards to Ray Serve:
|
|
|
|
```python
|
|
# pipeline_bridge.py - runs as Ray actor, subscribes to NATS
|
|
import ray
|
|
from ray import serve
|
|
import nats
|
|
|
|
@ray.remote
|
|
class NATSBridge:
|
|
def __init__(self):
|
|
self.nc = None
|
|
self.chat_handle = serve.get_deployment_handle("ChatDeployment", "chat-handler")
|
|
self.voice_handle = serve.get_deployment_handle("VoiceDeployment", "voice-assistant")
|
|
|
|
async def start(self):
|
|
self.nc = await nats.connect("nats://nats.ai-ml.svc.cluster.local:4222")
|
|
|
|
await self.nc.subscribe("ai.chat.request", cb=self.handle_chat)
|
|
await self.nc.subscribe("voice.request", cb=self.handle_voice)
|
|
|
|
async def handle_chat(self, msg):
|
|
result = await self.chat_handle.process_chat.remote(msg.data)
|
|
if msg.reply:
|
|
await self.nc.publish(msg.reply, result)
|
|
```
|
|
|
|
## CI/CD Flow
|
|
|
|
```
|
|
┌────────────────────────────────────────────────────────────────────┐
|
|
│ Developer pushes to handler repo │
|
|
├────────────────────────────────────────────────────────────────────┤
|
|
│ 1. Gitea Actions: lint → test │
|
|
│ 2. On tag: build wheel → publish to Gitea PyPI │
|
|
├────────────────────────────────────────────────────────────────────┤
|
|
│ 3. Update RayService version in homelab-k8s2 │
|
|
│ (bump handler-base>=0.2.0 in runtime_env) │
|
|
├────────────────────────────────────────────────────────────────────┤
|
|
│ 4. Flux detects change → applies RayService │
|
|
│ 5. Ray downloads new packages → restarts deployments │
|
|
└────────────────────────────────────────────────────────────────────┘
|
|
```
|
|
|
|
## Alternatives Considered
|
|
|
|
### Standalone Container Deployments
|
|
|
|
Run handlers as separate Kubernetes Deployments outside Ray.
|
|
|
|
**Rejected because:**
|
|
- Duplicates infrastructure (separate scaling, health checks, etc.)
|
|
- HTTP overhead for every inference call
|
|
- Separate observability stack
|
|
- Against the "Ray as unified compute" philosophy
|
|
|
|
### Bake Handlers into Worker Images
|
|
|
|
Pre-install handler code in ray-worker images.
|
|
|
|
**Rejected because:**
|
|
- Couples handler releases to image rebuilds
|
|
- Slower iteration cycle
|
|
- Larger images
|
|
|
|
## Consequences
|
|
|
|
### Positive
|
|
- Single platform: Everything runs in Ray
|
|
- Fast internal calls via Ray handles
|
|
- Unified observability in Ray Dashboard
|
|
- Clean abstraction layers: Kubeflow → KServe → Ray → GPU
|
|
- Handlers scale with Ray's autoscaler
|
|
|
|
### Negative
|
|
- Handlers share Ray head node resources
|
|
- Need to manage Gitea PyPI authentication for runtime_env
|
|
- Slightly more complex RayService configuration
|
|
|
|
### Neutral
|
|
- MLflow can track handler "models" if we want versioned deployments
|
|
- Kubeflow can trigger handler updates via pipelines
|
|
|
|
## References
|
|
|
|
- [ray-kserve-integration.md](../../homelab-k8s2/docs/ray-kserve-integration.md)
|
|
- [Ray Serve runtime_env docs](https://docs.ray.io/en/latest/serve/production-guide/config.html)
|
|
- [Gitea Package Registry](https://docs.gitea.io/en-us/packages/pypi/)
|
|
- [ADR-0012: Ray Cluster Architecture](ADR-0012-ray-cluster-unified.md)
|