chore: Consolidate ADRs into decisions/ directory
- Added ADR-0016: Affine email verification strategy - Moved ADRs 0019-0024 from docs/adr/ to decisions/ - Renamed to consistent format (removed ADR- prefix)
This commit is contained in:
365
decisions/0019-handler-deployment-strategy.md
Normal file
365
decisions/0019-handler-deployment-strategy.md
Normal file
@@ -0,0 +1,365 @@
|
||||
# ADR-0019: Python Module Deployment Strategy
|
||||
|
||||
## Status
|
||||
|
||||
Accepted
|
||||
|
||||
## Date
|
||||
|
||||
2026-02-02
|
||||
|
||||
## Context
|
||||
|
||||
We have Python modules for AI/ML workflows that need to run on our unified GPU cluster:
|
||||
|
||||
| Repo | Purpose | Needs GPU? |
|
||||
|------|---------|------------|
|
||||
| `handler-base` | Shared library (NATS, clients, telemetry) | No |
|
||||
| `chat-handler` | Text chat → RAG → LLM pipeline | No (calls GPU endpoints) |
|
||||
| `voice-assistant` | Audio → STT → RAG → LLM → TTS pipeline | No (calls GPU endpoints) |
|
||||
| `pipeline-bridge` | Kubeflow ↔ NATS integration | No |
|
||||
| `kuberay-images/ray-serve/` | Inference deployments (Whisper, TTS, LLM, etc.) | **Yes** |
|
||||
|
||||
### Current Architecture
|
||||
|
||||
```
|
||||
┌─────────────────────────────────────────────────────────────────────┐
|
||||
│ PLATFORM LAYERS │
|
||||
├─────────────────────────────────────────────────────────────────────┤
|
||||
│ Kubeflow Pipelines │ KServe (visibility) │ MLflow (registry) │
|
||||
│ [Orchestration] │ [InferenceServices] │ [Models/Metrics] │
|
||||
├─────────────────────────────────────────────────────────────────────┤
|
||||
│ RAY CLUSTER │
|
||||
│ ┌────────────────────────────────────────────────────────────────┐ │
|
||||
│ │ Ray Serve Applications (GPU inference) │ │
|
||||
│ │ ├─ /llm → VLLMDeployment (khelben, 0.95 GPU) │ │
|
||||
│ │ ├─ /whisper → WhisperDeployment (elminster, 0.5 GPU) │ │
|
||||
│ │ ├─ /tts → TTSDeployment (elminster, 0.5 GPU) │ │
|
||||
│ │ ├─ /embeddings → EmbeddingsDeployment (drizzt, 0.8 GPU) │ │
|
||||
│ │ └─ /reranker → RerankerDeployment (danilo, 0.8 GPU) │ │
|
||||
│ ├────────────────────────────────────────────────────────────────┤ │
|
||||
│ │ Ray Serve Applications (CPU orchestration) ← WHERE HANDLERS GO │ │
|
||||
│ │ ├─ /chat → ChatHandler (head node, 0 GPU) │ │
|
||||
│ │ └─ /voice → VoiceHandler (head node, 0 GPU) │ │
|
||||
│ └────────────────────────────────────────────────────────────────┘ │
|
||||
├─────────────────────────────────────────────────────────────────────┤
|
||||
│ RayJob (batch/training) │ NATS (events) │ Milvus (vectors) │
|
||||
└─────────────────────────────────────────────────────────────────────┘
|
||||
```
|
||||
|
||||
The key insight is that **handlers ARE Ray Serve applications** - they just don't need GPUs.
|
||||
They should run inside the Ray cluster to:
|
||||
1. Use Ray's internal calling (faster than HTTP)
|
||||
2. Share observability (Ray Dashboard)
|
||||
3. Leverage Ray's scheduling for resource management
|
||||
|
||||
## Decision
|
||||
|
||||
**Deploy handlers as Ray Serve applications inside the Ray cluster**, using `runtime_env`
|
||||
to install Python packages from Gitea's package registry at deployment time.
|
||||
|
||||
### Why Ray Serve (not standalone containers)?
|
||||
|
||||
1. **Unified Platform**: Everything runs in Ray - inference AND orchestration
|
||||
2. **Internal Calls**: Handlers can call inference deployments via Ray handles (no HTTP)
|
||||
3. **Resource Sharing**: Ray head node has spare CPU/memory for orchestration
|
||||
4. **Single Observability**: Ray Dashboard shows all applications
|
||||
5. **Simpler Ops**: One RayService to manage, not multiple Deployments
|
||||
|
||||
### Why runtime_env with pip (not baked into images)?
|
||||
|
||||
1. **Faster Iteration**: Change handler code → push to PyPI → redeploy RayService
|
||||
2. **Decoupled Releases**: Handlers update independently of worker images
|
||||
3. **Smaller Images**: Worker images only need inference dependencies
|
||||
4. **MLflow Integration**: Can version handlers as MLflow models if needed
|
||||
|
||||
## Implementation Plan
|
||||
|
||||
### Phase 1: Publish Packages to Gitea PyPI
|
||||
|
||||
Each handler repo publishes to Gitea's built-in package registry on release:
|
||||
|
||||
```yaml
|
||||
# .gitea/workflows/ci.yml
|
||||
name: CI
|
||||
|
||||
on:
|
||||
push:
|
||||
branches: [main]
|
||||
tags: ['v*']
|
||||
pull_request:
|
||||
branches: [main]
|
||||
|
||||
jobs:
|
||||
lint:
|
||||
# ... existing lint job
|
||||
|
||||
test:
|
||||
# ... existing test job
|
||||
|
||||
publish:
|
||||
runs-on: ubuntu-latest
|
||||
needs: [lint, test]
|
||||
if: startsWith(github.ref, 'refs/tags/v')
|
||||
steps:
|
||||
- uses: actions/checkout@v4
|
||||
|
||||
- name: Set up Python
|
||||
uses: actions/setup-python@v5
|
||||
with:
|
||||
python-version: '3.11'
|
||||
|
||||
- name: Install uv
|
||||
uses: astral-sh/setup-uv@v5
|
||||
|
||||
- name: Build package
|
||||
run: uv build
|
||||
|
||||
- name: Publish to Gitea PyPI
|
||||
env:
|
||||
UV_PUBLISH_URL: https://git.daviestechlabs.io/api/packages/daviestechlabs/pypi
|
||||
UV_PUBLISH_TOKEN: ${{ secrets.GITEA_TOKEN }}
|
||||
run: uv publish
|
||||
```
|
||||
|
||||
### Phase 2: Update RayService with Handler Applications
|
||||
|
||||
Add handler applications to the existing RayService:
|
||||
|
||||
```yaml
|
||||
# rayservice.yaml additions
|
||||
spec:
|
||||
serveConfigV2: |
|
||||
applications:
|
||||
# ... existing GPU inference applications ...
|
||||
|
||||
# ============================================
|
||||
# HANDLERS (CPU - runs on head node)
|
||||
# ============================================
|
||||
|
||||
# Chat Handler - RAG + LLM pipeline
|
||||
- name: chat-handler
|
||||
route_prefix: /chat
|
||||
import_path: chat_handler:app
|
||||
runtime_env:
|
||||
pip:
|
||||
- handler-base>=0.1.0
|
||||
- chat-handler>=0.1.0
|
||||
pip_find_links:
|
||||
- https://git.daviestechlabs.io/api/packages/daviestechlabs/pypi/simple/
|
||||
env_vars:
|
||||
NATS_URL: "nats://nats.ai-ml.svc.cluster.local:4222"
|
||||
MILVUS_HOST: "milvus.ai-ml.svc.cluster.local"
|
||||
OTEL_EXPORTER_OTLP_ENDPOINT: "http://otel-collector.monitoring.svc.cluster.local:4317"
|
||||
deployments:
|
||||
- name: ChatDeployment
|
||||
num_replicas: 2
|
||||
ray_actor_options:
|
||||
num_cpus: 0.5
|
||||
num_gpus: 0 # No GPU needed
|
||||
max_ongoing_requests: 50
|
||||
|
||||
# Voice Assistant - STT → RAG → LLM → TTS pipeline
|
||||
- name: voice-assistant
|
||||
route_prefix: /voice
|
||||
import_path: voice_assistant:app
|
||||
runtime_env:
|
||||
pip:
|
||||
- handler-base>=0.1.0
|
||||
- voice-assistant>=0.1.0
|
||||
pip_find_links:
|
||||
- https://git.daviestechlabs.io/api/packages/daviestechlabs/pypi/simple/
|
||||
env_vars:
|
||||
NATS_URL: "nats://nats.ai-ml.svc.cluster.local:4222"
|
||||
MILVUS_HOST: "milvus.ai-ml.svc.cluster.local"
|
||||
deployments:
|
||||
- name: VoiceDeployment
|
||||
num_replicas: 2
|
||||
ray_actor_options:
|
||||
num_cpus: 1
|
||||
num_gpus: 0
|
||||
max_ongoing_requests: 20
|
||||
```
|
||||
|
||||
### Phase 3: Refactor Handlers for Ray Serve
|
||||
|
||||
Convert handlers from standalone NATS subscribers to Ray Serve deployments that can
|
||||
also optionally subscribe to NATS:
|
||||
|
||||
```python
|
||||
# chat_handler.py (refactored)
|
||||
from ray import serve
|
||||
from handler_base import Settings
|
||||
from handler_base.clients import EmbeddingsClient, LLMClient, RerankerClient, MilvusClient
|
||||
|
||||
@serve.deployment(
|
||||
name="ChatDeployment",
|
||||
num_replicas=2,
|
||||
ray_actor_options={"num_cpus": 0.5, "num_gpus": 0}
|
||||
)
|
||||
class ChatHandler:
|
||||
def __init__(self):
|
||||
self.settings = Settings()
|
||||
|
||||
# Initialize clients - these can use Ray handles for internal calls
|
||||
self.embeddings = EmbeddingsClient()
|
||||
self.llm = LLMClient()
|
||||
self.reranker = RerankerClient()
|
||||
self.milvus = MilvusClient()
|
||||
|
||||
async def __call__(self, request) -> dict:
|
||||
"""Handle HTTP requests (from Gradio, etc.)"""
|
||||
data = await request.json()
|
||||
return await self.process_chat(data)
|
||||
|
||||
async def process_chat(self, data: dict) -> dict:
|
||||
"""Core chat logic - called by HTTP or NATS"""
|
||||
query = data["query"]
|
||||
|
||||
# 1. Generate embeddings
|
||||
embedding = await self.embeddings.embed(query)
|
||||
|
||||
# 2. Vector search
|
||||
results = await self.milvus.search(embedding, top_k=10)
|
||||
|
||||
# 3. Rerank
|
||||
reranked = await self.reranker.rerank(query, results)
|
||||
|
||||
# 4. Generate response
|
||||
response = await self.llm.generate(query, context=reranked[:5])
|
||||
|
||||
return {
|
||||
"response": response,
|
||||
"sources": reranked[:5]
|
||||
}
|
||||
|
||||
# Ray Serve app binding
|
||||
app = ChatHandler.bind()
|
||||
```
|
||||
|
||||
### Phase 4: Use Ray Handles for Internal Calls (Optional Optimization)
|
||||
|
||||
Update handler-base clients to use Ray handles when running inside Ray:
|
||||
|
||||
```python
|
||||
# handler_base/clients/embeddings.py
|
||||
import ray
|
||||
from ray import serve
|
||||
|
||||
class EmbeddingsClient:
|
||||
def __init__(self, url: str = None):
|
||||
self.url = url
|
||||
self._handle = None
|
||||
|
||||
# If running inside Ray, get handle to embeddings deployment
|
||||
if ray.is_initialized():
|
||||
try:
|
||||
self._handle = serve.get_deployment_handle(
|
||||
"EmbeddingsDeployment",
|
||||
app_name="embeddings"
|
||||
)
|
||||
except Exception:
|
||||
pass # Fall back to HTTP
|
||||
|
||||
async def embed(self, text: str) -> list[float]:
|
||||
if self._handle:
|
||||
# Fast internal Ray call
|
||||
return await self._handle.embed.remote(text)
|
||||
else:
|
||||
# HTTP fallback for external callers
|
||||
async with httpx.AsyncClient() as client:
|
||||
resp = await client.post(f"{self.url}/v1/embeddings", json={"input": text})
|
||||
return resp.json()["data"][0]["embedding"]
|
||||
```
|
||||
|
||||
### Phase 5: NATS Bridge (Optional)
|
||||
|
||||
If you still want NATS integration, add a separate NATS bridge that forwards to Ray Serve:
|
||||
|
||||
```python
|
||||
# pipeline_bridge.py - runs as Ray actor, subscribes to NATS
|
||||
import ray
|
||||
from ray import serve
|
||||
import nats
|
||||
|
||||
@ray.remote
|
||||
class NATSBridge:
|
||||
def __init__(self):
|
||||
self.nc = None
|
||||
self.chat_handle = serve.get_deployment_handle("ChatDeployment", "chat-handler")
|
||||
self.voice_handle = serve.get_deployment_handle("VoiceDeployment", "voice-assistant")
|
||||
|
||||
async def start(self):
|
||||
self.nc = await nats.connect("nats://nats.ai-ml.svc.cluster.local:4222")
|
||||
|
||||
await self.nc.subscribe("ai.chat.request", cb=self.handle_chat)
|
||||
await self.nc.subscribe("voice.request", cb=self.handle_voice)
|
||||
|
||||
async def handle_chat(self, msg):
|
||||
result = await self.chat_handle.process_chat.remote(msg.data)
|
||||
if msg.reply:
|
||||
await self.nc.publish(msg.reply, result)
|
||||
```
|
||||
|
||||
## CI/CD Flow
|
||||
|
||||
```
|
||||
┌────────────────────────────────────────────────────────────────────┐
|
||||
│ Developer pushes to handler repo │
|
||||
├────────────────────────────────────────────────────────────────────┤
|
||||
│ 1. Gitea Actions: lint → test │
|
||||
│ 2. On tag: build wheel → publish to Gitea PyPI │
|
||||
├────────────────────────────────────────────────────────────────────┤
|
||||
│ 3. Update RayService version in homelab-k8s2 │
|
||||
│ (bump handler-base>=0.2.0 in runtime_env) │
|
||||
├────────────────────────────────────────────────────────────────────┤
|
||||
│ 4. Flux detects change → applies RayService │
|
||||
│ 5. Ray downloads new packages → restarts deployments │
|
||||
└────────────────────────────────────────────────────────────────────┘
|
||||
```
|
||||
|
||||
## Alternatives Considered
|
||||
|
||||
### Standalone Container Deployments
|
||||
|
||||
Run handlers as separate Kubernetes Deployments outside Ray.
|
||||
|
||||
**Rejected because:**
|
||||
- Duplicates infrastructure (separate scaling, health checks, etc.)
|
||||
- HTTP overhead for every inference call
|
||||
- Separate observability stack
|
||||
- Against the "Ray as unified compute" philosophy
|
||||
|
||||
### Bake Handlers into Worker Images
|
||||
|
||||
Pre-install handler code in ray-worker images.
|
||||
|
||||
**Rejected because:**
|
||||
- Couples handler releases to image rebuilds
|
||||
- Slower iteration cycle
|
||||
- Larger images
|
||||
|
||||
## Consequences
|
||||
|
||||
### Positive
|
||||
- Single platform: Everything runs in Ray
|
||||
- Fast internal calls via Ray handles
|
||||
- Unified observability in Ray Dashboard
|
||||
- Clean abstraction layers: Kubeflow → KServe → Ray → GPU
|
||||
- Handlers scale with Ray's autoscaler
|
||||
|
||||
### Negative
|
||||
- Handlers share Ray head node resources
|
||||
- Need to manage Gitea PyPI authentication for runtime_env
|
||||
- Slightly more complex RayService configuration
|
||||
|
||||
### Neutral
|
||||
- MLflow can track handler "models" if we want versioned deployments
|
||||
- Kubeflow can trigger handler updates via pipelines
|
||||
|
||||
## References
|
||||
|
||||
- [ray-kserve-integration.md](../../homelab-k8s2/docs/ray-kserve-integration.md)
|
||||
- [Ray Serve runtime_env docs](https://docs.ray.io/en/latest/serve/production-guide/config.html)
|
||||
- [Gitea Package Registry](https://docs.gitea.io/en-us/packages/pypi/)
|
||||
- [ADR-0012: Ray Cluster Architecture](ADR-0012-ray-cluster-unified.md)
|
||||
Reference in New Issue
Block a user