- Added ADR-0016: Affine email verification strategy - Moved ADRs 0019-0024 from docs/adr/ to decisions/ - Renamed to consistent format (removed ADR- prefix)
14 KiB
ADR-0019: Python Module Deployment Strategy
Status
Accepted
Date
2026-02-02
Context
We have Python modules for AI/ML workflows that need to run on our unified GPU cluster:
| Repo | Purpose | Needs GPU? |
|---|---|---|
handler-base |
Shared library (NATS, clients, telemetry) | No |
chat-handler |
Text chat → RAG → LLM pipeline | No (calls GPU endpoints) |
voice-assistant |
Audio → STT → RAG → LLM → TTS pipeline | No (calls GPU endpoints) |
pipeline-bridge |
Kubeflow ↔ NATS integration | No |
kuberay-images/ray-serve/ |
Inference deployments (Whisper, TTS, LLM, etc.) | Yes |
Current Architecture
┌─────────────────────────────────────────────────────────────────────┐
│ PLATFORM LAYERS │
├─────────────────────────────────────────────────────────────────────┤
│ Kubeflow Pipelines │ KServe (visibility) │ MLflow (registry) │
│ [Orchestration] │ [InferenceServices] │ [Models/Metrics] │
├─────────────────────────────────────────────────────────────────────┤
│ RAY CLUSTER │
│ ┌────────────────────────────────────────────────────────────────┐ │
│ │ Ray Serve Applications (GPU inference) │ │
│ │ ├─ /llm → VLLMDeployment (khelben, 0.95 GPU) │ │
│ │ ├─ /whisper → WhisperDeployment (elminster, 0.5 GPU) │ │
│ │ ├─ /tts → TTSDeployment (elminster, 0.5 GPU) │ │
│ │ ├─ /embeddings → EmbeddingsDeployment (drizzt, 0.8 GPU) │ │
│ │ └─ /reranker → RerankerDeployment (danilo, 0.8 GPU) │ │
│ ├────────────────────────────────────────────────────────────────┤ │
│ │ Ray Serve Applications (CPU orchestration) ← WHERE HANDLERS GO │ │
│ │ ├─ /chat → ChatHandler (head node, 0 GPU) │ │
│ │ └─ /voice → VoiceHandler (head node, 0 GPU) │ │
│ └────────────────────────────────────────────────────────────────┘ │
├─────────────────────────────────────────────────────────────────────┤
│ RayJob (batch/training) │ NATS (events) │ Milvus (vectors) │
└─────────────────────────────────────────────────────────────────────┘
The key insight is that handlers ARE Ray Serve applications - they just don't need GPUs. They should run inside the Ray cluster to:
- Use Ray's internal calling (faster than HTTP)
- Share observability (Ray Dashboard)
- Leverage Ray's scheduling for resource management
Decision
Deploy handlers as Ray Serve applications inside the Ray cluster, using runtime_env
to install Python packages from Gitea's package registry at deployment time.
Why Ray Serve (not standalone containers)?
- Unified Platform: Everything runs in Ray - inference AND orchestration
- Internal Calls: Handlers can call inference deployments via Ray handles (no HTTP)
- Resource Sharing: Ray head node has spare CPU/memory for orchestration
- Single Observability: Ray Dashboard shows all applications
- Simpler Ops: One RayService to manage, not multiple Deployments
Why runtime_env with pip (not baked into images)?
- Faster Iteration: Change handler code → push to PyPI → redeploy RayService
- Decoupled Releases: Handlers update independently of worker images
- Smaller Images: Worker images only need inference dependencies
- MLflow Integration: Can version handlers as MLflow models if needed
Implementation Plan
Phase 1: Publish Packages to Gitea PyPI
Each handler repo publishes to Gitea's built-in package registry on release:
# .gitea/workflows/ci.yml
name: CI
on:
push:
branches: [main]
tags: ['v*']
pull_request:
branches: [main]
jobs:
lint:
# ... existing lint job
test:
# ... existing test job
publish:
runs-on: ubuntu-latest
needs: [lint, test]
if: startsWith(github.ref, 'refs/tags/v')
steps:
- uses: actions/checkout@v4
- name: Set up Python
uses: actions/setup-python@v5
with:
python-version: '3.11'
- name: Install uv
uses: astral-sh/setup-uv@v5
- name: Build package
run: uv build
- name: Publish to Gitea PyPI
env:
UV_PUBLISH_URL: https://git.daviestechlabs.io/api/packages/daviestechlabs/pypi
UV_PUBLISH_TOKEN: ${{ secrets.GITEA_TOKEN }}
run: uv publish
Phase 2: Update RayService with Handler Applications
Add handler applications to the existing RayService:
# rayservice.yaml additions
spec:
serveConfigV2: |
applications:
# ... existing GPU inference applications ...
# ============================================
# HANDLERS (CPU - runs on head node)
# ============================================
# Chat Handler - RAG + LLM pipeline
- name: chat-handler
route_prefix: /chat
import_path: chat_handler:app
runtime_env:
pip:
- handler-base>=0.1.0
- chat-handler>=0.1.0
pip_find_links:
- https://git.daviestechlabs.io/api/packages/daviestechlabs/pypi/simple/
env_vars:
NATS_URL: "nats://nats.ai-ml.svc.cluster.local:4222"
MILVUS_HOST: "milvus.ai-ml.svc.cluster.local"
OTEL_EXPORTER_OTLP_ENDPOINT: "http://otel-collector.monitoring.svc.cluster.local:4317"
deployments:
- name: ChatDeployment
num_replicas: 2
ray_actor_options:
num_cpus: 0.5
num_gpus: 0 # No GPU needed
max_ongoing_requests: 50
# Voice Assistant - STT → RAG → LLM → TTS pipeline
- name: voice-assistant
route_prefix: /voice
import_path: voice_assistant:app
runtime_env:
pip:
- handler-base>=0.1.0
- voice-assistant>=0.1.0
pip_find_links:
- https://git.daviestechlabs.io/api/packages/daviestechlabs/pypi/simple/
env_vars:
NATS_URL: "nats://nats.ai-ml.svc.cluster.local:4222"
MILVUS_HOST: "milvus.ai-ml.svc.cluster.local"
deployments:
- name: VoiceDeployment
num_replicas: 2
ray_actor_options:
num_cpus: 1
num_gpus: 0
max_ongoing_requests: 20
Phase 3: Refactor Handlers for Ray Serve
Convert handlers from standalone NATS subscribers to Ray Serve deployments that can also optionally subscribe to NATS:
# chat_handler.py (refactored)
from ray import serve
from handler_base import Settings
from handler_base.clients import EmbeddingsClient, LLMClient, RerankerClient, MilvusClient
@serve.deployment(
name="ChatDeployment",
num_replicas=2,
ray_actor_options={"num_cpus": 0.5, "num_gpus": 0}
)
class ChatHandler:
def __init__(self):
self.settings = Settings()
# Initialize clients - these can use Ray handles for internal calls
self.embeddings = EmbeddingsClient()
self.llm = LLMClient()
self.reranker = RerankerClient()
self.milvus = MilvusClient()
async def __call__(self, request) -> dict:
"""Handle HTTP requests (from Gradio, etc.)"""
data = await request.json()
return await self.process_chat(data)
async def process_chat(self, data: dict) -> dict:
"""Core chat logic - called by HTTP or NATS"""
query = data["query"]
# 1. Generate embeddings
embedding = await self.embeddings.embed(query)
# 2. Vector search
results = await self.milvus.search(embedding, top_k=10)
# 3. Rerank
reranked = await self.reranker.rerank(query, results)
# 4. Generate response
response = await self.llm.generate(query, context=reranked[:5])
return {
"response": response,
"sources": reranked[:5]
}
# Ray Serve app binding
app = ChatHandler.bind()
Phase 4: Use Ray Handles for Internal Calls (Optional Optimization)
Update handler-base clients to use Ray handles when running inside Ray:
# handler_base/clients/embeddings.py
import ray
from ray import serve
class EmbeddingsClient:
def __init__(self, url: str = None):
self.url = url
self._handle = None
# If running inside Ray, get handle to embeddings deployment
if ray.is_initialized():
try:
self._handle = serve.get_deployment_handle(
"EmbeddingsDeployment",
app_name="embeddings"
)
except Exception:
pass # Fall back to HTTP
async def embed(self, text: str) -> list[float]:
if self._handle:
# Fast internal Ray call
return await self._handle.embed.remote(text)
else:
# HTTP fallback for external callers
async with httpx.AsyncClient() as client:
resp = await client.post(f"{self.url}/v1/embeddings", json={"input": text})
return resp.json()["data"][0]["embedding"]
Phase 5: NATS Bridge (Optional)
If you still want NATS integration, add a separate NATS bridge that forwards to Ray Serve:
# pipeline_bridge.py - runs as Ray actor, subscribes to NATS
import ray
from ray import serve
import nats
@ray.remote
class NATSBridge:
def __init__(self):
self.nc = None
self.chat_handle = serve.get_deployment_handle("ChatDeployment", "chat-handler")
self.voice_handle = serve.get_deployment_handle("VoiceDeployment", "voice-assistant")
async def start(self):
self.nc = await nats.connect("nats://nats.ai-ml.svc.cluster.local:4222")
await self.nc.subscribe("ai.chat.request", cb=self.handle_chat)
await self.nc.subscribe("voice.request", cb=self.handle_voice)
async def handle_chat(self, msg):
result = await self.chat_handle.process_chat.remote(msg.data)
if msg.reply:
await self.nc.publish(msg.reply, result)
CI/CD Flow
┌────────────────────────────────────────────────────────────────────┐
│ Developer pushes to handler repo │
├────────────────────────────────────────────────────────────────────┤
│ 1. Gitea Actions: lint → test │
│ 2. On tag: build wheel → publish to Gitea PyPI │
├────────────────────────────────────────────────────────────────────┤
│ 3. Update RayService version in homelab-k8s2 │
│ (bump handler-base>=0.2.0 in runtime_env) │
├────────────────────────────────────────────────────────────────────┤
│ 4. Flux detects change → applies RayService │
│ 5. Ray downloads new packages → restarts deployments │
└────────────────────────────────────────────────────────────────────┘
Alternatives Considered
Standalone Container Deployments
Run handlers as separate Kubernetes Deployments outside Ray.
Rejected because:
- Duplicates infrastructure (separate scaling, health checks, etc.)
- HTTP overhead for every inference call
- Separate observability stack
- Against the "Ray as unified compute" philosophy
Bake Handlers into Worker Images
Pre-install handler code in ray-worker images.
Rejected because:
- Couples handler releases to image rebuilds
- Slower iteration cycle
- Larger images
Consequences
Positive
- Single platform: Everything runs in Ray
- Fast internal calls via Ray handles
- Unified observability in Ray Dashboard
- Clean abstraction layers: Kubeflow → KServe → Ray → GPU
- Handlers scale with Ray's autoscaler
Negative
- Handlers share Ray head node resources
- Need to manage Gitea PyPI authentication for runtime_env
- Slightly more complex RayService configuration
Neutral
- MLflow can track handler "models" if we want versioned deployments
- Kubeflow can trigger handler updates via pipelines