# ADR-0019: Python Module Deployment Strategy ## Status Accepted ## Date 2026-02-02 ## Context We have Python modules for AI/ML workflows that need to run on our unified GPU cluster: | Repo | Purpose | Needs GPU? | |------|---------|------------| | `handler-base` | Shared library (NATS, clients, telemetry) | No | | `chat-handler` | Text chat → RAG → LLM pipeline | No (calls GPU endpoints) | | `voice-assistant` | Audio → STT → RAG → LLM → TTS pipeline | No (calls GPU endpoints) | | `pipeline-bridge` | Kubeflow ↔ NATS integration | No | | `kuberay-images/ray-serve/` | Inference deployments (Whisper, TTS, LLM, etc.) | **Yes** | ### Current Architecture ``` ┌─────────────────────────────────────────────────────────────────────┐ │ PLATFORM LAYERS │ ├─────────────────────────────────────────────────────────────────────┤ │ Kubeflow Pipelines │ KServe (visibility) │ MLflow (registry) │ │ [Orchestration] │ [InferenceServices] │ [Models/Metrics] │ ├─────────────────────────────────────────────────────────────────────┤ │ RAY CLUSTER │ │ ┌────────────────────────────────────────────────────────────────┐ │ │ │ Ray Serve Applications (GPU inference) │ │ │ │ ├─ /llm → VLLMDeployment (khelben, 0.95 GPU) │ │ │ │ ├─ /whisper → WhisperDeployment (elminster, 0.5 GPU) │ │ │ │ ├─ /tts → TTSDeployment (elminster, 0.5 GPU) │ │ │ │ ├─ /embeddings → EmbeddingsDeployment (drizzt, 0.8 GPU) │ │ │ │ └─ /reranker → RerankerDeployment (danilo, 0.8 GPU) │ │ │ ├────────────────────────────────────────────────────────────────┤ │ │ │ Ray Serve Applications (CPU orchestration) ← WHERE HANDLERS GO │ │ │ │ ├─ /chat → ChatHandler (head node, 0 GPU) │ │ │ │ └─ /voice → VoiceHandler (head node, 0 GPU) │ │ │ └────────────────────────────────────────────────────────────────┘ │ ├─────────────────────────────────────────────────────────────────────┤ │ RayJob (batch/training) │ NATS (events) │ Milvus (vectors) │ └─────────────────────────────────────────────────────────────────────┘ ``` The key insight is that **handlers ARE Ray Serve applications** - they just don't need GPUs. They should run inside the Ray cluster to: 1. Use Ray's internal calling (faster than HTTP) 2. Share observability (Ray Dashboard) 3. Leverage Ray's scheduling for resource management ## Decision **Deploy handlers as Ray Serve applications inside the Ray cluster**, using `runtime_env` to install Python packages from Gitea's package registry at deployment time. ### Why Ray Serve (not standalone containers)? 1. **Unified Platform**: Everything runs in Ray - inference AND orchestration 2. **Internal Calls**: Handlers can call inference deployments via Ray handles (no HTTP) 3. **Resource Sharing**: Ray head node has spare CPU/memory for orchestration 4. **Single Observability**: Ray Dashboard shows all applications 5. **Simpler Ops**: One RayService to manage, not multiple Deployments ### Why runtime_env with pip (not baked into images)? 1. **Faster Iteration**: Change handler code → push to PyPI → redeploy RayService 2. **Decoupled Releases**: Handlers update independently of worker images 3. **Smaller Images**: Worker images only need inference dependencies 4. **MLflow Integration**: Can version handlers as MLflow models if needed ## Implementation Plan ### Phase 1: Publish Packages to Gitea PyPI Each handler repo publishes to Gitea's built-in package registry on release: ```yaml # .gitea/workflows/ci.yml name: CI on: push: branches: [main] tags: ['v*'] pull_request: branches: [main] jobs: lint: # ... existing lint job test: # ... existing test job publish: runs-on: ubuntu-latest needs: [lint, test] if: startsWith(github.ref, 'refs/tags/v') steps: - uses: actions/checkout@v4 - name: Set up Python uses: actions/setup-python@v5 with: python-version: '3.11' - name: Install uv uses: astral-sh/setup-uv@v5 - name: Build package run: uv build - name: Publish to Gitea PyPI env: UV_PUBLISH_URL: https://git.daviestechlabs.io/api/packages/daviestechlabs/pypi UV_PUBLISH_TOKEN: ${{ secrets.GITEA_TOKEN }} run: uv publish ``` ### Phase 2: Update RayService with Handler Applications Add handler applications to the existing RayService: ```yaml # rayservice.yaml additions spec: serveConfigV2: | applications: # ... existing GPU inference applications ... # ============================================ # HANDLERS (CPU - runs on head node) # ============================================ # Chat Handler - RAG + LLM pipeline - name: chat-handler route_prefix: /chat import_path: chat_handler:app runtime_env: pip: - handler-base>=0.1.0 - chat-handler>=0.1.0 pip_find_links: - https://git.daviestechlabs.io/api/packages/daviestechlabs/pypi/simple/ env_vars: NATS_URL: "nats://nats.ai-ml.svc.cluster.local:4222" MILVUS_HOST: "milvus.ai-ml.svc.cluster.local" OTEL_EXPORTER_OTLP_ENDPOINT: "http://otel-collector.monitoring.svc.cluster.local:4317" deployments: - name: ChatDeployment num_replicas: 2 ray_actor_options: num_cpus: 0.5 num_gpus: 0 # No GPU needed max_ongoing_requests: 50 # Voice Assistant - STT → RAG → LLM → TTS pipeline - name: voice-assistant route_prefix: /voice import_path: voice_assistant:app runtime_env: pip: - handler-base>=0.1.0 - voice-assistant>=0.1.0 pip_find_links: - https://git.daviestechlabs.io/api/packages/daviestechlabs/pypi/simple/ env_vars: NATS_URL: "nats://nats.ai-ml.svc.cluster.local:4222" MILVUS_HOST: "milvus.ai-ml.svc.cluster.local" deployments: - name: VoiceDeployment num_replicas: 2 ray_actor_options: num_cpus: 1 num_gpus: 0 max_ongoing_requests: 20 ``` ### Phase 3: Refactor Handlers for Ray Serve Convert handlers from standalone NATS subscribers to Ray Serve deployments that can also optionally subscribe to NATS: ```python # chat_handler.py (refactored) from ray import serve from handler_base import Settings from handler_base.clients import EmbeddingsClient, LLMClient, RerankerClient, MilvusClient @serve.deployment( name="ChatDeployment", num_replicas=2, ray_actor_options={"num_cpus": 0.5, "num_gpus": 0} ) class ChatHandler: def __init__(self): self.settings = Settings() # Initialize clients - these can use Ray handles for internal calls self.embeddings = EmbeddingsClient() self.llm = LLMClient() self.reranker = RerankerClient() self.milvus = MilvusClient() async def __call__(self, request) -> dict: """Handle HTTP requests (from Gradio, etc.)""" data = await request.json() return await self.process_chat(data) async def process_chat(self, data: dict) -> dict: """Core chat logic - called by HTTP or NATS""" query = data["query"] # 1. Generate embeddings embedding = await self.embeddings.embed(query) # 2. Vector search results = await self.milvus.search(embedding, top_k=10) # 3. Rerank reranked = await self.reranker.rerank(query, results) # 4. Generate response response = await self.llm.generate(query, context=reranked[:5]) return { "response": response, "sources": reranked[:5] } # Ray Serve app binding app = ChatHandler.bind() ``` ### Phase 4: Use Ray Handles for Internal Calls (Optional Optimization) Update handler-base clients to use Ray handles when running inside Ray: ```python # handler_base/clients/embeddings.py import ray from ray import serve class EmbeddingsClient: def __init__(self, url: str = None): self.url = url self._handle = None # If running inside Ray, get handle to embeddings deployment if ray.is_initialized(): try: self._handle = serve.get_deployment_handle( "EmbeddingsDeployment", app_name="embeddings" ) except Exception: pass # Fall back to HTTP async def embed(self, text: str) -> list[float]: if self._handle: # Fast internal Ray call return await self._handle.embed.remote(text) else: # HTTP fallback for external callers async with httpx.AsyncClient() as client: resp = await client.post(f"{self.url}/v1/embeddings", json={"input": text}) return resp.json()["data"][0]["embedding"] ``` ### Phase 5: NATS Bridge (Optional) If you still want NATS integration, add a separate NATS bridge that forwards to Ray Serve: ```python # pipeline_bridge.py - runs as Ray actor, subscribes to NATS import ray from ray import serve import nats @ray.remote class NATSBridge: def __init__(self): self.nc = None self.chat_handle = serve.get_deployment_handle("ChatDeployment", "chat-handler") self.voice_handle = serve.get_deployment_handle("VoiceDeployment", "voice-assistant") async def start(self): self.nc = await nats.connect("nats://nats.ai-ml.svc.cluster.local:4222") await self.nc.subscribe("ai.chat.request", cb=self.handle_chat) await self.nc.subscribe("voice.request", cb=self.handle_voice) async def handle_chat(self, msg): result = await self.chat_handle.process_chat.remote(msg.data) if msg.reply: await self.nc.publish(msg.reply, result) ``` ## CI/CD Flow ``` ┌────────────────────────────────────────────────────────────────────┐ │ Developer pushes to handler repo │ ├────────────────────────────────────────────────────────────────────┤ │ 1. Gitea Actions: lint → test │ │ 2. On tag: build wheel → publish to Gitea PyPI │ ├────────────────────────────────────────────────────────────────────┤ │ 3. Update RayService version in homelab-k8s2 │ │ (bump handler-base>=0.2.0 in runtime_env) │ ├────────────────────────────────────────────────────────────────────┤ │ 4. Flux detects change → applies RayService │ │ 5. Ray downloads new packages → restarts deployments │ └────────────────────────────────────────────────────────────────────┘ ``` ## Alternatives Considered ### Standalone Container Deployments Run handlers as separate Kubernetes Deployments outside Ray. **Rejected because:** - Duplicates infrastructure (separate scaling, health checks, etc.) - HTTP overhead for every inference call - Separate observability stack - Against the "Ray as unified compute" philosophy ### Bake Handlers into Worker Images Pre-install handler code in ray-worker images. **Rejected because:** - Couples handler releases to image rebuilds - Slower iteration cycle - Larger images ## Consequences ### Positive - Single platform: Everything runs in Ray - Fast internal calls via Ray handles - Unified observability in Ray Dashboard - Clean abstraction layers: Kubeflow → KServe → Ray → GPU - Handlers scale with Ray's autoscaler ### Negative - Handlers share Ray head node resources - Need to manage Gitea PyPI authentication for runtime_env - Slightly more complex RayService configuration ### Neutral - MLflow can track handler "models" if we want versioned deployments - Kubeflow can trigger handler updates via pipelines ## References - [ray-kserve-integration.md](../../homelab-k8s2/docs/ray-kserve-integration.md) - [Ray Serve runtime_env docs](https://docs.ray.io/en/latest/serve/production-guide/config.html) - [Gitea Package Registry](https://docs.gitea.io/en-us/packages/pypi/) - [ADR-0012: Ray Cluster Architecture](ADR-0012-ray-cluster-unified.md)