From 37b18dad08f0c73f48be455fa894298968624871 Mon Sep 17 00:00:00 2001
From: "Billy D." <billy.davies.10@icloud.com>
Date: Mon, 2 Feb 2026 09:14:33 -0500
Subject: [PATCH] accept: ADR-0019 handler deployment strategy

---
 .../ADR-0019-handler-deployment-strategy.md   | 365 ++++++++++++++++++
 1 file changed, 365 insertions(+)
 create mode 100644 docs/adr/ADR-0019-handler-deployment-strategy.md

diff --git a/docs/adr/ADR-0019-handler-deployment-strategy.md b/docs/adr/ADR-0019-handler-deployment-strategy.md
new file mode 100644
index 0000000..e783b69
--- /dev/null
+++ b/docs/adr/ADR-0019-handler-deployment-strategy.md
@@ -0,0 +1,365 @@
+# ADR-0019: Python Module Deployment Strategy
+
+## Status
+
+Accepted
+
+## Date
+
+2026-02-02
+
+## Context
+
+We have Python modules for AI/ML workflows that need to run on our unified GPU cluster:
+
+| Repo | Purpose | Needs GPU? |
+|------|---------|------------|
+| `handler-base` | Shared library (NATS, clients, telemetry) | No |
+| `chat-handler` | Text chat → RAG → LLM pipeline | No (calls GPU endpoints) |
+| `voice-assistant` | Audio → STT → RAG → LLM → TTS pipeline | No (calls GPU endpoints) |
+| `pipeline-bridge` | Kubeflow ↔ NATS integration | No |
+| `kuberay-images/ray-serve/` | Inference deployments (Whisper, TTS, LLM, etc.) | **Yes** |
+
+### Current Architecture
+
+```
+┌─────────────────────────────────────────────────────────────────────┐
+│                        PLATFORM LAYERS                               │
+├─────────────────────────────────────────────────────────────────────┤
+│  Kubeflow Pipelines  │  KServe (visibility)  │  MLflow (registry)  │
+│  [Orchestration]     │  [InferenceServices]  │  [Models/Metrics]   │
+├─────────────────────────────────────────────────────────────────────┤
+│                         RAY CLUSTER                                  │
+│  ┌────────────────────────────────────────────────────────────────┐ │
+│  │ Ray Serve Applications (GPU inference)                         │ │
+│  │ ├─ /llm        → VLLMDeployment      (khelben, 0.95 GPU)      │ │
+│  │ ├─ /whisper    → WhisperDeployment   (elminster, 0.5 GPU)     │ │
+│  │ ├─ /tts        → TTSDeployment       (elminster, 0.5 GPU)     │ │
+│  │ ├─ /embeddings → EmbeddingsDeployment (drizzt, 0.8 GPU)       │ │
+│  │ └─ /reranker   → RerankerDeployment  (danilo, 0.8 GPU)        │ │
+│  ├────────────────────────────────────────────────────────────────┤ │
+│  │ Ray Serve Applications (CPU orchestration) ← WHERE HANDLERS GO │ │
+│  │ ├─ /chat       → ChatHandler         (head node, 0 GPU)       │ │
+│  │ └─ /voice      → VoiceHandler        (head node, 0 GPU)       │ │
+│  └────────────────────────────────────────────────────────────────┘ │
+├─────────────────────────────────────────────────────────────────────┤
+│  RayJob (batch/training)  │  NATS (events)  │  Milvus (vectors)    │
+└─────────────────────────────────────────────────────────────────────┘
+```
+
+The key insight is that **handlers ARE Ray Serve applications** - they just don't need GPUs.
+They should run inside the Ray cluster to:
+1. Use Ray's internal calling (faster than HTTP)
+2. Share observability (Ray Dashboard)
+3. Leverage Ray's scheduling for resource management
+
+## Decision
+
+**Deploy handlers as Ray Serve applications inside the Ray cluster**, using `runtime_env` 
+to install Python packages from Gitea's package registry at deployment time.
+
+### Why Ray Serve (not standalone containers)?
+
+1. **Unified Platform**: Everything runs in Ray - inference AND orchestration
+2. **Internal Calls**: Handlers can call inference deployments via Ray handles (no HTTP)
+3. **Resource Sharing**: Ray head node has spare CPU/memory for orchestration
+4. **Single Observability**: Ray Dashboard shows all applications
+5. **Simpler Ops**: One RayService to manage, not multiple Deployments
+
+### Why runtime_env with pip (not baked into images)?
+
+1. **Faster Iteration**: Change handler code → push to PyPI → redeploy RayService
+2. **Decoupled Releases**: Handlers update independently of worker images
+3. **Smaller Images**: Worker images only need inference dependencies
+4. **MLflow Integration**: Can version handlers as MLflow models if needed
+
+## Implementation Plan
+
+### Phase 1: Publish Packages to Gitea PyPI
+
+Each handler repo publishes to Gitea's built-in package registry on release:
+
+```yaml
+# .gitea/workflows/ci.yml
+name: CI
+
+on:
+  push:
+    branches: [main]
+    tags: ['v*']
+  pull_request:
+    branches: [main]
+
+jobs:
+  lint:
+    # ... existing lint job
+
+  test:
+    # ... existing test job
+
+  publish:
+    runs-on: ubuntu-latest
+    needs: [lint, test]
+    if: startsWith(github.ref, 'refs/tags/v')
+    steps:
+      - uses: actions/checkout@v4
+      
+      - name: Set up Python
+        uses: actions/setup-python@v5
+        with:
+          python-version: '3.11'
+          
+      - name: Install uv
+        uses: astral-sh/setup-uv@v5
+        
+      - name: Build package
+        run: uv build
+        
+      - name: Publish to Gitea PyPI
+        env:
+          UV_PUBLISH_URL: https://git.daviestechlabs.io/api/packages/daviestechlabs/pypi
+          UV_PUBLISH_TOKEN: ${{ secrets.GITEA_TOKEN }}
+        run: uv publish
+```
+
+### Phase 2: Update RayService with Handler Applications
+
+Add handler applications to the existing RayService:
+
+```yaml
+# rayservice.yaml additions
+spec:
+  serveConfigV2: |
+    applications:
+      # ... existing GPU inference applications ...
+
+      # ============================================
+      # HANDLERS (CPU - runs on head node)
+      # ============================================
+
+      # Chat Handler - RAG + LLM pipeline
+      - name: chat-handler
+        route_prefix: /chat
+        import_path: chat_handler:app
+        runtime_env:
+          pip:
+            - handler-base>=0.1.0
+            - chat-handler>=0.1.0
+          pip_find_links:
+            - https://git.daviestechlabs.io/api/packages/daviestechlabs/pypi/simple/
+          env_vars:
+            NATS_URL: "nats://nats.ai-ml.svc.cluster.local:4222"
+            MILVUS_HOST: "milvus.ai-ml.svc.cluster.local"
+            OTEL_EXPORTER_OTLP_ENDPOINT: "http://otel-collector.monitoring.svc.cluster.local:4317"
+        deployments:
+          - name: ChatDeployment
+            num_replicas: 2
+            ray_actor_options:
+              num_cpus: 0.5
+              num_gpus: 0  # No GPU needed
+            max_ongoing_requests: 50
+
+      # Voice Assistant - STT → RAG → LLM → TTS pipeline  
+      - name: voice-assistant
+        route_prefix: /voice
+        import_path: voice_assistant:app
+        runtime_env:
+          pip:
+            - handler-base>=0.1.0
+            - voice-assistant>=0.1.0
+          pip_find_links:
+            - https://git.daviestechlabs.io/api/packages/daviestechlabs/pypi/simple/
+          env_vars:
+            NATS_URL: "nats://nats.ai-ml.svc.cluster.local:4222"
+            MILVUS_HOST: "milvus.ai-ml.svc.cluster.local"
+        deployments:
+          - name: VoiceDeployment
+            num_replicas: 2
+            ray_actor_options:
+              num_cpus: 1
+              num_gpus: 0
+            max_ongoing_requests: 20
+```
+
+### Phase 3: Refactor Handlers for Ray Serve
+
+Convert handlers from standalone NATS subscribers to Ray Serve deployments that can 
+also optionally subscribe to NATS:
+
+```python
+# chat_handler.py (refactored)
+from ray import serve
+from handler_base import Settings
+from handler_base.clients import EmbeddingsClient, LLMClient, RerankerClient, MilvusClient
+
+@serve.deployment(
+    name="ChatDeployment",
+    num_replicas=2,
+    ray_actor_options={"num_cpus": 0.5, "num_gpus": 0}
+)
+class ChatHandler:
+    def __init__(self):
+        self.settings = Settings()
+        
+        # Initialize clients - these can use Ray handles for internal calls
+        self.embeddings = EmbeddingsClient()
+        self.llm = LLMClient()
+        self.reranker = RerankerClient()
+        self.milvus = MilvusClient()
+
+    async def __call__(self, request) -> dict:
+        """Handle HTTP requests (from Gradio, etc.)"""
+        data = await request.json()
+        return await self.process_chat(data)
+
+    async def process_chat(self, data: dict) -> dict:
+        """Core chat logic - called by HTTP or NATS"""
+        query = data["query"]
+        
+        # 1. Generate embeddings
+        embedding = await self.embeddings.embed(query)
+        
+        # 2. Vector search
+        results = await self.milvus.search(embedding, top_k=10)
+        
+        # 3. Rerank
+        reranked = await self.reranker.rerank(query, results)
+        
+        # 4. Generate response
+        response = await self.llm.generate(query, context=reranked[:5])
+        
+        return {
+            "response": response,
+            "sources": reranked[:5]
+        }
+
+# Ray Serve app binding
+app = ChatHandler.bind()
+```
+
+### Phase 4: Use Ray Handles for Internal Calls (Optional Optimization)
+
+Update handler-base clients to use Ray handles when running inside Ray:
+
+```python
+# handler_base/clients/embeddings.py
+import ray
+from ray import serve
+
+class EmbeddingsClient:
+    def __init__(self, url: str = None):
+        self.url = url
+        self._handle = None
+        
+        # If running inside Ray, get handle to embeddings deployment
+        if ray.is_initialized():
+            try:
+                self._handle = serve.get_deployment_handle(
+                    "EmbeddingsDeployment", 
+                    app_name="embeddings"
+                )
+            except Exception:
+                pass  # Fall back to HTTP
+    
+    async def embed(self, text: str) -> list[float]:
+        if self._handle:
+            # Fast internal Ray call
+            return await self._handle.embed.remote(text)
+        else:
+            # HTTP fallback for external callers
+            async with httpx.AsyncClient() as client:
+                resp = await client.post(f"{self.url}/v1/embeddings", json={"input": text})
+                return resp.json()["data"][0]["embedding"]
+```
+
+### Phase 5: NATS Bridge (Optional)
+
+If you still want NATS integration, add a separate NATS bridge that forwards to Ray Serve:
+
+```python
+# pipeline_bridge.py - runs as Ray actor, subscribes to NATS
+import ray
+from ray import serve
+import nats
+
+@ray.remote
+class NATSBridge:
+    def __init__(self):
+        self.nc = None
+        self.chat_handle = serve.get_deployment_handle("ChatDeployment", "chat-handler")
+        self.voice_handle = serve.get_deployment_handle("VoiceDeployment", "voice-assistant")
+    
+    async def start(self):
+        self.nc = await nats.connect("nats://nats.ai-ml.svc.cluster.local:4222")
+        
+        await self.nc.subscribe("ai.chat.request", cb=self.handle_chat)
+        await self.nc.subscribe("voice.request", cb=self.handle_voice)
+    
+    async def handle_chat(self, msg):
+        result = await self.chat_handle.process_chat.remote(msg.data)
+        if msg.reply:
+            await self.nc.publish(msg.reply, result)
+```
+
+## CI/CD Flow
+
+```
+┌────────────────────────────────────────────────────────────────────┐
+│ Developer pushes to handler repo                                   │
+├────────────────────────────────────────────────────────────────────┤
+│ 1. Gitea Actions: lint → test                                      │
+│ 2. On tag: build wheel → publish to Gitea PyPI                    │
+├────────────────────────────────────────────────────────────────────┤
+│ 3. Update RayService version in homelab-k8s2                       │
+│    (bump handler-base>=0.2.0 in runtime_env)                       │
+├────────────────────────────────────────────────────────────────────┤
+│ 4. Flux detects change → applies RayService                        │
+│ 5. Ray downloads new packages → restarts deployments               │
+└────────────────────────────────────────────────────────────────────┘
+```
+
+## Alternatives Considered
+
+### Standalone Container Deployments
+
+Run handlers as separate Kubernetes Deployments outside Ray.
+
+**Rejected because:**
+- Duplicates infrastructure (separate scaling, health checks, etc.)
+- HTTP overhead for every inference call
+- Separate observability stack
+- Against the "Ray as unified compute" philosophy
+
+### Bake Handlers into Worker Images
+
+Pre-install handler code in ray-worker images.
+
+**Rejected because:**
+- Couples handler releases to image rebuilds
+- Slower iteration cycle
+- Larger images
+
+## Consequences
+
+### Positive
+- Single platform: Everything runs in Ray
+- Fast internal calls via Ray handles
+- Unified observability in Ray Dashboard
+- Clean abstraction layers: Kubeflow → KServe → Ray → GPU
+- Handlers scale with Ray's autoscaler
+
+### Negative
+- Handlers share Ray head node resources
+- Need to manage Gitea PyPI authentication for runtime_env
+- Slightly more complex RayService configuration
+
+### Neutral
+- MLflow can track handler "models" if we want versioned deployments
+- Kubeflow can trigger handler updates via pipelines
+
+## References
+
+- [ray-kserve-integration.md](../../homelab-k8s2/docs/ray-kserve-integration.md)
+- [Ray Serve runtime_env docs](https://docs.ray.io/en/latest/serve/production-guide/config.html)
+- [Gitea Package Registry](https://docs.gitea.io/en-us/packages/pypi/)
+- [ADR-0012: Ray Cluster Architecture](ADR-0012-ray-cluster-unified.md)