From 37b18dad08f0c73f48be455fa894298968624871 Mon Sep 17 00:00:00 2001 From: "Billy D." Date: Mon, 2 Feb 2026 09:14:33 -0500 Subject: [PATCH] accept: ADR-0019 handler deployment strategy --- .../ADR-0019-handler-deployment-strategy.md | 365 ++++++++++++++++++ 1 file changed, 365 insertions(+) create mode 100644 docs/adr/ADR-0019-handler-deployment-strategy.md diff --git a/docs/adr/ADR-0019-handler-deployment-strategy.md b/docs/adr/ADR-0019-handler-deployment-strategy.md new file mode 100644 index 0000000..e783b69 --- /dev/null +++ b/docs/adr/ADR-0019-handler-deployment-strategy.md @@ -0,0 +1,365 @@ +# ADR-0019: Python Module Deployment Strategy + +## Status + +Accepted + +## Date + +2026-02-02 + +## Context + +We have Python modules for AI/ML workflows that need to run on our unified GPU cluster: + +| Repo | Purpose | Needs GPU? | +|------|---------|------------| +| `handler-base` | Shared library (NATS, clients, telemetry) | No | +| `chat-handler` | Text chat → RAG → LLM pipeline | No (calls GPU endpoints) | +| `voice-assistant` | Audio → STT → RAG → LLM → TTS pipeline | No (calls GPU endpoints) | +| `pipeline-bridge` | Kubeflow ↔ NATS integration | No | +| `kuberay-images/ray-serve/` | Inference deployments (Whisper, TTS, LLM, etc.) | **Yes** | + +### Current Architecture + +``` +┌─────────────────────────────────────────────────────────────────────┐ +│ PLATFORM LAYERS │ +├─────────────────────────────────────────────────────────────────────┤ +│ Kubeflow Pipelines │ KServe (visibility) │ MLflow (registry) │ +│ [Orchestration] │ [InferenceServices] │ [Models/Metrics] │ +├─────────────────────────────────────────────────────────────────────┤ +│ RAY CLUSTER │ +│ ┌────────────────────────────────────────────────────────────────┐ │ +│ │ Ray Serve Applications (GPU inference) │ │ +│ │ ├─ /llm → VLLMDeployment (khelben, 0.95 GPU) │ │ +│ │ ├─ /whisper → WhisperDeployment (elminster, 0.5 GPU) │ │ +│ │ ├─ /tts → TTSDeployment (elminster, 0.5 GPU) │ │ +│ │ ├─ /embeddings → EmbeddingsDeployment (drizzt, 0.8 GPU) │ │ +│ │ └─ /reranker → RerankerDeployment (danilo, 0.8 GPU) │ │ +│ ├────────────────────────────────────────────────────────────────┤ │ +│ │ Ray Serve Applications (CPU orchestration) ← WHERE HANDLERS GO │ │ +│ │ ├─ /chat → ChatHandler (head node, 0 GPU) │ │ +│ │ └─ /voice → VoiceHandler (head node, 0 GPU) │ │ +│ └────────────────────────────────────────────────────────────────┘ │ +├─────────────────────────────────────────────────────────────────────┤ +│ RayJob (batch/training) │ NATS (events) │ Milvus (vectors) │ +└─────────────────────────────────────────────────────────────────────┘ +``` + +The key insight is that **handlers ARE Ray Serve applications** - they just don't need GPUs. +They should run inside the Ray cluster to: +1. Use Ray's internal calling (faster than HTTP) +2. Share observability (Ray Dashboard) +3. Leverage Ray's scheduling for resource management + +## Decision + +**Deploy handlers as Ray Serve applications inside the Ray cluster**, using `runtime_env` +to install Python packages from Gitea's package registry at deployment time. + +### Why Ray Serve (not standalone containers)? + +1. **Unified Platform**: Everything runs in Ray - inference AND orchestration +2. **Internal Calls**: Handlers can call inference deployments via Ray handles (no HTTP) +3. **Resource Sharing**: Ray head node has spare CPU/memory for orchestration +4. **Single Observability**: Ray Dashboard shows all applications +5. **Simpler Ops**: One RayService to manage, not multiple Deployments + +### Why runtime_env with pip (not baked into images)? + +1. **Faster Iteration**: Change handler code → push to PyPI → redeploy RayService +2. **Decoupled Releases**: Handlers update independently of worker images +3. **Smaller Images**: Worker images only need inference dependencies +4. **MLflow Integration**: Can version handlers as MLflow models if needed + +## Implementation Plan + +### Phase 1: Publish Packages to Gitea PyPI + +Each handler repo publishes to Gitea's built-in package registry on release: + +```yaml +# .gitea/workflows/ci.yml +name: CI + +on: + push: + branches: [main] + tags: ['v*'] + pull_request: + branches: [main] + +jobs: + lint: + # ... existing lint job + + test: + # ... existing test job + + publish: + runs-on: ubuntu-latest + needs: [lint, test] + if: startsWith(github.ref, 'refs/tags/v') + steps: + - uses: actions/checkout@v4 + + - name: Set up Python + uses: actions/setup-python@v5 + with: + python-version: '3.11' + + - name: Install uv + uses: astral-sh/setup-uv@v5 + + - name: Build package + run: uv build + + - name: Publish to Gitea PyPI + env: + UV_PUBLISH_URL: https://git.daviestechlabs.io/api/packages/daviestechlabs/pypi + UV_PUBLISH_TOKEN: ${{ secrets.GITEA_TOKEN }} + run: uv publish +``` + +### Phase 2: Update RayService with Handler Applications + +Add handler applications to the existing RayService: + +```yaml +# rayservice.yaml additions +spec: + serveConfigV2: | + applications: + # ... existing GPU inference applications ... + + # ============================================ + # HANDLERS (CPU - runs on head node) + # ============================================ + + # Chat Handler - RAG + LLM pipeline + - name: chat-handler + route_prefix: /chat + import_path: chat_handler:app + runtime_env: + pip: + - handler-base>=0.1.0 + - chat-handler>=0.1.0 + pip_find_links: + - https://git.daviestechlabs.io/api/packages/daviestechlabs/pypi/simple/ + env_vars: + NATS_URL: "nats://nats.ai-ml.svc.cluster.local:4222" + MILVUS_HOST: "milvus.ai-ml.svc.cluster.local" + OTEL_EXPORTER_OTLP_ENDPOINT: "http://otel-collector.monitoring.svc.cluster.local:4317" + deployments: + - name: ChatDeployment + num_replicas: 2 + ray_actor_options: + num_cpus: 0.5 + num_gpus: 0 # No GPU needed + max_ongoing_requests: 50 + + # Voice Assistant - STT → RAG → LLM → TTS pipeline + - name: voice-assistant + route_prefix: /voice + import_path: voice_assistant:app + runtime_env: + pip: + - handler-base>=0.1.0 + - voice-assistant>=0.1.0 + pip_find_links: + - https://git.daviestechlabs.io/api/packages/daviestechlabs/pypi/simple/ + env_vars: + NATS_URL: "nats://nats.ai-ml.svc.cluster.local:4222" + MILVUS_HOST: "milvus.ai-ml.svc.cluster.local" + deployments: + - name: VoiceDeployment + num_replicas: 2 + ray_actor_options: + num_cpus: 1 + num_gpus: 0 + max_ongoing_requests: 20 +``` + +### Phase 3: Refactor Handlers for Ray Serve + +Convert handlers from standalone NATS subscribers to Ray Serve deployments that can +also optionally subscribe to NATS: + +```python +# chat_handler.py (refactored) +from ray import serve +from handler_base import Settings +from handler_base.clients import EmbeddingsClient, LLMClient, RerankerClient, MilvusClient + +@serve.deployment( + name="ChatDeployment", + num_replicas=2, + ray_actor_options={"num_cpus": 0.5, "num_gpus": 0} +) +class ChatHandler: + def __init__(self): + self.settings = Settings() + + # Initialize clients - these can use Ray handles for internal calls + self.embeddings = EmbeddingsClient() + self.llm = LLMClient() + self.reranker = RerankerClient() + self.milvus = MilvusClient() + + async def __call__(self, request) -> dict: + """Handle HTTP requests (from Gradio, etc.)""" + data = await request.json() + return await self.process_chat(data) + + async def process_chat(self, data: dict) -> dict: + """Core chat logic - called by HTTP or NATS""" + query = data["query"] + + # 1. Generate embeddings + embedding = await self.embeddings.embed(query) + + # 2. Vector search + results = await self.milvus.search(embedding, top_k=10) + + # 3. Rerank + reranked = await self.reranker.rerank(query, results) + + # 4. Generate response + response = await self.llm.generate(query, context=reranked[:5]) + + return { + "response": response, + "sources": reranked[:5] + } + +# Ray Serve app binding +app = ChatHandler.bind() +``` + +### Phase 4: Use Ray Handles for Internal Calls (Optional Optimization) + +Update handler-base clients to use Ray handles when running inside Ray: + +```python +# handler_base/clients/embeddings.py +import ray +from ray import serve + +class EmbeddingsClient: + def __init__(self, url: str = None): + self.url = url + self._handle = None + + # If running inside Ray, get handle to embeddings deployment + if ray.is_initialized(): + try: + self._handle = serve.get_deployment_handle( + "EmbeddingsDeployment", + app_name="embeddings" + ) + except Exception: + pass # Fall back to HTTP + + async def embed(self, text: str) -> list[float]: + if self._handle: + # Fast internal Ray call + return await self._handle.embed.remote(text) + else: + # HTTP fallback for external callers + async with httpx.AsyncClient() as client: + resp = await client.post(f"{self.url}/v1/embeddings", json={"input": text}) + return resp.json()["data"][0]["embedding"] +``` + +### Phase 5: NATS Bridge (Optional) + +If you still want NATS integration, add a separate NATS bridge that forwards to Ray Serve: + +```python +# pipeline_bridge.py - runs as Ray actor, subscribes to NATS +import ray +from ray import serve +import nats + +@ray.remote +class NATSBridge: + def __init__(self): + self.nc = None + self.chat_handle = serve.get_deployment_handle("ChatDeployment", "chat-handler") + self.voice_handle = serve.get_deployment_handle("VoiceDeployment", "voice-assistant") + + async def start(self): + self.nc = await nats.connect("nats://nats.ai-ml.svc.cluster.local:4222") + + await self.nc.subscribe("ai.chat.request", cb=self.handle_chat) + await self.nc.subscribe("voice.request", cb=self.handle_voice) + + async def handle_chat(self, msg): + result = await self.chat_handle.process_chat.remote(msg.data) + if msg.reply: + await self.nc.publish(msg.reply, result) +``` + +## CI/CD Flow + +``` +┌────────────────────────────────────────────────────────────────────┐ +│ Developer pushes to handler repo │ +├────────────────────────────────────────────────────────────────────┤ +│ 1. Gitea Actions: lint → test │ +│ 2. On tag: build wheel → publish to Gitea PyPI │ +├────────────────────────────────────────────────────────────────────┤ +│ 3. Update RayService version in homelab-k8s2 │ +│ (bump handler-base>=0.2.0 in runtime_env) │ +├────────────────────────────────────────────────────────────────────┤ +│ 4. Flux detects change → applies RayService │ +│ 5. Ray downloads new packages → restarts deployments │ +└────────────────────────────────────────────────────────────────────┘ +``` + +## Alternatives Considered + +### Standalone Container Deployments + +Run handlers as separate Kubernetes Deployments outside Ray. + +**Rejected because:** +- Duplicates infrastructure (separate scaling, health checks, etc.) +- HTTP overhead for every inference call +- Separate observability stack +- Against the "Ray as unified compute" philosophy + +### Bake Handlers into Worker Images + +Pre-install handler code in ray-worker images. + +**Rejected because:** +- Couples handler releases to image rebuilds +- Slower iteration cycle +- Larger images + +## Consequences + +### Positive +- Single platform: Everything runs in Ray +- Fast internal calls via Ray handles +- Unified observability in Ray Dashboard +- Clean abstraction layers: Kubeflow → KServe → Ray → GPU +- Handlers scale with Ray's autoscaler + +### Negative +- Handlers share Ray head node resources +- Need to manage Gitea PyPI authentication for runtime_env +- Slightly more complex RayService configuration + +### Neutral +- MLflow can track handler "models" if we want versioned deployments +- Kubeflow can trigger handler updates via pipelines + +## References + +- [ray-kserve-integration.md](../../homelab-k8s2/docs/ray-kserve-integration.md) +- [Ray Serve runtime_env docs](https://docs.ray.io/en/latest/serve/production-guide/config.html) +- [Gitea Package Registry](https://docs.gitea.io/en-us/packages/pypi/) +- [ADR-0012: Ray Cluster Architecture](ADR-0012-ray-cluster-unified.md)