Files
homelab-design/decisions/0019-handler-deployment-strategy.md
Billy D. 8f4df84657 chore: Consolidate ADRs into decisions/ directory
- Added ADR-0016: Affine email verification strategy
- Moved ADRs 0019-0024 from docs/adr/ to decisions/
- Renamed to consistent format (removed ADR- prefix)
2026-02-04 08:28:12 -05:00

14 KiB

ADR-0019: Python Module Deployment Strategy

Status

Accepted

Date

2026-02-02

Context

We have Python modules for AI/ML workflows that need to run on our unified GPU cluster:

Repo Purpose Needs GPU?
handler-base Shared library (NATS, clients, telemetry) No
chat-handler Text chat → RAG → LLM pipeline No (calls GPU endpoints)
voice-assistant Audio → STT → RAG → LLM → TTS pipeline No (calls GPU endpoints)
pipeline-bridge Kubeflow ↔ NATS integration No
kuberay-images/ray-serve/ Inference deployments (Whisper, TTS, LLM, etc.) Yes

Current Architecture

┌─────────────────────────────────────────────────────────────────────┐
│                        PLATFORM LAYERS                               │
├─────────────────────────────────────────────────────────────────────┤
│  Kubeflow Pipelines  │  KServe (visibility)  │  MLflow (registry)  │
│  [Orchestration]     │  [InferenceServices]  │  [Models/Metrics]   │
├─────────────────────────────────────────────────────────────────────┤
│                         RAY CLUSTER                                  │
│  ┌────────────────────────────────────────────────────────────────┐ │
│  │ Ray Serve Applications (GPU inference)                         │ │
│  │ ├─ /llm        → VLLMDeployment      (khelben, 0.95 GPU)      │ │
│  │ ├─ /whisper    → WhisperDeployment   (elminster, 0.5 GPU)     │ │
│  │ ├─ /tts        → TTSDeployment       (elminster, 0.5 GPU)     │ │
│  │ ├─ /embeddings → EmbeddingsDeployment (drizzt, 0.8 GPU)       │ │
│  │ └─ /reranker   → RerankerDeployment  (danilo, 0.8 GPU)        │ │
│  ├────────────────────────────────────────────────────────────────┤ │
│  │ Ray Serve Applications (CPU orchestration) ← WHERE HANDLERS GO │ │
│  │ ├─ /chat       → ChatHandler         (head node, 0 GPU)       │ │
│  │ └─ /voice      → VoiceHandler        (head node, 0 GPU)       │ │
│  └────────────────────────────────────────────────────────────────┘ │
├─────────────────────────────────────────────────────────────────────┤
│  RayJob (batch/training)  │  NATS (events)  │  Milvus (vectors)    │
└─────────────────────────────────────────────────────────────────────┘

The key insight is that handlers ARE Ray Serve applications - they just don't need GPUs. They should run inside the Ray cluster to:

  1. Use Ray's internal calling (faster than HTTP)
  2. Share observability (Ray Dashboard)
  3. Leverage Ray's scheduling for resource management

Decision

Deploy handlers as Ray Serve applications inside the Ray cluster, using runtime_env to install Python packages from Gitea's package registry at deployment time.

Why Ray Serve (not standalone containers)?

  1. Unified Platform: Everything runs in Ray - inference AND orchestration
  2. Internal Calls: Handlers can call inference deployments via Ray handles (no HTTP)
  3. Resource Sharing: Ray head node has spare CPU/memory for orchestration
  4. Single Observability: Ray Dashboard shows all applications
  5. Simpler Ops: One RayService to manage, not multiple Deployments

Why runtime_env with pip (not baked into images)?

  1. Faster Iteration: Change handler code → push to PyPI → redeploy RayService
  2. Decoupled Releases: Handlers update independently of worker images
  3. Smaller Images: Worker images only need inference dependencies
  4. MLflow Integration: Can version handlers as MLflow models if needed

Implementation Plan

Phase 1: Publish Packages to Gitea PyPI

Each handler repo publishes to Gitea's built-in package registry on release:

# .gitea/workflows/ci.yml
name: CI

on:
  push:
    branches: [main]
    tags: ['v*']
  pull_request:
    branches: [main]

jobs:
  lint:
    # ... existing lint job

  test:
    # ... existing test job

  publish:
    runs-on: ubuntu-latest
    needs: [lint, test]
    if: startsWith(github.ref, 'refs/tags/v')
    steps:
      - uses: actions/checkout@v4
      
      - name: Set up Python
        uses: actions/setup-python@v5
        with:
          python-version: '3.11'
          
      - name: Install uv
        uses: astral-sh/setup-uv@v5
        
      - name: Build package
        run: uv build
        
      - name: Publish to Gitea PyPI
        env:
          UV_PUBLISH_URL: https://git.daviestechlabs.io/api/packages/daviestechlabs/pypi
          UV_PUBLISH_TOKEN: ${{ secrets.GITEA_TOKEN }}
        run: uv publish

Phase 2: Update RayService with Handler Applications

Add handler applications to the existing RayService:

# rayservice.yaml additions
spec:
  serveConfigV2: |
    applications:
      # ... existing GPU inference applications ...

      # ============================================
      # HANDLERS (CPU - runs on head node)
      # ============================================

      # Chat Handler - RAG + LLM pipeline
      - name: chat-handler
        route_prefix: /chat
        import_path: chat_handler:app
        runtime_env:
          pip:
            - handler-base>=0.1.0
            - chat-handler>=0.1.0
          pip_find_links:
            - https://git.daviestechlabs.io/api/packages/daviestechlabs/pypi/simple/
          env_vars:
            NATS_URL: "nats://nats.ai-ml.svc.cluster.local:4222"
            MILVUS_HOST: "milvus.ai-ml.svc.cluster.local"
            OTEL_EXPORTER_OTLP_ENDPOINT: "http://otel-collector.monitoring.svc.cluster.local:4317"
        deployments:
          - name: ChatDeployment
            num_replicas: 2
            ray_actor_options:
              num_cpus: 0.5
              num_gpus: 0  # No GPU needed
            max_ongoing_requests: 50

      # Voice Assistant - STT → RAG → LLM → TTS pipeline  
      - name: voice-assistant
        route_prefix: /voice
        import_path: voice_assistant:app
        runtime_env:
          pip:
            - handler-base>=0.1.0
            - voice-assistant>=0.1.0
          pip_find_links:
            - https://git.daviestechlabs.io/api/packages/daviestechlabs/pypi/simple/
          env_vars:
            NATS_URL: "nats://nats.ai-ml.svc.cluster.local:4222"
            MILVUS_HOST: "milvus.ai-ml.svc.cluster.local"
        deployments:
          - name: VoiceDeployment
            num_replicas: 2
            ray_actor_options:
              num_cpus: 1
              num_gpus: 0
            max_ongoing_requests: 20

Phase 3: Refactor Handlers for Ray Serve

Convert handlers from standalone NATS subscribers to Ray Serve deployments that can also optionally subscribe to NATS:

# chat_handler.py (refactored)
from ray import serve
from handler_base import Settings
from handler_base.clients import EmbeddingsClient, LLMClient, RerankerClient, MilvusClient

@serve.deployment(
    name="ChatDeployment",
    num_replicas=2,
    ray_actor_options={"num_cpus": 0.5, "num_gpus": 0}
)
class ChatHandler:
    def __init__(self):
        self.settings = Settings()
        
        # Initialize clients - these can use Ray handles for internal calls
        self.embeddings = EmbeddingsClient()
        self.llm = LLMClient()
        self.reranker = RerankerClient()
        self.milvus = MilvusClient()

    async def __call__(self, request) -> dict:
        """Handle HTTP requests (from Gradio, etc.)"""
        data = await request.json()
        return await self.process_chat(data)

    async def process_chat(self, data: dict) -> dict:
        """Core chat logic - called by HTTP or NATS"""
        query = data["query"]
        
        # 1. Generate embeddings
        embedding = await self.embeddings.embed(query)
        
        # 2. Vector search
        results = await self.milvus.search(embedding, top_k=10)
        
        # 3. Rerank
        reranked = await self.reranker.rerank(query, results)
        
        # 4. Generate response
        response = await self.llm.generate(query, context=reranked[:5])
        
        return {
            "response": response,
            "sources": reranked[:5]
        }

# Ray Serve app binding
app = ChatHandler.bind()

Phase 4: Use Ray Handles for Internal Calls (Optional Optimization)

Update handler-base clients to use Ray handles when running inside Ray:

# handler_base/clients/embeddings.py
import ray
from ray import serve

class EmbeddingsClient:
    def __init__(self, url: str = None):
        self.url = url
        self._handle = None
        
        # If running inside Ray, get handle to embeddings deployment
        if ray.is_initialized():
            try:
                self._handle = serve.get_deployment_handle(
                    "EmbeddingsDeployment", 
                    app_name="embeddings"
                )
            except Exception:
                pass  # Fall back to HTTP
    
    async def embed(self, text: str) -> list[float]:
        if self._handle:
            # Fast internal Ray call
            return await self._handle.embed.remote(text)
        else:
            # HTTP fallback for external callers
            async with httpx.AsyncClient() as client:
                resp = await client.post(f"{self.url}/v1/embeddings", json={"input": text})
                return resp.json()["data"][0]["embedding"]

Phase 5: NATS Bridge (Optional)

If you still want NATS integration, add a separate NATS bridge that forwards to Ray Serve:

# pipeline_bridge.py - runs as Ray actor, subscribes to NATS
import ray
from ray import serve
import nats

@ray.remote
class NATSBridge:
    def __init__(self):
        self.nc = None
        self.chat_handle = serve.get_deployment_handle("ChatDeployment", "chat-handler")
        self.voice_handle = serve.get_deployment_handle("VoiceDeployment", "voice-assistant")
    
    async def start(self):
        self.nc = await nats.connect("nats://nats.ai-ml.svc.cluster.local:4222")
        
        await self.nc.subscribe("ai.chat.request", cb=self.handle_chat)
        await self.nc.subscribe("voice.request", cb=self.handle_voice)
    
    async def handle_chat(self, msg):
        result = await self.chat_handle.process_chat.remote(msg.data)
        if msg.reply:
            await self.nc.publish(msg.reply, result)

CI/CD Flow

┌────────────────────────────────────────────────────────────────────┐
│ Developer pushes to handler repo                                   │
├────────────────────────────────────────────────────────────────────┤
│ 1. Gitea Actions: lint → test                                      │
│ 2. On tag: build wheel → publish to Gitea PyPI                    │
├────────────────────────────────────────────────────────────────────┤
│ 3. Update RayService version in homelab-k8s2                       │
│    (bump handler-base>=0.2.0 in runtime_env)                       │
├────────────────────────────────────────────────────────────────────┤
│ 4. Flux detects change → applies RayService                        │
│ 5. Ray downloads new packages → restarts deployments               │
└────────────────────────────────────────────────────────────────────┘

Alternatives Considered

Standalone Container Deployments

Run handlers as separate Kubernetes Deployments outside Ray.

Rejected because:

  • Duplicates infrastructure (separate scaling, health checks, etc.)
  • HTTP overhead for every inference call
  • Separate observability stack
  • Against the "Ray as unified compute" philosophy

Bake Handlers into Worker Images

Pre-install handler code in ray-worker images.

Rejected because:

  • Couples handler releases to image rebuilds
  • Slower iteration cycle
  • Larger images

Consequences

Positive

  • Single platform: Everything runs in Ray
  • Fast internal calls via Ray handles
  • Unified observability in Ray Dashboard
  • Clean abstraction layers: Kubeflow → KServe → Ray → GPU
  • Handlers scale with Ray's autoscaler

Negative

  • Handlers share Ray head node resources
  • Need to manage Gitea PyPI authentication for runtime_env
  • Slightly more complex RayService configuration

Neutral

  • MLflow can track handler "models" if we want versioned deployments
  • Kubeflow can trigger handler updates via pipelines

References