homelab-design/decisions/0019-handler-deployment-strategy.md

# ADR-0019: Python Module Deployment Strategy

## Status

Accepted

## Date

2026-02-02

## Context

We have Python modules for AI/ML workflows that need to run on our unified GPU cluster:

| Repo | Purpose | Needs GPU? |
|------|---------|------------|
| `handler-base` | Shared library (NATS, clients, telemetry) | No |
| `chat-handler` | Text chat → RAG → LLM pipeline | No (calls GPU endpoints) |
| `voice-assistant` | Audio → STT → RAG → LLM → TTS pipeline | No (calls GPU endpoints) |
| `pipeline-bridge` | Kubeflow ↔ NATS integration | No |
| `kuberay-images/ray-serve/` | Inference deployments (Whisper, TTS, LLM, etc.) | **Yes** |

### Current Architecture

```
┌─────────────────────────────────────────────────────────────────────┐
│                        PLATFORM LAYERS                               │
├─────────────────────────────────────────────────────────────────────┤
│  Kubeflow Pipelines  │  KServe (visibility)  │  MLflow (registry)  │
│  [Orchestration]     │  [InferenceServices]  │  [Models/Metrics]   │
├─────────────────────────────────────────────────────────────────────┤
│                         RAY CLUSTER                                  │
│  ┌────────────────────────────────────────────────────────────────┐ │
│  │ Ray Serve Applications (GPU inference)                         │ │
│  │ ├─ /llm        → VLLMDeployment      (khelben, 0.95 GPU)      │ │
│  │ ├─ /whisper    → WhisperDeployment   (elminster, 0.5 GPU)     │ │
│  │ ├─ /tts        → TTSDeployment       (elminster, 0.5 GPU)     │ │
│  │ ├─ /embeddings → EmbeddingsDeployment (drizzt, 0.8 GPU)       │ │
│  │ └─ /reranker   → RerankerDeployment  (danilo, 0.8 GPU)        │ │
│  ├────────────────────────────────────────────────────────────────┤ │
│  │ Ray Serve Applications (CPU orchestration) ← WHERE HANDLERS GO │ │
│  │ ├─ /chat       → ChatHandler         (head node, 0 GPU)       │ │
│  │ └─ /voice      → VoiceHandler        (head node, 0 GPU)       │ │
│  └────────────────────────────────────────────────────────────────┘ │
├─────────────────────────────────────────────────────────────────────┤
│  RayJob (batch/training)  │  NATS (events)  │  Milvus (vectors)    │
└─────────────────────────────────────────────────────────────────────┘
```

The key insight is that **handlers ARE Ray Serve applications** - they just don't need GPUs.
They should run inside the Ray cluster to:
1. Use Ray's internal calling (faster than HTTP)
2. Share observability (Ray Dashboard)
3. Leverage Ray's scheduling for resource management

## Decision

**Deploy handlers as Ray Serve applications inside the Ray cluster**, using `runtime_env`
to install Python packages from Gitea's package registry at deployment time.

### Why Ray Serve (not standalone containers)?

1. **Unified Platform**: Everything runs in Ray - inference AND orchestration
2. **Internal Calls**: Handlers can call inference deployments via Ray handles (no HTTP)
3. **Resource Sharing**: Ray head node has spare CPU/memory for orchestration
4. **Single Observability**: Ray Dashboard shows all applications
5. **Simpler Ops**: One RayService to manage, not multiple Deployments

### Why runtime_env with pip (not baked into images)?

1. **Faster Iteration**: Change handler code → push to PyPI → redeploy RayService
2. **Decoupled Releases**: Handlers update independently of worker images
3. **Smaller Images**: Worker images only need inference dependencies
4. **MLflow Integration**: Can version handlers as MLflow models if needed

## Implementation Plan

### Phase 1: Publish Packages to Gitea PyPI

Each handler repo publishes to Gitea's built-in package registry on release:

```yaml
# .gitea/workflows/ci.yml
name: CI

on:
  push:
    branches: [main]
    tags: ['v*']
  pull_request:
    branches: [main]

jobs:
  lint:
    # ... existing lint job

  test:
    # ... existing test job

  publish:
    runs-on: ubuntu-latest
    needs: [lint, test]
    if: startsWith(github.ref, 'refs/tags/v')
    steps:
      - uses: actions/checkout@v4

      - name: Set up Python
        uses: actions/setup-python@v5
        with:
          python-version: '3.11'

      - name: Install uv
        uses: astral-sh/setup-uv@v5

      - name: Build package
        run: uv build

      - name: Publish to Gitea PyPI
        env:
          UV_PUBLISH_URL: https://git.daviestechlabs.io/api/packages/daviestechlabs/pypi
          UV_PUBLISH_TOKEN: ${{ secrets.GITEA_TOKEN }}
        run: uv publish
```

### Phase 2: Update RayService with Handler Applications

Add handler applications to the existing RayService:

```yaml
# rayservice.yaml additions
spec:
  serveConfigV2: |
    applications:
      # ... existing GPU inference applications ...

      # ============================================
      # HANDLERS (CPU - runs on head node)
      # ============================================

      # Chat Handler - RAG + LLM pipeline
      - name: chat-handler
        route_prefix: /chat
        import_path: chat_handler:app
        runtime_env:
          pip:
            - handler-base>=0.1.0
            - chat-handler>=0.1.0
          pip_find_links:
            - https://git.daviestechlabs.io/api/packages/daviestechlabs/pypi/simple/
          env_vars:
            NATS_URL: "nats://nats.ai-ml.svc.cluster.local:4222"
            MILVUS_HOST: "milvus.ai-ml.svc.cluster.local"
            OTEL_EXPORTER_OTLP_ENDPOINT: "http://otel-collector.monitoring.svc.cluster.local:4317"
        deployments:
          - name: ChatDeployment
            num_replicas: 2
            ray_actor_options:
              num_cpus: 0.5
              num_gpus: 0  # No GPU needed
            max_ongoing_requests: 50

      # Voice Assistant - STT → RAG → LLM → TTS pipeline
      - name: voice-assistant
        route_prefix: /voice
        import_path: voice_assistant:app
        runtime_env:
          pip:
            - handler-base>=0.1.0
            - voice-assistant>=0.1.0
          pip_find_links:
            - https://git.daviestechlabs.io/api/packages/daviestechlabs/pypi/simple/
          env_vars:
            NATS_URL: "nats://nats.ai-ml.svc.cluster.local:4222"
            MILVUS_HOST: "milvus.ai-ml.svc.cluster.local"
        deployments:
          - name: VoiceDeployment
            num_replicas: 2
            ray_actor_options:
              num_cpus: 1
              num_gpus: 0
            max_ongoing_requests: 20
```

### Phase 3: Refactor Handlers for Ray Serve

Convert handlers from standalone NATS subscribers to Ray Serve deployments that can
also optionally subscribe to NATS:

```python
# chat_handler.py (refactored)
from ray import serve
from handler_base import Settings
from handler_base.clients import EmbeddingsClient, LLMClient, RerankerClient, MilvusClient

@serve.deployment(
    name="ChatDeployment",
    num_replicas=2,
    ray_actor_options={"num_cpus": 0.5, "num_gpus": 0}
)
class ChatHandler:
    def __init__(self):
        self.settings = Settings()

        # Initialize clients - these can use Ray handles for internal calls
        self.embeddings = EmbeddingsClient()
        self.llm = LLMClient()
        self.reranker = RerankerClient()
        self.milvus = MilvusClient()

    async def __call__(self, request) -> dict:
        """Handle HTTP requests (from Gradio, etc.)"""
        data = await request.json()
        return await self.process_chat(data)

    async def process_chat(self, data: dict) -> dict:
        """Core chat logic - called by HTTP or NATS"""
        query = data["query"]

        # 1. Generate embeddings
        embedding = await self.embeddings.embed(query)

        # 2. Vector search
        results = await self.milvus.search(embedding, top_k=10)

        # 3. Rerank
        reranked = await self.reranker.rerank(query, results)

        # 4. Generate response
        response = await self.llm.generate(query, context=reranked[:5])

        return {
            "response": response,
            "sources": reranked[:5]
        }

# Ray Serve app binding
app = ChatHandler.bind()
```

### Phase 4: Use Ray Handles for Internal Calls (Optional Optimization)

Update handler-base clients to use Ray handles when running inside Ray:

```python
# handler_base/clients/embeddings.py
import ray
from ray import serve

class EmbeddingsClient:
    def __init__(self, url: str = None):
        self.url = url
        self._handle = None

        # If running inside Ray, get handle to embeddings deployment
        if ray.is_initialized():
            try:
                self._handle = serve.get_deployment_handle(
                    "EmbeddingsDeployment",
                    app_name="embeddings"
                )
            except Exception:
                pass  # Fall back to HTTP

    async def embed(self, text: str) -> list[float]:
        if self._handle:
            # Fast internal Ray call
            return await self._handle.embed.remote(text)
        else:
            # HTTP fallback for external callers
            async with httpx.AsyncClient() as client:
                resp = await client.post(f"{self.url}/v1/embeddings", json={"input": text})
                return resp.json()["data"][0]["embedding"]
```

### Phase 5: NATS Bridge (Optional)

If you still want NATS integration, add a separate NATS bridge that forwards to Ray Serve:

```python
# pipeline_bridge.py - runs as Ray actor, subscribes to NATS
import ray
from ray import serve
import nats

@ray.remote
class NATSBridge:
    def __init__(self):
        self.nc = None
        self.chat_handle = serve.get_deployment_handle("ChatDeployment", "chat-handler")
        self.voice_handle = serve.get_deployment_handle("VoiceDeployment", "voice-assistant")

    async def start(self):
        self.nc = await nats.connect("nats://nats.ai-ml.svc.cluster.local:4222")

        await self.nc.subscribe("ai.chat.request", cb=self.handle_chat)
        await self.nc.subscribe("voice.request", cb=self.handle_voice)

    async def handle_chat(self, msg):
        result = await self.chat_handle.process_chat.remote(msg.data)
        if msg.reply:
            await self.nc.publish(msg.reply, result)
```

## CI/CD Flow

```
┌────────────────────────────────────────────────────────────────────┐
│ Developer pushes to handler repo                                   │
├────────────────────────────────────────────────────────────────────┤
│ 1. Gitea Actions: lint → test                                      │
│ 2. On tag: build wheel → publish to Gitea PyPI                    │
├────────────────────────────────────────────────────────────────────┤
│ 3. Update RayService version in homelab-k8s2                       │
│    (bump handler-base>=0.2.0 in runtime_env)                       │
├────────────────────────────────────────────────────────────────────┤
│ 4. Flux detects change → applies RayService                        │
│ 5. Ray downloads new packages → restarts deployments               │
└────────────────────────────────────────────────────────────────────┘
```

## Alternatives Considered

### Standalone Container Deployments

Run handlers as separate Kubernetes Deployments outside Ray.

**Rejected because:**
- Duplicates infrastructure (separate scaling, health checks, etc.)
- HTTP overhead for every inference call
- Separate observability stack
- Against the "Ray as unified compute" philosophy

### Bake Handlers into Worker Images

Pre-install handler code in ray-worker images.

**Rejected because:**
- Couples handler releases to image rebuilds
- Slower iteration cycle
- Larger images

## Consequences

### Positive
- Single platform: Everything runs in Ray
- Fast internal calls via Ray handles
- Unified observability in Ray Dashboard
- Clean abstraction layers: Kubeflow → KServe → Ray → GPU
- Handlers scale with Ray's autoscaler

### Negative
- Handlers share Ray head node resources
- Need to manage Gitea PyPI authentication for runtime_env
- Slightly more complex RayService configuration

### Neutral
- MLflow can track handler "models" if we want versioned deployments
- Kubeflow can trigger handler updates via pipelines

## References

- [ray-kserve-integration.md](../../homelab-k8s2/docs/ray-kserve-integration.md)
- [Ray Serve runtime_env docs](https://docs.ray.io/en/latest/serve/production-guide/config.html)
- [Gitea Package Registry](https://docs.gitea.io/en-us/packages/pypi/)
- [ADR-0012: Ray Cluster Architecture](ADR-0012-ray-cluster-unified.md)