Files

Billy D. 8f4df84657 chore: Consolidate ADRs into decisions/ directory

- Added ADR-0016: Affine email verification strategy
- Moved ADRs 0019-0024 from docs/adr/ to decisions/
- Renamed to consistent format (removed ADR- prefix)

2026-02-04 08:28:12 -05:00

14 KiB

Raw Blame History

ADR-0019: Python Module Deployment Strategy

Status

Accepted

Date

2026-02-02

Context

We have Python modules for AI/ML workflows that need to run on our unified GPU cluster:

Repo	Purpose	Needs GPU?
`handler-base`	Shared library (NATS, clients, telemetry)	No
`chat-handler`	Text chat → RAG → LLM pipeline	No (calls GPU endpoints)
`voice-assistant`	Audio → STT → RAG → LLM → TTS pipeline	No (calls GPU endpoints)
`pipeline-bridge`	Kubeflow ↔ NATS integration	No
`kuberay-images/ray-serve/`	Inference deployments (Whisper, TTS, LLM, etc.)	Yes

Current Architecture

┌─────────────────────────────────────────────────────────────────────┐
│                        PLATFORM LAYERS                               │
├─────────────────────────────────────────────────────────────────────┤
│  Kubeflow Pipelines  │  KServe (visibility)  │  MLflow (registry)  │
│  [Orchestration]     │  [InferenceServices]  │  [Models/Metrics]   │
├─────────────────────────────────────────────────────────────────────┤
│                         RAY CLUSTER                                  │
│  ┌────────────────────────────────────────────────────────────────┐ │
│  │ Ray Serve Applications (GPU inference)                         │ │
│  │ ├─ /llm        → VLLMDeployment      (khelben, 0.95 GPU)      │ │
│  │ ├─ /whisper    → WhisperDeployment   (elminster, 0.5 GPU)     │ │
│  │ ├─ /tts        → TTSDeployment       (elminster, 0.5 GPU)     │ │
│  │ ├─ /embeddings → EmbeddingsDeployment (drizzt, 0.8 GPU)       │ │
│  │ └─ /reranker   → RerankerDeployment  (danilo, 0.8 GPU)        │ │
│  ├────────────────────────────────────────────────────────────────┤ │
│  │ Ray Serve Applications (CPU orchestration) ← WHERE HANDLERS GO │ │
│  │ ├─ /chat       → ChatHandler         (head node, 0 GPU)       │ │
│  │ └─ /voice      → VoiceHandler        (head node, 0 GPU)       │ │
│  └────────────────────────────────────────────────────────────────┘ │
├─────────────────────────────────────────────────────────────────────┤
│  RayJob (batch/training)  │  NATS (events)  │  Milvus (vectors)    │
└─────────────────────────────────────────────────────────────────────┘

The key insight is that handlers ARE Ray Serve applications - they just don't need GPUs. They should run inside the Ray cluster to:

Use Ray's internal calling (faster than HTTP)
Share observability (Ray Dashboard)
Leverage Ray's scheduling for resource management

Decision

Deploy handlers as Ray Serve applications inside the Ray cluster, using runtime_env to install Python packages from Gitea's package registry at deployment time.

Why Ray Serve (not standalone containers)?

Unified Platform: Everything runs in Ray - inference AND orchestration
Internal Calls: Handlers can call inference deployments via Ray handles (no HTTP)
Resource Sharing: Ray head node has spare CPU/memory for orchestration
Single Observability: Ray Dashboard shows all applications
Simpler Ops: One RayService to manage, not multiple Deployments

Why runtime_env with pip (not baked into images)?

Faster Iteration: Change handler code → push to PyPI → redeploy RayService
Decoupled Releases: Handlers update independently of worker images
Smaller Images: Worker images only need inference dependencies
MLflow Integration: Can version handlers as MLflow models if needed

Implementation Plan

Phase 1: Publish Packages to Gitea PyPI

Each handler repo publishes to Gitea's built-in package registry on release:

# .gitea/workflows/ci.yml
name: CI

on:
  push:
    branches: [main]
    tags: ['v*']
  pull_request:
    branches: [main]

jobs:
  lint:
    # ... existing lint job

  test:
    # ... existing test job

  publish:
    runs-on: ubuntu-latest
    needs: [lint, test]
    if: startsWith(github.ref, 'refs/tags/v')
    steps:
      - uses: actions/checkout@v4
      
      - name: Set up Python
        uses: actions/setup-python@v5
        with:
          python-version: '3.11'
          
      - name: Install uv
        uses: astral-sh/setup-uv@v5
        
      - name: Build package
        run: uv build
        
      - name: Publish to Gitea PyPI
        env:
          UV_PUBLISH_URL: https://git.daviestechlabs.io/api/packages/daviestechlabs/pypi
          UV_PUBLISH_TOKEN: ${{ secrets.GITEA_TOKEN }}
        run: uv publish

Phase 2: Update RayService with Handler Applications

Add handler applications to the existing RayService:

# rayservice.yaml additions
spec:
  serveConfigV2: |
    applications:
      # ... existing GPU inference applications ...

      # ============================================
      # HANDLERS (CPU - runs on head node)
      # ============================================

      # Chat Handler - RAG + LLM pipeline
      - name: chat-handler
        route_prefix: /chat
        import_path: chat_handler:app
        runtime_env:
          pip:
            - handler-base>=0.1.0
            - chat-handler>=0.1.0
          pip_find_links:
            - https://git.daviestechlabs.io/api/packages/daviestechlabs/pypi/simple/
          env_vars:
            NATS_URL: "nats://nats.ai-ml.svc.cluster.local:4222"
            MILVUS_HOST: "milvus.ai-ml.svc.cluster.local"
            OTEL_EXPORTER_OTLP_ENDPOINT: "http://otel-collector.monitoring.svc.cluster.local:4317"
        deployments:
          - name: ChatDeployment
            num_replicas: 2
            ray_actor_options:
              num_cpus: 0.5
              num_gpus: 0  # No GPU needed
            max_ongoing_requests: 50

      # Voice Assistant - STT → RAG → LLM → TTS pipeline  
      - name: voice-assistant
        route_prefix: /voice
        import_path: voice_assistant:app
        runtime_env:
          pip:
            - handler-base>=0.1.0
            - voice-assistant>=0.1.0
          pip_find_links:
            - https://git.daviestechlabs.io/api/packages/daviestechlabs/pypi/simple/
          env_vars:
            NATS_URL: "nats://nats.ai-ml.svc.cluster.local:4222"
            MILVUS_HOST: "milvus.ai-ml.svc.cluster.local"
        deployments:
          - name: VoiceDeployment
            num_replicas: 2
            ray_actor_options:
              num_cpus: 1
              num_gpus: 0
            max_ongoing_requests: 20

Phase 3: Refactor Handlers for Ray Serve

Convert handlers from standalone NATS subscribers to Ray Serve deployments that can also optionally subscribe to NATS:

# chat_handler.py (refactored)
from ray import serve
from handler_base import Settings
from handler_base.clients import EmbeddingsClient, LLMClient, RerankerClient, MilvusClient

@serve.deployment(
    name="ChatDeployment",
    num_replicas=2,
    ray_actor_options={"num_cpus": 0.5, "num_gpus": 0}
)
class ChatHandler:
    def __init__(self):
        self.settings = Settings()
        
        # Initialize clients - these can use Ray handles for internal calls
        self.embeddings = EmbeddingsClient()
        self.llm = LLMClient()
        self.reranker = RerankerClient()
        self.milvus = MilvusClient()

    async def __call__(self, request) -> dict:
        """Handle HTTP requests (from Gradio, etc.)"""
        data = await request.json()
        return await self.process_chat(data)

    async def process_chat(self, data: dict) -> dict:
        """Core chat logic - called by HTTP or NATS"""
        query = data["query"]
        
        # 1. Generate embeddings
        embedding = await self.embeddings.embed(query)
        
        # 2. Vector search
        results = await self.milvus.search(embedding, top_k=10)
        
        # 3. Rerank
        reranked = await self.reranker.rerank(query, results)
        
        # 4. Generate response
        response = await self.llm.generate(query, context=reranked[:5])
        
        return {
            "response": response,
            "sources": reranked[:5]
        }

# Ray Serve app binding
app = ChatHandler.bind()

Phase 4: Use Ray Handles for Internal Calls (Optional Optimization)

Update handler-base clients to use Ray handles when running inside Ray:

# handler_base/clients/embeddings.py
import ray
from ray import serve

class EmbeddingsClient:
    def __init__(self, url: str = None):
        self.url = url
        self._handle = None
        
        # If running inside Ray, get handle to embeddings deployment
        if ray.is_initialized():
            try:
                self._handle = serve.get_deployment_handle(
                    "EmbeddingsDeployment", 
                    app_name="embeddings"
                )
            except Exception:
                pass  # Fall back to HTTP
    
    async def embed(self, text: str) -> list[float]:
        if self._handle:
            # Fast internal Ray call
            return await self._handle.embed.remote(text)
        else:
            # HTTP fallback for external callers
            async with httpx.AsyncClient() as client:
                resp = await client.post(f"{self.url}/v1/embeddings", json={"input": text})
                return resp.json()["data"][0]["embedding"]

Phase 5: NATS Bridge (Optional)

If you still want NATS integration, add a separate NATS bridge that forwards to Ray Serve:

# pipeline_bridge.py - runs as Ray actor, subscribes to NATS
import ray
from ray import serve
import nats

@ray.remote
class NATSBridge:
    def __init__(self):
        self.nc = None
        self.chat_handle = serve.get_deployment_handle("ChatDeployment", "chat-handler")
        self.voice_handle = serve.get_deployment_handle("VoiceDeployment", "voice-assistant")
    
    async def start(self):
        self.nc = await nats.connect("nats://nats.ai-ml.svc.cluster.local:4222")
        
        await self.nc.subscribe("ai.chat.request", cb=self.handle_chat)
        await self.nc.subscribe("voice.request", cb=self.handle_voice)
    
    async def handle_chat(self, msg):
        result = await self.chat_handle.process_chat.remote(msg.data)
        if msg.reply:
            await self.nc.publish(msg.reply, result)

CI/CD Flow

┌────────────────────────────────────────────────────────────────────┐
│ Developer pushes to handler repo                                   │
├────────────────────────────────────────────────────────────────────┤
│ 1. Gitea Actions: lint → test                                      │
│ 2. On tag: build wheel → publish to Gitea PyPI                    │
├────────────────────────────────────────────────────────────────────┤
│ 3. Update RayService version in homelab-k8s2                       │
│    (bump handler-base>=0.2.0 in runtime_env)                       │
├────────────────────────────────────────────────────────────────────┤
│ 4. Flux detects change → applies RayService                        │
│ 5. Ray downloads new packages → restarts deployments               │
└────────────────────────────────────────────────────────────────────┘

Alternatives Considered

Standalone Container Deployments

Run handlers as separate Kubernetes Deployments outside Ray.

Rejected because:

Duplicates infrastructure (separate scaling, health checks, etc.)
HTTP overhead for every inference call
Separate observability stack
Against the "Ray as unified compute" philosophy

Bake Handlers into Worker Images

Pre-install handler code in ray-worker images.

Rejected because:

Couples handler releases to image rebuilds
Slower iteration cycle
Larger images

Consequences

Positive

Single platform: Everything runs in Ray
Fast internal calls via Ray handles
Unified observability in Ray Dashboard
Clean abstraction layers: Kubeflow → KServe → Ray → GPU
Handlers scale with Ray's autoscaler

14 KiB

Raw Blame History

ADR-0019: Python Module Deployment Strategy

Status

Date

Context

Current Architecture

Decision

Why Ray Serve (not standalone containers)?

Why runtime_env with pip (not baked into images)?

Implementation Plan

Phase 1: Publish Packages to Gitea PyPI

Phase 2: Update RayService with Handler Applications

Phase 3: Refactor Handlers for Ray Serve

Phase 4: Use Ray Handles for Internal Calls (Optional Optimization)

Phase 5: NATS Bridge (Optional)

CI/CD Flow

Alternatives Considered

Standalone Container Deployments

Bake Handlers into Worker Images

Consequences

Positive

Negative

Neutral

References

14 KiB Raw Blame History

ADR-0019: Python Module Deployment Strategy

Status

Date

Context

Current Architecture

Decision

Why Ray Serve (not standalone containers)?

Why runtime_env with pip (not baked into images)?

Implementation Plan

Phase 1: Publish Packages to Gitea PyPI

Phase 2: Update RayService with Handler Applications

Phase 3: Refactor Handlers for Ray Serve

Phase 4: Use Ray Handles for Internal Calls (Optional Optimization)

Phase 5: NATS Bridge (Optional)

CI/CD Flow

Alternatives Considered

Standalone Container Deployments

Bake Handlers into Worker Images

Consequences

Positive

Negative

Neutral

References

14 KiB

Raw Blame History