Files
homelab-design/decisions/0061-go-handler-refactor.md

140 lines
7.7 KiB
Markdown
Raw Blame History

This file contains ambiguous Unicode characters
This file contains Unicode characters that might be confused with other characters. If you think that this is intentional, you can safely ignore this warning. Use the Escape button to reveal them.
# Refactor NATS Handler Services from Python to Go
* Status: proposed
* Date: 2026-02-19
* Deciders: Billy
* Technical Story: Reduce container image sizes and resource consumption for non-ML handler services by rewriting them in Go
## Context and Problem Statement
The AI pipeline's non-inference services — `chat-handler`, `voice-assistant`, `pipeline-bridge`, `tts-module`, and the HTTP-forwarding variant of `stt-module` — are Python applications built on the `handler-base` shared library. None of these services perform local ML inference; they orchestrate calls to external Ray Serve endpoints over HTTP and route messages via NATS with MessagePack encoding.
Despite doing only lightweight I/O orchestration, each service inherits the full Python runtime and its dependency tree through `handler-base` (which pulls in `numpy`, `pymilvus`, `redis`, `httpx`, `pydantic`, `opentelemetry-*`, `mlflow`, and `psycopg2-binary`). This results in container images of **500700 MB each** — five services totalling **~3 GB** of registry storage — for workloads that are fundamentally HTTP/NATS glue code.
The homelab already has two production Go services (`companions-frontend` and `ntfy-discord`) that prove the NATS + MessagePack + OpenTelemetry pattern works well in Go with images under 30 MB.
How do we reduce the image footprint and resource consumption of the non-ML handler services without disrupting the ML inference layer?
## Decision Drivers
* Container images for glue services are 500700 MB despite doing no ML work
* Go produces static binaries yielding images of ~1530 MB (scratch/distroless base)
* Go services start in milliseconds vs. seconds for Python, improving pod scheduling
* Go's memory footprint is ~10× lower for equivalent I/O-bound workloads
* The NATS + msgpack + OTel pattern is already proven in `companions-frontend`
* Go has first-class Kubernetes client support (`client-go`) — relevant for `pipeline-bridge`
* ML inference services (Ray Serve, kuberay-images) must remain Python — only orchestration moves
* Five services share a common base (`handler-base`) — a single Go module replaces it for all
## Considered Options
1. **Rewrite handler services in Go with a shared Go module**
2. **Optimise Python images (multi-stage builds, slim deps, compiled wheels)**
3. **Keep current Python stack unchanged**
## Decision Outcome
Chosen option: **Option 1 — Rewrite handler services in Go**, because the services are pure I/O orchestration with no ML dependencies, the Go pattern is already proven in-cluster, and the image + resource savings are an order of magnitude improvement that Python optimisation cannot match.
### Positive Consequences
* Five container images shrink from ~3 GB total to ~100150 MB total
* Sub-second cold start enables faster rollouts and autoscaling via KEDA
* Lower memory footprint frees cluster resources for ML workloads
* Eliminates Python runtime CVE surface area from non-ML services
* Single `handler-go` module provides shared NATS, health, OTel, and client code
* `pipeline-bridge` gains `client-go` — the canonical Kubernetes client library
* Go's type system catches message schema drift at compile time
### Negative Consequences
* One-time rewrite effort across five services
* Team must maintain Go **and** Python codebases (Python remains for Ray Serve, Kubeflow pipelines, Gradio UIs)
* `handler-go` needs feature parity with `handler-base` for the orchestration subset (NATS client, health server, OTel, HTTP clients, Milvus client)
* Audio handling in `stt-module` (VAD) requires a Go webrtcvad binding or equivalent
## Pros and Cons of the Options
### Option 1 — Rewrite in Go
* Good, because images shrink from ~600 MB → ~20 MB per service
* Good, because memory usage drops from ~150 MB → ~15 MB per service
* Good, because startup time drops from ~3 s → <100 ms
* Good, because Go has mature libraries for every dependency (nats.go, client-go, otel-go, milvus-sdk-go)
* Good, because two existing Go services in the cluster prove the pattern
* Bad, because one-time engineering effort to rewrite five services
* Bad, because two language ecosystems to maintain
### Option 2 — Optimise Python images
* Good, because no rewrite needed
* Good, because multi-stage builds and dependency trimming can reduce images by 3050%
* Bad, because Python runtime + interpreter overhead remains (~200 MB floor)
* Bad, because memory and startup improvements are marginal
* Bad, because `handler-base` dependency tree is difficult to slim without breaking shared code
### Option 3 — Keep current stack
* Good, because zero effort
* Bad, because images remain 500700 MB for glue code
* Bad, because resource waste reduces headroom for ML workloads
* Bad, because slow cold starts limit KEDA autoscaling effectiveness
## Implementation Plan
### Phase 1: `handler-go` Shared Module
Create `git.daviestechlabs.io/daviestechlabs/handler-go` as a Go module with:
| Package | Purpose | Python Equivalent |
|---------|---------|-------------------|
| `nats/` | NATS/JetStream client with msgpack encoding | `handler_base.nats_client` |
| `health/` | HTTP health + readiness server | `handler_base.health` |
| `telemetry/` | OTel traces + metrics setup | `handler_base.telemetry` |
| `config/` | Env-based configuration (struct tags) | `handler_base.config` (pydantic-settings) |
| `clients/` | HTTP clients for LLM, embeddings, reranker, STT, TTS | `handler_base.clients` |
| `milvus/` | Milvus vector search client | `pymilvus` wrapper in handler_base |
Reference implementations: `companions-frontend/internal/` (NATS, msgpack, OTel), `ntfy-discord/internal/` (health, config, metrics).
### Phase 2: Service Ports (in order of complexity)
| Order | Service | Rationale |
|-------|---------|-----------|
| 1 | `pipeline-bridge` | Simplest — NATS + HTTP + k8s API calls. Validates `handler-go` module. |
| 2 | `tts-module` | Tiny NATS ↔ HTTP bridge to external Coqui API |
| 3 | `chat-handler` | Core text pipeline — NATS + Milvus + HTTP calls |
| 4 | `voice-assistant` | Same pattern as chat-handler with audio base64 handling |
| 5 | `stt-module` (streaming) | Requires Go VAD bindings for the HTTP-forwarding variant |
### Phase 3: Cleanup
* Archive Python versions of ported services
* Update Flux manifests for new Go images
* Update CI pipelines (Gitea Actions) for Go build/test/lint
* Update CODING-CONVENTIONS.md with Go section
### What Stays in Python
| Repository | Reason |
|------------|--------|
| `ray-serve` | PyTorch, vLLM, sentence-transformers — core ML inference |
| `kuberay-images` | GPU runtime Docker images (ROCm, CUDA, IPEX) |
| `gradio-ui` | Gradio is Python-only; dev/testing tool, not production |
| `kubeflow/` | Kubeflow Pipelines SDK is Python-only |
| `mlflow/` | MLflow SDK integration (tracking + model registry) |
| `stt-module` (local Whisper variant) | PyTorch + openai-whisper on GPU |
| `spark-analytics-jobs` | PySpark (being replaced by Flink anyway) |
## Links
* Related: [ADR-0003](0003-use-nats-for-messaging.md) — NATS as messaging backbone
* Related: [ADR-0004](0004-use-messagepack-for-nats.md) — MessagePack binary encoding
* Related: [ADR-0011](0011-kuberay-unified-gpu-backend.md) — KubeRay unified GPU backend
* Related: [ADR-0013](0013-gitea-actions-for-ci.md) — Gitea Actions CI
* Related: [ADR-0014](0014-docker-build-best-practices.md) — Docker build best practices
* Related: [ADR-0019](0019-handler-deployment-strategy.md) — Handler deployment strategy
* Related: [ADR-0024](0024-ray-repository-structure.md) — Ray repository structure
* Related: [ADR-0046](0046-companions-frontend-architecture.md) — Companions frontend (Go reference)
* Related: [ADR-0051](0051-keda-event-driven-autoscaling.md) — KEDA autoscaling