From e57d998d9a0f7ac01873f2a0e676cefe7e930ac3 Mon Sep 17 00:00:00 2001 From: "Billy D." Date: Thu, 19 Feb 2026 07:14:36 -0500 Subject: [PATCH] docs(adr): add ADR-0061 Go handler refactor --- ARCHITECTURE.md | 1 + decisions/0061-go-handler-refactor.md | 139 ++++++++++++++++++++++++++ 2 files changed, 140 insertions(+) create mode 100644 decisions/0061-go-handler-refactor.md diff --git a/ARCHITECTURE.md b/ARCHITECTURE.md index af67679..c50a303 100644 --- a/ARCHITECTURE.md +++ b/ARCHITECTURE.md @@ -317,6 +317,7 @@ Applications ──► OpenTelemetry SDK ──► Jaeger/Tempo ──► Grafan | GitOps with Flux | Declarative, auditable, secure | [ADR-0006](decisions/0006-gitops-with-flux.md) | | KServe for inference | Standardized API, autoscaling | [ADR-0007](decisions/0007-use-kserve-for-inference.md) | | KubeRay unified backend | Fractional GPU, single endpoint | [ADR-0011](decisions/0011-kuberay-unified-gpu-backend.md) | +| Go handler refactor | Slim images for non-ML services | [ADR-0061](decisions/0061-go-handler-refactor.md) | ## Related Documents diff --git a/decisions/0061-go-handler-refactor.md b/decisions/0061-go-handler-refactor.md new file mode 100644 index 0000000..3871e56 --- /dev/null +++ b/decisions/0061-go-handler-refactor.md @@ -0,0 +1,139 @@ +# Refactor NATS Handler Services from Python to Go + +* Status: proposed +* Date: 2026-02-19 +* Deciders: Billy +* Technical Story: Reduce container image sizes and resource consumption for non-ML handler services by rewriting them in Go + +## Context and Problem Statement + +The AI pipeline's non-inference services — `chat-handler`, `voice-assistant`, `pipeline-bridge`, `tts-module`, and the HTTP-forwarding variant of `stt-module` — are Python applications built on the `handler-base` shared library. None of these services perform local ML inference; they orchestrate calls to external Ray Serve endpoints over HTTP and route messages via NATS with MessagePack encoding. + +Despite doing only lightweight I/O orchestration, each service inherits the full Python runtime and its dependency tree through `handler-base` (which pulls in `numpy`, `pymilvus`, `redis`, `httpx`, `pydantic`, `opentelemetry-*`, `mlflow`, and `psycopg2-binary`). This results in container images of **500–700 MB each** — five services totalling **~3 GB** of registry storage — for workloads that are fundamentally HTTP/NATS glue code. + +The homelab already has two production Go services (`companions-frontend` and `ntfy-discord`) that prove the NATS + MessagePack + OpenTelemetry pattern works well in Go with images under 30 MB. + +How do we reduce the image footprint and resource consumption of the non-ML handler services without disrupting the ML inference layer? + +## Decision Drivers + +* Container images for glue services are 500–700 MB despite doing no ML work +* Go produces static binaries yielding images of ~15–30 MB (scratch/distroless base) +* Go services start in milliseconds vs. seconds for Python, improving pod scheduling +* Go's memory footprint is ~10× lower for equivalent I/O-bound workloads +* The NATS + msgpack + OTel pattern is already proven in `companions-frontend` +* Go has first-class Kubernetes client support (`client-go`) — relevant for `pipeline-bridge` +* ML inference services (Ray Serve, kuberay-images) must remain Python — only orchestration moves +* Five services share a common base (`handler-base`) — a single Go module replaces it for all + +## Considered Options + +1. **Rewrite handler services in Go with a shared Go module** +2. **Optimise Python images (multi-stage builds, slim deps, compiled wheels)** +3. **Keep current Python stack unchanged** + +## Decision Outcome + +Chosen option: **Option 1 — Rewrite handler services in Go**, because the services are pure I/O orchestration with no ML dependencies, the Go pattern is already proven in-cluster, and the image + resource savings are an order of magnitude improvement that Python optimisation cannot match. + +### Positive Consequences + +* Five container images shrink from ~3 GB total to ~100–150 MB total +* Sub-second cold start enables faster rollouts and autoscaling via KEDA +* Lower memory footprint frees cluster resources for ML workloads +* Eliminates Python runtime CVE surface area from non-ML services +* Single `handler-go` module provides shared NATS, health, OTel, and client code +* `pipeline-bridge` gains `client-go` — the canonical Kubernetes client library +* Go's type system catches message schema drift at compile time + +### Negative Consequences + +* One-time rewrite effort across five services +* Team must maintain Go **and** Python codebases (Python remains for Ray Serve, Kubeflow pipelines, Gradio UIs) +* `handler-go` needs feature parity with `handler-base` for the orchestration subset (NATS client, health server, OTel, HTTP clients, Milvus client) +* Audio handling in `stt-module` (VAD) requires a Go webrtcvad binding or equivalent + +## Pros and Cons of the Options + +### Option 1 — Rewrite in Go + +* Good, because images shrink from ~600 MB → ~20 MB per service +* Good, because memory usage drops from ~150 MB → ~15 MB per service +* Good, because startup time drops from ~3 s → <100 ms +* Good, because Go has mature libraries for every dependency (nats.go, client-go, otel-go, milvus-sdk-go) +* Good, because two existing Go services in the cluster prove the pattern +* Bad, because one-time engineering effort to rewrite five services +* Bad, because two language ecosystems to maintain + +### Option 2 — Optimise Python images + +* Good, because no rewrite needed +* Good, because multi-stage builds and dependency trimming can reduce images by 30–50% +* Bad, because Python runtime + interpreter overhead remains (~200 MB floor) +* Bad, because memory and startup improvements are marginal +* Bad, because `handler-base` dependency tree is difficult to slim without breaking shared code + +### Option 3 — Keep current stack + +* Good, because zero effort +* Bad, because images remain 500–700 MB for glue code +* Bad, because resource waste reduces headroom for ML workloads +* Bad, because slow cold starts limit KEDA autoscaling effectiveness + +## Implementation Plan + +### Phase 1: `handler-go` Shared Module + +Create `git.daviestechlabs.io/daviestechlabs/handler-go` as a Go module with: + +| Package | Purpose | Python Equivalent | +|---------|---------|-------------------| +| `nats/` | NATS/JetStream client with msgpack encoding | `handler_base.nats_client` | +| `health/` | HTTP health + readiness server | `handler_base.health` | +| `telemetry/` | OTel traces + metrics setup | `handler_base.telemetry` | +| `config/` | Env-based configuration (struct tags) | `handler_base.config` (pydantic-settings) | +| `clients/` | HTTP clients for LLM, embeddings, reranker, STT, TTS | `handler_base.clients` | +| `milvus/` | Milvus vector search client | `pymilvus` wrapper in handler_base | + +Reference implementations: `companions-frontend/internal/` (NATS, msgpack, OTel), `ntfy-discord/internal/` (health, config, metrics). + +### Phase 2: Service Ports (in order of complexity) + +| Order | Service | Rationale | +|-------|---------|-----------| +| 1 | `pipeline-bridge` | Simplest — NATS + HTTP + k8s API calls. Validates `handler-go` module. | +| 2 | `tts-module` | Tiny NATS ↔ HTTP bridge to external Coqui API | +| 3 | `chat-handler` | Core text pipeline — NATS + Milvus + HTTP calls | +| 4 | `voice-assistant` | Same pattern as chat-handler with audio base64 handling | +| 5 | `stt-module` (streaming) | Requires Go VAD bindings for the HTTP-forwarding variant | + +### Phase 3: Cleanup + +* Archive Python versions of ported services +* Update Flux manifests for new Go images +* Update CI pipelines (Gitea Actions) for Go build/test/lint +* Update CODING-CONVENTIONS.md with Go section + +### What Stays in Python + +| Repository | Reason | +|------------|--------| +| `ray-serve` | PyTorch, vLLM, sentence-transformers — core ML inference | +| `kuberay-images` | GPU runtime Docker images (ROCm, CUDA, IPEX) | +| `gradio-ui` | Gradio is Python-only; dev/testing tool, not production | +| `kubeflow/` | Kubeflow Pipelines SDK is Python-only | +| `mlflow/` | MLflow SDK integration (tracking + model registry) | +| `stt-module` (local Whisper variant) | PyTorch + openai-whisper on GPU | +| `spark-analytics-jobs` | PySpark (being replaced by Flink anyway) | + +## Links + +* Related: [ADR-0003](0003-use-nats-for-messaging.md) — NATS as messaging backbone +* Related: [ADR-0004](0004-use-messagepack-for-nats.md) — MessagePack binary encoding +* Related: [ADR-0011](0011-kuberay-unified-gpu-backend.md) — KubeRay unified GPU backend +* Related: [ADR-0013](0013-gitea-actions-for-ci.md) — Gitea Actions CI +* Related: [ADR-0014](0014-docker-build-best-practices.md) — Docker build best practices +* Related: [ADR-0019](0019-handler-deployment-strategy.md) — Handler deployment strategy +* Related: [ADR-0024](0024-ray-repository-structure.md) — Ray repository structure +* Related: [ADR-0046](0046-companions-frontend-architecture.md) — Companions frontend (Go reference) +* Related: [ADR-0051](0051-keda-event-driven-autoscaling.md) — KEDA autoscaling