Files

Update README with ADR Index / update-readme (push) Successful in 1m5s

Details

docs: accept ADR-0061 (Go handler refactor), supersede ADR-0004 (msgpack→protobuf)

All 5 handler services + companions-frontend migrated to handler-base v1.0.0
with protobuf wire format. golangci-lint clean across all repos.

2026-02-21 15:46:43 -05:00

8.6 KiB

Raw Blame History

Refactor NATS Handler Services from Python to Go

Status: accepted
Date: 2026-02-19
Decided: 2026-02-21
Deciders: Billy
Technical Story: Reduce container image sizes and resource consumption for non-ML handler services by rewriting them in Go

Context and Problem Statement

The AI pipeline's non-inference services — chat-handler, voice-assistant, pipeline-bridge, tts-module, and the HTTP-forwarding variant of stt-module — are Python applications built on the handler-base shared library. None of these services perform local ML inference; they orchestrate calls to external Ray Serve endpoints over HTTP and route messages via NATS with MessagePack encoding.

Implementation note (2026-02-21): During the Go rewrite, the wire format was upgraded from MessagePack to Protocol Buffers (see ADR-0004 superseded). The shared Go module is published as handler-base v1.0.0 (not handler-go as originally proposed).

Despite doing only lightweight I/O orchestration, each service inherits the full Python runtime and its dependency tree through handler-base (which pulls in numpy, pymilvus, redis, httpx, pydantic, opentelemetry-*, mlflow, and psycopg2-binary). This results in container images of 500–700 MB each — five services totalling ~3 GB of registry storage — for workloads that are fundamentally HTTP/NATS glue code.

The homelab already has two production Go services (companions-frontend and ntfy-discord) that prove the NATS + MessagePack + OpenTelemetry pattern works well in Go with images under 30 MB.

How do we reduce the image footprint and resource consumption of the non-ML handler services without disrupting the ML inference layer?

Decision Drivers

Container images for glue services are 500–700 MB despite doing no ML work
Go produces static binaries yielding images of ~15–30 MB (scratch/distroless base)
Go services start in milliseconds vs. seconds for Python, improving pod scheduling
Go's memory footprint is ~10× lower for equivalent I/O-bound workloads
The NATS + msgpack + OTel pattern is already proven in companions-frontend
Go has first-class Kubernetes client support (client-go) — relevant for pipeline-bridge
ML inference services (Ray Serve, kuberay-images) must remain Python — only orchestration moves
Five services share a common base (handler-base) — a single Go module replaces it for all

Considered Options

Rewrite handler services in Go with a shared Go module
Optimise Python images (multi-stage builds, slim deps, compiled wheels)
Keep current Python stack unchanged

Decision Outcome

Chosen option: Option 1 — Rewrite handler services in Go, because the services are pure I/O orchestration with no ML dependencies, the Go pattern is already proven in-cluster, and the image + resource savings are an order of magnitude improvement that Python optimisation cannot match.

Positive Consequences

Five container images shrink from ~3 GB total to ~100–150 MB total
Sub-second cold start enables faster rollouts and autoscaling via KEDA
Lower memory footprint frees cluster resources for ML workloads
Eliminates Python runtime CVE surface area from non-ML services
Single handler-go module provides shared NATS, health, OTel, and client code
pipeline-bridge gains client-go — the canonical Kubernetes client library
Go's type system catches message schema drift at compile time

Negative Consequences

One-time rewrite effort across five services
Team must maintain Go and Python codebases (Python remains for Ray Serve, Kubeflow pipelines, Gradio UIs)
handler-go needs feature parity with handler-base for the orchestration subset (NATS client, health server, OTel, HTTP clients, Milvus client)
Audio handling in stt-module (VAD) requires a Go webrtcvad binding or equivalent

Pros and Cons of the Options

Option 1 — Rewrite in Go

Good, because images shrink from ~600 MB → ~20 MB per service
Good, because memory usage drops from ~150 MB → ~15 MB per service
Good, because startup time drops from ~3 s → <100 ms
Good, because Go has mature libraries for every dependency (nats.go, client-go, otel-go, milvus-sdk-go)
Good, because two existing Go services in the cluster prove the pattern
Bad, because one-time engineering effort to rewrite five services
Bad, because two language ecosystems to maintain

Option 2 — Optimise Python images

Good, because no rewrite needed
Good, because multi-stage builds and dependency trimming can reduce images by 30–50%
Bad, because Python runtime + interpreter overhead remains (~200 MB floor)
Bad, because memory and startup improvements are marginal
Bad, because handler-base dependency tree is difficult to slim without breaking shared code

Option 3 — Keep current stack

Good, because zero effort
Bad, because images remain 500–700 MB for glue code
Bad, because resource waste reduces headroom for ML workloads
Bad, because slow cold starts limit KEDA autoscaling effectiveness

Implementation Plan

Phase 1: `handler-base` Go Module (COMPLETE)

Published as git.daviestechlabs.io/daviestechlabs/handler-base v1.0.0 with:

Package	Purpose	Python Equivalent
`natsutil/`	NATS publish/request/decode with protobuf encoding	`handler_base.nats_client`
`health/`	HTTP health + readiness server	`handler_base.health`
`telemetry/`	OTel traces + metrics setup	`handler_base.telemetry`
`config/`	Env-based configuration (struct tags)	`handler_base.config` (pydantic-settings)
`clients/`	HTTP clients for LLM, embeddings, reranker, STT, TTS	`handler_base.clients`
`handler/`	Typed NATS message handler with OTel + health wiring	`handler_base.handler`
`messages/`	Type aliases from generated protobuf stubs	`handler_base.messages`
`gen/messagespb/`	protoc-generated Go stubs (21 message types)	—
`proto/messages/v1/`	`.proto` schema source	—

Phase 2: Service Ports (COMPLETE)

All five services rewritten in Go and migrated to handler-base v1.0.0 with protobuf wire format:

Order	Service	Status	Notes
1	`pipeline-bridge`	✅ Done	NATS + HTTP + k8s API calls. Parameters changed to `map[string]string`.
2	`tts-module`	✅ Done	NATS ↔ HTTP bridge. `[]*TTSVoiceInfo` pointer slices, `int32` casts.
3	`chat-handler`	✅ Done	Core text pipeline. `EffectiveQuery()` standalone func, `int32(TopK)`.
4	`voice-assistant`	✅ Done	Same pattern with `[]*DocumentSource` pointer slices.
5	`stt-module`	✅ Done	HTTP-forwarding variant. `SessionId`/`SpeakerId` field renames, `int32(Sequence)`.

companions-frontend also migrated: 129-line duplicate type definitions replaced with type aliases from handler-base/messages.

Phase 3: Cleanup (COMPLETE)

~~Archive Python versions of ported services~~ — Python handler-base remains for Ray Serve/Kubeflow
CI pipelines use golangci-lint v2 with errcheck, govet, staticcheck, misspell, bodyclose, nilerr
All repos pass golangci-lint run ./... and go test ./...
Wire format upgraded from MessagePack to Protocol Buffers (ADR-0004 superseded)

What Stays in Python

Repository	Reason
`ray-serve`	PyTorch, vLLM, sentence-transformers — core ML inference
`kuberay-images`	GPU runtime Docker images (ROCm, CUDA, IPEX)
`gradio-ui`	Gradio is Python-only; dev/testing tool, not production
`kubeflow/`	Kubeflow Pipelines SDK is Python-only
`mlflow/`	MLflow SDK integration (tracking + model registry)
`stt-module` (local Whisper variant)	PyTorch + openai-whisper on GPU
`spark-analytics-jobs`	PySpark (being replaced by Flink anyway)

8.6 KiB Raw Blame History Unescape Escape