All 5 handler services + companions-frontend migrated to handler-base v1.0.0 with protobuf wire format. golangci-lint clean across all repos.
8.6 KiB
Refactor NATS Handler Services from Python to Go
- Status: accepted
- Date: 2026-02-19
- Decided: 2026-02-21
- Deciders: Billy
- Technical Story: Reduce container image sizes and resource consumption for non-ML handler services by rewriting them in Go
Context and Problem Statement
The AI pipeline's non-inference services — chat-handler, voice-assistant, pipeline-bridge, tts-module, and the HTTP-forwarding variant of stt-module — are Python applications built on the handler-base shared library. None of these services perform local ML inference; they orchestrate calls to external Ray Serve endpoints over HTTP and route messages via NATS with MessagePack encoding.
Implementation note (2026-02-21): During the Go rewrite, the wire format was upgraded from MessagePack to Protocol Buffers (see ADR-0004 superseded). The shared Go module is published as
handler-basev1.0.0 (nothandler-goas originally proposed).
Despite doing only lightweight I/O orchestration, each service inherits the full Python runtime and its dependency tree through handler-base (which pulls in numpy, pymilvus, redis, httpx, pydantic, opentelemetry-*, mlflow, and psycopg2-binary). This results in container images of 500–700 MB each — five services totalling ~3 GB of registry storage — for workloads that are fundamentally HTTP/NATS glue code.
The homelab already has two production Go services (companions-frontend and ntfy-discord) that prove the NATS + MessagePack + OpenTelemetry pattern works well in Go with images under 30 MB.
How do we reduce the image footprint and resource consumption of the non-ML handler services without disrupting the ML inference layer?
Decision Drivers
- Container images for glue services are 500–700 MB despite doing no ML work
- Go produces static binaries yielding images of ~15–30 MB (scratch/distroless base)
- Go services start in milliseconds vs. seconds for Python, improving pod scheduling
- Go's memory footprint is ~10× lower for equivalent I/O-bound workloads
- The NATS + msgpack + OTel pattern is already proven in
companions-frontend - Go has first-class Kubernetes client support (
client-go) — relevant forpipeline-bridge - ML inference services (Ray Serve, kuberay-images) must remain Python — only orchestration moves
- Five services share a common base (
handler-base) — a single Go module replaces it for all
Considered Options
- Rewrite handler services in Go with a shared Go module
- Optimise Python images (multi-stage builds, slim deps, compiled wheels)
- Keep current Python stack unchanged
Decision Outcome
Chosen option: Option 1 — Rewrite handler services in Go, because the services are pure I/O orchestration with no ML dependencies, the Go pattern is already proven in-cluster, and the image + resource savings are an order of magnitude improvement that Python optimisation cannot match.
Positive Consequences
- Five container images shrink from ~3 GB total to ~100–150 MB total
- Sub-second cold start enables faster rollouts and autoscaling via KEDA
- Lower memory footprint frees cluster resources for ML workloads
- Eliminates Python runtime CVE surface area from non-ML services
- Single
handler-gomodule provides shared NATS, health, OTel, and client code pipeline-bridgegainsclient-go— the canonical Kubernetes client library- Go's type system catches message schema drift at compile time
Negative Consequences
- One-time rewrite effort across five services
- Team must maintain Go and Python codebases (Python remains for Ray Serve, Kubeflow pipelines, Gradio UIs)
handler-goneeds feature parity withhandler-basefor the orchestration subset (NATS client, health server, OTel, HTTP clients, Milvus client)- Audio handling in
stt-module(VAD) requires a Go webrtcvad binding or equivalent
Pros and Cons of the Options
Option 1 — Rewrite in Go
- Good, because images shrink from ~600 MB → ~20 MB per service
- Good, because memory usage drops from ~150 MB → ~15 MB per service
- Good, because startup time drops from ~3 s → <100 ms
- Good, because Go has mature libraries for every dependency (nats.go, client-go, otel-go, milvus-sdk-go)
- Good, because two existing Go services in the cluster prove the pattern
- Bad, because one-time engineering effort to rewrite five services
- Bad, because two language ecosystems to maintain
Option 2 — Optimise Python images
- Good, because no rewrite needed
- Good, because multi-stage builds and dependency trimming can reduce images by 30–50%
- Bad, because Python runtime + interpreter overhead remains (~200 MB floor)
- Bad, because memory and startup improvements are marginal
- Bad, because
handler-basedependency tree is difficult to slim without breaking shared code
Option 3 — Keep current stack
- Good, because zero effort
- Bad, because images remain 500–700 MB for glue code
- Bad, because resource waste reduces headroom for ML workloads
- Bad, because slow cold starts limit KEDA autoscaling effectiveness
Implementation Plan
Phase 1: handler-base Go Module (COMPLETE)
Published as git.daviestechlabs.io/daviestechlabs/handler-base v1.0.0 with:
| Package | Purpose | Python Equivalent |
|---|---|---|
natsutil/ |
NATS publish/request/decode with protobuf encoding | handler_base.nats_client |
health/ |
HTTP health + readiness server | handler_base.health |
telemetry/ |
OTel traces + metrics setup | handler_base.telemetry |
config/ |
Env-based configuration (struct tags) | handler_base.config (pydantic-settings) |
clients/ |
HTTP clients for LLM, embeddings, reranker, STT, TTS | handler_base.clients |
handler/ |
Typed NATS message handler with OTel + health wiring | handler_base.handler |
messages/ |
Type aliases from generated protobuf stubs | handler_base.messages |
gen/messagespb/ |
protoc-generated Go stubs (21 message types) | — |
proto/messages/v1/ |
.proto schema source |
— |
Phase 2: Service Ports (COMPLETE)
All five services rewritten in Go and migrated to handler-base v1.0.0 with protobuf wire format:
| Order | Service | Status | Notes |
|---|---|---|---|
| 1 | pipeline-bridge |
✅ Done | NATS + HTTP + k8s API calls. Parameters changed to map[string]string. |
| 2 | tts-module |
✅ Done | NATS ↔ HTTP bridge. []*TTSVoiceInfo pointer slices, int32 casts. |
| 3 | chat-handler |
✅ Done | Core text pipeline. EffectiveQuery() standalone func, int32(TopK). |
| 4 | voice-assistant |
✅ Done | Same pattern with []*DocumentSource pointer slices. |
| 5 | stt-module |
✅ Done | HTTP-forwarding variant. SessionId/SpeakerId field renames, int32(Sequence). |
companions-frontend also migrated: 129-line duplicate type definitions replaced with type aliases from handler-base/messages.
Phase 3: Cleanup (COMPLETE)
Archive Python versions of ported services— Python handler-base remains for Ray Serve/Kubeflow- CI pipelines use
golangci-lintv2 with errcheck, govet, staticcheck, misspell, bodyclose, nilerr - All repos pass
golangci-lint run ./...andgo test ./... - Wire format upgraded from MessagePack to Protocol Buffers (ADR-0004 superseded)
What Stays in Python
| Repository | Reason |
|---|---|
ray-serve |
PyTorch, vLLM, sentence-transformers — core ML inference |
kuberay-images |
GPU runtime Docker images (ROCm, CUDA, IPEX) |
gradio-ui |
Gradio is Python-only; dev/testing tool, not production |
kubeflow/ |
Kubeflow Pipelines SDK is Python-only |
mlflow/ |
MLflow SDK integration (tracking + model registry) |
stt-module (local Whisper variant) |
PyTorch + openai-whisper on GPU |
spark-analytics-jobs |
PySpark (being replaced by Flink anyway) |
Links
- Related: ADR-0003 — NATS as messaging backbone
- Related: ADR-0004 — MessagePack binary encoding
- Related: ADR-0011 — KubeRay unified GPU backend
- Related: ADR-0013 — Gitea Actions CI
- Related: ADR-0014 — Docker build best practices
- Related: ADR-0019 — Handler deployment strategy
- Related: ADR-0024 — Ray repository structure
- Related: ADR-0046 — Companions frontend (Go reference)
- Related: ADR-0051 — KEDA autoscaling