Files
homelab-design/decisions/0003-use-nats-for-messaging.md
Billy D. 832cda34bd feat: add comprehensive architecture documentation
- Add AGENT-ONBOARDING.md for AI agents
- Add ARCHITECTURE.md with full system overview
- Add TECH-STACK.md with complete technology inventory
- Add DOMAIN-MODEL.md with entities and bounded contexts
- Add CODING-CONVENTIONS.md with patterns and practices
- Add GLOSSARY.md with terminology reference
- Add C4 diagrams (Context and Container levels)
- Add 10 ADRs documenting key decisions:
  - Talos Linux, NATS, MessagePack, Multi-GPU strategy
  - GitOps with Flux, KServe, Milvus, Dual workflow engines
  - Envoy Gateway
- Add specs directory with JetStream configuration
- Add diagrams for GPU allocation and data flows

Based on analysis of homelab-k8s2 and llm-workflows repositories
and kubectl cluster-info dump data.
2026-02-01 14:30:05 -05:00

3.1 KiB

Use NATS for AI/ML Messaging

  • Status: accepted
  • Date: 2025-12-01
  • Deciders: Billy Davies
  • Technical Story: Selecting message bus for AI service orchestration

Context and Problem Statement

The AI/ML platform requires a messaging system for:

  • Real-time chat message routing
  • Voice request/response streaming
  • Pipeline triggers and status updates
  • Event-driven workflow orchestration

We need a messaging system that handles both ephemeral real-time messages and persistent streams.

Decision Drivers

  • Low latency for real-time chat/voice
  • Persistence for audit and replay
  • Simple operations for homelab
  • Support for request-reply pattern
  • Wildcard subscriptions for routing
  • Binary message support (audio data)

Considered Options

  • Apache Kafka
  • RabbitMQ
  • Redis Pub/Sub + Streams
  • NATS with JetStream
  • Apache Pulsar

Decision Outcome

Chosen option: "NATS with JetStream", because it provides both fire-and-forget messaging and persistent streams with significantly simpler operations than alternatives.

Positive Consequences

  • Sub-millisecond latency for real-time messages
  • JetStream provides persistence when needed
  • Simple deployment (single binary)
  • Excellent Kubernetes integration
  • Request-reply pattern built-in
  • Wildcard subscriptions for flexible routing
  • Low resource footprint

Negative Consequences

  • Less ecosystem than Kafka
  • JetStream less mature than Kafka Streams
  • No built-in schema registry
  • Smaller community than RabbitMQ

Pros and Cons of the Options

Apache Kafka

  • Good, because industry standard for streaming
  • Good, because rich ecosystem (Kafka Streams, Connect)
  • Good, because schema registry
  • Good, because excellent for high throughput
  • Bad, because operationally complex (ZooKeeper/KRaft)
  • Bad, because high resource requirements
  • Bad, because overkill for homelab scale
  • Bad, because higher latency for real-time messages

RabbitMQ

  • Good, because mature and stable
  • Good, because flexible routing
  • Good, because good management UI
  • Bad, because AMQP protocol overhead
  • Bad, because not designed for streaming
  • Bad, because more complex clustering

Redis Pub/Sub + Streams

  • Good, because simple
  • Good, because already might use Redis
  • Good, because low latency
  • Bad, because pub/sub not persistent
  • Bad, because streams API less intuitive
  • Bad, because not primary purpose of Redis

NATS with JetStream

  • Good, because extremely low latency
  • Good, because simple operations
  • Good, because both pub/sub and persistence
  • Good, because request-reply built-in
  • Good, because wildcard subscriptions
  • Good, because low resource usage
  • Good, because excellent Go/Python clients
  • Bad, because smaller ecosystem
  • Bad, because JetStream newer than Kafka

Apache Pulsar

  • Good, because unified messaging + streaming
  • Good, because multi-tenancy
  • Good, because geo-replication
  • Bad, because complex architecture
  • Bad, because high resource requirements
  • Bad, because smaller community