homelab-design/decisions/0003-use-nats-for-messaging.md

# Use NATS for AI/ML Messaging

* Status: accepted
* Date: 2025-12-01
* Deciders: Billy Davies
* Technical Story: Selecting message bus for AI service orchestration

## Context and Problem Statement

The AI/ML platform requires a messaging system for:
- Real-time chat message routing
- Voice request/response streaming
- Pipeline triggers and status updates
- Event-driven workflow orchestration

We need a messaging system that handles both ephemeral real-time messages and persistent streams.

## Decision Drivers

* Low latency for real-time chat/voice
* Persistence for audit and replay
* Simple operations for homelab
* Support for request-reply pattern
* Wildcard subscriptions for routing
* Binary message support (audio data)

## Considered Options

* Apache Kafka
* RabbitMQ
* Redis Pub/Sub + Streams
* NATS with JetStream
* Apache Pulsar

## Decision Outcome

Chosen option: "NATS with JetStream", because it provides both fire-and-forget messaging and persistent streams with significantly simpler operations than alternatives.

### Positive Consequences

* Sub-millisecond latency for real-time messages
* JetStream provides persistence when needed
* Simple deployment (single binary)
* Excellent Kubernetes integration
* Request-reply pattern built-in
* Wildcard subscriptions for flexible routing
* Low resource footprint

### Negative Consequences

* Less ecosystem than Kafka
* JetStream less mature than Kafka Streams
* No built-in schema registry
* Smaller community than RabbitMQ

## Pros and Cons of the Options

### Apache Kafka

* Good, because industry standard for streaming
* Good, because rich ecosystem (Kafka Streams, Connect)
* Good, because schema registry
* Good, because excellent for high throughput
* Bad, because operationally complex (ZooKeeper/KRaft)
* Bad, because high resource requirements
* Bad, because overkill for homelab scale
* Bad, because higher latency for real-time messages

### RabbitMQ

* Good, because mature and stable
* Good, because flexible routing
* Good, because good management UI
* Bad, because AMQP protocol overhead
* Bad, because not designed for streaming
* Bad, because more complex clustering

### Redis Pub/Sub + Streams

* Good, because simple
* Good, because already might use Redis
* Good, because low latency
* Bad, because pub/sub not persistent
* Bad, because streams API less intuitive
* Bad, because not primary purpose of Redis

### NATS with JetStream

* Good, because extremely low latency
* Good, because simple operations
* Good, because both pub/sub and persistence
* Good, because request-reply built-in
* Good, because wildcard subscriptions
* Good, because low resource usage
* Good, because excellent Go/Python clients
* Bad, because smaller ecosystem
* Bad, because JetStream newer than Kafka

### Apache Pulsar

* Good, because unified messaging + streaming
* Good, because multi-tenancy
* Good, because geo-replication
* Bad, because complex architecture
* Bad, because high resource requirements
* Bad, because smaller community

## Links

* [NATS.io](https://nats.io)
* [JetStream Documentation](https://docs.nats.io/nats-concepts/jetstream)
* Related: [ADR-0004](0004-use-messagepack-for-nats.md) - Message format