- Add AGENT-ONBOARDING.md for AI agents - Add ARCHITECTURE.md with full system overview - Add TECH-STACK.md with complete technology inventory - Add DOMAIN-MODEL.md with entities and bounded contexts - Add CODING-CONVENTIONS.md with patterns and practices - Add GLOSSARY.md with terminology reference - Add C4 diagrams (Context and Container levels) - Add 10 ADRs documenting key decisions: - Talos Linux, NATS, MessagePack, Multi-GPU strategy - GitOps with Flux, KServe, Milvus, Dual workflow engines - Envoy Gateway - Add specs directory with JetStream configuration - Add diagrams for GPU allocation and data flows Based on analysis of homelab-k8s2 and llm-workflows repositories and kubectl cluster-info dump data.
3.1 KiB
3.1 KiB
Use NATS for AI/ML Messaging
- Status: accepted
- Date: 2025-12-01
- Deciders: Billy Davies
- Technical Story: Selecting message bus for AI service orchestration
Context and Problem Statement
The AI/ML platform requires a messaging system for:
- Real-time chat message routing
- Voice request/response streaming
- Pipeline triggers and status updates
- Event-driven workflow orchestration
We need a messaging system that handles both ephemeral real-time messages and persistent streams.
Decision Drivers
- Low latency for real-time chat/voice
- Persistence for audit and replay
- Simple operations for homelab
- Support for request-reply pattern
- Wildcard subscriptions for routing
- Binary message support (audio data)
Considered Options
- Apache Kafka
- RabbitMQ
- Redis Pub/Sub + Streams
- NATS with JetStream
- Apache Pulsar
Decision Outcome
Chosen option: "NATS with JetStream", because it provides both fire-and-forget messaging and persistent streams with significantly simpler operations than alternatives.
Positive Consequences
- Sub-millisecond latency for real-time messages
- JetStream provides persistence when needed
- Simple deployment (single binary)
- Excellent Kubernetes integration
- Request-reply pattern built-in
- Wildcard subscriptions for flexible routing
- Low resource footprint
Negative Consequences
- Less ecosystem than Kafka
- JetStream less mature than Kafka Streams
- No built-in schema registry
- Smaller community than RabbitMQ
Pros and Cons of the Options
Apache Kafka
- Good, because industry standard for streaming
- Good, because rich ecosystem (Kafka Streams, Connect)
- Good, because schema registry
- Good, because excellent for high throughput
- Bad, because operationally complex (ZooKeeper/KRaft)
- Bad, because high resource requirements
- Bad, because overkill for homelab scale
- Bad, because higher latency for real-time messages
RabbitMQ
- Good, because mature and stable
- Good, because flexible routing
- Good, because good management UI
- Bad, because AMQP protocol overhead
- Bad, because not designed for streaming
- Bad, because more complex clustering
Redis Pub/Sub + Streams
- Good, because simple
- Good, because already might use Redis
- Good, because low latency
- Bad, because pub/sub not persistent
- Bad, because streams API less intuitive
- Bad, because not primary purpose of Redis
NATS with JetStream
- Good, because extremely low latency
- Good, because simple operations
- Good, because both pub/sub and persistence
- Good, because request-reply built-in
- Good, because wildcard subscriptions
- Good, because low resource usage
- Good, because excellent Go/Python clients
- Bad, because smaller ecosystem
- Bad, because JetStream newer than Kafka
Apache Pulsar
- Good, because unified messaging + streaming
- Good, because multi-tenancy
- Good, because geo-replication
- Bad, because complex architecture
- Bad, because high resource requirements
- Bad, because smaller community
Links
- NATS.io
- JetStream Documentation
- Related: ADR-0004 - Message format