feat: add comprehensive architecture documentation
- Add AGENT-ONBOARDING.md for AI agents - Add ARCHITECTURE.md with full system overview - Add TECH-STACK.md with complete technology inventory - Add DOMAIN-MODEL.md with entities and bounded contexts - Add CODING-CONVENTIONS.md with patterns and practices - Add GLOSSARY.md with terminology reference - Add C4 diagrams (Context and Container levels) - Add 10 ADRs documenting key decisions: - Talos Linux, NATS, MessagePack, Multi-GPU strategy - GitOps with Flux, KServe, Milvus, Dual workflow engines - Envoy Gateway - Add specs directory with JetStream configuration - Add diagrams for GPU allocation and data flows Based on analysis of homelab-k8s2 and llm-workflows repositories and kubectl cluster-info dump data.
This commit is contained in:
112
decisions/0003-use-nats-for-messaging.md
Normal file
112
decisions/0003-use-nats-for-messaging.md
Normal file
@@ -0,0 +1,112 @@
|
||||
# Use NATS for AI/ML Messaging
|
||||
|
||||
* Status: accepted
|
||||
* Date: 2025-12-01
|
||||
* Deciders: Billy Davies
|
||||
* Technical Story: Selecting message bus for AI service orchestration
|
||||
|
||||
## Context and Problem Statement
|
||||
|
||||
The AI/ML platform requires a messaging system for:
|
||||
- Real-time chat message routing
|
||||
- Voice request/response streaming
|
||||
- Pipeline triggers and status updates
|
||||
- Event-driven workflow orchestration
|
||||
|
||||
We need a messaging system that handles both ephemeral real-time messages and persistent streams.
|
||||
|
||||
## Decision Drivers
|
||||
|
||||
* Low latency for real-time chat/voice
|
||||
* Persistence for audit and replay
|
||||
* Simple operations for homelab
|
||||
* Support for request-reply pattern
|
||||
* Wildcard subscriptions for routing
|
||||
* Binary message support (audio data)
|
||||
|
||||
## Considered Options
|
||||
|
||||
* Apache Kafka
|
||||
* RabbitMQ
|
||||
* Redis Pub/Sub + Streams
|
||||
* NATS with JetStream
|
||||
* Apache Pulsar
|
||||
|
||||
## Decision Outcome
|
||||
|
||||
Chosen option: "NATS with JetStream", because it provides both fire-and-forget messaging and persistent streams with significantly simpler operations than alternatives.
|
||||
|
||||
### Positive Consequences
|
||||
|
||||
* Sub-millisecond latency for real-time messages
|
||||
* JetStream provides persistence when needed
|
||||
* Simple deployment (single binary)
|
||||
* Excellent Kubernetes integration
|
||||
* Request-reply pattern built-in
|
||||
* Wildcard subscriptions for flexible routing
|
||||
* Low resource footprint
|
||||
|
||||
### Negative Consequences
|
||||
|
||||
* Less ecosystem than Kafka
|
||||
* JetStream less mature than Kafka Streams
|
||||
* No built-in schema registry
|
||||
* Smaller community than RabbitMQ
|
||||
|
||||
## Pros and Cons of the Options
|
||||
|
||||
### Apache Kafka
|
||||
|
||||
* Good, because industry standard for streaming
|
||||
* Good, because rich ecosystem (Kafka Streams, Connect)
|
||||
* Good, because schema registry
|
||||
* Good, because excellent for high throughput
|
||||
* Bad, because operationally complex (ZooKeeper/KRaft)
|
||||
* Bad, because high resource requirements
|
||||
* Bad, because overkill for homelab scale
|
||||
* Bad, because higher latency for real-time messages
|
||||
|
||||
### RabbitMQ
|
||||
|
||||
* Good, because mature and stable
|
||||
* Good, because flexible routing
|
||||
* Good, because good management UI
|
||||
* Bad, because AMQP protocol overhead
|
||||
* Bad, because not designed for streaming
|
||||
* Bad, because more complex clustering
|
||||
|
||||
### Redis Pub/Sub + Streams
|
||||
|
||||
* Good, because simple
|
||||
* Good, because already might use Redis
|
||||
* Good, because low latency
|
||||
* Bad, because pub/sub not persistent
|
||||
* Bad, because streams API less intuitive
|
||||
* Bad, because not primary purpose of Redis
|
||||
|
||||
### NATS with JetStream
|
||||
|
||||
* Good, because extremely low latency
|
||||
* Good, because simple operations
|
||||
* Good, because both pub/sub and persistence
|
||||
* Good, because request-reply built-in
|
||||
* Good, because wildcard subscriptions
|
||||
* Good, because low resource usage
|
||||
* Good, because excellent Go/Python clients
|
||||
* Bad, because smaller ecosystem
|
||||
* Bad, because JetStream newer than Kafka
|
||||
|
||||
### Apache Pulsar
|
||||
|
||||
* Good, because unified messaging + streaming
|
||||
* Good, because multi-tenancy
|
||||
* Good, because geo-replication
|
||||
* Bad, because complex architecture
|
||||
* Bad, because high resource requirements
|
||||
* Bad, because smaller community
|
||||
|
||||
## Links
|
||||
|
||||
* [NATS.io](https://nats.io)
|
||||
* [JetStream Documentation](https://docs.nats.io/nats-concepts/jetstream)
|
||||
* Related: [ADR-0004](0004-use-messagepack-for-nats.md) - Message format
|
||||
Reference in New Issue
Block a user