Files

Billy D. 832cda34bd feat: add comprehensive architecture documentation

- Add AGENT-ONBOARDING.md for AI agents
- Add ARCHITECTURE.md with full system overview
- Add TECH-STACK.md with complete technology inventory
- Add DOMAIN-MODEL.md with entities and bounded contexts
- Add CODING-CONVENTIONS.md with patterns and practices
- Add GLOSSARY.md with terminology reference
- Add C4 diagrams (Context and Container levels)
- Add 10 ADRs documenting key decisions:
  - Talos Linux, NATS, MessagePack, Multi-GPU strategy
  - GitOps with Flux, KServe, Milvus, Dual workflow engines
  - Envoy Gateway
- Add specs directory with JetStream configuration
- Add diagrams for GPU allocation and data flows

Based on analysis of homelab-k8s2 and llm-workflows repositories
and kubectl cluster-info dump data.

2026-02-01 14:30:05 -05:00

3.1 KiB

Raw Blame History

Use NATS for AI/ML Messaging

Status: accepted
Date: 2025-12-01
Deciders: Billy Davies
Technical Story: Selecting message bus for AI service orchestration

Context and Problem Statement

The AI/ML platform requires a messaging system for:

Real-time chat message routing
Voice request/response streaming
Pipeline triggers and status updates
Event-driven workflow orchestration

We need a messaging system that handles both ephemeral real-time messages and persistent streams.

Decision Drivers

Low latency for real-time chat/voice
Persistence for audit and replay
Simple operations for homelab
Support for request-reply pattern
Wildcard subscriptions for routing
Binary message support (audio data)

Considered Options

Apache Kafka
RabbitMQ
Redis Pub/Sub + Streams
NATS with JetStream
Apache Pulsar

Decision Outcome

Chosen option: "NATS with JetStream", because it provides both fire-and-forget messaging and persistent streams with significantly simpler operations than alternatives.

Positive Consequences

Sub-millisecond latency for real-time messages
JetStream provides persistence when needed
Simple deployment (single binary)
Excellent Kubernetes integration
Request-reply pattern built-in
Wildcard subscriptions for flexible routing
Low resource footprint

Negative Consequences

Less ecosystem than Kafka
JetStream less mature than Kafka Streams
No built-in schema registry
Smaller community than RabbitMQ

Pros and Cons of the Options

Apache Kafka

Good, because industry standard for streaming
Good, because rich ecosystem (Kafka Streams, Connect)
Good, because schema registry
Good, because excellent for high throughput
Bad, because operationally complex (ZooKeeper/KRaft)
Bad, because high resource requirements
Bad, because overkill for homelab scale
Bad, because higher latency for real-time messages

RabbitMQ

Good, because mature and stable
Good, because flexible routing
Good, because good management UI
Bad, because AMQP protocol overhead
Bad, because not designed for streaming
Bad, because more complex clustering

Redis Pub/Sub + Streams

Good, because simple
Good, because already might use Redis
Good, because low latency
Bad, because pub/sub not persistent
Bad, because streams API less intuitive
Bad, because not primary purpose of Redis

NATS with JetStream

Good, because extremely low latency
Good, because simple operations
Good, because both pub/sub and persistence
Good, because request-reply built-in
Good, because wildcard subscriptions
Good, because low resource usage
Good, because excellent Go/Python clients
Bad, because smaller ecosystem
Bad, because JetStream newer than Kafka

Apache Pulsar

Good, because unified messaging + streaming
Good, because multi-tenancy
Good, because geo-replication
Bad, because complex architecture
Bad, because high resource requirements
Bad, because smaller community

3.1 KiB Raw Blame History