feat: add comprehensive architecture documentation

- Add AGENT-ONBOARDING.md for AI agents
- Add ARCHITECTURE.md with full system overview
- Add TECH-STACK.md with complete technology inventory
- Add DOMAIN-MODEL.md with entities and bounded contexts
- Add CODING-CONVENTIONS.md with patterns and practices
- Add GLOSSARY.md with terminology reference
- Add C4 diagrams (Context and Container levels)
- Add 10 ADRs documenting key decisions:
  - Talos Linux, NATS, MessagePack, Multi-GPU strategy
  - GitOps with Flux, KServe, Milvus, Dual workflow engines
  - Envoy Gateway
- Add specs directory with JetStream configuration
- Add diagrams for GPU allocation and data flows

Based on analysis of homelab-k8s2 and llm-workflows repositories
and kubectl cluster-info dump data.
This commit is contained in:
2026-02-01 14:30:05 -05:00
parent 4d4f6f464c
commit 832cda34bd
26 changed files with 3805 additions and 2 deletions

191
AGENT-ONBOARDING.md Normal file
View File

@@ -0,0 +1,191 @@
# 🤖 Agent Onboarding
> **This is the most important file for AI agents working on this codebase.**
## TL;DR
You are working on a **homelab Kubernetes cluster** running:
- **Talos Linux v1.12.1** on bare-metal nodes
- **Kubernetes v1.35.0** with Flux CD GitOps
- **AI/ML platform** with KServe, Kubeflow, Milvus, NATS
- **Multi-GPU** (AMD ROCm, NVIDIA CUDA, Intel Arc)
## 🗺️ Repository Map
| Repo | What It Contains | When to Edit |
|------|------------------|--------------|
| `homelab-k8s2` | Kubernetes manifests, Talos config, Flux | Infrastructure changes |
| `llm-workflows` | NATS handlers, Argo/KFP workflows | Workflow/handler changes |
| `companions-frontend` | Go server, HTMX UI, VRM avatars | Frontend changes |
| `homelab-design` (this) | Architecture docs, ADRs | Design decisions |
## 🏗️ System Architecture (30-Second Version)
```
┌─────────────────────────────────────────────────────────────────┐
│ USER INTERFACES │
│ Companions WebApp │ Voice WebApp │ Kubeflow UI │ CLI │
└───────────────────────────┬─────────────────────────────────────┘
│ WebSocket/HTTP
┌─────────────────────────────────────────────────────────────────┐
│ NATS MESSAGE BUS │
│ Subjects: ai.chat.*, ai.voice.*, ai.pipeline.* │
│ Format: MessagePack (binary) │
└───────────────────────────┬─────────────────────────────────────┘
┌───────────────────┼───────────────────┐
▼ ▼ ▼
┌───────────────┐ ┌───────────────┐ ┌───────────────┐
│ Chat Handler │ │Voice Assistant│ │Pipeline Bridge│
│ (RAG+LLM) │ │ (STT→LLM→TTS) │ │ (KFP/Argo) │
└───────┬───────┘ └───────┬───────┘ └───────┬───────┘
│ │ │
└───────────────────┼───────────────────┘
┌─────────────────────────────────────────────────────────────────┐
│ AI SERVICES │
│ Whisper │ XTTS │ vLLM │ Milvus │ BGE Embed │ Reranker │
│ STT │ TTS │ LLM │ RAG │ Embed │ Rank │
└─────────────────────────────────────────────────────────────────┘
```
## 📁 Key File Locations
### Infrastructure (`homelab-k8s2`)
```
kubernetes/apps/
├── ai-ml/ # 🧠 AI/ML services
│ ├── kserve/ # InferenceServices
│ ├── kubeflow/ # Pipelines, Training Operator
│ ├── milvus/ # Vector database
│ ├── nats/ # Message bus
│ ├── vllm/ # LLM inference
│ └── llm-workflows/ # GitRepo sync to llm-workflows
├── analytics/ # 📊 Spark, Flink, ClickHouse
├── observability/ # 📈 Grafana, Alloy, OpenTelemetry
└── security/ # 🔒 Vault, Authentik, Falco
talos/
├── talconfig.yaml # Node definitions
├── patches/ # GPU-specific patches
│ ├── amd/amdgpu.yaml
│ └── nvidia/nvidia-runtime.yaml
```
### Workflows (`llm-workflows`)
```
workflows/ # NATS handler deployments
├── chat-handler.yaml
├── voice-assistant.yaml
└── pipeline-bridge.yaml
argo/ # Argo WorkflowTemplates
├── document-ingestion.yaml
├── batch-inference.yaml
└── qlora-training.yaml
pipelines/ # Kubeflow Pipeline Python
├── voice_pipeline.py
└── document_ingestion_pipeline.py
```
## 🔌 Service Endpoints (Internal)
```python
# Copy-paste ready for Python code
NATS_URL = "nats://nats.ai-ml.svc.cluster.local:4222"
VLLM_URL = "http://llm-draft.ai-ml.svc.cluster.local:8000/v1"
WHISPER_URL = "http://whisper-predictor.ai-ml.svc.cluster.local"
TTS_URL = "http://tts-predictor.ai-ml.svc.cluster.local"
EMBEDDINGS_URL = "http://embeddings-predictor.ai-ml.svc.cluster.local"
RERANKER_URL = "http://reranker-predictor.ai-ml.svc.cluster.local"
MILVUS_HOST = "milvus.ai-ml.svc.cluster.local"
MILVUS_PORT = 19530
VALKEY_URL = "redis://valkey.ai-ml.svc.cluster.local:6379"
```
## 📨 NATS Subject Patterns
```python
# Chat
f"ai.chat.user.{user_id}.message" # User sends message
f"ai.chat.response.{request_id}" # Response back
f"ai.chat.response.stream.{request_id}" # Streaming tokens
# Voice
f"ai.voice.user.{user_id}.request" # Voice input
f"ai.voice.response.{request_id}" # Voice output
# Pipelines
"ai.pipeline.trigger" # Trigger any pipeline
f"ai.pipeline.status.{request_id}" # Status updates
```
## 🎮 GPU Allocation
| Node | GPU | Workload | Memory |
|------|-----|----------|--------|
| khelben | AMD Strix Halo | vLLM (dedicated) | 64GB unified |
| elminster | NVIDIA RTX 2070 | Whisper + XTTS | 8GB VRAM |
| drizzt | AMD Radeon 680M | BGE Embeddings | 12GB VRAM |
| danilo | Intel Arc | Reranker | 16GB shared |
## ⚡ Common Tasks
### Deploy a New AI Service
1. Create InferenceService in `homelab-k8s2/kubernetes/apps/ai-ml/kserve/`
2. Add endpoint to `llm-workflows/config/ai-services-config.yaml`
3. Push to main → Flux deploys automatically
### Add a New Workflow
1. Create handler in `llm-workflows/chat-handler/` or `llm-workflows/voice-assistant/`
2. Add Kubernetes Deployment in `llm-workflows/workflows/`
3. Push to main → Flux deploys automatically
### Create Architecture Decision
1. Copy `decisions/0000-template.md` to `decisions/NNNN-title.md`
2. Fill in context, decision, consequences
3. Submit PR
## ❌ Antipatterns to Avoid
1. **Don't hardcode secrets** - Use External Secrets Operator
2. **Don't use `latest` tags** - Pin versions for reproducibility
3. **Don't skip ADRs** - Document significant decisions
4. **Don't bypass Flux** - All changes via Git, never `kubectl apply` directly
## 📚 Where to Learn More
- [ARCHITECTURE.md](ARCHITECTURE.md) - Full system design
- [TECH-STACK.md](TECH-STACK.md) - All technologies used
- [decisions/](decisions/) - Why we made certain choices
- [DOMAIN-MODEL.md](DOMAIN-MODEL.md) - Core entities
## 🆘 Quick Debugging
```bash
# Check Flux sync status
flux get all -A
# View NATS JetStream streams
kubectl exec -n ai-ml deploy/nats-box -- nats stream ls
# Check GPU allocation
kubectl describe node khelben | grep -A10 "Allocated"
# View KServe inference services
kubectl get inferenceservices -n ai-ml
# Tail AI service logs
kubectl logs -n ai-ml -l app=chat-handler -f
```
---
*This document is the canonical starting point for AI agents. When in doubt, check the ADRs.*

287
ARCHITECTURE.md Normal file
View File

@@ -0,0 +1,287 @@
# 🏗️ System Architecture
> **Comprehensive technical overview of the DaviesTechLabs homelab infrastructure**
## Overview
The homelab is a production-grade Kubernetes cluster running on bare-metal hardware, designed for AI/ML workloads with multi-GPU support. It follows GitOps principles using Flux CD with SOPS-encrypted secrets.
## System Layers
```
┌─────────────────────────────────────────────────────────────────────────────┐
│ USER LAYER │
├─────────────────────────────────────────────────────────────────────────────┤
│ ┌──────────────────┐ ┌──────────────────┐ ┌──────────────────┐ │
│ │ Companions WebApp│ │ Voice WebApp │ │ Kubeflow UI │ │
│ │ HTMX + Alpine │ │ Gradio UI │ │ Pipeline Mgmt │ │
│ └────────┬─────────┘ └────────┬─────────┘ └────────┬─────────┘ │
│ │ WebSocket │ HTTP/WS │ HTTP │
└───────────┴─────────────────────┴─────────────────────┴─────────────────────┘
┌─────────────────────────────────────────────────────────────────────────────┐
│ INGRESS LAYER │
├─────────────────────────────────────────────────────────────────────────────┤
│ Cloudflared Tunnel ──► Envoy Gateway ──► HTTPRoute CRDs │
│ │
│ External: *.daviestechlabs.io Internal: *.lab.daviestechlabs.io │
│ • git.daviestechlabs.io • kubeflow.lab.daviestechlabs.io │
│ • auth.daviestechlabs.io • companions-chat.lab... │
└─────────────────────────────────────────────────────────────────────────────┘
┌─────────────────────────────────────────────────────────────────────────────┐
│ MESSAGE BUS LAYER │
├─────────────────────────────────────────────────────────────────────────────┤
│ NATS + JetStream │
│ ┌─────────────────────────────────────────────────────────────────────┐ │
│ │ Streams: │ │
│ │ • COMPANIONS_LOGINS (7d retention) - User analytics │ │
│ │ • COMPANIONS_CHAT (30d retention) - Chat history │ │
│ │ • AI_CHAT_STREAM (5min, memory) - Ephemeral streaming │ │
│ │ • AI_VOICE_STREAM (1h, file) - Voice processing │ │
│ │ • AI_PIPELINE (24h, file) - Workflow triggers │ │
│ └─────────────────────────────────────────────────────────────────────┘ │
│ │
│ Message Format: MessagePack (binary, not JSON) │
└─────────────────────────────────────────────────────────────────────────────┘
┌─────────────────────────┼─────────────────────────┐
▼ ▼ ▼
┌───────────────────┐ ┌───────────────────┐ ┌───────────────────┐
│ Chat Handler │ │ Voice Assistant │ │ Pipeline Bridge │
├───────────────────┤ ├───────────────────┤ ├───────────────────┤
│ • RAG retrieval │ │ • STT (Whisper) │ │ • KFP triggers │
│ • LLM inference │ │ • RAG retrieval │ │ • Argo triggers │
│ • Streaming resp │ │ • LLM inference │ │ • Status updates │
│ • Session state │ │ • TTS (XTTS) │ │ • Error handling │
└───────────────────┘ └───────────────────┘ └───────────────────┘
┌─────────────────────────────────────────────────────────────────────────────┐
│ AI SERVICES LAYER │
├─────────────────────────────────────────────────────────────────────────────┤
│ ┌─────────┐ ┌─────────┐ ┌─────────┐ ┌─────────┐ ┌─────────┐ ┌─────────┐ │
│ │ Whisper │ │ XTTS │ │ vLLM │ │ Milvus │ │ BGE │ │Reranker │ │
│ │ (STT) │ │ (TTS) │ │ (LLM) │ │ (RAG) │ │(Embed) │ │ (BGE) │ │
│ ├─────────┤ ├─────────┤ ├─────────┤ ├─────────┤ ├─────────┤ ├─────────┤ │
│ │ KServe │ │ KServe │ │ vLLM │ │ Helm │ │ KServe │ │ KServe │ │
│ │ nvidia │ │ nvidia │ │ ROCm │ │ Minio │ │ rdna2 │ │ intel │ │
│ └─────────┘ └─────────┘ └─────────┘ └─────────┘ └─────────┘ └─────────┘ │
└─────────────────────────────────────────────────────────────────────────────┘
┌─────────────────────────────────────────────────────────────────────────────┐
│ WORKFLOW ENGINE LAYER │
├─────────────────────────────────────────────────────────────────────────────┤
│ ┌────────────────────────────┐ ┌────────────────────────────┐ │
│ │ Argo Workflows │◄──►│ Kubeflow Pipelines │ │
│ ├────────────────────────────┤ ├────────────────────────────┤ │
│ │ • Complex DAG orchestration│ │ • ML pipeline caching │ │
│ │ • Training workflows │ │ • Experiment tracking │ │
│ │ • Document ingestion │ │ • Model versioning │ │
│ │ • Batch inference │ │ • Artifact lineage │ │
│ └────────────────────────────┘ └────────────────────────────┘ │
│ │
│ Trigger: Argo Events (EventSource → Sensor → Workflow/Pipeline) │
└─────────────────────────────────────────────────────────────────────────────┘
┌─────────────────────────────────────────────────────────────────────────────┐
│ INFRASTRUCTURE LAYER │
├─────────────────────────────────────────────────────────────────────────────┤
│ Storage: Compute: Security: │
│ ├─ Longhorn (block) ├─ Volcano Scheduler ├─ Vault (secrets) │
│ ├─ NFS CSI (shared) ├─ GPU Device Plugins ├─ Authentik (SSO) │
│ └─ MinIO (S3) │ ├─ AMD ROCm ├─ Falco (runtime) │
│ │ ├─ NVIDIA CUDA └─ SOPS (GitOps) │
│ Databases: │ └─ Intel i915/Arc │
│ ├─ CloudNative-PG └─ Node Feature Discovery │
│ ├─ Valkey (cache) │
│ └─ ClickHouse (analytics) │
└─────────────────────────────────────────────────────────────────────────────┘
┌─────────────────────────────────────────────────────────────────────────────┐
│ PLATFORM LAYER │
├─────────────────────────────────────────────────────────────────────────────┤
│ Talos Linux v1.12.1 │ Kubernetes v1.35.0 │ Cilium CNI │
│ │
│ Nodes: storm, bruenor, catti (control) │ elminster, khelben, drizzt, │
│ │ danilo (workers) │
└─────────────────────────────────────────────────────────────────────────────┘
```
## Node Topology
### Control Plane (HA)
| Node | IP | CPU | Memory | Storage | Role |
|------|-------|-----|--------|---------|------|
| storm | 192.168.100.25 | Intel 13th Gen (4c) | 16GB | 500GB NVMe | etcd, API server |
| bruenor | 192.168.100.26 | Intel 13th Gen (4c) | 16GB | 500GB NVMe | etcd, API server |
| catti | 192.168.100.27 | Intel 13th Gen (4c) | 16GB | 500GB NVMe | etcd, API server |
**VIP**: 192.168.100.20 (shared across control plane)
### Worker Nodes
| Node | IP | CPU | GPU | GPU Memory | Workload |
|------|-------|-----|-----|------------|----------|
| elminster | 192.168.100.31 | Intel | NVIDIA RTX 2070 | 8GB VRAM | Whisper, XTTS |
| khelben | 192.168.100.32 | AMD Ryzen | AMD Strix Halo | 64GB Unified | vLLM (dedicated) |
| drizzt | 192.168.100.40 | AMD Ryzen 7 6800H | AMD Radeon 680M | 12GB VRAM | BGE Embeddings |
| danilo | 192.168.100.41 | Intel Core Ultra 9 | Intel Arc | 16GB Shared | Reranker |
## Networking
### External Access
```
Internet → Cloudflare → cloudflared tunnel → Envoy Gateway → Services
```
### DNS Zones
- **External**: `*.daviestechlabs.io` (Cloudflare DNS)
- **Internal**: `*.lab.daviestechlabs.io` (internal split-horizon)
### Network CIDRs
| Network | CIDR | Purpose |
|---------|------|---------|
| Node Network | 192.168.100.0/24 | Physical nodes |
| Pod Network | 10.42.0.0/16 | Kubernetes pods |
| Service Network | 10.43.0.0/16 | Kubernetes services |
## Data Flow: Chat Request
```mermaid
sequenceDiagram
participant U as User
participant W as WebApp
participant N as NATS
participant C as Chat Handler
participant M as Milvus
participant L as vLLM
participant V as Valkey
U->>W: Send message
W->>N: Publish ai.chat.user.{id}.message
N->>C: Deliver to chat-handler
C->>V: Get session history
C->>M: RAG query (if enabled)
M-->>C: Relevant documents
C->>L: LLM inference (with context)
L-->>C: Streaming tokens
C->>N: Publish ai.chat.response.stream.{id}
N-->>W: Deliver streaming chunks
W-->>U: Display tokens
C->>V: Save to session
```
## GitOps Flow
```
Developer → Git Push → GitHub/Gitea
┌─────────────┐
│ Flux CD │
│ (reconcile) │
└──────┬──────┘
┌──────────────┼──────────────┐
▼ ▼ ▼
┌──────────┐ ┌──────────┐ ┌──────────┐
│homelab- │ │ llm- │ │ helm │
│ k8s2 │ │workflows │ │ charts │
└──────────┘ └──────────┘ └──────────┘
│ │ │
└──────────────┴──────────────┘
┌─────────────┐
│ Kubernetes │
│ Cluster │
└─────────────┘
```
## Security Architecture
### Secrets Management
```
External Secrets Operator ──► Vault / SOPS ──► Kubernetes Secrets
```
### Authentication
```
User ──► Cloudflare Access ──► Authentik ──► Application
└──► OIDC/SAML providers
```
### Network Security
- **Cilium**: Network policies, eBPF-based security
- **Falco**: Runtime security monitoring
- **RBAC**: Fine-grained Kubernetes permissions
## High Availability
### Control Plane
- 3-node etcd cluster with automatic leader election
- Virtual IP (192.168.100.20) for API server access
- Automatic failover via Talos
### Workloads
- Pod anti-affinity for critical services
- HPA for auto-scaling
- PodDisruptionBudgets for controlled updates
### Storage
- Longhorn 3-replica default
- MinIO erasure coding for S3
- Regular Velero backups
## Observability
### Metrics Pipeline
```
Applications ──► OpenTelemetry Collector ──► Prometheus ──► Grafana
```
### Logging Pipeline
```
Applications ──► Grafana Alloy ──► Loki ──► Grafana
```
### Tracing Pipeline
```
Applications ──► OpenTelemetry SDK ──► Jaeger/Tempo ──► Grafana
```
## Key Design Decisions
| Decision | Rationale | ADR |
|----------|-----------|-----|
| Talos Linux | Immutable, API-driven, secure | [ADR-0002](decisions/0002-use-talos-linux.md) |
| NATS over Kafka | Simpler ops, sufficient throughput | [ADR-0003](decisions/0003-use-nats-for-messaging.md) |
| MessagePack over JSON | Binary efficiency for audio | [ADR-0004](decisions/0004-use-messagepack-for-nats.md) |
| Multi-GPU heterogeneous | Cost optimization, workload matching | [ADR-0005](decisions/0005-multi-gpu-strategy.md) |
| GitOps with Flux | Declarative, auditable, secure | [ADR-0006](decisions/0006-gitops-with-flux.md) |
## Related Documents
- [TECH-STACK.md](TECH-STACK.md) - Complete technology inventory
- [DOMAIN-MODEL.md](DOMAIN-MODEL.md) - Core entities and relationships
- [decisions/](decisions/) - All architecture decisions

424
CODING-CONVENTIONS.md Normal file
View File

@@ -0,0 +1,424 @@
# 📐 Coding Conventions
> **Patterns, practices, and folder structure conventions for DaviesTechLabs repositories**
## Repository Conventions
### homelab-k8s2 (Infrastructure)
```
kubernetes/
├── apps/ # Application deployments
│ └── {namespace}/ # One folder per namespace
│ └── {app}/ # One folder per application
│ ├── app/ # Kubernetes manifests
│ │ ├── kustomization.yaml
│ │ ├── helmrelease.yaml # OR individual manifests
│ │ └── ...
│ └── ks.yaml # Flux Kustomization
├── components/ # Reusable Kustomize components
└── flux/ # Flux system configuration
```
**Naming Conventions:**
- Namespaces: lowercase with hyphens (`ai-ml`, `cert-manager`)
- Apps: lowercase with hyphens (`chat-handler`, `voice-assistant`)
- Secrets: `{app}-{type}` (e.g., `milvus-credentials`)
### llm-workflows (Orchestration)
```
workflows/ # Kubernetes Deployments for NATS handlers
├── {handler}.yaml # One file per handler
argo/ # Argo WorkflowTemplates
├── {workflow-name}.yaml # One file per workflow
pipelines/ # Kubeflow Pipeline Python files
├── {pipeline}_pipeline.py # Pipeline definition
└── kfp-sync-job.yaml # Upload job
{handler}/ # Python source code
├── __init__.py
├── {handler}.py # Main entry point
├── requirements.txt
└── Dockerfile
```
---
## Python Conventions
### Project Structure
```python
# Use async/await for I/O
async def handle_message(msg: Msg) -> None:
...
# Use dataclasses for structured data
@dataclass
class ChatRequest:
user_id: str
message: str
enable_rag: bool = True
# Use msgpack for NATS messages
import msgpack
data = msgpack.packb({"key": "value"})
```
### Naming
| Element | Convention | Example |
|---------|------------|---------|
| Files | snake_case | `chat_handler.py` |
| Classes | PascalCase | `ChatHandler` |
| Functions | snake_case | `process_message` |
| Constants | UPPER_SNAKE | `NATS_URL` |
| Private | Leading underscore | `_internal_method` |
### Type Hints
```python
# Always use type hints
from typing import Optional, List, Dict, Any
async def query_rag(
query: str,
collection: str = "knowledge_base",
top_k: int = 5,
) -> List[Dict[str, Any]]:
...
```
### Error Handling
```python
# Use specific exceptions
class RAGQueryError(Exception):
"""Raised when RAG query fails."""
pass
# Log errors with context
import logging
logger = logging.getLogger(__name__)
try:
result = await milvus.search(...)
except Exception as e:
logger.error(f"RAG query failed: {e}", extra={"query": query})
raise RAGQueryError(f"Failed to query collection {collection}") from e
```
### NATS Message Handling
```python
import nats
import msgpack
async def message_handler(msg: Msg) -> None:
try:
# Decode MessagePack
data = msgpack.unpackb(msg.data, raw=False)
# Process
result = await process(data)
# Reply if request-reply pattern
if msg.reply:
await msg.respond(msgpack.packb(result))
# Acknowledge for JetStream
await msg.ack()
except Exception as e:
logger.error(f"Handler error: {e}")
# NAK for retry (JetStream)
await msg.nak()
```
---
## Kubernetes Manifest Conventions
### Labels
```yaml
metadata:
labels:
# Required
app.kubernetes.io/name: chat-handler
app.kubernetes.io/instance: chat-handler
app.kubernetes.io/component: handler
app.kubernetes.io/part-of: ai-platform
# Optional
app.kubernetes.io/version: "1.0.0"
app.kubernetes.io/managed-by: flux
```
### Annotations
```yaml
metadata:
annotations:
# Reloader for config changes
reloader.stakater.com/auto: "true"
# Documentation
description: "Handles chat messages via NATS"
```
### Resource Requests
```yaml
resources:
requests:
cpu: 100m
memory: 256Mi
limits:
cpu: 500m
memory: 512Mi
# GPU workloads
resources:
limits:
amd.com/gpu: 1 # AMD
nvidia.com/gpu: 1 # NVIDIA
```
### Health Checks
```yaml
livenessProbe:
httpGet:
path: /health
port: 8080
initialDelaySeconds: 10
periodSeconds: 30
readinessProbe:
httpGet:
path: /ready
port: 8080
initialDelaySeconds: 5
periodSeconds: 10
```
---
## Flux/GitOps Conventions
### Kustomization Structure
```yaml
# ks.yaml - Flux Kustomization
apiVersion: kustomize.toolkit.fluxcd.io/v1
kind: Kustomization
metadata:
name: &app chat-handler
namespace: flux-system
spec:
targetNamespace: ai-ml
commonMetadata:
labels:
app.kubernetes.io/name: *app
path: ./kubernetes/apps/ai-ml/chat-handler/app
prune: true
sourceRef:
kind: GitRepository
name: flux-system
wait: true
interval: 30m
retryInterval: 1m
timeout: 5m
```
### HelmRelease Structure
```yaml
apiVersion: helm.toolkit.fluxcd.io/v2
kind: HelmRelease
metadata:
name: milvus
spec:
interval: 30m
chart:
spec:
chart: milvus
version: 4.x.x
sourceRef:
kind: HelmRepository
name: milvus
namespace: flux-system
values:
# Values here
```
### Secret References
```yaml
# Never hardcode secrets
env:
- name: DATABASE_PASSWORD
valueFrom:
secretKeyRef:
name: postgres-credentials
key: password
```
---
## NATS Subject Conventions
### Hierarchy
```
ai.{domain}.{scope}.{action}
Examples:
ai.chat.user.{userId}.message # User chat message
ai.chat.response.{requestId} # Chat response
ai.voice.user.{userId}.request # Voice request
ai.pipeline.trigger # Pipeline trigger
```
### Wildcards
```
ai.chat.> # All chat events
ai.chat.user.*.message # All user messages
ai.*.response.{id} # Any response type
```
---
## Git Conventions
### Commit Messages
```
type(scope): subject
body (optional)
footer (optional)
```
**Types:**
- `feat`: New feature
- `fix`: Bug fix
- `docs`: Documentation
- `style`: Formatting
- `refactor`: Code restructuring
- `test`: Tests
- `chore`: Maintenance
**Examples:**
```
feat(chat-handler): add streaming response support
fix(voice): handle empty audio gracefully
docs(adr): add decision for MessagePack format
```
### Branch Naming
```
feature/short-description
fix/issue-number-description
docs/what-changed
```
---
## Configuration Conventions
### Environment Variables
```python
# Use pydantic-settings or similar
from pydantic_settings import BaseSettings
class Settings(BaseSettings):
nats_url: str = "nats://localhost:4222"
vllm_url: str = "http://localhost:8000"
milvus_host: str = "localhost"
milvus_port: int = 19530
log_level: str = "INFO"
class Config:
env_prefix = "" # No prefix
```
### ConfigMaps
```yaml
apiVersion: v1
kind: ConfigMap
metadata:
name: ai-services-config
data:
NATS_URL: "nats://nats.ai-ml.svc.cluster.local:4222"
VLLM_URL: "http://llm-draft.ai-ml.svc.cluster.local:8000/v1"
# ... other non-sensitive config
```
---
## Documentation Conventions
### ADR Format
See [decisions/0000-template.md](decisions/0000-template.md)
### Code Comments
```python
# Use docstrings for public functions
async def query_rag(query: str) -> List[Dict]:
"""
Query the RAG system for relevant documents.
Args:
query: The search query string
Returns:
List of document chunks with scores
Raises:
RAGQueryError: If the query fails
"""
...
```
### README Files
Each application should have a README with:
1. Purpose
2. Configuration
3. Deployment
4. Local development
5. API documentation (if applicable)
---
## Anti-Patterns to Avoid
| Don't | Do Instead |
|-------|------------|
| `kubectl apply` directly | Commit to Git, let Flux deploy |
| Hardcode secrets | Use External Secrets Operator |
| Use `latest` image tags | Pin to specific versions |
| Skip health checks | Always define liveness/readiness |
| Ignore resource limits | Set appropriate requests/limits |
| Use JSON for NATS messages | Use MessagePack (binary) |
| Synchronous I/O in handlers | Use async/await |
---
## Related Documents
- [TECH-STACK.md](TECH-STACK.md) - Technologies used
- [ARCHITECTURE.md](ARCHITECTURE.md) - System design
- [decisions/](decisions/) - Why we made certain choices

123
CONTAINER-DIAGRAM.mmd Normal file
View File

@@ -0,0 +1,123 @@
%% C4 Container Diagram - Level 2
%% DaviesTechLabs Homelab AI/ML Platform
%%
%% To render: Use Mermaid Live Editor or VS Code Mermaid extension
graph TB
subgraph users["Users"]
user["👤 User"]
end
subgraph ingress["Ingress Layer"]
cloudflared["cloudflared<br/>(Tunnel)"]
envoy["Envoy Gateway<br/>(HTTPRoute)"]
end
subgraph frontends["Frontend Applications"]
companions["Companions WebApp<br/>[Go + HTMX]<br/>AI Chat Interface"]
voice["Voice WebApp<br/>[Gradio]<br/>Voice Assistant UI"]
kubeflow_ui["Kubeflow UI<br/>[React]<br/>Pipeline Management"]
end
subgraph messaging["Message Bus"]
nats["NATS<br/>[JetStream]<br/>Event Streaming"]
end
subgraph handlers["NATS Handlers"]
chat_handler["Chat Handler<br/>[Python]<br/>RAG + LLM Orchestration"]
voice_handler["Voice Assistant<br/>[Python]<br/>STT → LLM → TTS"]
pipeline_bridge["Pipeline Bridge<br/>[Python]<br/>Workflow Triggers"]
end
subgraph ai_services["AI Services (KServe)"]
whisper["Whisper<br/>[faster-whisper]<br/>Speech-to-Text"]
xtts["XTTS<br/>[Coqui]<br/>Text-to-Speech"]
vllm["vLLM<br/>[ROCm]<br/>LLM Inference"]
embeddings["BGE Embeddings<br/>[sentence-transformers]<br/>Vector Encoding"]
reranker["BGE Reranker<br/>[sentence-transformers]<br/>Document Ranking"]
end
subgraph storage["Data Stores"]
milvus["Milvus<br/>[Vector DB]<br/>RAG Storage"]
valkey["Valkey<br/>[Redis API]<br/>Session Cache"]
postgres["CloudNative-PG<br/>[PostgreSQL]<br/>Metadata"]
minio["MinIO<br/>[S3 API]<br/>Object Storage"]
end
subgraph workflows["Workflow Engines"]
argo["Argo Workflows<br/>[DAG Engine]<br/>Complex Pipelines"]
kfp["Kubeflow Pipelines<br/>[ML Platform]<br/>Training + Inference"]
argo_events["Argo Events<br/>[Event Source]<br/>NATS → Workflow"]
end
subgraph mlops["MLOps"]
mlflow["MLflow<br/>[Tracking Server]<br/>Experiment Tracking"]
volcano["Volcano<br/>[Scheduler]<br/>GPU Scheduling"]
end
%% User flow
user --> cloudflared
cloudflared --> envoy
envoy --> companions
envoy --> voice
envoy --> kubeflow_ui
%% Frontend to NATS
companions --> |WebSocket| nats
voice --> |HTTP/WS| nats
%% NATS to handlers
nats --> chat_handler
nats --> voice_handler
nats --> pipeline_bridge
%% Handlers to AI services
chat_handler --> embeddings
chat_handler --> reranker
chat_handler --> vllm
chat_handler --> milvus
chat_handler --> valkey
voice_handler --> whisper
voice_handler --> embeddings
voice_handler --> reranker
voice_handler --> vllm
voice_handler --> xtts
%% Pipeline flow
pipeline_bridge --> argo_events
argo_events --> argo
argo_events --> kfp
kubeflow_ui --> kfp
%% Workflow to AI
argo --> ai_services
kfp --> ai_services
kfp --> mlflow
%% Storage connections
ai_services --> minio
milvus --> minio
kfp --> postgres
mlflow --> postgres
mlflow --> minio
%% GPU scheduling
volcano -.-> vllm
volcano -.-> whisper
volcano -.-> xtts
%% Styling
classDef frontend fill:#90EE90,stroke:#333
classDef handler fill:#87CEEB,stroke:#333
classDef ai fill:#FFB6C1,stroke:#333
classDef storage fill:#DDA0DD,stroke:#333
classDef workflow fill:#F0E68C,stroke:#333
classDef messaging fill:#FFA500,stroke:#333
class companions,voice,kubeflow_ui frontend
class chat_handler,voice_handler,pipeline_bridge handler
class whisper,xtts,vllm,embeddings,reranker ai
class milvus,valkey,postgres,minio storage
class argo,kfp,argo_events,mlflow,volcano workflow
class nats messaging

69
CONTEXT-DIAGRAM.mmd Normal file
View File

@@ -0,0 +1,69 @@
%% C4 Context Diagram - Level 1
%% DaviesTechLabs Homelab System Context
%%
%% To render: Use Mermaid Live Editor or VS Code Mermaid extension
graph TB
subgraph users["External Users"]
dev["👤 Developer<br/>(Billy)"]
family["👥 Family Members"]
agents["🤖 AI Agents"]
end
subgraph external["External Systems"]
cf["☁️ Cloudflare<br/>DNS + Tunnel"]
gh["🐙 GitHub<br/>Source Code"]
ghcr["📦 GHCR<br/>Container Registry"]
hf["🤗 Hugging Face<br/>Model Registry"]
end
subgraph homelab["🏠 DaviesTechLabs Homelab"]
direction TB
subgraph apps["Application Layer"]
companions["💬 Companions<br/>AI Chat"]
voice["🎤 Voice Assistant"]
media["🎬 Media Services<br/>(Jellyfin, *arr)"]
productivity["📝 Productivity<br/>(Nextcloud, Gitea)"]
end
subgraph platform["Platform Layer"]
k8s["☸️ Kubernetes Cluster<br/>Talos Linux"]
end
subgraph ai["AI/ML Layer"]
inference["🧠 Inference Services<br/>(vLLM, Whisper, XTTS)"]
workflows["⚙️ Workflow Engines<br/>(Kubeflow, Argo)"]
vectordb["📚 Vector Store<br/>(Milvus)"]
end
end
%% User interactions
dev --> |manages| productivity
dev --> |develops| k8s
family --> |uses| media
family --> |chats| companions
agents --> |queries| inference
%% External integrations
cf --> |routes traffic| apps
gh --> |GitOps sync| k8s
ghcr --> |pulls images| k8s
hf --> |downloads models| inference
%% Internal relationships
apps --> platform
ai --> platform
companions --> inference
voice --> inference
workflows --> inference
inference --> vectordb
%% Styling
classDef external fill:#f9f,stroke:#333,stroke-width:2px
classDef homelab fill:#bbf,stroke:#333,stroke-width:2px
classDef user fill:#bfb,stroke:#333,stroke-width:2px
class cf,gh,ghcr,hf external
class companions,voice,media,productivity,k8s,inference,workflows,vectordb homelab
class dev,family,agents user

345
DOMAIN-MODEL.md Normal file
View File

@@ -0,0 +1,345 @@
# 📊 Domain Model
> **Core entities, bounded contexts, and relationships in the DaviesTechLabs homelab**
## Bounded Contexts
```
┌─────────────────────────────────────────────────────────────────────────────┐
│ BOUNDED CONTEXTS │
├─────────────────────────────────────────────────────────────────────────────┤
│ │
│ ┌───────────────────┐ ┌───────────────────┐ ┌───────────────────┐ │
│ │ CHAT CONTEXT │ │ VOICE CONTEXT │ │ WORKFLOW CONTEXT │ │
│ ├───────────────────┤ ├───────────────────┤ ├───────────────────┤ │
│ │ • ChatSession │ │ • VoiceSession │ │ • Pipeline │ │
│ │ • ChatMessage │ │ • AudioChunk │ │ • PipelineRun │ │
│ │ • Conversation │ │ • Transcription │ │ • Artifact │ │
│ │ • User │ │ • SynthesizedAudio│ │ • Experiment │ │
│ └─────────┬─────────┘ └─────────┬─────────┘ └─────────┬─────────┘ │
│ │ │ │ │
│ └───────────────────────┼───────────────────────┘ │
│ │ │
│ ▼ │
│ ┌───────────────────────────────────────────────────────────────────┐ │
│ │ INFERENCE CONTEXT │ │
│ ├───────────────────────────────────────────────────────────────────┤ │
│ │ • InferenceRequest • Model • Embedding • Document • Chunk │ │
│ └───────────────────────────────────────────────────────────────────┘ │
│ │
└─────────────────────────────────────────────────────────────────────────────┘
```
---
## Core Entities
### User Context
```yaml
User:
id: string (UUID)
username: string
premium: boolean
preferences:
voice_id: string
model_preference: string
enable_rag: boolean
created_at: timestamp
Session:
id: string (UUID)
user_id: string
type: "chat" | "voice"
started_at: timestamp
last_activity: timestamp
metadata: object
```
### Chat Context
```yaml
ChatMessage:
id: string (UUID)
session_id: string
user_id: string
role: "user" | "assistant" | "system"
content: string
created_at: timestamp
metadata:
tokens_used: integer
latency_ms: float
rag_sources: string[]
model_used: string
Conversation:
id: string (UUID)
user_id: string
messages: ChatMessage[]
title: string (auto-generated)
created_at: timestamp
updated_at: timestamp
```
### Voice Context
```yaml
VoiceRequest:
id: string (UUID)
user_id: string
audio_b64: string (base64)
format: "wav" | "webm" | "mp3"
language: string
premium: boolean
enable_rag: boolean
VoiceResponse:
id: string (UUID)
request_id: string
transcription: string
response_text: string
audio_b64: string (base64)
audio_format: string
latency_ms: float
rag_docs_used: integer
```
### Inference Context
```yaml
InferenceRequest:
id: string (UUID)
service: "llm" | "stt" | "tts" | "embeddings" | "reranker"
input: string | bytes
parameters: object
priority: "standard" | "premium"
InferenceResponse:
id: string (UUID)
request_id: string
output: string | bytes | float[]
metadata:
model: string
latency_ms: float
tokens: integer (if applicable)
```
### RAG Context
```yaml
Document:
id: string (UUID)
collection: string
title: string
content: string
source_url: string
ingested_at: timestamp
Chunk:
id: string (UUID)
document_id: string
content: string
embedding: float[1024] # BGE-large dimensions
metadata:
position: integer
page: integer
RAGQuery:
query: string
collection: string
top_k: integer (default: 5)
rerank: boolean (default: true)
rerank_top_k: integer (default: 3)
RAGResult:
chunks: Chunk[]
scores: float[]
reranked: boolean
```
### Workflow Context
```yaml
Pipeline:
id: string
name: string
version: string
engine: "kubeflow" | "argo"
definition: object (YAML)
PipelineRun:
id: string (UUID)
pipeline_id: string
status: "pending" | "running" | "succeeded" | "failed"
started_at: timestamp
completed_at: timestamp
parameters: object
artifacts: Artifact[]
Artifact:
id: string (UUID)
run_id: string
name: string
type: "model" | "dataset" | "metrics" | "logs"
uri: string (s3://)
metadata: object
Experiment:
id: string (UUID)
name: string
runs: PipelineRun[]
metrics: object
created_at: timestamp
```
---
## Entity Relationships
```mermaid
erDiagram
USER ||--o{ SESSION : has
USER ||--o{ CONVERSATION : owns
SESSION ||--o{ CHAT_MESSAGE : contains
CONVERSATION ||--o{ CHAT_MESSAGE : contains
USER ||--o{ VOICE_REQUEST : makes
VOICE_REQUEST ||--|| VOICE_RESPONSE : produces
DOCUMENT ||--o{ CHUNK : contains
CHUNK }|--|| EMBEDDING : has
PIPELINE ||--o{ PIPELINE_RUN : executed_as
PIPELINE_RUN ||--o{ ARTIFACT : produces
EXPERIMENT ||--o{ PIPELINE_RUN : tracks
INFERENCE_REQUEST }|--|| INFERENCE_RESPONSE : produces
```
---
## Aggregate Roots
| Aggregate | Root Entity | Child Entities |
|-----------|-------------|----------------|
| Chat | Conversation | ChatMessage |
| Voice | VoiceRequest | VoiceResponse |
| RAG | Document | Chunk, Embedding |
| Workflow | PipelineRun | Artifact |
| User | User | Session, Preferences |
---
## Event Flow
### Chat Event Stream
```
UserLogin
└─► SessionCreated
└─► MessageReceived
├─► RAGQueryExecuted (optional)
├─► InferenceRequested
└─► ResponseGenerated
└─► MessageStored
```
### Voice Event Stream
```
VoiceRequestReceived
└─► TranscriptionStarted
└─► TranscriptionCompleted
└─► RAGQueryExecuted (optional)
└─► LLMInferenceStarted
└─► LLMResponseGenerated
└─► TTSSynthesisStarted
└─► AudioResponseReady
```
### Workflow Event Stream
```
PipelineTriggerReceived
└─► PipelineRunCreated
└─► StepStarted (repeated)
└─► StepCompleted (repeated)
└─► ArtifactProduced (repeated)
└─► PipelineRunCompleted
```
---
## Data Retention
| Entity | Retention | Storage |
|--------|-----------|---------|
| ChatMessage | 30 days | JetStream → PostgreSQL |
| VoiceRequest/Response | 1 hour (audio), 30 days (text) | JetStream → PostgreSQL |
| Chunk/Embedding | Permanent | Milvus |
| PipelineRun | Permanent | PostgreSQL |
| Artifact | Permanent | MinIO |
| Session | 7 days | Valkey |
---
## Invariants
### Chat Context
- A ChatMessage must belong to exactly one Conversation
- A Conversation must have at least one ChatMessage
- Messages are immutable once created
### Voice Context
- VoiceResponse must have corresponding VoiceRequest
- Audio format must be one of: wav, webm, mp3
- Transcription cannot be empty for valid audio
### RAG Context
- Chunk must belong to exactly one Document
- Embedding dimensions must match model (1024 for BGE-large)
- Document must have at least one Chunk
### Workflow Context
- PipelineRun must reference valid Pipeline
- Artifacts must have valid S3 URIs
- Run status transitions: pending → running → (succeeded|failed)
---
## Value Objects
```python
# Immutable value objects
@dataclass(frozen=True)
class MessageContent:
text: str
tokens: int
@dataclass(frozen=True)
class AudioData:
data: bytes
format: str
duration_ms: int
sample_rate: int
@dataclass(frozen=True)
class EmbeddingVector:
values: tuple[float, ...]
model: str
dimensions: int
@dataclass(frozen=True)
class RAGContext:
chunks: tuple[str, ...]
scores: tuple[float, ...]
query: str
```
---
## Related Documents
- [ARCHITECTURE.md](ARCHITECTURE.md) - System architecture
- [GLOSSARY.md](GLOSSARY.md) - Term definitions
- [decisions/0004-use-messagepack-for-nats.md](decisions/0004-use-messagepack-for-nats.md) - Message format decision

242
GLOSSARY.md Normal file
View File

@@ -0,0 +1,242 @@
# 📖 Glossary
> **Terminology and abbreviations used in the DaviesTechLabs homelab**
## A
**ADR (Architecture Decision Record)**
: A document that captures an important architectural decision, including context, decision, and consequences.
**Argo Events**
: Event-driven automation for Kubernetes that triggers workflows based on events from various sources.
**Argo Workflows**
: A container-native workflow engine for orchestrating parallel jobs on Kubernetes.
**Authentik**
: Self-hosted identity provider supporting SAML, OIDC, and other protocols.
## B
**BGE (BAAI General Embedding)**
: A family of embedding models from BAAI used for semantic search and RAG.
**Bounded Context**
: A DDD concept defining a boundary within which a particular domain model applies.
## C
**C4 Model**
: A hierarchical approach to software architecture diagrams: Context, Container, Component, Code.
**Cilium**
: eBPF-based networking, security, and observability for Kubernetes.
**CloudNative-PG**
: Kubernetes operator for PostgreSQL databases.
**CNI (Container Network Interface)**
: Standard for configuring network interfaces in Linux containers.
## D
**DDD (Domain-Driven Design)**
: Software design approach focusing on the core domain and domain logic.
## E
**Embedding**
: A vector representation of text, used for semantic similarity and search.
**Envoy Gateway**
: Kubernetes Gateway API implementation using Envoy proxy.
**External Secrets Operator (ESO)**
: Kubernetes operator that syncs secrets from external stores (Vault, etc.).
## F
**Falco**
: Runtime security tool that detects anomalous activity in containers.
**Flux CD**
: GitOps toolkit for Kubernetes, continuously reconciling cluster state with Git.
## G
**GitOps**
: Operational practice using Git as the single source of truth for declarative infrastructure.
**GPU Device Plugin**
: Kubernetes plugin that exposes GPU resources to containers.
## H
**HelmRelease**
: Flux CRD for managing Helm chart releases declaratively.
**HTTPRoute**
: Kubernetes Gateway API resource for HTTP routing rules.
## I
**InferenceService**
: KServe CRD for deploying ML models with autoscaling and traffic management.
## J
**JetStream**
: NATS persistence layer providing streaming, key-value, and object stores.
## K
**KServe**
: Kubernetes-native platform for deploying and serving ML models.
**Kubeflow**
: ML toolkit for Kubernetes, including pipelines, training operators, and more.
**Kustomization**
: Flux CRD for applying Kustomize overlays from Git sources.
## L
**LLM (Large Language Model)**
: AI model trained on vast text data, capable of generating human-like text.
**Longhorn**
: Cloud-native distributed storage for Kubernetes.
## M
**MessagePack (msgpack)**
: Binary serialization format, more compact than JSON.
**Milvus**
: Open-source vector database for similarity search and AI applications.
**MLflow**
: Platform for managing the ML lifecycle: experiments, models, deployment.
**MinIO**
: S3-compatible object storage.
## N
**NATS**
: Cloud-native messaging system for microservices, IoT, and serverless.
**Node Feature Discovery (NFD)**
: Kubernetes add-on for detecting hardware features on nodes.
## P
**Pipeline**
: In ML context, a DAG of components that process data and train/serve models.
**Premium User**
: User tier with enhanced features (more RAG docs, priority routing).
## R
**RAG (Retrieval-Augmented Generation)**
: AI technique combining document retrieval with LLM generation for grounded responses.
**Reranker**
: Model that rescores retrieved documents based on relevance to a query.
**ROCm**
: AMD's open-source GPU computing platform (alternative to CUDA).
## S
**Schematic**
: Talos Linux concept for defining system extensions and configurations.
**SOPS (Secrets OPerationS)**
: Tool for encrypting secrets in Git repositories.
**STT (Speech-to-Text)**
: Converting spoken audio to text (e.g., Whisper).
**Strix Halo**
: AMD's unified memory architecture for APUs with large GPU memory.
## T
**Talos Linux**
: Minimal, immutable Linux distribution designed specifically for Kubernetes.
**TTS (Text-to-Speech)**
: Converting text to spoken audio (e.g., XTTS/Coqui).
## V
**Valkey**
: Redis-compatible in-memory data store (Redis fork).
**vLLM**
: High-throughput LLM serving engine with PagedAttention.
**VIP (Virtual IP)**
: IP address shared among multiple hosts for high availability.
**Volcano**
: Kubernetes batch scheduler for high-performance workloads (ML, HPC).
**VRM**
: File format for 3D humanoid avatars.
## W
**Whisper**
: OpenAI's speech recognition model.
## X
**XTTS**
: Coqui's multi-language text-to-speech model with voice cloning.
---
## Acronyms Quick Reference
| Acronym | Full Form |
|---------|-----------|
| ADR | Architecture Decision Record |
| API | Application Programming Interface |
| BGE | BAAI General Embedding |
| CI/CD | Continuous Integration/Continuous Deployment |
| CRD | Custom Resource Definition |
| DAG | Directed Acyclic Graph |
| DDD | Domain-Driven Design |
| ESO | External Secrets Operator |
| GPU | Graphics Processing Unit |
| HA | High Availability |
| HPA | Horizontal Pod Autoscaler |
| LLM | Large Language Model |
| ML | Machine Learning |
| NATS | (not an acronym, named after message passing in Erlang) |
| NFD | Node Feature Discovery |
| OIDC | OpenID Connect |
| RAG | Retrieval-Augmented Generation |
| RBAC | Role-Based Access Control |
| ROCm | Radeon Open Compute |
| S3 | Simple Storage Service |
| SAML | Security Assertion Markup Language |
| SOPS | Secrets OPerationS |
| SSO | Single Sign-On |
| STT | Speech-to-Text |
| TLS | Transport Layer Security |
| TTS | Text-to-Speech |
| UUID | Universally Unique Identifier |
| VIP | Virtual IP |
| VRAM | Video Random Access Memory |
---
## Related Documents
- [ARCHITECTURE.md](ARCHITECTURE.md) - System overview
- [TECH-STACK.md](TECH-STACK.md) - Technology details
- [DOMAIN-MODEL.md](DOMAIN-MODEL.md) - Entity definitions

106
README.md
View File

@@ -1,3 +1,105 @@
# homelab-design
# 🏠 DaviesTechLabs Homelab Architecture
homelab design process goes here.
> **Production-grade AI/ML platform running on bare-metal Kubernetes**
[![Talos](https://img.shields.io/badge/Talos-v1.12.1-blue?logo=linux)](https://talos.dev)
[![Kubernetes](https://img.shields.io/badge/Kubernetes-v1.35.0-326CE5?logo=kubernetes)](https://kubernetes.io)
[![Flux](https://img.shields.io/badge/GitOps-Flux-blue?logo=flux)](https://fluxcd.io)
[![License](https://img.shields.io/badge/License-MIT-green)](LICENSE)
## 📖 Quick Navigation
| Document | Purpose |
|----------|---------|
| [AGENT-ONBOARDING.md](AGENT-ONBOARDING.md) | **Start here if you're an AI agent** |
| [ARCHITECTURE.md](ARCHITECTURE.md) | High-level system overview |
| [TECH-STACK.md](TECH-STACK.md) | Complete technology stack |
| [DOMAIN-MODEL.md](DOMAIN-MODEL.md) | Core entities and bounded contexts |
| [GLOSSARY.md](GLOSSARY.md) | Terminology reference |
| [decisions/](decisions/) | Architecture Decision Records (ADRs) |
## 🎯 What This Is
A comprehensive architecture documentation repository for the DaviesTechLabs homelab Kubernetes cluster, featuring:
- **AI/ML Platform**: KServe inference services, RAG pipelines, voice assistants
- **Multi-GPU Support**: AMD ROCm (RDNA3/Strix Halo), NVIDIA CUDA, Intel Arc
- **GitOps**: Flux CD with SOPS encryption
- **Event-Driven**: NATS JetStream for real-time messaging
- **ML Workflows**: Kubeflow Pipelines + Argo Workflows
## 🖥️ Cluster Overview
| Node | Role | Hardware | GPU |
|------|------|----------|-----|
| storm | Control Plane | Intel 13th Gen | Integrated |
| bruenor | Control Plane | Intel 13th Gen | Integrated |
| catti | Control Plane | Intel 13th Gen | Integrated |
| elminster | Worker | NVIDIA RTX 2070 | 8GB CUDA |
| khelben | Worker (vLLM) | AMD Strix Halo | 64GB Unified |
| drizzt | Worker | AMD Radeon 680M | 12GB RDNA2 |
| danilo | Worker | Intel Core Ultra 9 | Intel Arc |
## 🚀 Quick Start
### View Current Cluster State
```bash
# Get node status
kubectl get nodes -o wide
# View AI/ML workloads
kubectl get pods -n ai-ml
# Check KServe inference services
kubectl get inferenceservices -n ai-ml
```
### Key Endpoints
| Service | URL | Purpose |
|---------|-----|---------|
| Kubeflow | `kubeflow.lab.daviestechlabs.io` | ML Pipeline UI |
| Companions | `companions-chat.lab.daviestechlabs.io` | AI Chat Interface |
| Voice | `voice.lab.daviestechlabs.io` | Voice Assistant |
| Gitea | `git.daviestechlabs.io` | Self-hosted Git |
## 📂 Repository Structure
```
homelab-design/
├── README.md # This file
├── AGENT-ONBOARDING.md # AI agent quick-start
├── ARCHITECTURE.md # High-level system overview
├── CONTEXT-DIAGRAM.mmd # C4 Level 1 (Mermaid)
├── CONTAINER-DIAGRAM.mmd # C4 Level 2
├── TECH-STACK.md # Complete tech stack
├── DOMAIN-MODEL.md # Core entities
├── CODING-CONVENTIONS.md # Patterns & practices
├── GLOSSARY.md # Terminology
├── decisions/ # ADRs
│ ├── 0000-template.md
│ ├── 0001-record-architecture-decisions.md
│ ├── 0002-use-talos-linux.md
│ └── ...
├── specs/ # Feature specifications
└── diagrams/ # Additional diagrams
```
## 🔗 Related Repositories
| Repository | Purpose |
|------------|---------|
| [homelab-k8s2](https://github.com/Billy-Davies-2/homelab-k8s2) | Kubernetes manifests, Flux GitOps |
| [llm-workflows](https://github.com/Billy-Davies-2/llm-workflows) | NATS handlers, Argo/KFP workflows |
| [companions-frontend](https://github.com/Billy-Davies-2/companions-frontend) | Go web server, HTMX frontend |
## 📝 Contributing
1. For architecture changes, create an ADR in `decisions/`
2. Update relevant documentation
3. Submit a PR with context
---
*Last updated: 2026-02-01*

271
TECH-STACK.md Normal file
View File

@@ -0,0 +1,271 @@
# 🛠️ Technology Stack
> **Complete inventory of technologies used in the DaviesTechLabs homelab**
## Platform Layer
### Operating System
| Component | Version | Purpose |
|-----------|---------|---------|
| [Talos Linux](https://talos.dev) | v1.12.1 | Immutable, API-driven Kubernetes OS |
| Kernel | 6.18.2-talos | Linux kernel with GPU drivers |
### Container Orchestration
| Component | Version | Purpose |
|-----------|---------|---------|
| [Kubernetes](https://kubernetes.io) | v1.35.0 | Container orchestration |
| [containerd](https://containerd.io) | 2.1.6 | Container runtime |
| [Cilium](https://cilium.io) | Latest | CNI, network policies, eBPF |
### GitOps
| Component | Version | Purpose |
|-----------|---------|---------|
| [Flux CD](https://fluxcd.io) | v2 | GitOps continuous delivery |
| [SOPS](https://github.com/getsops/sops) | Latest | Secret encryption |
| [Age](https://github.com/FiloSottile/age) | Latest | Encryption key management |
---
## AI/ML Layer
### Inference Engines
| Service | Framework | GPU | Model Type |
|---------|-----------|-----|------------|
| [vLLM](https://vllm.ai) | ROCm | AMD Strix Halo | Large Language Models |
| [faster-whisper](https://github.com/guillaumekln/faster-whisper) | CUDA | NVIDIA RTX 2070 | Speech-to-Text |
| [XTTS](https://github.com/coqui-ai/TTS) | CUDA | NVIDIA RTX 2070 | Text-to-Speech |
| [BGE Embeddings](https://huggingface.co/BAAI/bge-large-en-v1.5) | ROCm | AMD Radeon 680M | Text Embeddings |
| [BGE Reranker](https://huggingface.co/BAAI/bge-reranker-large) | Intel | Intel Arc | Document Reranking |
### ML Serving
| Component | Version | Purpose |
|-----------|---------|---------|
| [KServe](https://kserve.github.io) | v0.12+ | Model serving framework |
| [Ray Serve](https://ray.io/serve) | 2.53.0 | Unified inference endpoints |
### ML Workflows
| Component | Version | Purpose |
|-----------|---------|---------|
| [Kubeflow Pipelines](https://kubeflow.org) | 2.15.0 | ML pipeline orchestration |
| [Argo Workflows](https://argoproj.github.io/workflows) | v3.7.8 | DAG-based workflows |
| [Argo Events](https://argoproj.github.io/events) | Latest | Event-driven triggers |
| [MLflow](https://mlflow.org) | 3.7.0 | Experiment tracking, model registry |
### GPU Scheduling
| Component | Version | Purpose |
|-----------|---------|---------|
| [Volcano](https://volcano.sh) | Latest | GPU-aware scheduling |
| AMD GPU Device Plugin | v1.4.1 | ROCm GPU allocation |
| NVIDIA Device Plugin | Latest | CUDA GPU allocation |
| [Node Feature Discovery](https://github.com/kubernetes-sigs/node-feature-discovery) | v0.18.2 | Hardware detection |
---
## Data Layer
### Databases
| Component | Version | Purpose |
|-----------|---------|---------|
| [CloudNative-PG](https://cloudnative-pg.io) | 16.11 | PostgreSQL for metadata |
| [Milvus](https://milvus.io) | Latest | Vector database for RAG |
| [ClickHouse](https://clickhouse.com) | Latest | Analytics, access logs |
| [Valkey](https://valkey.io) | Latest | Redis-compatible cache |
### Object Storage
| Component | Version | Purpose |
|-----------|---------|---------|
| [MinIO](https://min.io) | Latest | S3-compatible storage |
| [Longhorn](https://longhorn.io) | v1.10.1 | Distributed block storage |
| NFS CSI Driver | Latest | Shared filesystem |
### Messaging
| Component | Version | Purpose |
|-----------|---------|---------|
| [NATS](https://nats.io) | Latest | Message bus |
| NATS JetStream | Built-in | Persistent streaming |
### Data Processing
| Component | Version | Purpose |
|-----------|---------|---------|
| [Apache Spark](https://spark.apache.org) | Latest | Batch analytics |
| [Apache Flink](https://flink.apache.org) | Latest | Stream processing |
| [Apache Iceberg](https://iceberg.apache.org) | Latest | Table format |
| [Nessie](https://projectnessie.org) | Latest | Data catalog |
| [Trino](https://trino.io) | 479 | SQL query engine |
---
## Application Layer
### Web Frameworks
| Application | Language | Framework | Purpose |
|-------------|----------|-----------|---------|
| Companions | Go | net/http + HTMX | AI chat interface |
| Voice WebApp | Python | Gradio | Voice assistant UI |
| Various handlers | Python | asyncio + nats.py | NATS event handlers |
### Frontend
| Technology | Purpose |
|------------|---------|
| [HTMX](https://htmx.org) | Dynamic HTML updates |
| [Alpine.js](https://alpinejs.dev) | Lightweight reactivity |
| [VRM](https://vrm.dev) | 3D avatar rendering |
---
## Networking Layer
### Ingress
| Component | Version | Purpose |
|-----------|---------|---------|
| [Envoy Gateway](https://gateway.envoyproxy.io) | v1.6.3 | Gateway API implementation |
| [cloudflared](https://developers.cloudflare.com/cloudflare-one/connections/connect-apps) | Latest | Cloudflare tunnel |
### DNS & Certificates
| Component | Version | Purpose |
|-----------|---------|---------|
| [external-dns](https://github.com/kubernetes-sigs/external-dns) | Latest | Automatic DNS management |
| [cert-manager](https://cert-manager.io) | Latest | TLS certificate automation |
### Service Mesh
| Component | Purpose |
|-----------|---------|
| [Spegel](https://github.com/spegel-org/spegel) | P2P container image distribution |
---
## Security Layer
### Identity & Access
| Component | Version | Purpose |
|-----------|---------|---------|
| [Authentik](https://goauthentik.io) | 2025.12.1 | Identity provider, SSO |
| [Vault](https://vaultproject.io) | 1.21.2 | Secret management |
| [External Secrets Operator](https://external-secrets.io) | v1.3.1 | Kubernetes secrets sync |
### Runtime Security
| Component | Version | Purpose |
|-----------|---------|---------|
| [Falco](https://falco.org) | 0.42.1 | Runtime threat detection |
| Cilium Network Policies | Built-in | Network segmentation |
### Backup
| Component | Version | Purpose |
|-----------|---------|---------|
| [Velero](https://velero.io) | v1.17.1 | Cluster backup/restore |
---
## Observability Layer
### Metrics
| Component | Purpose |
|-----------|---------|
| [Prometheus](https://prometheus.io) | Metrics collection |
| [Grafana](https://grafana.com) | Dashboards & visualization |
### Logging
| Component | Version | Purpose |
|-----------|---------|---------|
| [Grafana Alloy](https://grafana.com/oss/alloy) | v1.12.0 | Log collection |
| [Loki](https://grafana.com/oss/loki) | Latest | Log aggregation |
### Tracing
| Component | Purpose |
|-----------|---------|
| [OpenTelemetry Collector](https://opentelemetry.io) | Trace collection |
| Tempo/Jaeger | Trace storage & query |
---
## Development Tools
### Local Development
| Tool | Purpose |
|------|---------|
| [mise](https://mise.jdx.dev) | Tool version management |
| [Task](https://taskfile.dev) | Task runner (Taskfile.yaml) |
| [flux-local](https://github.com/allenporter/flux-local) | Local Flux testing |
### CI/CD
| Tool | Purpose |
|------|---------|
| GitHub Actions | CI/CD pipelines |
| [Renovate](https://renovatebot.com) | Dependency updates |
### Image Building
| Tool | Purpose |
|------|---------|
| Docker | Container builds |
| GHCR | Container registry |
---
## Media & Entertainment
| Component | Version | Purpose |
|-----------|---------|---------|
| [Jellyfin](https://jellyfin.org) | 10.11.5 | Media server |
| [Nextcloud](https://nextcloud.com) | 32.0.5 | File sync & share |
| Prowlarr, Bazarr, etc. | Various | *arr stack |
| [Kasm](https://kasmweb.com) | 1.18.1 | Browser isolation |
---
## Python Dependencies (llm-workflows)
```toml
# Core
nats-py>=2.7.0 # NATS client
msgpack>=1.0.0 # Binary serialization
aiohttp>=3.9.0 # HTTP client
# ML/AI
pymilvus>=2.4.0 # Milvus client
sentence-transformers # Embeddings
openai>=1.0.0 # vLLM OpenAI API
# Kubeflow
kfp>=2.12.1 # Pipeline SDK
```
---
## Version Pinning Strategy
| Component Type | Strategy |
|----------------|----------|
| Base images | Pin major.minor |
| Helm charts | Pin exact version |
| Python packages | Pin minimum version |
| System extensions | Pin via Talos schematic |
## Related Documents
- [ARCHITECTURE.md](ARCHITECTURE.md) - How components connect
- [decisions/](decisions/) - Why we chose specific technologies

View File

@@ -0,0 +1,71 @@
# [short title of solved problem and solution]
* Status: [proposed | rejected | accepted | deprecated | superseded by [ADR-NNNN](NNNN-example.md)]
* Date: YYYY-MM-DD
* Deciders: [list of people involved in decision]
* Technical Story: [description | ticket/issue URL]
## Context and Problem Statement
[Describe the context and problem statement, e.g., in free form using two to three sentences. You may want to articulate the problem in form of a question.]
## Decision Drivers
* [driver 1, e.g., a force, facing concern, …]
* [driver 2, e.g., a force, facing concern, …]
* … <!-- numbers of drivers can vary -->
## Considered Options
* [option 1]
* [option 2]
* [option 3]
* … <!-- numbers of options can vary -->
## Decision Outcome
Chosen option: "[option N]", because [justification. e.g., only option which meets k.o. criterion decision driver | which resolves force | … | comes out best (see below)].
### Positive Consequences
* [e.g., improvement of quality attribute satisfaction, follow-up decisions required, …]
*
### Negative Consequences
* [e.g., compromising quality attribute, follow-up decisions required, …]
*
## Pros and Cons of the Options
### [option 1]
[example | description | pointer to more information | …]
* Good, because [argument a]
* Good, because [argument b]
* Bad, because [argument c]
* … <!-- numbers of pros and cons can vary -->
### [option 2]
[example | description | pointer to more information | …]
* Good, because [argument a]
* Good, because [argument b]
* Bad, because [argument c]
* … <!-- numbers of pros and cons can vary -->
### [option 3]
[example | description | pointer to more information | …]
* Good, because [argument a]
* Good, because [argument b]
* Bad, because [argument c]
* … <!-- numbers of pros and cons can vary -->
## Links
* [Link type] [Link to ADR] <!-- example: Refined by [ADR-0005](0005-example.md) -->
* … <!-- numbers of links can vary -->

View File

@@ -0,0 +1,79 @@
# Record Architecture Decisions
* Status: accepted
* Date: 2025-11-30
* Deciders: Billy Davies
* Technical Story: Initial setup of homelab documentation
## Context and Problem Statement
As the homelab infrastructure grows in complexity with AI/ML services, multi-GPU configurations, and event-driven architectures, we need a way to document and communicate significant architectural decisions. Without documentation, the rationale behind choices gets lost, making future changes risky and onboarding difficult.
## Decision Drivers
* Need to preserve context for why decisions were made
* Enable future maintainers (including AI agents) to understand the system
* Provide a structured way to evaluate alternatives
* Support the wiki/design process for iterative improvements
## Considered Options
* Informal documentation in README files
* Wiki pages without structure
* Architecture Decision Records (ADRs)
* No documentation (rely on code)
## Decision Outcome
Chosen option: "Architecture Decision Records (ADRs)", because they provide a structured format that captures context, alternatives, and consequences. They're lightweight, version-controlled, and well-suited for technical decisions.
### Positive Consequences
* Clear historical record of decisions
* Structured format makes decisions searchable
* Forces consideration of alternatives
* Git-versioned alongside code
* AI agents can parse and understand decisions
### Negative Consequences
* Requires discipline to create ADRs
* May accumulate outdated decisions over time
* Additional overhead for simple decisions
## Pros and Cons of the Options
### Informal README documentation
* Good, because low friction
* Good, because close to code
* Bad, because no structure for alternatives
* Bad, because decisions get buried in prose
### Wiki pages
* Good, because easy to edit
* Good, because supports rich formatting
* Bad, because separate from code repository
* Bad, because no enforced structure
### ADRs
* Good, because structured format
* Good, because version controlled
* Good, because captures alternatives considered
* Good, because industry-standard practice
* Bad, because requires creating new files
* Bad, because may seem bureaucratic for small decisions
### No documentation
* Good, because no overhead
* Bad, because context is lost
* Bad, because makes onboarding difficult
* Bad, because risky for future changes
## Links
* Based on [MADR template](https://adr.github.io/madr/)
* [ADR GitHub organization](https://adr.github.io/)

View File

@@ -0,0 +1,97 @@
# Use Talos Linux for Kubernetes Nodes
* Status: accepted
* Date: 2025-11-30
* Deciders: Billy Davies
* Technical Story: Selecting OS for bare-metal Kubernetes cluster
## Context and Problem Statement
We need a reliable, secure operating system for running Kubernetes on bare-metal homelab nodes. The OS should minimize attack surface, be easy to manage at scale, and support our GPU requirements (AMD ROCm, NVIDIA CUDA, Intel).
## Decision Drivers
* Security-first design (immutable, minimal)
* API-driven management (no SSH)
* Support for various GPU drivers
* Kubernetes-native focus
* Community support and updates
* Ease of upgrades
## Considered Options
* Ubuntu Server with kubeadm
* Flatcar Container Linux
* Talos Linux
* k3OS (discontinued)
* Rocky Linux with RKE2
## Decision Outcome
Chosen option: "Talos Linux", because it provides an immutable, API-driven, Kubernetes-focused OS that minimizes attack surface and simplifies operations.
### Positive Consequences
* Immutable root filesystem prevents drift
* No SSH reduces attack vectors
* API-driven management integrates well with GitOps
* Schematic system allows custom kernel modules (GPU drivers)
* Consistent configuration across all nodes
* Automatic updates with minimal disruption
### Negative Consequences
* Learning curve for API-driven management
* Debugging requires different approaches (no SSH)
* Custom extensions require schematic IDs
* Less flexibility for non-Kubernetes workloads
## Pros and Cons of the Options
### Ubuntu Server with kubeadm
* Good, because familiar
* Good, because extensive package availability
* Good, because easy debugging via SSH
* Bad, because mutable system leads to drift
* Bad, because large attack surface
* Bad, because manual package management
### Flatcar Container Linux
* Good, because immutable
* Good, because auto-updates
* Good, because container-focused
* Bad, because less Kubernetes-specific
* Bad, because smaller community than Talos
* Bad, because GPU driver setup more complex
### Talos Linux
* Good, because purpose-built for Kubernetes
* Good, because immutable and minimal
* Good, because API-driven (no SSH)
* Good, because excellent Kubernetes integration
* Good, because active development and community
* Good, because schematic system for GPU drivers
* Bad, because learning curve
* Bad, because no traditional debugging
### k3OS
* Good, because simple
* Bad, because discontinued
### Rocky Linux with RKE2
* Good, because enterprise-like
* Good, because familiar Linux experience
* Bad, because mutable system
* Bad, because more operational overhead
* Bad, because larger attack surface
## Links
* [Talos Linux](https://talos.dev)
* [Talos Image Factory](https://factory.talos.dev)
* Related: [ADR-0005](0005-multi-gpu-strategy.md) - GPU driver integration via schematics

View File

@@ -0,0 +1,112 @@
# Use NATS for AI/ML Messaging
* Status: accepted
* Date: 2025-12-01
* Deciders: Billy Davies
* Technical Story: Selecting message bus for AI service orchestration
## Context and Problem Statement
The AI/ML platform requires a messaging system for:
- Real-time chat message routing
- Voice request/response streaming
- Pipeline triggers and status updates
- Event-driven workflow orchestration
We need a messaging system that handles both ephemeral real-time messages and persistent streams.
## Decision Drivers
* Low latency for real-time chat/voice
* Persistence for audit and replay
* Simple operations for homelab
* Support for request-reply pattern
* Wildcard subscriptions for routing
* Binary message support (audio data)
## Considered Options
* Apache Kafka
* RabbitMQ
* Redis Pub/Sub + Streams
* NATS with JetStream
* Apache Pulsar
## Decision Outcome
Chosen option: "NATS with JetStream", because it provides both fire-and-forget messaging and persistent streams with significantly simpler operations than alternatives.
### Positive Consequences
* Sub-millisecond latency for real-time messages
* JetStream provides persistence when needed
* Simple deployment (single binary)
* Excellent Kubernetes integration
* Request-reply pattern built-in
* Wildcard subscriptions for flexible routing
* Low resource footprint
### Negative Consequences
* Less ecosystem than Kafka
* JetStream less mature than Kafka Streams
* No built-in schema registry
* Smaller community than RabbitMQ
## Pros and Cons of the Options
### Apache Kafka
* Good, because industry standard for streaming
* Good, because rich ecosystem (Kafka Streams, Connect)
* Good, because schema registry
* Good, because excellent for high throughput
* Bad, because operationally complex (ZooKeeper/KRaft)
* Bad, because high resource requirements
* Bad, because overkill for homelab scale
* Bad, because higher latency for real-time messages
### RabbitMQ
* Good, because mature and stable
* Good, because flexible routing
* Good, because good management UI
* Bad, because AMQP protocol overhead
* Bad, because not designed for streaming
* Bad, because more complex clustering
### Redis Pub/Sub + Streams
* Good, because simple
* Good, because already might use Redis
* Good, because low latency
* Bad, because pub/sub not persistent
* Bad, because streams API less intuitive
* Bad, because not primary purpose of Redis
### NATS with JetStream
* Good, because extremely low latency
* Good, because simple operations
* Good, because both pub/sub and persistence
* Good, because request-reply built-in
* Good, because wildcard subscriptions
* Good, because low resource usage
* Good, because excellent Go/Python clients
* Bad, because smaller ecosystem
* Bad, because JetStream newer than Kafka
### Apache Pulsar
* Good, because unified messaging + streaming
* Good, because multi-tenancy
* Good, because geo-replication
* Bad, because complex architecture
* Bad, because high resource requirements
* Bad, because smaller community
## Links
* [NATS.io](https://nats.io)
* [JetStream Documentation](https://docs.nats.io/nats-concepts/jetstream)
* Related: [ADR-0004](0004-use-messagepack-for-nats.md) - Message format

View File

@@ -0,0 +1,137 @@
# Use MessagePack for NATS Messages
* Status: accepted
* Date: 2025-12-01
* Deciders: Billy Davies
* Technical Story: Selecting serialization format for NATS messages
## Context and Problem Statement
NATS messages in the AI platform carry various payloads:
- Text chat messages (small)
- Voice audio data (potentially large, base64 or binary)
- Streaming response chunks
- Pipeline parameters
We need a serialization format that handles both text and binary efficiently.
## Decision Drivers
* Efficient binary data handling (audio)
* Compact message size
* Fast serialization/deserialization
* Cross-language support (Python, Go)
* Debugging ability
* Schema flexibility
## Considered Options
* JSON
* Protocol Buffers (protobuf)
* MessagePack (msgpack)
* CBOR
* Avro
## Decision Outcome
Chosen option: "MessagePack (msgpack)", because it provides binary efficiency with JSON-like simplicity and schema-less flexibility.
### Positive Consequences
* Native binary support (no base64 overhead for audio)
* 20-50% smaller than JSON for typical messages
* Faster serialization than JSON
* No schema compilation step
* Easy debugging (can pretty-print like JSON)
* Excellent Python and Go libraries
### Negative Consequences
* Less human-readable than JSON when raw
* No built-in schema validation
* Slightly less common than JSON
## Pros and Cons of the Options
### JSON
* Good, because human-readable
* Good, because universal support
* Good, because no setup required
* Bad, because binary data requires base64 (33% overhead)
* Bad, because larger message sizes
* Bad, because slower parsing
### Protocol Buffers
* Good, because very compact
* Good, because fast
* Good, because schema validation
* Good, because cross-language
* Bad, because requires schema definition
* Bad, because compilation step
* Bad, because less flexible for evolving schemas
* Bad, because overkill for simple messages
### MessagePack
* Good, because binary-efficient
* Good, because JSON-like simplicity
* Good, because no schema required
* Good, because excellent library support
* Good, because can include raw bytes
* Bad, because not human-readable raw
* Bad, because no schema validation
### CBOR
* Good, because binary-efficient
* Good, because IETF standard
* Good, because schema-less
* Bad, because less common libraries
* Bad, because smaller community
* Bad, because similar to msgpack with less adoption
### Avro
* Good, because schema evolution
* Good, because compact
* Good, because schema registry integration
* Bad, because requires schema
* Bad, because more complex setup
* Bad, because Java-centric ecosystem
## Implementation Notes
```python
# Python usage
import msgpack
# Serialize
data = {
"user_id": "user-123",
"audio": audio_bytes, # Raw bytes, no base64
"premium": True
}
payload = msgpack.packb(data)
# Deserialize
data = msgpack.unpackb(payload, raw=False)
```
```go
// Go usage
import "github.com/vmihailenco/msgpack/v5"
type Message struct {
UserID string `msgpack:"user_id"`
Audio []byte `msgpack:"audio"`
}
```
## Links
* [MessagePack Specification](https://msgpack.org)
* [msgpack-python](https://github.com/msgpack/msgpack-python)
* Related: [ADR-0003](0003-use-nats-for-messaging.md) - Message bus choice
* See: [BINARY_MESSAGES_AND_JETSTREAM.md](../specs/BINARY_MESSAGES_AND_JETSTREAM.md)

View File

@@ -0,0 +1,145 @@
# Multi-GPU Heterogeneous Strategy
* Status: accepted
* Date: 2025-12-01
* Deciders: Billy Davies
* Technical Story: GPU allocation strategy for AI workloads
## Context and Problem Statement
The homelab has diverse GPU hardware:
- AMD Strix Halo (64GB unified memory) - khelben
- NVIDIA RTX 2070 (8GB VRAM) - elminster
- AMD Radeon 680M (12GB VRAM) - drizzt
- Intel Arc (integrated) - danilo
Different AI workloads have different requirements. How do we allocate GPUs effectively?
## Decision Drivers
* Maximize utilization of all GPUs
* Match workloads to appropriate hardware
* Support concurrent inference services
* Enable fractional GPU sharing where appropriate
* Minimize cross-vendor complexity
## Considered Options
* Single GPU vendor only
* All workloads on largest GPU
* Workload-specific GPU allocation
* Dynamic GPU scheduling (MIG/fractional)
## Decision Outcome
Chosen option: "Workload-specific GPU allocation with dedicated nodes", where each AI service is pinned to the most appropriate GPU based on requirements.
### Allocation Strategy
| Workload | GPU | Node | Rationale |
|----------|-----|------|-----------|
| vLLM (LLM inference) | AMD Strix Halo (64GB) | khelben (dedicated) | Large models need unified memory |
| Whisper (STT) | NVIDIA RTX 2070 (8GB) | elminster | CUDA optimized, medium memory |
| XTTS (TTS) | NVIDIA RTX 2070 (8GB) | elminster | Shares with Whisper |
| BGE Embeddings | AMD Radeon 680M (12GB) | drizzt | ROCm support, batch processing |
| BGE Reranker | Intel Arc | danilo | Light workload, Intel optimization |
### Positive Consequences
* Each workload gets optimal hardware
* No GPU memory contention for LLM
* NVIDIA services can share via time-slicing
* Cost-effective use of varied hardware
* Clear ownership and debugging
### Negative Consequences
* More complex scheduling (node taints/tolerations)
* Less flexibility for workload migration
* Must maintain multiple GPU driver stacks
* Some GPUs underutilized at times
## Implementation
### Node Taints
```yaml
# khelben - dedicated vLLM node
nodeTaints:
dedicated: "vllm:NoSchedule"
```
### Pod Tolerations and Node Affinity
```yaml
# vLLM deployment
spec:
tolerations:
- key: "dedicated"
operator: "Equal"
value: "vllm"
effect: "NoSchedule"
affinity:
nodeAffinity:
requiredDuringSchedulingIgnoredDuringExecution:
nodeSelectorTerms:
- matchExpressions:
- key: kubernetes.io/hostname
operator: In
values: ["khelben"]
```
### Resource Limits
```yaml
# NVIDIA GPU (elminster)
resources:
limits:
nvidia.com/gpu: 1
# AMD GPU (drizzt, khelben)
resources:
limits:
amd.com/gpu: 1
```
## Pros and Cons of the Options
### Single GPU vendor only
* Good, because simpler driver management
* Good, because consistent tooling
* Bad, because wastes existing hardware
* Bad, because higher cost for new hardware
### All workloads on largest GPU
* Good, because simple scheduling
* Good, because unified memory benefits
* Bad, because memory contention
* Bad, because single point of failure
* Bad, because wastes other GPUs
### Workload-specific allocation (chosen)
* Good, because optimal hardware matching
* Good, because uses all available GPUs
* Good, because clear resource boundaries
* Good, because parallel inference
* Bad, because more complex configuration
* Bad, because multiple driver stacks
### Dynamic GPU scheduling
* Good, because flexible
* Good, because maximizes utilization
* Bad, because complex to implement
* Bad, because MIG not available on consumer GPUs
* Bad, because cross-vendor scheduling immature
## Links
* [Volcano Scheduler](https://volcano.sh)
* [AMD GPU Device Plugin](https://github.com/ROCm/k8s-device-plugin)
* [NVIDIA Device Plugin](https://github.com/NVIDIA/k8s-device-plugin)
* Related: [ADR-0002](0002-use-talos-linux.md) - GPU drivers via Talos schematics

View File

@@ -0,0 +1,140 @@
# GitOps with Flux CD
* Status: accepted
* Date: 2025-11-30
* Deciders: Billy Davies
* Technical Story: Implementing GitOps for cluster management
## Context and Problem Statement
Managing a Kubernetes cluster with numerous applications, configurations, and secrets requires a reliable, auditable, and reproducible approach. Manual `kubectl apply` is error-prone and doesn't track state over time.
## Decision Drivers
* Infrastructure as Code (IaC) principles
* Audit trail for all changes
* Self-healing cluster state
* Multi-repository support
* Secret encryption integration
* Active community and maintenance
## Considered Options
* Manual kubectl apply
* ArgoCD
* Flux CD
* Rancher Fleet
* Pulumi/Terraform for Kubernetes
## Decision Outcome
Chosen option: "Flux CD", because it provides a mature GitOps implementation with excellent multi-source support, SOPS integration, and aligns well with the Kubernetes ecosystem.
### Positive Consequences
* Git is single source of truth
* Automatic drift detection and correction
* Native SOPS/Age secret encryption
* Multi-repository support (homelab-k8s2 + llm-workflows)
* Helm and Kustomize native support
* Webhook-free sync (pull-based)
### Negative Consequences
* No built-in UI (use CLI or third-party)
* Learning curve for CRD-based configuration
* Debugging requires understanding Flux controllers
## Configuration
### Repository Structure
```
homelab-k8s2/
├── kubernetes/
│ ├── flux/ # Flux system config
│ │ ├── config/
│ │ │ ├── cluster.yaml
│ │ │ └── secrets.yaml # SOPS encrypted
│ │ └── repositories/
│ │ ├── helm/ # HelmRepositories
│ │ └── git/ # GitRepositories
│ └── apps/ # Application Kustomizations
```
### Multi-Repository Sync
```yaml
# GitRepository for llm-workflows
apiVersion: source.toolkit.fluxcd.io/v1
kind: GitRepository
metadata:
name: llm-workflows
namespace: flux-system
spec:
url: ssh://git@github.com/Billy-Davies-2/llm-workflows
ref:
branch: main
secretRef:
name: github-deploy-key
```
### SOPS Integration
```yaml
# .sops.yaml
creation_rules:
- path_regex: .*\.sops\.yaml$
age: >-
age1... # Public key
```
## Pros and Cons of the Options
### Manual kubectl apply
* Good, because simple
* Good, because no setup
* Bad, because no audit trail
* Bad, because no drift detection
* Bad, because not reproducible
### ArgoCD
* Good, because great UI
* Good, because app-of-apps pattern
* Good, because large community
* Bad, because heavier resource usage
* Bad, because webhook-dependent sync
* Bad, because SOPS requires plugins
### Flux CD
* Good, because lightweight
* Good, because pull-based (no webhooks)
* Good, because native SOPS support
* Good, because multi-source/multi-tenant
* Good, because Kubernetes-native CRDs
* Bad, because no built-in UI
* Bad, because CRD learning curve
### Rancher Fleet
* Good, because integrated with Rancher
* Good, because multi-cluster
* Bad, because Rancher ecosystem lock-in
* Bad, because smaller community
### Pulumi/Terraform
* Good, because familiar IaC tools
* Good, because drift detection
* Bad, because not Kubernetes-native
* Bad, because requires state management
* Bad, because not continuous reconciliation
## Links
* [Flux CD](https://fluxcd.io)
* [SOPS Integration](https://fluxcd.io/flux/guides/mozilla-sops/)
* [flux-local](https://github.com/allenporter/flux-local) - Local testing

View File

@@ -0,0 +1,115 @@
# Use KServe for ML Model Serving
* Status: accepted
* Date: 2025-12-15
* Deciders: Billy Davies
* Technical Story: Selecting model serving platform for inference services
## Context and Problem Statement
We need to deploy multiple ML models (Whisper, XTTS, BGE, vLLM) as inference endpoints. Each model has different requirements for scaling, protocols (HTTP/gRPC), and GPU allocation.
## Decision Drivers
* Standardized inference protocol (V2)
* Autoscaling based on load
* Traffic splitting for canary deployments
* Integration with Kubeflow ecosystem
* GPU resource management
* Health checks and readiness
## Considered Options
* Raw Kubernetes Deployments + Services
* KServe InferenceService
* Seldon Core
* BentoML
* Ray Serve only
## Decision Outcome
Chosen option: "KServe InferenceService", because it provides a standardized, Kubernetes-native approach to model serving with built-in autoscaling and traffic management.
### Positive Consequences
* Standardized V2 inference protocol
* Automatic scale-to-zero capability
* Canary/blue-green deployments
* Integration with Kubeflow UI
* Transformer/Explainer components
* GPU resource abstraction
### Negative Consequences
* Additional CRDs and operators
* Learning curve for InferenceService spec
* Some overhead for simple deployments
* Knative Serving dependency (optional)
## Pros and Cons of the Options
### Raw Kubernetes Deployments
* Good, because simple
* Good, because full control
* Bad, because no autoscaling logic
* Bad, because manual service mesh
* Bad, because repetitive configuration
### KServe InferenceService
* Good, because standardized API
* Good, because autoscaling
* Good, because traffic management
* Good, because Kubeflow integration
* Bad, because operator complexity
* Bad, because Knative optional dependency
### Seldon Core
* Good, because mature
* Good, because A/B testing
* Good, because explainability
* Bad, because more complex than KServe
* Bad, because heavier resource usage
### BentoML
* Good, because developer-friendly
* Good, because packaging focused
* Bad, because less Kubernetes-native
* Bad, because smaller community
### Ray Serve
* Good, because unified compute
* Good, because Python-native
* Good, because fractional GPU
* Bad, because less standardized API
* Bad, because Ray cluster overhead
## Current Configuration
```yaml
apiVersion: serving.kserve.io/v1beta1
kind: InferenceService
metadata:
name: whisper
namespace: ai-ml
spec:
predictor:
minReplicas: 1
maxReplicas: 3
containers:
- name: whisper
image: ghcr.io/org/whisper:latest
resources:
limits:
nvidia.com/gpu: 1
```
## Links
* [KServe](https://kserve.github.io)
* [V2 Inference Protocol](https://kserve.github.io/website/latest/modelserving/data_plane/v2_protocol/)
* Related: [ADR-0005](0005-multi-gpu-strategy.md) - GPU allocation

View File

@@ -0,0 +1,107 @@
# Use Milvus for Vector Storage
* Status: accepted
* Date: 2025-12-15
* Deciders: Billy Davies
* Technical Story: Selecting vector database for RAG system
## Context and Problem Statement
The RAG (Retrieval-Augmented Generation) system requires a vector database to store document embeddings and perform similarity search. We need to store millions of embeddings and query them with low latency.
## Decision Drivers
* Query performance (< 100ms for top-k search)
* Scalability to millions of vectors
* Kubernetes-native deployment
* Active development and community
* Support for metadata filtering
* Backup and restore capabilities
## Considered Options
* Milvus
* Pinecone (managed)
* Qdrant
* Weaviate
* pgvector (PostgreSQL extension)
* Chroma
## Decision Outcome
Chosen option: "Milvus", because it provides production-grade vector search with excellent Kubernetes support, scalability, and active development.
### Positive Consequences
* High-performance similarity search
* Horizontal scalability
* Rich filtering and hybrid search
* Helm chart for Kubernetes
* Active CNCF sandbox project
* GPU acceleration available
### Negative Consequences
* Complex architecture (multiple components)
* Higher resource usage than simpler alternatives
* Requires object storage (MinIO)
* Learning curve for optimization
## Pros and Cons of the Options
### Milvus
* Good, because production-proven at scale
* Good, because rich query API
* Good, because Kubernetes-native
* Good, because hybrid search (vector + scalar)
* Good, because CNCF project
* Bad, because complex architecture
* Bad, because higher resource usage
### Pinecone
* Good, because fully managed
* Good, because simple API
* Good, because reliable
* Bad, because external dependency
* Bad, because cost at scale
* Bad, because data sovereignty concerns
### Qdrant
* Good, because simpler than Milvus
* Good, because Rust performance
* Good, because good filtering
* Bad, because smaller community
* Bad, because less enterprise features
### Weaviate
* Good, because built-in vectorization
* Good, because GraphQL API
* Good, because modules system
* Bad, because more opinionated
* Bad, because schema requirements
### pgvector
* Good, because familiar PostgreSQL
* Good, because simple deployment
* Good, because ACID transactions
* Bad, because limited scale
* Bad, because slower for large datasets
* Bad, because no specialized optimizations
### Chroma
* Good, because simple
* Good, because embedded option
* Bad, because not production-ready at scale
* Bad, because limited features
## Links
* [Milvus](https://milvus.io)
* [Milvus Helm Chart](https://github.com/milvus-io/milvus-helm)
* Related: [DOMAIN-MODEL.md](../DOMAIN-MODEL.md) - Chunk/Embedding entities

View File

@@ -0,0 +1,124 @@
# Dual Workflow Engine Strategy (Argo + Kubeflow)
* Status: accepted
* Date: 2026-01-15
* Deciders: Billy Davies
* Technical Story: Selecting workflow orchestration for ML pipelines
## Context and Problem Statement
The AI platform needs workflow orchestration for:
- ML training pipelines with caching
- Document ingestion (batch)
- Complex DAG workflows (training → evaluation → deployment)
- Hybrid scenarios combining both
Should we use one engine or leverage strengths of multiple?
## Decision Drivers
* ML-specific features (caching, lineage)
* Complex DAG support
* Kubernetes-native execution
* Visibility and debugging
* Community and ecosystem
* Integration with existing tools
## Considered Options
* Kubeflow Pipelines only
* Argo Workflows only
* Both engines with clear use cases
* Airflow on Kubernetes
* Prefect/Dagster
## Decision Outcome
Chosen option: "Both engines with clear use cases", using Kubeflow Pipelines for ML-centric workflows and Argo Workflows for complex DAG orchestration.
### Decision Matrix
| Use Case | Engine | Reason |
|----------|--------|--------|
| ML training with caching | Kubeflow | Component caching, experiment tracking |
| Model evaluation | Kubeflow | Metric collection, comparison |
| Document ingestion | Argo | Simple DAG, no ML features needed |
| Batch inference | Argo | Parallelization, retries |
| Complex DAG with branching | Argo | Superior control flow |
| Hybrid ML training | Both | Argo orchestrates, KFP for ML steps |
### Positive Consequences
* Best tool for each job
* ML pipelines get proper caching
* Complex workflows get better DAG support
* Can integrate via Argo Events
* Gradual migration possible
### Negative Consequences
* Two systems to maintain
* Team needs to learn both
* More complex debugging
* Integration overhead
## Integration Architecture
```
NATS Event ──► Argo Events ──► Sensor ──┬──► Argo Workflow
└──► Kubeflow Pipeline (via API)
OR
Argo Workflow ──► Step: kfp-trigger ──► Kubeflow Pipeline
(WorkflowTemplate)
```
## Pros and Cons of the Options
### Kubeflow Pipelines only
* Good, because ML-focused
* Good, because caching
* Good, because experiment tracking
* Bad, because limited DAG features
* Bad, because less flexible control flow
### Argo Workflows only
* Good, because powerful DAG
* Good, because flexible
* Good, because great debugging
* Bad, because no ML caching
* Bad, because no experiment tracking
### Both engines (chosen)
* Good, because best of both
* Good, because appropriate tool per job
* Good, because can integrate
* Bad, because operational complexity
* Bad, because learning two systems
### Airflow
* Good, because mature
* Good, because large community
* Bad, because Python-centric
* Bad, because not Kubernetes-native
* Bad, because no ML features
### Prefect/Dagster
* Good, because modern design
* Good, because Python-native
* Bad, because less Kubernetes-native
* Bad, because newer/less proven
## Links
* [Kubeflow Pipelines](https://kubeflow.org/docs/components/pipelines/)
* [Argo Workflows](https://argoproj.github.io/workflows/)
* [Argo Events](https://argoproj.github.io/events/)
* Related: [kfp-integration.yaml](../../llm-workflows/argo/kfp-integration.yaml)

View File

@@ -0,0 +1,120 @@
# Use Envoy Gateway for Ingress
* Status: accepted
* Date: 2025-12-01
* Deciders: Billy Davies
* Technical Story: Selecting ingress controller for cluster
## Context and Problem Statement
We need an ingress solution that supports:
- Gateway API (modern Kubernetes standard)
- gRPC for ML inference
- WebSocket for real-time chat/voice
- Header-based routing for A/B testing
- TLS termination
## Decision Drivers
* Gateway API support (HTTPRoute, GRPCRoute)
* WebSocket support
* gRPC support
* Performance at edge
* Active development
* Envoy ecosystem familiarity
## Considered Options
* NGINX Ingress Controller
* Traefik
* Envoy Gateway
* Istio Gateway
* Contour
## Decision Outcome
Chosen option: "Envoy Gateway", because it's the reference implementation of Gateway API with full Envoy feature set.
### Positive Consequences
* Native Gateway API support
* Full Envoy feature set
* WebSocket and gRPC native
* No Istio complexity
* CNCF graduated project (Envoy)
* Easy integration with observability
### Negative Consequences
* Newer than alternatives
* Less documentation than NGINX
* Envoy configuration learning curve
## Pros and Cons of the Options
### NGINX Ingress
* Good, because mature
* Good, because well-documented
* Good, because familiar
* Bad, because limited Gateway API
* Bad, because commercial features gated
### Traefik
* Good, because auto-discovery
* Good, because good UI
* Good, because Let's Encrypt
* Bad, because Gateway API experimental
* Bad, because less gRPC focus
### Envoy Gateway
* Good, because Gateway API native
* Good, because full Envoy features
* Good, because extensible
* Good, because gRPC/WebSocket native
* Bad, because newer project
* Bad, because less community content
### Istio Gateway
* Good, because full mesh features
* Good, because Gateway API
* Bad, because overkill without mesh
* Bad, because resource heavy
### Contour
* Good, because Envoy-based
* Good, because lightweight
* Bad, because Gateway API evolving
* Bad, because smaller community
## Configuration Example
```yaml
apiVersion: gateway.networking.k8s.io/v1
kind: HTTPRoute
metadata:
name: companions-chat
spec:
parentRefs:
- name: eg-gateway
namespace: network
hostnames:
- companions-chat.lab.daviestechlabs.io
rules:
- matches:
- path:
type: PathPrefix
value: /
backendRefs:
- name: companions-chat
port: 8080
```
## Links
* [Envoy Gateway](https://gateway.envoyproxy.io)
* [Gateway API](https://gateway-api.sigs.k8s.io)

35
diagrams/README.md Normal file
View File

@@ -0,0 +1,35 @@
# Diagrams
This directory contains additional architecture diagrams beyond the main C4 diagrams.
## Available Diagrams
| File | Description |
|------|-------------|
| [gpu-allocation.mmd](gpu-allocation.mmd) | GPU workload distribution |
| [data-flow-chat.mmd](data-flow-chat.mmd) | Chat request data flow |
| [data-flow-voice.mmd](data-flow-voice.mmd) | Voice request data flow |
## Rendering Diagrams
### VS Code
Install the "Markdown Preview Mermaid Support" extension.
### CLI
```bash
# Using mmdc (Mermaid CLI)
npx @mermaid-js/mermaid-cli mmdc -i diagram.mmd -o diagram.png
```
### Online
Use [Mermaid Live Editor](https://mermaid.live)
## Diagram Conventions
1. Use `.mmd` extension for Mermaid diagrams
2. Include title as comment at top of file
3. Use consistent styling classes
4. Keep diagrams focused (one concept per diagram)

View File

@@ -0,0 +1,51 @@
%% Chat Request Data Flow
%% Sequence diagram showing chat message processing
sequenceDiagram
autonumber
participant U as User
participant W as WebApp<br/>(companions)
participant N as NATS
participant C as Chat Handler
participant V as Valkey<br/>(Cache)
participant E as BGE Embeddings
participant M as Milvus
participant R as Reranker
participant L as vLLM
U->>W: Send message
W->>N: Publish ai.chat.user.{id}.message
N->>C: Deliver message
C->>V: Get session history
V-->>C: Previous messages
alt RAG Enabled
C->>E: Generate query embedding
E-->>C: Query vector
C->>M: Search similar chunks
M-->>C: Top-K chunks
opt Reranker Enabled
C->>R: Rerank chunks
R-->>C: Reordered chunks
end
end
C->>L: LLM inference (context + query)
alt Streaming Enabled
loop For each token
L-->>C: Token
C->>N: Publish ai.chat.response.stream.{id}
N-->>W: Deliver chunk
W-->>U: Display token
end
else Non-streaming
L-->>C: Full response
C->>N: Publish ai.chat.response.{id}
N-->>W: Deliver response
W-->>U: Display response
end
C->>V: Save to session history

View File

@@ -0,0 +1,46 @@
%% Voice Request Data Flow
%% Sequence diagram showing voice assistant processing
sequenceDiagram
autonumber
participant U as User
participant W as Voice WebApp
participant N as NATS
participant VA as Voice Assistant
participant STT as Whisper<br/>(STT)
participant E as BGE Embeddings
participant M as Milvus
participant R as Reranker
participant L as vLLM
participant TTS as XTTS<br/>(TTS)
U->>W: Record audio
W->>N: Publish ai.voice.user.{id}.request<br/>(msgpack with audio bytes)
N->>VA: Deliver voice request
VA->>STT: Transcribe audio
STT-->>VA: Transcription text
alt RAG Enabled
VA->>E: Generate query embedding
E-->>VA: Query vector
VA->>M: Search similar chunks
M-->>VA: Top-K chunks
opt Reranker Enabled
VA->>R: Rerank chunks
R-->>VA: Reordered chunks
end
end
VA->>L: LLM inference
L-->>VA: Response text
VA->>TTS: Synthesize speech
TTS-->>VA: Audio bytes
VA->>N: Publish ai.voice.response.{id}<br/>(text + audio)
N-->>W: Deliver response
W-->>U: Play audio + show text
Note over VA,TTS: Total latency target: < 3s

View File

@@ -0,0 +1,47 @@
%% GPU Allocation Diagram
%% Shows how AI workloads are distributed across GPU nodes
flowchart TB
subgraph khelben["🖥️ khelben (AMD Strix Halo 64GB)"]
direction TB
vllm["🧠 vLLM<br/>LLM Inference<br/>100% GPU"]
end
subgraph elminster["🖥️ elminster (NVIDIA RTX 2070 8GB)"]
direction TB
whisper["🎤 Whisper<br/>STT<br/>~50% GPU"]
xtts["🔊 XTTS<br/>TTS<br/>~50% GPU"]
end
subgraph drizzt["🖥️ drizzt (AMD Radeon 680M 12GB)"]
direction TB
embeddings["📊 BGE Embeddings<br/>Vector Encoding<br/>~80% GPU"]
end
subgraph danilo["🖥️ danilo (Intel Arc)"]
direction TB
reranker["📋 BGE Reranker<br/>Document Ranking<br/>~80% GPU"]
end
subgraph workloads["Workload Routing"]
chat["💬 Chat Request"]
voice["🎤 Voice Request"]
end
chat --> embeddings
chat --> reranker
chat --> vllm
voice --> whisper
voice --> embeddings
voice --> reranker
voice --> vllm
voice --> xtts
classDef nvidia fill:#76B900,color:white
classDef amd fill:#ED1C24,color:white
classDef intel fill:#0071C5,color:white
class whisper,xtts nvidia
class vllm,embeddings amd
class reranker intel

View File

@@ -0,0 +1,287 @@
# Binary Messages and JetStream Configuration
> Technical specification for NATS message handling in the AI platform
## Overview
The AI platform uses NATS with JetStream for message persistence. All messages use MessagePack (msgpack) binary format for efficiency, especially when handling audio data.
## Message Format
### Why MessagePack?
1. **Binary efficiency**: Audio data embedded directly without base64 overhead
2. **Compact**: 20-50% smaller than equivalent JSON
3. **Fast**: Lower serialization/deserialization overhead
4. **Compatible**: JSON-like structure, easy debugging
### Schema
All messages follow this general structure:
```python
{
"request_id": str, # UUID for correlation
"user_id": str, # User identifier
"timestamp": float, # Unix timestamp
"payload": Any, # Type-specific data
"metadata": dict # Optional metadata
}
```
### Chat Message
```python
{
"request_id": "uuid-here",
"user_id": "user-123",
"username": "john_doe",
"message": "Hello, how are you?",
"premium": False,
"enable_streaming": True,
"enable_rag": True,
"enable_reranker": True,
"top_k": 5,
"session_id": "session-abc"
}
```
### Voice Message
```python
{
"request_id": "uuid-here",
"user_id": "user-123",
"audio": b"...", # Raw bytes, not base64!
"format": "wav",
"sample_rate": 16000,
"premium": False,
"enable_rag": True,
"language": "en"
}
```
### Streaming Response Chunk
```python
{
"request_id": "uuid-here",
"type": "chunk", # "chunk", "done", "error"
"content": "token",
"done": False,
"timestamp": 1706000000.0
}
```
## JetStream Configuration
### Streams
| Stream | Subjects | Retention | Max Age | Storage | Replicas |
|--------|----------|-----------|---------|---------|----------|
| `COMPANIONS_LOGINS` | `ai.chat.user.*.login` | Limits | 7 days | File | 1 |
| `COMPANIONS_CHAT` | `ai.chat.user.*.message`, `ai.chat.user.*.greeting.*` | Limits | 30 days | File | 1 |
| `AI_CHAT_STREAM` | `ai.chat.response.stream.>` | Limits | 5 min | Memory | 1 |
| `AI_VOICE_STREAM` | `ai.voice.>` | Limits | 1 hour | File | 1 |
| `AI_VOICE_RESPONSE_STREAM` | `ai.voice.response.stream.>` | Limits | 5 min | Memory | 1 |
| `AI_PIPELINE` | `ai.pipeline.>` | Limits | 24 hours | File | 1 |
### Consumer Configuration
```yaml
# Durable consumer for chat handler
consumer:
name: chat-handler
durable_name: chat-handler
filter_subjects:
- "ai.chat.user.*.message"
ack_policy: explicit
ack_wait: 30s
max_deliver: 3
deliver_policy: new
```
### Stream Creation (CLI)
```bash
# Create chat stream
nats stream add COMPANIONS_CHAT \
--subjects "ai.chat.user.*.message,ai.chat.user.*.greeting.*" \
--retention limits \
--max-age 30d \
--storage file \
--replicas 1
# Create ephemeral stream
nats stream add AI_CHAT_STREAM \
--subjects "ai.chat.response.stream.>" \
--retention limits \
--max-age 5m \
--storage memory \
--replicas 1
```
## Python Implementation
### Publisher
```python
import nats
import msgpack
from datetime import datetime
async def publish_chat_message(nc: nats.NATS, user_id: str, message: str):
data = {
"request_id": str(uuid.uuid4()),
"user_id": user_id,
"message": message,
"timestamp": datetime.utcnow().timestamp(),
"enable_streaming": True,
"enable_rag": True,
}
subject = f"ai.chat.user.{user_id}.message"
await nc.publish(subject, msgpack.packb(data))
```
### Subscriber (JetStream)
```python
async def message_handler(msg):
try:
data = msgpack.unpackb(msg.data, raw=False)
# Process message
result = await process_chat(data)
# Publish response
response_subject = f"ai.chat.response.{data['request_id']}"
await nc.publish(response_subject, msgpack.packb(result))
# Acknowledge
await msg.ack()
except Exception as e:
logger.error(f"Handler error: {e}")
await msg.nak(delay=5) # Retry after 5s
# Subscribe with JetStream
js = nc.jetstream()
sub = await js.subscribe(
"ai.chat.user.*.message",
cb=message_handler,
durable="chat-handler",
manual_ack=True
)
```
### Streaming Response
```python
async def stream_response(nc, request_id: str, response_generator):
subject = f"ai.chat.response.stream.{request_id}"
async for token in response_generator:
chunk = {
"request_id": request_id,
"type": "chunk",
"content": token,
"done": False
}
await nc.publish(subject, msgpack.packb(chunk))
# Send done marker
done = {
"request_id": request_id,
"type": "done",
"content": "",
"done": True
}
await nc.publish(subject, msgpack.packb(done))
```
## Go Implementation
### Publisher
```go
import (
"github.com/nats-io/nats.go"
"github.com/vmihailenco/msgpack/v5"
)
type ChatMessage struct {
RequestID string `msgpack:"request_id"`
UserID string `msgpack:"user_id"`
Message string `msgpack:"message"`
}
func PublishChat(nc *nats.Conn, userID, message string) error {
msg := ChatMessage{
RequestID: uuid.New().String(),
UserID: userID,
Message: message,
}
data, err := msgpack.Marshal(msg)
if err != nil {
return err
}
subject := fmt.Sprintf("ai.chat.user.%s.message", userID)
return nc.Publish(subject, data)
}
```
## Error Handling
### NAK with Delay
```python
# Temporary failure - retry later
await msg.nak(delay=5) # 5 second delay
# Permanent failure - move to dead letter
if attempt >= max_retries:
await nc.publish("ai.dlq.chat", msg.data)
await msg.term() # Terminate delivery
```
### Dead Letter Queue
```yaml
stream:
name: AI_DLQ
subjects:
- "ai.dlq.>"
retention: limits
max_age: 7d
storage: file
```
## Monitoring
### Key Metrics
```bash
# Stream info
nats stream info COMPANIONS_CHAT
# Consumer info
nats consumer info COMPANIONS_CHAT chat-handler
# Message rate
nats stream report
```
### Prometheus Metrics
- `nats_stream_messages_total`
- `nats_consumer_pending_messages`
- `nats_consumer_ack_pending`
## Related
- [ADR-0003: Use NATS for Messaging](../decisions/0003-use-nats-for-messaging.md)
- [ADR-0004: Use MessagePack](../decisions/0004-use-messagepack-for-nats.md)
- [DOMAIN-MODEL.md](../DOMAIN-MODEL.md)

36
specs/README.md Normal file
View File

@@ -0,0 +1,36 @@
# Specifications
This directory contains feature-level specifications and technical designs.
## Contents
- [BINARY_MESSAGES_AND_JETSTREAM.md](BINARY_MESSAGES_AND_JETSTREAM.md) - MessagePack format and JetStream configuration
- Future specs will be added here
## Spec Template
```markdown
# Feature Name
## Overview
Brief description of the feature
## Requirements
- Requirement 1
- Requirement 2
## Design
Technical design details
## API
Interface definitions
## Implementation Notes
Key implementation considerations
## Testing
Test strategy
## Open Questions
Unresolved items
```