diff --git a/AGENT-ONBOARDING.md b/AGENT-ONBOARDING.md new file mode 100644 index 0000000..ee43710 --- /dev/null +++ b/AGENT-ONBOARDING.md @@ -0,0 +1,191 @@ +# ๐Ÿค– Agent Onboarding + +> **This is the most important file for AI agents working on this codebase.** + +## TL;DR + +You are working on a **homelab Kubernetes cluster** running: +- **Talos Linux v1.12.1** on bare-metal nodes +- **Kubernetes v1.35.0** with Flux CD GitOps +- **AI/ML platform** with KServe, Kubeflow, Milvus, NATS +- **Multi-GPU** (AMD ROCm, NVIDIA CUDA, Intel Arc) + +## ๐Ÿ—บ๏ธ Repository Map + +| Repo | What It Contains | When to Edit | +|------|------------------|--------------| +| `homelab-k8s2` | Kubernetes manifests, Talos config, Flux | Infrastructure changes | +| `llm-workflows` | NATS handlers, Argo/KFP workflows | Workflow/handler changes | +| `companions-frontend` | Go server, HTMX UI, VRM avatars | Frontend changes | +| `homelab-design` (this) | Architecture docs, ADRs | Design decisions | + +## ๐Ÿ—๏ธ System Architecture (30-Second Version) + +``` +โ”Œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ” +โ”‚ USER INTERFACES โ”‚ +โ”‚ Companions WebApp โ”‚ Voice WebApp โ”‚ Kubeflow UI โ”‚ CLI โ”‚ +โ””โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ฌโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”˜ + โ”‚ WebSocket/HTTP + โ–ผ +โ”Œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ” +โ”‚ NATS MESSAGE BUS โ”‚ +โ”‚ Subjects: ai.chat.*, ai.voice.*, ai.pipeline.* โ”‚ +โ”‚ Format: MessagePack (binary) โ”‚ +โ””โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ฌโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”˜ + โ”‚ + โ”Œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ผโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ” + โ–ผ โ–ผ โ–ผ +โ”Œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ” โ”Œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ” โ”Œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ” +โ”‚ Chat Handler โ”‚ โ”‚Voice Assistantโ”‚ โ”‚Pipeline Bridgeโ”‚ +โ”‚ (RAG+LLM) โ”‚ โ”‚ (STTโ†’LLMโ†’TTS) โ”‚ โ”‚ (KFP/Argo) โ”‚ +โ””โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ฌโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”˜ โ””โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ฌโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”˜ โ””โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ฌโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”˜ + โ”‚ โ”‚ โ”‚ + โ””โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ผโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”˜ + โ–ผ +โ”Œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ” +โ”‚ AI SERVICES โ”‚ +โ”‚ Whisper โ”‚ XTTS โ”‚ vLLM โ”‚ Milvus โ”‚ BGE Embed โ”‚ Reranker โ”‚ +โ”‚ STT โ”‚ TTS โ”‚ LLM โ”‚ RAG โ”‚ Embed โ”‚ Rank โ”‚ +โ””โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”˜ +``` + +## ๐Ÿ“ Key File Locations + +### Infrastructure (`homelab-k8s2`) + +``` +kubernetes/apps/ +โ”œโ”€โ”€ ai-ml/ # ๐Ÿง  AI/ML services +โ”‚ โ”œโ”€โ”€ kserve/ # InferenceServices +โ”‚ โ”œโ”€โ”€ kubeflow/ # Pipelines, Training Operator +โ”‚ โ”œโ”€โ”€ milvus/ # Vector database +โ”‚ โ”œโ”€โ”€ nats/ # Message bus +โ”‚ โ”œโ”€โ”€ vllm/ # LLM inference +โ”‚ โ””โ”€โ”€ llm-workflows/ # GitRepo sync to llm-workflows +โ”œโ”€โ”€ analytics/ # ๐Ÿ“Š Spark, Flink, ClickHouse +โ”œโ”€โ”€ observability/ # ๐Ÿ“ˆ Grafana, Alloy, OpenTelemetry +โ””โ”€โ”€ security/ # ๐Ÿ”’ Vault, Authentik, Falco + +talos/ +โ”œโ”€โ”€ talconfig.yaml # Node definitions +โ”œโ”€โ”€ patches/ # GPU-specific patches +โ”‚ โ”œโ”€โ”€ amd/amdgpu.yaml +โ”‚ โ””โ”€โ”€ nvidia/nvidia-runtime.yaml +``` + +### Workflows (`llm-workflows`) + +``` +workflows/ # NATS handler deployments +โ”œโ”€โ”€ chat-handler.yaml +โ”œโ”€โ”€ voice-assistant.yaml +โ””โ”€โ”€ pipeline-bridge.yaml + +argo/ # Argo WorkflowTemplates +โ”œโ”€โ”€ document-ingestion.yaml +โ”œโ”€โ”€ batch-inference.yaml +โ””โ”€โ”€ qlora-training.yaml + +pipelines/ # Kubeflow Pipeline Python +โ”œโ”€โ”€ voice_pipeline.py +โ””โ”€โ”€ document_ingestion_pipeline.py +``` + +## ๐Ÿ”Œ Service Endpoints (Internal) + +```python +# Copy-paste ready for Python code +NATS_URL = "nats://nats.ai-ml.svc.cluster.local:4222" +VLLM_URL = "http://llm-draft.ai-ml.svc.cluster.local:8000/v1" +WHISPER_URL = "http://whisper-predictor.ai-ml.svc.cluster.local" +TTS_URL = "http://tts-predictor.ai-ml.svc.cluster.local" +EMBEDDINGS_URL = "http://embeddings-predictor.ai-ml.svc.cluster.local" +RERANKER_URL = "http://reranker-predictor.ai-ml.svc.cluster.local" +MILVUS_HOST = "milvus.ai-ml.svc.cluster.local" +MILVUS_PORT = 19530 +VALKEY_URL = "redis://valkey.ai-ml.svc.cluster.local:6379" +``` + +## ๐Ÿ“จ NATS Subject Patterns + +```python +# Chat +f"ai.chat.user.{user_id}.message" # User sends message +f"ai.chat.response.{request_id}" # Response back +f"ai.chat.response.stream.{request_id}" # Streaming tokens + +# Voice +f"ai.voice.user.{user_id}.request" # Voice input +f"ai.voice.response.{request_id}" # Voice output + +# Pipelines +"ai.pipeline.trigger" # Trigger any pipeline +f"ai.pipeline.status.{request_id}" # Status updates +``` + +## ๐ŸŽฎ GPU Allocation + +| Node | GPU | Workload | Memory | +|------|-----|----------|--------| +| khelben | AMD Strix Halo | vLLM (dedicated) | 64GB unified | +| elminster | NVIDIA RTX 2070 | Whisper + XTTS | 8GB VRAM | +| drizzt | AMD Radeon 680M | BGE Embeddings | 12GB VRAM | +| danilo | Intel Arc | Reranker | 16GB shared | + +## โšก Common Tasks + +### Deploy a New AI Service + +1. Create InferenceService in `homelab-k8s2/kubernetes/apps/ai-ml/kserve/` +2. Add endpoint to `llm-workflows/config/ai-services-config.yaml` +3. Push to main โ†’ Flux deploys automatically + +### Add a New Workflow + +1. Create handler in `llm-workflows/chat-handler/` or `llm-workflows/voice-assistant/` +2. Add Kubernetes Deployment in `llm-workflows/workflows/` +3. Push to main โ†’ Flux deploys automatically + +### Create Architecture Decision + +1. Copy `decisions/0000-template.md` to `decisions/NNNN-title.md` +2. Fill in context, decision, consequences +3. Submit PR + +## โŒ Antipatterns to Avoid + +1. **Don't hardcode secrets** - Use External Secrets Operator +2. **Don't use `latest` tags** - Pin versions for reproducibility +3. **Don't skip ADRs** - Document significant decisions +4. **Don't bypass Flux** - All changes via Git, never `kubectl apply` directly + +## ๐Ÿ“š Where to Learn More + +- [ARCHITECTURE.md](ARCHITECTURE.md) - Full system design +- [TECH-STACK.md](TECH-STACK.md) - All technologies used +- [decisions/](decisions/) - Why we made certain choices +- [DOMAIN-MODEL.md](DOMAIN-MODEL.md) - Core entities + +## ๐Ÿ†˜ Quick Debugging + +```bash +# Check Flux sync status +flux get all -A + +# View NATS JetStream streams +kubectl exec -n ai-ml deploy/nats-box -- nats stream ls + +# Check GPU allocation +kubectl describe node khelben | grep -A10 "Allocated" + +# View KServe inference services +kubectl get inferenceservices -n ai-ml + +# Tail AI service logs +kubectl logs -n ai-ml -l app=chat-handler -f +``` + +--- + +*This document is the canonical starting point for AI agents. When in doubt, check the ADRs.* diff --git a/ARCHITECTURE.md b/ARCHITECTURE.md new file mode 100644 index 0000000..d554afa --- /dev/null +++ b/ARCHITECTURE.md @@ -0,0 +1,287 @@ +# ๐Ÿ—๏ธ System Architecture + +> **Comprehensive technical overview of the DaviesTechLabs homelab infrastructure** + +## Overview + +The homelab is a production-grade Kubernetes cluster running on bare-metal hardware, designed for AI/ML workloads with multi-GPU support. It follows GitOps principles using Flux CD with SOPS-encrypted secrets. + +## System Layers + +``` +โ”Œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ” +โ”‚ USER LAYER โ”‚ +โ”œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ค +โ”‚ โ”Œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ” โ”Œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ” โ”Œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ” โ”‚ +โ”‚ โ”‚ Companions WebAppโ”‚ โ”‚ Voice WebApp โ”‚ โ”‚ Kubeflow UI โ”‚ โ”‚ +โ”‚ โ”‚ HTMX + Alpine โ”‚ โ”‚ Gradio UI โ”‚ โ”‚ Pipeline Mgmt โ”‚ โ”‚ +โ”‚ โ””โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ฌโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”˜ โ””โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ฌโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”˜ โ””โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ฌโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”˜ โ”‚ +โ”‚ โ”‚ WebSocket โ”‚ HTTP/WS โ”‚ HTTP โ”‚ +โ””โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ดโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ดโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ดโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”˜ + โ”‚ + โ–ผ +โ”Œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ” +โ”‚ INGRESS LAYER โ”‚ +โ”œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ค +โ”‚ Cloudflared Tunnel โ”€โ”€โ–บ Envoy Gateway โ”€โ”€โ–บ HTTPRoute CRDs โ”‚ +โ”‚ โ”‚ +โ”‚ External: *.daviestechlabs.io Internal: *.lab.daviestechlabs.io โ”‚ +โ”‚ โ€ข git.daviestechlabs.io โ€ข kubeflow.lab.daviestechlabs.io โ”‚ +โ”‚ โ€ข auth.daviestechlabs.io โ€ข companions-chat.lab... โ”‚ +โ””โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”˜ + โ”‚ + โ–ผ +โ”Œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ” +โ”‚ MESSAGE BUS LAYER โ”‚ +โ”œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ค +โ”‚ NATS + JetStream โ”‚ +โ”‚ โ”Œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ” โ”‚ +โ”‚ โ”‚ Streams: โ”‚ โ”‚ +โ”‚ โ”‚ โ€ข COMPANIONS_LOGINS (7d retention) - User analytics โ”‚ โ”‚ +โ”‚ โ”‚ โ€ข COMPANIONS_CHAT (30d retention) - Chat history โ”‚ โ”‚ +โ”‚ โ”‚ โ€ข AI_CHAT_STREAM (5min, memory) - Ephemeral streaming โ”‚ โ”‚ +โ”‚ โ”‚ โ€ข AI_VOICE_STREAM (1h, file) - Voice processing โ”‚ โ”‚ +โ”‚ โ”‚ โ€ข AI_PIPELINE (24h, file) - Workflow triggers โ”‚ โ”‚ +โ”‚ โ””โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”˜ โ”‚ +โ”‚ โ”‚ +โ”‚ Message Format: MessagePack (binary, not JSON) โ”‚ +โ””โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”˜ + โ”‚ + โ”Œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ผโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ” + โ–ผ โ–ผ โ–ผ +โ”Œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ” โ”Œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ” โ”Œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ” +โ”‚ Chat Handler โ”‚ โ”‚ Voice Assistant โ”‚ โ”‚ Pipeline Bridge โ”‚ +โ”œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ค โ”œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ค โ”œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ค +โ”‚ โ€ข RAG retrieval โ”‚ โ”‚ โ€ข STT (Whisper) โ”‚ โ”‚ โ€ข KFP triggers โ”‚ +โ”‚ โ€ข LLM inference โ”‚ โ”‚ โ€ข RAG retrieval โ”‚ โ”‚ โ€ข Argo triggers โ”‚ +โ”‚ โ€ข Streaming resp โ”‚ โ”‚ โ€ข LLM inference โ”‚ โ”‚ โ€ข Status updates โ”‚ +โ”‚ โ€ข Session state โ”‚ โ”‚ โ€ข TTS (XTTS) โ”‚ โ”‚ โ€ข Error handling โ”‚ +โ””โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”˜ โ””โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”˜ โ””โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”˜ + โ”‚ + โ–ผ +โ”Œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ” +โ”‚ AI SERVICES LAYER โ”‚ +โ”œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ค +โ”‚ โ”Œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ” โ”Œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ” โ”Œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ” โ”Œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ” โ”Œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ” โ”Œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ” โ”‚ +โ”‚ โ”‚ Whisper โ”‚ โ”‚ XTTS โ”‚ โ”‚ vLLM โ”‚ โ”‚ Milvus โ”‚ โ”‚ BGE โ”‚ โ”‚Reranker โ”‚ โ”‚ +โ”‚ โ”‚ (STT) โ”‚ โ”‚ (TTS) โ”‚ โ”‚ (LLM) โ”‚ โ”‚ (RAG) โ”‚ โ”‚(Embed) โ”‚ โ”‚ (BGE) โ”‚ โ”‚ +โ”‚ โ”œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ค โ”œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ค โ”œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ค โ”œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ค โ”œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ค โ”œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ค โ”‚ +โ”‚ โ”‚ KServe โ”‚ โ”‚ KServe โ”‚ โ”‚ vLLM โ”‚ โ”‚ Helm โ”‚ โ”‚ KServe โ”‚ โ”‚ KServe โ”‚ โ”‚ +โ”‚ โ”‚ nvidia โ”‚ โ”‚ nvidia โ”‚ โ”‚ ROCm โ”‚ โ”‚ Minio โ”‚ โ”‚ rdna2 โ”‚ โ”‚ intel โ”‚ โ”‚ +โ”‚ โ””โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”˜ โ””โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”˜ โ””โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”˜ โ””โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”˜ โ””โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”˜ โ””โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”˜ โ”‚ +โ””โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”˜ + โ”‚ + โ–ผ +โ”Œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ” +โ”‚ WORKFLOW ENGINE LAYER โ”‚ +โ”œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ค +โ”‚ โ”Œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ” โ”Œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ” โ”‚ +โ”‚ โ”‚ Argo Workflows โ”‚โ—„โ”€โ”€โ–บโ”‚ Kubeflow Pipelines โ”‚ โ”‚ +โ”‚ โ”œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ค โ”œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ค โ”‚ +โ”‚ โ”‚ โ€ข Complex DAG orchestrationโ”‚ โ”‚ โ€ข ML pipeline caching โ”‚ โ”‚ +โ”‚ โ”‚ โ€ข Training workflows โ”‚ โ”‚ โ€ข Experiment tracking โ”‚ โ”‚ +โ”‚ โ”‚ โ€ข Document ingestion โ”‚ โ”‚ โ€ข Model versioning โ”‚ โ”‚ +โ”‚ โ”‚ โ€ข Batch inference โ”‚ โ”‚ โ€ข Artifact lineage โ”‚ โ”‚ +โ”‚ โ””โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”˜ โ””โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”˜ โ”‚ +โ”‚ โ”‚ +โ”‚ Trigger: Argo Events (EventSource โ†’ Sensor โ†’ Workflow/Pipeline) โ”‚ +โ””โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”˜ + โ”‚ + โ–ผ +โ”Œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ” +โ”‚ INFRASTRUCTURE LAYER โ”‚ +โ”œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ค +โ”‚ Storage: Compute: Security: โ”‚ +โ”‚ โ”œโ”€ Longhorn (block) โ”œโ”€ Volcano Scheduler โ”œโ”€ Vault (secrets) โ”‚ +โ”‚ โ”œโ”€ NFS CSI (shared) โ”œโ”€ GPU Device Plugins โ”œโ”€ Authentik (SSO) โ”‚ +โ”‚ โ””โ”€ MinIO (S3) โ”‚ โ”œโ”€ AMD ROCm โ”œโ”€ Falco (runtime) โ”‚ +โ”‚ โ”‚ โ”œโ”€ NVIDIA CUDA โ””โ”€ SOPS (GitOps) โ”‚ +โ”‚ Databases: โ”‚ โ””โ”€ Intel i915/Arc โ”‚ +โ”‚ โ”œโ”€ CloudNative-PG โ””โ”€ Node Feature Discovery โ”‚ +โ”‚ โ”œโ”€ Valkey (cache) โ”‚ +โ”‚ โ””โ”€ ClickHouse (analytics) โ”‚ +โ””โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”˜ + โ”‚ + โ–ผ +โ”Œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ” +โ”‚ PLATFORM LAYER โ”‚ +โ”œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ค +โ”‚ Talos Linux v1.12.1 โ”‚ Kubernetes v1.35.0 โ”‚ Cilium CNI โ”‚ +โ”‚ โ”‚ +โ”‚ Nodes: storm, bruenor, catti (control) โ”‚ elminster, khelben, drizzt, โ”‚ +โ”‚ โ”‚ danilo (workers) โ”‚ +โ””โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”˜ +``` + +## Node Topology + +### Control Plane (HA) + +| Node | IP | CPU | Memory | Storage | Role | +|------|-------|-----|--------|---------|------| +| storm | 192.168.100.25 | Intel 13th Gen (4c) | 16GB | 500GB NVMe | etcd, API server | +| bruenor | 192.168.100.26 | Intel 13th Gen (4c) | 16GB | 500GB NVMe | etcd, API server | +| catti | 192.168.100.27 | Intel 13th Gen (4c) | 16GB | 500GB NVMe | etcd, API server | + +**VIP**: 192.168.100.20 (shared across control plane) + +### Worker Nodes + +| Node | IP | CPU | GPU | GPU Memory | Workload | +|------|-------|-----|-----|------------|----------| +| elminster | 192.168.100.31 | Intel | NVIDIA RTX 2070 | 8GB VRAM | Whisper, XTTS | +| khelben | 192.168.100.32 | AMD Ryzen | AMD Strix Halo | 64GB Unified | vLLM (dedicated) | +| drizzt | 192.168.100.40 | AMD Ryzen 7 6800H | AMD Radeon 680M | 12GB VRAM | BGE Embeddings | +| danilo | 192.168.100.41 | Intel Core Ultra 9 | Intel Arc | 16GB Shared | Reranker | + +## Networking + +### External Access + +``` +Internet โ†’ Cloudflare โ†’ cloudflared tunnel โ†’ Envoy Gateway โ†’ Services +``` + +### DNS Zones + +- **External**: `*.daviestechlabs.io` (Cloudflare DNS) +- **Internal**: `*.lab.daviestechlabs.io` (internal split-horizon) + +### Network CIDRs + +| Network | CIDR | Purpose | +|---------|------|---------| +| Node Network | 192.168.100.0/24 | Physical nodes | +| Pod Network | 10.42.0.0/16 | Kubernetes pods | +| Service Network | 10.43.0.0/16 | Kubernetes services | + +## Data Flow: Chat Request + +```mermaid +sequenceDiagram + participant U as User + participant W as WebApp + participant N as NATS + participant C as Chat Handler + participant M as Milvus + participant L as vLLM + participant V as Valkey + + U->>W: Send message + W->>N: Publish ai.chat.user.{id}.message + N->>C: Deliver to chat-handler + C->>V: Get session history + C->>M: RAG query (if enabled) + M-->>C: Relevant documents + C->>L: LLM inference (with context) + L-->>C: Streaming tokens + C->>N: Publish ai.chat.response.stream.{id} + N-->>W: Deliver streaming chunks + W-->>U: Display tokens + C->>V: Save to session +``` + +## GitOps Flow + +``` +Developer โ†’ Git Push โ†’ GitHub/Gitea + โ”‚ + โ–ผ + โ”Œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ” + โ”‚ Flux CD โ”‚ + โ”‚ (reconcile) โ”‚ + โ””โ”€โ”€โ”€โ”€โ”€โ”€โ”ฌโ”€โ”€โ”€โ”€โ”€โ”€โ”˜ + โ”‚ + โ”Œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ผโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ” + โ–ผ โ–ผ โ–ผ + โ”Œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ” โ”Œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ” โ”Œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ” + โ”‚homelab- โ”‚ โ”‚ llm- โ”‚ โ”‚ helm โ”‚ + โ”‚ k8s2 โ”‚ โ”‚workflows โ”‚ โ”‚ charts โ”‚ + โ””โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”˜ โ””โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”˜ โ””โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”˜ + โ”‚ โ”‚ โ”‚ + โ””โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ดโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”˜ + โ”‚ + โ–ผ + โ”Œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ” + โ”‚ Kubernetes โ”‚ + โ”‚ Cluster โ”‚ + โ””โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”˜ +``` + +## Security Architecture + +### Secrets Management + +``` +External Secrets Operator โ”€โ”€โ–บ Vault / SOPS โ”€โ”€โ–บ Kubernetes Secrets +``` + +### Authentication + +``` +User โ”€โ”€โ–บ Cloudflare Access โ”€โ”€โ–บ Authentik โ”€โ”€โ–บ Application + โ”‚ + โ””โ”€โ”€โ–บ OIDC/SAML providers +``` + +### Network Security + +- **Cilium**: Network policies, eBPF-based security +- **Falco**: Runtime security monitoring +- **RBAC**: Fine-grained Kubernetes permissions + +## High Availability + +### Control Plane + +- 3-node etcd cluster with automatic leader election +- Virtual IP (192.168.100.20) for API server access +- Automatic failover via Talos + +### Workloads + +- Pod anti-affinity for critical services +- HPA for auto-scaling +- PodDisruptionBudgets for controlled updates + +### Storage + +- Longhorn 3-replica default +- MinIO erasure coding for S3 +- Regular Velero backups + +## Observability + +### Metrics Pipeline + +``` +Applications โ”€โ”€โ–บ OpenTelemetry Collector โ”€โ”€โ–บ Prometheus โ”€โ”€โ–บ Grafana +``` + +### Logging Pipeline + +``` +Applications โ”€โ”€โ–บ Grafana Alloy โ”€โ”€โ–บ Loki โ”€โ”€โ–บ Grafana +``` + +### Tracing Pipeline + +``` +Applications โ”€โ”€โ–บ OpenTelemetry SDK โ”€โ”€โ–บ Jaeger/Tempo โ”€โ”€โ–บ Grafana +``` + +## Key Design Decisions + +| Decision | Rationale | ADR | +|----------|-----------|-----| +| Talos Linux | Immutable, API-driven, secure | [ADR-0002](decisions/0002-use-talos-linux.md) | +| NATS over Kafka | Simpler ops, sufficient throughput | [ADR-0003](decisions/0003-use-nats-for-messaging.md) | +| MessagePack over JSON | Binary efficiency for audio | [ADR-0004](decisions/0004-use-messagepack-for-nats.md) | +| Multi-GPU heterogeneous | Cost optimization, workload matching | [ADR-0005](decisions/0005-multi-gpu-strategy.md) | +| GitOps with Flux | Declarative, auditable, secure | [ADR-0006](decisions/0006-gitops-with-flux.md) | + +## Related Documents + +- [TECH-STACK.md](TECH-STACK.md) - Complete technology inventory +- [DOMAIN-MODEL.md](DOMAIN-MODEL.md) - Core entities and relationships +- [decisions/](decisions/) - All architecture decisions diff --git a/CODING-CONVENTIONS.md b/CODING-CONVENTIONS.md new file mode 100644 index 0000000..9929c93 --- /dev/null +++ b/CODING-CONVENTIONS.md @@ -0,0 +1,424 @@ +# ๐Ÿ“ Coding Conventions + +> **Patterns, practices, and folder structure conventions for DaviesTechLabs repositories** + +## Repository Conventions + +### homelab-k8s2 (Infrastructure) + +``` +kubernetes/ +โ”œโ”€โ”€ apps/ # Application deployments +โ”‚ โ””โ”€โ”€ {namespace}/ # One folder per namespace +โ”‚ โ””โ”€โ”€ {app}/ # One folder per application +โ”‚ โ”œโ”€โ”€ app/ # Kubernetes manifests +โ”‚ โ”‚ โ”œโ”€โ”€ kustomization.yaml +โ”‚ โ”‚ โ”œโ”€โ”€ helmrelease.yaml # OR individual manifests +โ”‚ โ”‚ โ””โ”€โ”€ ... +โ”‚ โ””โ”€โ”€ ks.yaml # Flux Kustomization +โ”œโ”€โ”€ components/ # Reusable Kustomize components +โ””โ”€โ”€ flux/ # Flux system configuration +``` + +**Naming Conventions:** +- Namespaces: lowercase with hyphens (`ai-ml`, `cert-manager`) +- Apps: lowercase with hyphens (`chat-handler`, `voice-assistant`) +- Secrets: `{app}-{type}` (e.g., `milvus-credentials`) + +### llm-workflows (Orchestration) + +``` +workflows/ # Kubernetes Deployments for NATS handlers +โ”œโ”€โ”€ {handler}.yaml # One file per handler + +argo/ # Argo WorkflowTemplates +โ”œโ”€โ”€ {workflow-name}.yaml # One file per workflow + +pipelines/ # Kubeflow Pipeline Python files +โ”œโ”€โ”€ {pipeline}_pipeline.py # Pipeline definition +โ””โ”€โ”€ kfp-sync-job.yaml # Upload job + +{handler}/ # Python source code +โ”œโ”€โ”€ __init__.py +โ”œโ”€โ”€ {handler}.py # Main entry point +โ”œโ”€โ”€ requirements.txt +โ””โ”€โ”€ Dockerfile +``` + +--- + +## Python Conventions + +### Project Structure + +```python +# Use async/await for I/O +async def handle_message(msg: Msg) -> None: + ... + +# Use dataclasses for structured data +@dataclass +class ChatRequest: + user_id: str + message: str + enable_rag: bool = True + +# Use msgpack for NATS messages +import msgpack +data = msgpack.packb({"key": "value"}) +``` + +### Naming + +| Element | Convention | Example | +|---------|------------|---------| +| Files | snake_case | `chat_handler.py` | +| Classes | PascalCase | `ChatHandler` | +| Functions | snake_case | `process_message` | +| Constants | UPPER_SNAKE | `NATS_URL` | +| Private | Leading underscore | `_internal_method` | + +### Type Hints + +```python +# Always use type hints +from typing import Optional, List, Dict, Any + +async def query_rag( + query: str, + collection: str = "knowledge_base", + top_k: int = 5, +) -> List[Dict[str, Any]]: + ... +``` + +### Error Handling + +```python +# Use specific exceptions +class RAGQueryError(Exception): + """Raised when RAG query fails.""" + pass + +# Log errors with context +import logging +logger = logging.getLogger(__name__) + +try: + result = await milvus.search(...) +except Exception as e: + logger.error(f"RAG query failed: {e}", extra={"query": query}) + raise RAGQueryError(f"Failed to query collection {collection}") from e +``` + +### NATS Message Handling + +```python +import nats +import msgpack + +async def message_handler(msg: Msg) -> None: + try: + # Decode MessagePack + data = msgpack.unpackb(msg.data, raw=False) + + # Process + result = await process(data) + + # Reply if request-reply pattern + if msg.reply: + await msg.respond(msgpack.packb(result)) + + # Acknowledge for JetStream + await msg.ack() + + except Exception as e: + logger.error(f"Handler error: {e}") + # NAK for retry (JetStream) + await msg.nak() +``` + +--- + +## Kubernetes Manifest Conventions + +### Labels + +```yaml +metadata: + labels: + # Required + app.kubernetes.io/name: chat-handler + app.kubernetes.io/instance: chat-handler + app.kubernetes.io/component: handler + app.kubernetes.io/part-of: ai-platform + + # Optional + app.kubernetes.io/version: "1.0.0" + app.kubernetes.io/managed-by: flux +``` + +### Annotations + +```yaml +metadata: + annotations: + # Reloader for config changes + reloader.stakater.com/auto: "true" + + # Documentation + description: "Handles chat messages via NATS" +``` + +### Resource Requests + +```yaml +resources: + requests: + cpu: 100m + memory: 256Mi + limits: + cpu: 500m + memory: 512Mi + +# GPU workloads +resources: + limits: + amd.com/gpu: 1 # AMD + nvidia.com/gpu: 1 # NVIDIA +``` + +### Health Checks + +```yaml +livenessProbe: + httpGet: + path: /health + port: 8080 + initialDelaySeconds: 10 + periodSeconds: 30 + +readinessProbe: + httpGet: + path: /ready + port: 8080 + initialDelaySeconds: 5 + periodSeconds: 10 +``` + +--- + +## Flux/GitOps Conventions + +### Kustomization Structure + +```yaml +# ks.yaml - Flux Kustomization +apiVersion: kustomize.toolkit.fluxcd.io/v1 +kind: Kustomization +metadata: + name: &app chat-handler + namespace: flux-system +spec: + targetNamespace: ai-ml + commonMetadata: + labels: + app.kubernetes.io/name: *app + path: ./kubernetes/apps/ai-ml/chat-handler/app + prune: true + sourceRef: + kind: GitRepository + name: flux-system + wait: true + interval: 30m + retryInterval: 1m + timeout: 5m +``` + +### HelmRelease Structure + +```yaml +apiVersion: helm.toolkit.fluxcd.io/v2 +kind: HelmRelease +metadata: + name: milvus +spec: + interval: 30m + chart: + spec: + chart: milvus + version: 4.x.x + sourceRef: + kind: HelmRepository + name: milvus + namespace: flux-system + values: + # Values here +``` + +### Secret References + +```yaml +# Never hardcode secrets +env: + - name: DATABASE_PASSWORD + valueFrom: + secretKeyRef: + name: postgres-credentials + key: password +``` + +--- + +## NATS Subject Conventions + +### Hierarchy + +``` +ai.{domain}.{scope}.{action} + +Examples: +ai.chat.user.{userId}.message # User chat message +ai.chat.response.{requestId} # Chat response +ai.voice.user.{userId}.request # Voice request +ai.pipeline.trigger # Pipeline trigger +``` + +### Wildcards + +``` +ai.chat.> # All chat events +ai.chat.user.*.message # All user messages +ai.*.response.{id} # Any response type +``` + +--- + +## Git Conventions + +### Commit Messages + +``` +type(scope): subject + +body (optional) + +footer (optional) +``` + +**Types:** +- `feat`: New feature +- `fix`: Bug fix +- `docs`: Documentation +- `style`: Formatting +- `refactor`: Code restructuring +- `test`: Tests +- `chore`: Maintenance + +**Examples:** +``` +feat(chat-handler): add streaming response support +fix(voice): handle empty audio gracefully +docs(adr): add decision for MessagePack format +``` + +### Branch Naming + +``` +feature/short-description +fix/issue-number-description +docs/what-changed +``` + +--- + +## Configuration Conventions + +### Environment Variables + +```python +# Use pydantic-settings or similar +from pydantic_settings import BaseSettings + +class Settings(BaseSettings): + nats_url: str = "nats://localhost:4222" + vllm_url: str = "http://localhost:8000" + milvus_host: str = "localhost" + milvus_port: int = 19530 + log_level: str = "INFO" + + class Config: + env_prefix = "" # No prefix +``` + +### ConfigMaps + +```yaml +apiVersion: v1 +kind: ConfigMap +metadata: + name: ai-services-config +data: + NATS_URL: "nats://nats.ai-ml.svc.cluster.local:4222" + VLLM_URL: "http://llm-draft.ai-ml.svc.cluster.local:8000/v1" + # ... other non-sensitive config +``` + +--- + +## Documentation Conventions + +### ADR Format + +See [decisions/0000-template.md](decisions/0000-template.md) + +### Code Comments + +```python +# Use docstrings for public functions +async def query_rag(query: str) -> List[Dict]: + """ + Query the RAG system for relevant documents. + + Args: + query: The search query string + + Returns: + List of document chunks with scores + + Raises: + RAGQueryError: If the query fails + """ + ... +``` + +### README Files + +Each application should have a README with: +1. Purpose +2. Configuration +3. Deployment +4. Local development +5. API documentation (if applicable) + +--- + +## Anti-Patterns to Avoid + +| Don't | Do Instead | +|-------|------------| +| `kubectl apply` directly | Commit to Git, let Flux deploy | +| Hardcode secrets | Use External Secrets Operator | +| Use `latest` image tags | Pin to specific versions | +| Skip health checks | Always define liveness/readiness | +| Ignore resource limits | Set appropriate requests/limits | +| Use JSON for NATS messages | Use MessagePack (binary) | +| Synchronous I/O in handlers | Use async/await | + +--- + +## Related Documents + +- [TECH-STACK.md](TECH-STACK.md) - Technologies used +- [ARCHITECTURE.md](ARCHITECTURE.md) - System design +- [decisions/](decisions/) - Why we made certain choices diff --git a/CONTAINER-DIAGRAM.mmd b/CONTAINER-DIAGRAM.mmd new file mode 100644 index 0000000..7a5bfea --- /dev/null +++ b/CONTAINER-DIAGRAM.mmd @@ -0,0 +1,123 @@ +%% C4 Container Diagram - Level 2 +%% DaviesTechLabs Homelab AI/ML Platform +%% +%% To render: Use Mermaid Live Editor or VS Code Mermaid extension + +graph TB + subgraph users["Users"] + user["๐Ÿ‘ค User"] + end + + subgraph ingress["Ingress Layer"] + cloudflared["cloudflared
(Tunnel)"] + envoy["Envoy Gateway
(HTTPRoute)"] + end + + subgraph frontends["Frontend Applications"] + companions["Companions WebApp
[Go + HTMX]
AI Chat Interface"] + voice["Voice WebApp
[Gradio]
Voice Assistant UI"] + kubeflow_ui["Kubeflow UI
[React]
Pipeline Management"] + end + + subgraph messaging["Message Bus"] + nats["NATS
[JetStream]
Event Streaming"] + end + + subgraph handlers["NATS Handlers"] + chat_handler["Chat Handler
[Python]
RAG + LLM Orchestration"] + voice_handler["Voice Assistant
[Python]
STT โ†’ LLM โ†’ TTS"] + pipeline_bridge["Pipeline Bridge
[Python]
Workflow Triggers"] + end + + subgraph ai_services["AI Services (KServe)"] + whisper["Whisper
[faster-whisper]
Speech-to-Text"] + xtts["XTTS
[Coqui]
Text-to-Speech"] + vllm["vLLM
[ROCm]
LLM Inference"] + embeddings["BGE Embeddings
[sentence-transformers]
Vector Encoding"] + reranker["BGE Reranker
[sentence-transformers]
Document Ranking"] + end + + subgraph storage["Data Stores"] + milvus["Milvus
[Vector DB]
RAG Storage"] + valkey["Valkey
[Redis API]
Session Cache"] + postgres["CloudNative-PG
[PostgreSQL]
Metadata"] + minio["MinIO
[S3 API]
Object Storage"] + end + + subgraph workflows["Workflow Engines"] + argo["Argo Workflows
[DAG Engine]
Complex Pipelines"] + kfp["Kubeflow Pipelines
[ML Platform]
Training + Inference"] + argo_events["Argo Events
[Event Source]
NATS โ†’ Workflow"] + end + + subgraph mlops["MLOps"] + mlflow["MLflow
[Tracking Server]
Experiment Tracking"] + volcano["Volcano
[Scheduler]
GPU Scheduling"] + end + + %% User flow + user --> cloudflared + cloudflared --> envoy + envoy --> companions + envoy --> voice + envoy --> kubeflow_ui + + %% Frontend to NATS + companions --> |WebSocket| nats + voice --> |HTTP/WS| nats + + %% NATS to handlers + nats --> chat_handler + nats --> voice_handler + nats --> pipeline_bridge + + %% Handlers to AI services + chat_handler --> embeddings + chat_handler --> reranker + chat_handler --> vllm + chat_handler --> milvus + chat_handler --> valkey + + voice_handler --> whisper + voice_handler --> embeddings + voice_handler --> reranker + voice_handler --> vllm + voice_handler --> xtts + + %% Pipeline flow + pipeline_bridge --> argo_events + argo_events --> argo + argo_events --> kfp + kubeflow_ui --> kfp + + %% Workflow to AI + argo --> ai_services + kfp --> ai_services + kfp --> mlflow + + %% Storage connections + ai_services --> minio + milvus --> minio + kfp --> postgres + mlflow --> postgres + mlflow --> minio + + %% GPU scheduling + volcano -.-> vllm + volcano -.-> whisper + volcano -.-> xtts + + %% Styling + classDef frontend fill:#90EE90,stroke:#333 + classDef handler fill:#87CEEB,stroke:#333 + classDef ai fill:#FFB6C1,stroke:#333 + classDef storage fill:#DDA0DD,stroke:#333 + classDef workflow fill:#F0E68C,stroke:#333 + classDef messaging fill:#FFA500,stroke:#333 + + class companions,voice,kubeflow_ui frontend + class chat_handler,voice_handler,pipeline_bridge handler + class whisper,xtts,vllm,embeddings,reranker ai + class milvus,valkey,postgres,minio storage + class argo,kfp,argo_events,mlflow,volcano workflow + class nats messaging diff --git a/CONTEXT-DIAGRAM.mmd b/CONTEXT-DIAGRAM.mmd new file mode 100644 index 0000000..78c075e --- /dev/null +++ b/CONTEXT-DIAGRAM.mmd @@ -0,0 +1,69 @@ +%% C4 Context Diagram - Level 1 +%% DaviesTechLabs Homelab System Context +%% +%% To render: Use Mermaid Live Editor or VS Code Mermaid extension + +graph TB + subgraph users["External Users"] + dev["๐Ÿ‘ค Developer
(Billy)"] + family["๐Ÿ‘ฅ Family Members"] + agents["๐Ÿค– AI Agents"] + end + + subgraph external["External Systems"] + cf["โ˜๏ธ Cloudflare
DNS + Tunnel"] + gh["๐Ÿ™ GitHub
Source Code"] + ghcr["๐Ÿ“ฆ GHCR
Container Registry"] + hf["๐Ÿค— Hugging Face
Model Registry"] + end + + subgraph homelab["๐Ÿ  DaviesTechLabs Homelab"] + direction TB + + subgraph apps["Application Layer"] + companions["๐Ÿ’ฌ Companions
AI Chat"] + voice["๐ŸŽค Voice Assistant"] + media["๐ŸŽฌ Media Services
(Jellyfin, *arr)"] + productivity["๐Ÿ“ Productivity
(Nextcloud, Gitea)"] + end + + subgraph platform["Platform Layer"] + k8s["โ˜ธ๏ธ Kubernetes Cluster
Talos Linux"] + end + + subgraph ai["AI/ML Layer"] + inference["๐Ÿง  Inference Services
(vLLM, Whisper, XTTS)"] + workflows["โš™๏ธ Workflow Engines
(Kubeflow, Argo)"] + vectordb["๐Ÿ“š Vector Store
(Milvus)"] + end + end + + %% User interactions + dev --> |manages| productivity + dev --> |develops| k8s + family --> |uses| media + family --> |chats| companions + agents --> |queries| inference + + %% External integrations + cf --> |routes traffic| apps + gh --> |GitOps sync| k8s + ghcr --> |pulls images| k8s + hf --> |downloads models| inference + + %% Internal relationships + apps --> platform + ai --> platform + companions --> inference + voice --> inference + workflows --> inference + inference --> vectordb + + %% Styling + classDef external fill:#f9f,stroke:#333,stroke-width:2px + classDef homelab fill:#bbf,stroke:#333,stroke-width:2px + classDef user fill:#bfb,stroke:#333,stroke-width:2px + + class cf,gh,ghcr,hf external + class companions,voice,media,productivity,k8s,inference,workflows,vectordb homelab + class dev,family,agents user diff --git a/DOMAIN-MODEL.md b/DOMAIN-MODEL.md new file mode 100644 index 0000000..94b076b --- /dev/null +++ b/DOMAIN-MODEL.md @@ -0,0 +1,345 @@ +# ๐Ÿ“Š Domain Model + +> **Core entities, bounded contexts, and relationships in the DaviesTechLabs homelab** + +## Bounded Contexts + +``` +โ”Œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ” +โ”‚ BOUNDED CONTEXTS โ”‚ +โ”œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ค +โ”‚ โ”‚ +โ”‚ โ”Œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ” โ”Œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ” โ”Œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ” โ”‚ +โ”‚ โ”‚ CHAT CONTEXT โ”‚ โ”‚ VOICE CONTEXT โ”‚ โ”‚ WORKFLOW CONTEXT โ”‚ โ”‚ +โ”‚ โ”œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ค โ”œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ค โ”œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ค โ”‚ +โ”‚ โ”‚ โ€ข ChatSession โ”‚ โ”‚ โ€ข VoiceSession โ”‚ โ”‚ โ€ข Pipeline โ”‚ โ”‚ +โ”‚ โ”‚ โ€ข ChatMessage โ”‚ โ”‚ โ€ข AudioChunk โ”‚ โ”‚ โ€ข PipelineRun โ”‚ โ”‚ +โ”‚ โ”‚ โ€ข Conversation โ”‚ โ”‚ โ€ข Transcription โ”‚ โ”‚ โ€ข Artifact โ”‚ โ”‚ +โ”‚ โ”‚ โ€ข User โ”‚ โ”‚ โ€ข SynthesizedAudioโ”‚ โ”‚ โ€ข Experiment โ”‚ โ”‚ +โ”‚ โ””โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ฌโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”˜ โ””โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ฌโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”˜ โ””โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ฌโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”˜ โ”‚ +โ”‚ โ”‚ โ”‚ โ”‚ โ”‚ +โ”‚ โ””โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ผโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”˜ โ”‚ +โ”‚ โ”‚ โ”‚ +โ”‚ โ–ผ โ”‚ +โ”‚ โ”Œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ” โ”‚ +โ”‚ โ”‚ INFERENCE CONTEXT โ”‚ โ”‚ +โ”‚ โ”œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ค โ”‚ +โ”‚ โ”‚ โ€ข InferenceRequest โ€ข Model โ€ข Embedding โ€ข Document โ€ข Chunk โ”‚ โ”‚ +โ”‚ โ””โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”˜ โ”‚ +โ”‚ โ”‚ +โ””โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”˜ +``` + +--- + +## Core Entities + +### User Context + +```yaml +User: + id: string (UUID) + username: string + premium: boolean + preferences: + voice_id: string + model_preference: string + enable_rag: boolean + created_at: timestamp + +Session: + id: string (UUID) + user_id: string + type: "chat" | "voice" + started_at: timestamp + last_activity: timestamp + metadata: object +``` + +### Chat Context + +```yaml +ChatMessage: + id: string (UUID) + session_id: string + user_id: string + role: "user" | "assistant" | "system" + content: string + created_at: timestamp + metadata: + tokens_used: integer + latency_ms: float + rag_sources: string[] + model_used: string + +Conversation: + id: string (UUID) + user_id: string + messages: ChatMessage[] + title: string (auto-generated) + created_at: timestamp + updated_at: timestamp +``` + +### Voice Context + +```yaml +VoiceRequest: + id: string (UUID) + user_id: string + audio_b64: string (base64) + format: "wav" | "webm" | "mp3" + language: string + premium: boolean + enable_rag: boolean + +VoiceResponse: + id: string (UUID) + request_id: string + transcription: string + response_text: string + audio_b64: string (base64) + audio_format: string + latency_ms: float + rag_docs_used: integer +``` + +### Inference Context + +```yaml +InferenceRequest: + id: string (UUID) + service: "llm" | "stt" | "tts" | "embeddings" | "reranker" + input: string | bytes + parameters: object + priority: "standard" | "premium" + +InferenceResponse: + id: string (UUID) + request_id: string + output: string | bytes | float[] + metadata: + model: string + latency_ms: float + tokens: integer (if applicable) +``` + +### RAG Context + +```yaml +Document: + id: string (UUID) + collection: string + title: string + content: string + source_url: string + ingested_at: timestamp + +Chunk: + id: string (UUID) + document_id: string + content: string + embedding: float[1024] # BGE-large dimensions + metadata: + position: integer + page: integer + +RAGQuery: + query: string + collection: string + top_k: integer (default: 5) + rerank: boolean (default: true) + rerank_top_k: integer (default: 3) + +RAGResult: + chunks: Chunk[] + scores: float[] + reranked: boolean +``` + +### Workflow Context + +```yaml +Pipeline: + id: string + name: string + version: string + engine: "kubeflow" | "argo" + definition: object (YAML) + +PipelineRun: + id: string (UUID) + pipeline_id: string + status: "pending" | "running" | "succeeded" | "failed" + started_at: timestamp + completed_at: timestamp + parameters: object + artifacts: Artifact[] + +Artifact: + id: string (UUID) + run_id: string + name: string + type: "model" | "dataset" | "metrics" | "logs" + uri: string (s3://) + metadata: object + +Experiment: + id: string (UUID) + name: string + runs: PipelineRun[] + metrics: object + created_at: timestamp +``` + +--- + +## Entity Relationships + +```mermaid +erDiagram + USER ||--o{ SESSION : has + USER ||--o{ CONVERSATION : owns + SESSION ||--o{ CHAT_MESSAGE : contains + CONVERSATION ||--o{ CHAT_MESSAGE : contains + + USER ||--o{ VOICE_REQUEST : makes + VOICE_REQUEST ||--|| VOICE_RESPONSE : produces + + DOCUMENT ||--o{ CHUNK : contains + CHUNK }|--|| EMBEDDING : has + + PIPELINE ||--o{ PIPELINE_RUN : executed_as + PIPELINE_RUN ||--o{ ARTIFACT : produces + EXPERIMENT ||--o{ PIPELINE_RUN : tracks + + INFERENCE_REQUEST }|--|| INFERENCE_RESPONSE : produces +``` + +--- + +## Aggregate Roots + +| Aggregate | Root Entity | Child Entities | +|-----------|-------------|----------------| +| Chat | Conversation | ChatMessage | +| Voice | VoiceRequest | VoiceResponse | +| RAG | Document | Chunk, Embedding | +| Workflow | PipelineRun | Artifact | +| User | User | Session, Preferences | + +--- + +## Event Flow + +### Chat Event Stream + +``` +UserLogin + โ””โ”€โ–บ SessionCreated + โ””โ”€โ–บ MessageReceived + โ”œโ”€โ–บ RAGQueryExecuted (optional) + โ”œโ”€โ–บ InferenceRequested + โ””โ”€โ–บ ResponseGenerated + โ””โ”€โ–บ MessageStored +``` + +### Voice Event Stream + +``` +VoiceRequestReceived + โ””โ”€โ–บ TranscriptionStarted + โ””โ”€โ–บ TranscriptionCompleted + โ””โ”€โ–บ RAGQueryExecuted (optional) + โ””โ”€โ–บ LLMInferenceStarted + โ””โ”€โ–บ LLMResponseGenerated + โ””โ”€โ–บ TTSSynthesisStarted + โ””โ”€โ–บ AudioResponseReady +``` + +### Workflow Event Stream + +``` +PipelineTriggerReceived + โ””โ”€โ–บ PipelineRunCreated + โ””โ”€โ–บ StepStarted (repeated) + โ””โ”€โ–บ StepCompleted (repeated) + โ””โ”€โ–บ ArtifactProduced (repeated) + โ””โ”€โ–บ PipelineRunCompleted +``` + +--- + +## Data Retention + +| Entity | Retention | Storage | +|--------|-----------|---------| +| ChatMessage | 30 days | JetStream โ†’ PostgreSQL | +| VoiceRequest/Response | 1 hour (audio), 30 days (text) | JetStream โ†’ PostgreSQL | +| Chunk/Embedding | Permanent | Milvus | +| PipelineRun | Permanent | PostgreSQL | +| Artifact | Permanent | MinIO | +| Session | 7 days | Valkey | + +--- + +## Invariants + +### Chat Context +- A ChatMessage must belong to exactly one Conversation +- A Conversation must have at least one ChatMessage +- Messages are immutable once created + +### Voice Context +- VoiceResponse must have corresponding VoiceRequest +- Audio format must be one of: wav, webm, mp3 +- Transcription cannot be empty for valid audio + +### RAG Context +- Chunk must belong to exactly one Document +- Embedding dimensions must match model (1024 for BGE-large) +- Document must have at least one Chunk + +### Workflow Context +- PipelineRun must reference valid Pipeline +- Artifacts must have valid S3 URIs +- Run status transitions: pending โ†’ running โ†’ (succeeded|failed) + +--- + +## Value Objects + +```python +# Immutable value objects +@dataclass(frozen=True) +class MessageContent: + text: str + tokens: int + +@dataclass(frozen=True) +class AudioData: + data: bytes + format: str + duration_ms: int + sample_rate: int + +@dataclass(frozen=True) +class EmbeddingVector: + values: tuple[float, ...] + model: str + dimensions: int + +@dataclass(frozen=True) +class RAGContext: + chunks: tuple[str, ...] + scores: tuple[float, ...] + query: str +``` + +--- + +## Related Documents + +- [ARCHITECTURE.md](ARCHITECTURE.md) - System architecture +- [GLOSSARY.md](GLOSSARY.md) - Term definitions +- [decisions/0004-use-messagepack-for-nats.md](decisions/0004-use-messagepack-for-nats.md) - Message format decision diff --git a/GLOSSARY.md b/GLOSSARY.md new file mode 100644 index 0000000..ce5efa2 --- /dev/null +++ b/GLOSSARY.md @@ -0,0 +1,242 @@ +# ๐Ÿ“– Glossary + +> **Terminology and abbreviations used in the DaviesTechLabs homelab** + +## A + +**ADR (Architecture Decision Record)** +: A document that captures an important architectural decision, including context, decision, and consequences. + +**Argo Events** +: Event-driven automation for Kubernetes that triggers workflows based on events from various sources. + +**Argo Workflows** +: A container-native workflow engine for orchestrating parallel jobs on Kubernetes. + +**Authentik** +: Self-hosted identity provider supporting SAML, OIDC, and other protocols. + +## B + +**BGE (BAAI General Embedding)** +: A family of embedding models from BAAI used for semantic search and RAG. + +**Bounded Context** +: A DDD concept defining a boundary within which a particular domain model applies. + +## C + +**C4 Model** +: A hierarchical approach to software architecture diagrams: Context, Container, Component, Code. + +**Cilium** +: eBPF-based networking, security, and observability for Kubernetes. + +**CloudNative-PG** +: Kubernetes operator for PostgreSQL databases. + +**CNI (Container Network Interface)** +: Standard for configuring network interfaces in Linux containers. + +## D + +**DDD (Domain-Driven Design)** +: Software design approach focusing on the core domain and domain logic. + +## E + +**Embedding** +: A vector representation of text, used for semantic similarity and search. + +**Envoy Gateway** +: Kubernetes Gateway API implementation using Envoy proxy. + +**External Secrets Operator (ESO)** +: Kubernetes operator that syncs secrets from external stores (Vault, etc.). + +## F + +**Falco** +: Runtime security tool that detects anomalous activity in containers. + +**Flux CD** +: GitOps toolkit for Kubernetes, continuously reconciling cluster state with Git. + +## G + +**GitOps** +: Operational practice using Git as the single source of truth for declarative infrastructure. + +**GPU Device Plugin** +: Kubernetes plugin that exposes GPU resources to containers. + +## H + +**HelmRelease** +: Flux CRD for managing Helm chart releases declaratively. + +**HTTPRoute** +: Kubernetes Gateway API resource for HTTP routing rules. + +## I + +**InferenceService** +: KServe CRD for deploying ML models with autoscaling and traffic management. + +## J + +**JetStream** +: NATS persistence layer providing streaming, key-value, and object stores. + +## K + +**KServe** +: Kubernetes-native platform for deploying and serving ML models. + +**Kubeflow** +: ML toolkit for Kubernetes, including pipelines, training operators, and more. + +**Kustomization** +: Flux CRD for applying Kustomize overlays from Git sources. + +## L + +**LLM (Large Language Model)** +: AI model trained on vast text data, capable of generating human-like text. + +**Longhorn** +: Cloud-native distributed storage for Kubernetes. + +## M + +**MessagePack (msgpack)** +: Binary serialization format, more compact than JSON. + +**Milvus** +: Open-source vector database for similarity search and AI applications. + +**MLflow** +: Platform for managing the ML lifecycle: experiments, models, deployment. + +**MinIO** +: S3-compatible object storage. + +## N + +**NATS** +: Cloud-native messaging system for microservices, IoT, and serverless. + +**Node Feature Discovery (NFD)** +: Kubernetes add-on for detecting hardware features on nodes. + +## P + +**Pipeline** +: In ML context, a DAG of components that process data and train/serve models. + +**Premium User** +: User tier with enhanced features (more RAG docs, priority routing). + +## R + +**RAG (Retrieval-Augmented Generation)** +: AI technique combining document retrieval with LLM generation for grounded responses. + +**Reranker** +: Model that rescores retrieved documents based on relevance to a query. + +**ROCm** +: AMD's open-source GPU computing platform (alternative to CUDA). + +## S + +**Schematic** +: Talos Linux concept for defining system extensions and configurations. + +**SOPS (Secrets OPerationS)** +: Tool for encrypting secrets in Git repositories. + +**STT (Speech-to-Text)** +: Converting spoken audio to text (e.g., Whisper). + +**Strix Halo** +: AMD's unified memory architecture for APUs with large GPU memory. + +## T + +**Talos Linux** +: Minimal, immutable Linux distribution designed specifically for Kubernetes. + +**TTS (Text-to-Speech)** +: Converting text to spoken audio (e.g., XTTS/Coqui). + +## V + +**Valkey** +: Redis-compatible in-memory data store (Redis fork). + +**vLLM** +: High-throughput LLM serving engine with PagedAttention. + +**VIP (Virtual IP)** +: IP address shared among multiple hosts for high availability. + +**Volcano** +: Kubernetes batch scheduler for high-performance workloads (ML, HPC). + +**VRM** +: File format for 3D humanoid avatars. + +## W + +**Whisper** +: OpenAI's speech recognition model. + +## X + +**XTTS** +: Coqui's multi-language text-to-speech model with voice cloning. + +--- + +## Acronyms Quick Reference + +| Acronym | Full Form | +|---------|-----------| +| ADR | Architecture Decision Record | +| API | Application Programming Interface | +| BGE | BAAI General Embedding | +| CI/CD | Continuous Integration/Continuous Deployment | +| CRD | Custom Resource Definition | +| DAG | Directed Acyclic Graph | +| DDD | Domain-Driven Design | +| ESO | External Secrets Operator | +| GPU | Graphics Processing Unit | +| HA | High Availability | +| HPA | Horizontal Pod Autoscaler | +| LLM | Large Language Model | +| ML | Machine Learning | +| NATS | (not an acronym, named after message passing in Erlang) | +| NFD | Node Feature Discovery | +| OIDC | OpenID Connect | +| RAG | Retrieval-Augmented Generation | +| RBAC | Role-Based Access Control | +| ROCm | Radeon Open Compute | +| S3 | Simple Storage Service | +| SAML | Security Assertion Markup Language | +| SOPS | Secrets OPerationS | +| SSO | Single Sign-On | +| STT | Speech-to-Text | +| TLS | Transport Layer Security | +| TTS | Text-to-Speech | +| UUID | Universally Unique Identifier | +| VIP | Virtual IP | +| VRAM | Video Random Access Memory | + +--- + +## Related Documents + +- [ARCHITECTURE.md](ARCHITECTURE.md) - System overview +- [TECH-STACK.md](TECH-STACK.md) - Technology details +- [DOMAIN-MODEL.md](DOMAIN-MODEL.md) - Entity definitions diff --git a/README.md b/README.md index a408e86..35449c6 100644 --- a/README.md +++ b/README.md @@ -1,3 +1,105 @@ -# homelab-design +# ๐Ÿ  DaviesTechLabs Homelab Architecture -homelab design process goes here. \ No newline at end of file +> **Production-grade AI/ML platform running on bare-metal Kubernetes** + +[![Talos](https://img.shields.io/badge/Talos-v1.12.1-blue?logo=linux)](https://talos.dev) +[![Kubernetes](https://img.shields.io/badge/Kubernetes-v1.35.0-326CE5?logo=kubernetes)](https://kubernetes.io) +[![Flux](https://img.shields.io/badge/GitOps-Flux-blue?logo=flux)](https://fluxcd.io) +[![License](https://img.shields.io/badge/License-MIT-green)](LICENSE) + +## ๐Ÿ“– Quick Navigation + +| Document | Purpose | +|----------|---------| +| [AGENT-ONBOARDING.md](AGENT-ONBOARDING.md) | **Start here if you're an AI agent** | +| [ARCHITECTURE.md](ARCHITECTURE.md) | High-level system overview | +| [TECH-STACK.md](TECH-STACK.md) | Complete technology stack | +| [DOMAIN-MODEL.md](DOMAIN-MODEL.md) | Core entities and bounded contexts | +| [GLOSSARY.md](GLOSSARY.md) | Terminology reference | +| [decisions/](decisions/) | Architecture Decision Records (ADRs) | + +## ๐ŸŽฏ What This Is + +A comprehensive architecture documentation repository for the DaviesTechLabs homelab Kubernetes cluster, featuring: + +- **AI/ML Platform**: KServe inference services, RAG pipelines, voice assistants +- **Multi-GPU Support**: AMD ROCm (RDNA3/Strix Halo), NVIDIA CUDA, Intel Arc +- **GitOps**: Flux CD with SOPS encryption +- **Event-Driven**: NATS JetStream for real-time messaging +- **ML Workflows**: Kubeflow Pipelines + Argo Workflows + +## ๐Ÿ–ฅ๏ธ Cluster Overview + +| Node | Role | Hardware | GPU | +|------|------|----------|-----| +| storm | Control Plane | Intel 13th Gen | Integrated | +| bruenor | Control Plane | Intel 13th Gen | Integrated | +| catti | Control Plane | Intel 13th Gen | Integrated | +| elminster | Worker | NVIDIA RTX 2070 | 8GB CUDA | +| khelben | Worker (vLLM) | AMD Strix Halo | 64GB Unified | +| drizzt | Worker | AMD Radeon 680M | 12GB RDNA2 | +| danilo | Worker | Intel Core Ultra 9 | Intel Arc | + +## ๐Ÿš€ Quick Start + +### View Current Cluster State + +```bash +# Get node status +kubectl get nodes -o wide + +# View AI/ML workloads +kubectl get pods -n ai-ml + +# Check KServe inference services +kubectl get inferenceservices -n ai-ml +``` + +### Key Endpoints + +| Service | URL | Purpose | +|---------|-----|---------| +| Kubeflow | `kubeflow.lab.daviestechlabs.io` | ML Pipeline UI | +| Companions | `companions-chat.lab.daviestechlabs.io` | AI Chat Interface | +| Voice | `voice.lab.daviestechlabs.io` | Voice Assistant | +| Gitea | `git.daviestechlabs.io` | Self-hosted Git | + +## ๐Ÿ“‚ Repository Structure + +``` +homelab-design/ +โ”œโ”€โ”€ README.md # This file +โ”œโ”€โ”€ AGENT-ONBOARDING.md # AI agent quick-start +โ”œโ”€โ”€ ARCHITECTURE.md # High-level system overview +โ”œโ”€โ”€ CONTEXT-DIAGRAM.mmd # C4 Level 1 (Mermaid) +โ”œโ”€โ”€ CONTAINER-DIAGRAM.mmd # C4 Level 2 +โ”œโ”€โ”€ TECH-STACK.md # Complete tech stack +โ”œโ”€โ”€ DOMAIN-MODEL.md # Core entities +โ”œโ”€โ”€ CODING-CONVENTIONS.md # Patterns & practices +โ”œโ”€โ”€ GLOSSARY.md # Terminology +โ”œโ”€โ”€ decisions/ # ADRs +โ”‚ โ”œโ”€โ”€ 0000-template.md +โ”‚ โ”œโ”€โ”€ 0001-record-architecture-decisions.md +โ”‚ โ”œโ”€โ”€ 0002-use-talos-linux.md +โ”‚ โ””โ”€โ”€ ... +โ”œโ”€โ”€ specs/ # Feature specifications +โ””โ”€โ”€ diagrams/ # Additional diagrams +``` + +## ๐Ÿ”— Related Repositories + +| Repository | Purpose | +|------------|---------| +| [homelab-k8s2](https://github.com/Billy-Davies-2/homelab-k8s2) | Kubernetes manifests, Flux GitOps | +| [llm-workflows](https://github.com/Billy-Davies-2/llm-workflows) | NATS handlers, Argo/KFP workflows | +| [companions-frontend](https://github.com/Billy-Davies-2/companions-frontend) | Go web server, HTMX frontend | + +## ๐Ÿ“ Contributing + +1. For architecture changes, create an ADR in `decisions/` +2. Update relevant documentation +3. Submit a PR with context + +--- + +*Last updated: 2026-02-01* diff --git a/TECH-STACK.md b/TECH-STACK.md new file mode 100644 index 0000000..bbb1de8 --- /dev/null +++ b/TECH-STACK.md @@ -0,0 +1,271 @@ +# ๐Ÿ› ๏ธ Technology Stack + +> **Complete inventory of technologies used in the DaviesTechLabs homelab** + +## Platform Layer + +### Operating System + +| Component | Version | Purpose | +|-----------|---------|---------| +| [Talos Linux](https://talos.dev) | v1.12.1 | Immutable, API-driven Kubernetes OS | +| Kernel | 6.18.2-talos | Linux kernel with GPU drivers | + +### Container Orchestration + +| Component | Version | Purpose | +|-----------|---------|---------| +| [Kubernetes](https://kubernetes.io) | v1.35.0 | Container orchestration | +| [containerd](https://containerd.io) | 2.1.6 | Container runtime | +| [Cilium](https://cilium.io) | Latest | CNI, network policies, eBPF | + +### GitOps + +| Component | Version | Purpose | +|-----------|---------|---------| +| [Flux CD](https://fluxcd.io) | v2 | GitOps continuous delivery | +| [SOPS](https://github.com/getsops/sops) | Latest | Secret encryption | +| [Age](https://github.com/FiloSottile/age) | Latest | Encryption key management | + +--- + +## AI/ML Layer + +### Inference Engines + +| Service | Framework | GPU | Model Type | +|---------|-----------|-----|------------| +| [vLLM](https://vllm.ai) | ROCm | AMD Strix Halo | Large Language Models | +| [faster-whisper](https://github.com/guillaumekln/faster-whisper) | CUDA | NVIDIA RTX 2070 | Speech-to-Text | +| [XTTS](https://github.com/coqui-ai/TTS) | CUDA | NVIDIA RTX 2070 | Text-to-Speech | +| [BGE Embeddings](https://huggingface.co/BAAI/bge-large-en-v1.5) | ROCm | AMD Radeon 680M | Text Embeddings | +| [BGE Reranker](https://huggingface.co/BAAI/bge-reranker-large) | Intel | Intel Arc | Document Reranking | + +### ML Serving + +| Component | Version | Purpose | +|-----------|---------|---------| +| [KServe](https://kserve.github.io) | v0.12+ | Model serving framework | +| [Ray Serve](https://ray.io/serve) | 2.53.0 | Unified inference endpoints | + +### ML Workflows + +| Component | Version | Purpose | +|-----------|---------|---------| +| [Kubeflow Pipelines](https://kubeflow.org) | 2.15.0 | ML pipeline orchestration | +| [Argo Workflows](https://argoproj.github.io/workflows) | v3.7.8 | DAG-based workflows | +| [Argo Events](https://argoproj.github.io/events) | Latest | Event-driven triggers | +| [MLflow](https://mlflow.org) | 3.7.0 | Experiment tracking, model registry | + +### GPU Scheduling + +| Component | Version | Purpose | +|-----------|---------|---------| +| [Volcano](https://volcano.sh) | Latest | GPU-aware scheduling | +| AMD GPU Device Plugin | v1.4.1 | ROCm GPU allocation | +| NVIDIA Device Plugin | Latest | CUDA GPU allocation | +| [Node Feature Discovery](https://github.com/kubernetes-sigs/node-feature-discovery) | v0.18.2 | Hardware detection | + +--- + +## Data Layer + +### Databases + +| Component | Version | Purpose | +|-----------|---------|---------| +| [CloudNative-PG](https://cloudnative-pg.io) | 16.11 | PostgreSQL for metadata | +| [Milvus](https://milvus.io) | Latest | Vector database for RAG | +| [ClickHouse](https://clickhouse.com) | Latest | Analytics, access logs | +| [Valkey](https://valkey.io) | Latest | Redis-compatible cache | + +### Object Storage + +| Component | Version | Purpose | +|-----------|---------|---------| +| [MinIO](https://min.io) | Latest | S3-compatible storage | +| [Longhorn](https://longhorn.io) | v1.10.1 | Distributed block storage | +| NFS CSI Driver | Latest | Shared filesystem | + +### Messaging + +| Component | Version | Purpose | +|-----------|---------|---------| +| [NATS](https://nats.io) | Latest | Message bus | +| NATS JetStream | Built-in | Persistent streaming | + +### Data Processing + +| Component | Version | Purpose | +|-----------|---------|---------| +| [Apache Spark](https://spark.apache.org) | Latest | Batch analytics | +| [Apache Flink](https://flink.apache.org) | Latest | Stream processing | +| [Apache Iceberg](https://iceberg.apache.org) | Latest | Table format | +| [Nessie](https://projectnessie.org) | Latest | Data catalog | +| [Trino](https://trino.io) | 479 | SQL query engine | + +--- + +## Application Layer + +### Web Frameworks + +| Application | Language | Framework | Purpose | +|-------------|----------|-----------|---------| +| Companions | Go | net/http + HTMX | AI chat interface | +| Voice WebApp | Python | Gradio | Voice assistant UI | +| Various handlers | Python | asyncio + nats.py | NATS event handlers | + +### Frontend + +| Technology | Purpose | +|------------|---------| +| [HTMX](https://htmx.org) | Dynamic HTML updates | +| [Alpine.js](https://alpinejs.dev) | Lightweight reactivity | +| [VRM](https://vrm.dev) | 3D avatar rendering | + +--- + +## Networking Layer + +### Ingress + +| Component | Version | Purpose | +|-----------|---------|---------| +| [Envoy Gateway](https://gateway.envoyproxy.io) | v1.6.3 | Gateway API implementation | +| [cloudflared](https://developers.cloudflare.com/cloudflare-one/connections/connect-apps) | Latest | Cloudflare tunnel | + +### DNS & Certificates + +| Component | Version | Purpose | +|-----------|---------|---------| +| [external-dns](https://github.com/kubernetes-sigs/external-dns) | Latest | Automatic DNS management | +| [cert-manager](https://cert-manager.io) | Latest | TLS certificate automation | + +### Service Mesh + +| Component | Purpose | +|-----------|---------| +| [Spegel](https://github.com/spegel-org/spegel) | P2P container image distribution | + +--- + +## Security Layer + +### Identity & Access + +| Component | Version | Purpose | +|-----------|---------|---------| +| [Authentik](https://goauthentik.io) | 2025.12.1 | Identity provider, SSO | +| [Vault](https://vaultproject.io) | 1.21.2 | Secret management | +| [External Secrets Operator](https://external-secrets.io) | v1.3.1 | Kubernetes secrets sync | + +### Runtime Security + +| Component | Version | Purpose | +|-----------|---------|---------| +| [Falco](https://falco.org) | 0.42.1 | Runtime threat detection | +| Cilium Network Policies | Built-in | Network segmentation | + +### Backup + +| Component | Version | Purpose | +|-----------|---------|---------| +| [Velero](https://velero.io) | v1.17.1 | Cluster backup/restore | + +--- + +## Observability Layer + +### Metrics + +| Component | Purpose | +|-----------|---------| +| [Prometheus](https://prometheus.io) | Metrics collection | +| [Grafana](https://grafana.com) | Dashboards & visualization | + +### Logging + +| Component | Version | Purpose | +|-----------|---------|---------| +| [Grafana Alloy](https://grafana.com/oss/alloy) | v1.12.0 | Log collection | +| [Loki](https://grafana.com/oss/loki) | Latest | Log aggregation | + +### Tracing + +| Component | Purpose | +|-----------|---------| +| [OpenTelemetry Collector](https://opentelemetry.io) | Trace collection | +| Tempo/Jaeger | Trace storage & query | + +--- + +## Development Tools + +### Local Development + +| Tool | Purpose | +|------|---------| +| [mise](https://mise.jdx.dev) | Tool version management | +| [Task](https://taskfile.dev) | Task runner (Taskfile.yaml) | +| [flux-local](https://github.com/allenporter/flux-local) | Local Flux testing | + +### CI/CD + +| Tool | Purpose | +|------|---------| +| GitHub Actions | CI/CD pipelines | +| [Renovate](https://renovatebot.com) | Dependency updates | + +### Image Building + +| Tool | Purpose | +|------|---------| +| Docker | Container builds | +| GHCR | Container registry | + +--- + +## Media & Entertainment + +| Component | Version | Purpose | +|-----------|---------|---------| +| [Jellyfin](https://jellyfin.org) | 10.11.5 | Media server | +| [Nextcloud](https://nextcloud.com) | 32.0.5 | File sync & share | +| Prowlarr, Bazarr, etc. | Various | *arr stack | +| [Kasm](https://kasmweb.com) | 1.18.1 | Browser isolation | + +--- + +## Python Dependencies (llm-workflows) + +```toml +# Core +nats-py>=2.7.0 # NATS client +msgpack>=1.0.0 # Binary serialization +aiohttp>=3.9.0 # HTTP client + +# ML/AI +pymilvus>=2.4.0 # Milvus client +sentence-transformers # Embeddings +openai>=1.0.0 # vLLM OpenAI API + +# Kubeflow +kfp>=2.12.1 # Pipeline SDK +``` + +--- + +## Version Pinning Strategy + +| Component Type | Strategy | +|----------------|----------| +| Base images | Pin major.minor | +| Helm charts | Pin exact version | +| Python packages | Pin minimum version | +| System extensions | Pin via Talos schematic | + +## Related Documents + +- [ARCHITECTURE.md](ARCHITECTURE.md) - How components connect +- [decisions/](decisions/) - Why we chose specific technologies diff --git a/decisions/0000-template.md b/decisions/0000-template.md new file mode 100644 index 0000000..0566a72 --- /dev/null +++ b/decisions/0000-template.md @@ -0,0 +1,71 @@ +# [short title of solved problem and solution] + +* Status: [proposed | rejected | accepted | deprecated | superseded by [ADR-NNNN](NNNN-example.md)] +* Date: YYYY-MM-DD +* Deciders: [list of people involved in decision] +* Technical Story: [description | ticket/issue URL] + +## Context and Problem Statement + +[Describe the context and problem statement, e.g., in free form using two to three sentences. You may want to articulate the problem in form of a question.] + +## Decision Drivers + +* [driver 1, e.g., a force, facing concern, โ€ฆ] +* [driver 2, e.g., a force, facing concern, โ€ฆ] +* โ€ฆ + +## Considered Options + +* [option 1] +* [option 2] +* [option 3] +* โ€ฆ + +## Decision Outcome + +Chosen option: "[option N]", because [justification. e.g., only option which meets k.o. criterion decision driver | which resolves force | โ€ฆ | comes out best (see below)]. + +### Positive Consequences + +* [e.g., improvement of quality attribute satisfaction, follow-up decisions required, โ€ฆ] +* โ€ฆ + +### Negative Consequences + +* [e.g., compromising quality attribute, follow-up decisions required, โ€ฆ] +* โ€ฆ + +## Pros and Cons of the Options + +### [option 1] + +[example | description | pointer to more information | โ€ฆ] + +* Good, because [argument a] +* Good, because [argument b] +* Bad, because [argument c] +* โ€ฆ + +### [option 2] + +[example | description | pointer to more information | โ€ฆ] + +* Good, because [argument a] +* Good, because [argument b] +* Bad, because [argument c] +* โ€ฆ + +### [option 3] + +[example | description | pointer to more information | โ€ฆ] + +* Good, because [argument a] +* Good, because [argument b] +* Bad, because [argument c] +* โ€ฆ + +## Links + +* [Link type] [Link to ADR] +* โ€ฆ diff --git a/decisions/0001-record-architecture-decisions.md b/decisions/0001-record-architecture-decisions.md new file mode 100644 index 0000000..6a48c48 --- /dev/null +++ b/decisions/0001-record-architecture-decisions.md @@ -0,0 +1,79 @@ +# Record Architecture Decisions + +* Status: accepted +* Date: 2025-11-30 +* Deciders: Billy Davies +* Technical Story: Initial setup of homelab documentation + +## Context and Problem Statement + +As the homelab infrastructure grows in complexity with AI/ML services, multi-GPU configurations, and event-driven architectures, we need a way to document and communicate significant architectural decisions. Without documentation, the rationale behind choices gets lost, making future changes risky and onboarding difficult. + +## Decision Drivers + +* Need to preserve context for why decisions were made +* Enable future maintainers (including AI agents) to understand the system +* Provide a structured way to evaluate alternatives +* Support the wiki/design process for iterative improvements + +## Considered Options + +* Informal documentation in README files +* Wiki pages without structure +* Architecture Decision Records (ADRs) +* No documentation (rely on code) + +## Decision Outcome + +Chosen option: "Architecture Decision Records (ADRs)", because they provide a structured format that captures context, alternatives, and consequences. They're lightweight, version-controlled, and well-suited for technical decisions. + +### Positive Consequences + +* Clear historical record of decisions +* Structured format makes decisions searchable +* Forces consideration of alternatives +* Git-versioned alongside code +* AI agents can parse and understand decisions + +### Negative Consequences + +* Requires discipline to create ADRs +* May accumulate outdated decisions over time +* Additional overhead for simple decisions + +## Pros and Cons of the Options + +### Informal README documentation + +* Good, because low friction +* Good, because close to code +* Bad, because no structure for alternatives +* Bad, because decisions get buried in prose + +### Wiki pages + +* Good, because easy to edit +* Good, because supports rich formatting +* Bad, because separate from code repository +* Bad, because no enforced structure + +### ADRs + +* Good, because structured format +* Good, because version controlled +* Good, because captures alternatives considered +* Good, because industry-standard practice +* Bad, because requires creating new files +* Bad, because may seem bureaucratic for small decisions + +### No documentation + +* Good, because no overhead +* Bad, because context is lost +* Bad, because makes onboarding difficult +* Bad, because risky for future changes + +## Links + +* Based on [MADR template](https://adr.github.io/madr/) +* [ADR GitHub organization](https://adr.github.io/) diff --git a/decisions/0002-use-talos-linux.md b/decisions/0002-use-talos-linux.md new file mode 100644 index 0000000..f5bd437 --- /dev/null +++ b/decisions/0002-use-talos-linux.md @@ -0,0 +1,97 @@ +# Use Talos Linux for Kubernetes Nodes + +* Status: accepted +* Date: 2025-11-30 +* Deciders: Billy Davies +* Technical Story: Selecting OS for bare-metal Kubernetes cluster + +## Context and Problem Statement + +We need a reliable, secure operating system for running Kubernetes on bare-metal homelab nodes. The OS should minimize attack surface, be easy to manage at scale, and support our GPU requirements (AMD ROCm, NVIDIA CUDA, Intel). + +## Decision Drivers + +* Security-first design (immutable, minimal) +* API-driven management (no SSH) +* Support for various GPU drivers +* Kubernetes-native focus +* Community support and updates +* Ease of upgrades + +## Considered Options + +* Ubuntu Server with kubeadm +* Flatcar Container Linux +* Talos Linux +* k3OS (discontinued) +* Rocky Linux with RKE2 + +## Decision Outcome + +Chosen option: "Talos Linux", because it provides an immutable, API-driven, Kubernetes-focused OS that minimizes attack surface and simplifies operations. + +### Positive Consequences + +* Immutable root filesystem prevents drift +* No SSH reduces attack vectors +* API-driven management integrates well with GitOps +* Schematic system allows custom kernel modules (GPU drivers) +* Consistent configuration across all nodes +* Automatic updates with minimal disruption + +### Negative Consequences + +* Learning curve for API-driven management +* Debugging requires different approaches (no SSH) +* Custom extensions require schematic IDs +* Less flexibility for non-Kubernetes workloads + +## Pros and Cons of the Options + +### Ubuntu Server with kubeadm + +* Good, because familiar +* Good, because extensive package availability +* Good, because easy debugging via SSH +* Bad, because mutable system leads to drift +* Bad, because large attack surface +* Bad, because manual package management + +### Flatcar Container Linux + +* Good, because immutable +* Good, because auto-updates +* Good, because container-focused +* Bad, because less Kubernetes-specific +* Bad, because smaller community than Talos +* Bad, because GPU driver setup more complex + +### Talos Linux + +* Good, because purpose-built for Kubernetes +* Good, because immutable and minimal +* Good, because API-driven (no SSH) +* Good, because excellent Kubernetes integration +* Good, because active development and community +* Good, because schematic system for GPU drivers +* Bad, because learning curve +* Bad, because no traditional debugging + +### k3OS + +* Good, because simple +* Bad, because discontinued + +### Rocky Linux with RKE2 + +* Good, because enterprise-like +* Good, because familiar Linux experience +* Bad, because mutable system +* Bad, because more operational overhead +* Bad, because larger attack surface + +## Links + +* [Talos Linux](https://talos.dev) +* [Talos Image Factory](https://factory.talos.dev) +* Related: [ADR-0005](0005-multi-gpu-strategy.md) - GPU driver integration via schematics diff --git a/decisions/0003-use-nats-for-messaging.md b/decisions/0003-use-nats-for-messaging.md new file mode 100644 index 0000000..7ad495a --- /dev/null +++ b/decisions/0003-use-nats-for-messaging.md @@ -0,0 +1,112 @@ +# Use NATS for AI/ML Messaging + +* Status: accepted +* Date: 2025-12-01 +* Deciders: Billy Davies +* Technical Story: Selecting message bus for AI service orchestration + +## Context and Problem Statement + +The AI/ML platform requires a messaging system for: +- Real-time chat message routing +- Voice request/response streaming +- Pipeline triggers and status updates +- Event-driven workflow orchestration + +We need a messaging system that handles both ephemeral real-time messages and persistent streams. + +## Decision Drivers + +* Low latency for real-time chat/voice +* Persistence for audit and replay +* Simple operations for homelab +* Support for request-reply pattern +* Wildcard subscriptions for routing +* Binary message support (audio data) + +## Considered Options + +* Apache Kafka +* RabbitMQ +* Redis Pub/Sub + Streams +* NATS with JetStream +* Apache Pulsar + +## Decision Outcome + +Chosen option: "NATS with JetStream", because it provides both fire-and-forget messaging and persistent streams with significantly simpler operations than alternatives. + +### Positive Consequences + +* Sub-millisecond latency for real-time messages +* JetStream provides persistence when needed +* Simple deployment (single binary) +* Excellent Kubernetes integration +* Request-reply pattern built-in +* Wildcard subscriptions for flexible routing +* Low resource footprint + +### Negative Consequences + +* Less ecosystem than Kafka +* JetStream less mature than Kafka Streams +* No built-in schema registry +* Smaller community than RabbitMQ + +## Pros and Cons of the Options + +### Apache Kafka + +* Good, because industry standard for streaming +* Good, because rich ecosystem (Kafka Streams, Connect) +* Good, because schema registry +* Good, because excellent for high throughput +* Bad, because operationally complex (ZooKeeper/KRaft) +* Bad, because high resource requirements +* Bad, because overkill for homelab scale +* Bad, because higher latency for real-time messages + +### RabbitMQ + +* Good, because mature and stable +* Good, because flexible routing +* Good, because good management UI +* Bad, because AMQP protocol overhead +* Bad, because not designed for streaming +* Bad, because more complex clustering + +### Redis Pub/Sub + Streams + +* Good, because simple +* Good, because already might use Redis +* Good, because low latency +* Bad, because pub/sub not persistent +* Bad, because streams API less intuitive +* Bad, because not primary purpose of Redis + +### NATS with JetStream + +* Good, because extremely low latency +* Good, because simple operations +* Good, because both pub/sub and persistence +* Good, because request-reply built-in +* Good, because wildcard subscriptions +* Good, because low resource usage +* Good, because excellent Go/Python clients +* Bad, because smaller ecosystem +* Bad, because JetStream newer than Kafka + +### Apache Pulsar + +* Good, because unified messaging + streaming +* Good, because multi-tenancy +* Good, because geo-replication +* Bad, because complex architecture +* Bad, because high resource requirements +* Bad, because smaller community + +## Links + +* [NATS.io](https://nats.io) +* [JetStream Documentation](https://docs.nats.io/nats-concepts/jetstream) +* Related: [ADR-0004](0004-use-messagepack-for-nats.md) - Message format diff --git a/decisions/0004-use-messagepack-for-nats.md b/decisions/0004-use-messagepack-for-nats.md new file mode 100644 index 0000000..09b4cf9 --- /dev/null +++ b/decisions/0004-use-messagepack-for-nats.md @@ -0,0 +1,137 @@ +# Use MessagePack for NATS Messages + +* Status: accepted +* Date: 2025-12-01 +* Deciders: Billy Davies +* Technical Story: Selecting serialization format for NATS messages + +## Context and Problem Statement + +NATS messages in the AI platform carry various payloads: +- Text chat messages (small) +- Voice audio data (potentially large, base64 or binary) +- Streaming response chunks +- Pipeline parameters + +We need a serialization format that handles both text and binary efficiently. + +## Decision Drivers + +* Efficient binary data handling (audio) +* Compact message size +* Fast serialization/deserialization +* Cross-language support (Python, Go) +* Debugging ability +* Schema flexibility + +## Considered Options + +* JSON +* Protocol Buffers (protobuf) +* MessagePack (msgpack) +* CBOR +* Avro + +## Decision Outcome + +Chosen option: "MessagePack (msgpack)", because it provides binary efficiency with JSON-like simplicity and schema-less flexibility. + +### Positive Consequences + +* Native binary support (no base64 overhead for audio) +* 20-50% smaller than JSON for typical messages +* Faster serialization than JSON +* No schema compilation step +* Easy debugging (can pretty-print like JSON) +* Excellent Python and Go libraries + +### Negative Consequences + +* Less human-readable than JSON when raw +* No built-in schema validation +* Slightly less common than JSON + +## Pros and Cons of the Options + +### JSON + +* Good, because human-readable +* Good, because universal support +* Good, because no setup required +* Bad, because binary data requires base64 (33% overhead) +* Bad, because larger message sizes +* Bad, because slower parsing + +### Protocol Buffers + +* Good, because very compact +* Good, because fast +* Good, because schema validation +* Good, because cross-language +* Bad, because requires schema definition +* Bad, because compilation step +* Bad, because less flexible for evolving schemas +* Bad, because overkill for simple messages + +### MessagePack + +* Good, because binary-efficient +* Good, because JSON-like simplicity +* Good, because no schema required +* Good, because excellent library support +* Good, because can include raw bytes +* Bad, because not human-readable raw +* Bad, because no schema validation + +### CBOR + +* Good, because binary-efficient +* Good, because IETF standard +* Good, because schema-less +* Bad, because less common libraries +* Bad, because smaller community +* Bad, because similar to msgpack with less adoption + +### Avro + +* Good, because schema evolution +* Good, because compact +* Good, because schema registry integration +* Bad, because requires schema +* Bad, because more complex setup +* Bad, because Java-centric ecosystem + +## Implementation Notes + +```python +# Python usage +import msgpack + +# Serialize +data = { + "user_id": "user-123", + "audio": audio_bytes, # Raw bytes, no base64 + "premium": True +} +payload = msgpack.packb(data) + +# Deserialize +data = msgpack.unpackb(payload, raw=False) +``` + +```go +// Go usage +import "github.com/vmihailenco/msgpack/v5" + +type Message struct { + UserID string `msgpack:"user_id"` + Audio []byte `msgpack:"audio"` +} +``` + +## Links + +* [MessagePack Specification](https://msgpack.org) +* [msgpack-python](https://github.com/msgpack/msgpack-python) +* Related: [ADR-0003](0003-use-nats-for-messaging.md) - Message bus choice +* See: [BINARY_MESSAGES_AND_JETSTREAM.md](../specs/BINARY_MESSAGES_AND_JETSTREAM.md) diff --git a/decisions/0005-multi-gpu-strategy.md b/decisions/0005-multi-gpu-strategy.md new file mode 100644 index 0000000..54c782e --- /dev/null +++ b/decisions/0005-multi-gpu-strategy.md @@ -0,0 +1,145 @@ +# Multi-GPU Heterogeneous Strategy + +* Status: accepted +* Date: 2025-12-01 +* Deciders: Billy Davies +* Technical Story: GPU allocation strategy for AI workloads + +## Context and Problem Statement + +The homelab has diverse GPU hardware: +- AMD Strix Halo (64GB unified memory) - khelben +- NVIDIA RTX 2070 (8GB VRAM) - elminster +- AMD Radeon 680M (12GB VRAM) - drizzt +- Intel Arc (integrated) - danilo + +Different AI workloads have different requirements. How do we allocate GPUs effectively? + +## Decision Drivers + +* Maximize utilization of all GPUs +* Match workloads to appropriate hardware +* Support concurrent inference services +* Enable fractional GPU sharing where appropriate +* Minimize cross-vendor complexity + +## Considered Options + +* Single GPU vendor only +* All workloads on largest GPU +* Workload-specific GPU allocation +* Dynamic GPU scheduling (MIG/fractional) + +## Decision Outcome + +Chosen option: "Workload-specific GPU allocation with dedicated nodes", where each AI service is pinned to the most appropriate GPU based on requirements. + +### Allocation Strategy + +| Workload | GPU | Node | Rationale | +|----------|-----|------|-----------| +| vLLM (LLM inference) | AMD Strix Halo (64GB) | khelben (dedicated) | Large models need unified memory | +| Whisper (STT) | NVIDIA RTX 2070 (8GB) | elminster | CUDA optimized, medium memory | +| XTTS (TTS) | NVIDIA RTX 2070 (8GB) | elminster | Shares with Whisper | +| BGE Embeddings | AMD Radeon 680M (12GB) | drizzt | ROCm support, batch processing | +| BGE Reranker | Intel Arc | danilo | Light workload, Intel optimization | + +### Positive Consequences + +* Each workload gets optimal hardware +* No GPU memory contention for LLM +* NVIDIA services can share via time-slicing +* Cost-effective use of varied hardware +* Clear ownership and debugging + +### Negative Consequences + +* More complex scheduling (node taints/tolerations) +* Less flexibility for workload migration +* Must maintain multiple GPU driver stacks +* Some GPUs underutilized at times + +## Implementation + +### Node Taints + +```yaml +# khelben - dedicated vLLM node +nodeTaints: + dedicated: "vllm:NoSchedule" +``` + +### Pod Tolerations and Node Affinity + +```yaml +# vLLM deployment +spec: + tolerations: + - key: "dedicated" + operator: "Equal" + value: "vllm" + effect: "NoSchedule" + affinity: + nodeAffinity: + requiredDuringSchedulingIgnoredDuringExecution: + nodeSelectorTerms: + - matchExpressions: + - key: kubernetes.io/hostname + operator: In + values: ["khelben"] +``` + +### Resource Limits + +```yaml +# NVIDIA GPU (elminster) +resources: + limits: + nvidia.com/gpu: 1 + +# AMD GPU (drizzt, khelben) +resources: + limits: + amd.com/gpu: 1 +``` + +## Pros and Cons of the Options + +### Single GPU vendor only + +* Good, because simpler driver management +* Good, because consistent tooling +* Bad, because wastes existing hardware +* Bad, because higher cost for new hardware + +### All workloads on largest GPU + +* Good, because simple scheduling +* Good, because unified memory benefits +* Bad, because memory contention +* Bad, because single point of failure +* Bad, because wastes other GPUs + +### Workload-specific allocation (chosen) + +* Good, because optimal hardware matching +* Good, because uses all available GPUs +* Good, because clear resource boundaries +* Good, because parallel inference +* Bad, because more complex configuration +* Bad, because multiple driver stacks + +### Dynamic GPU scheduling + +* Good, because flexible +* Good, because maximizes utilization +* Bad, because complex to implement +* Bad, because MIG not available on consumer GPUs +* Bad, because cross-vendor scheduling immature + +## Links + +* [Volcano Scheduler](https://volcano.sh) +* [AMD GPU Device Plugin](https://github.com/ROCm/k8s-device-plugin) +* [NVIDIA Device Plugin](https://github.com/NVIDIA/k8s-device-plugin) +* Related: [ADR-0002](0002-use-talos-linux.md) - GPU drivers via Talos schematics diff --git a/decisions/0006-gitops-with-flux.md b/decisions/0006-gitops-with-flux.md new file mode 100644 index 0000000..4c55e39 --- /dev/null +++ b/decisions/0006-gitops-with-flux.md @@ -0,0 +1,140 @@ +# GitOps with Flux CD + +* Status: accepted +* Date: 2025-11-30 +* Deciders: Billy Davies +* Technical Story: Implementing GitOps for cluster management + +## Context and Problem Statement + +Managing a Kubernetes cluster with numerous applications, configurations, and secrets requires a reliable, auditable, and reproducible approach. Manual `kubectl apply` is error-prone and doesn't track state over time. + +## Decision Drivers + +* Infrastructure as Code (IaC) principles +* Audit trail for all changes +* Self-healing cluster state +* Multi-repository support +* Secret encryption integration +* Active community and maintenance + +## Considered Options + +* Manual kubectl apply +* ArgoCD +* Flux CD +* Rancher Fleet +* Pulumi/Terraform for Kubernetes + +## Decision Outcome + +Chosen option: "Flux CD", because it provides a mature GitOps implementation with excellent multi-source support, SOPS integration, and aligns well with the Kubernetes ecosystem. + +### Positive Consequences + +* Git is single source of truth +* Automatic drift detection and correction +* Native SOPS/Age secret encryption +* Multi-repository support (homelab-k8s2 + llm-workflows) +* Helm and Kustomize native support +* Webhook-free sync (pull-based) + +### Negative Consequences + +* No built-in UI (use CLI or third-party) +* Learning curve for CRD-based configuration +* Debugging requires understanding Flux controllers + +## Configuration + +### Repository Structure + +``` +homelab-k8s2/ +โ”œโ”€โ”€ kubernetes/ +โ”‚ โ”œโ”€โ”€ flux/ # Flux system config +โ”‚ โ”‚ โ”œโ”€โ”€ config/ +โ”‚ โ”‚ โ”‚ โ”œโ”€โ”€ cluster.yaml +โ”‚ โ”‚ โ”‚ โ””โ”€โ”€ secrets.yaml # SOPS encrypted +โ”‚ โ”‚ โ””โ”€โ”€ repositories/ +โ”‚ โ”‚ โ”œโ”€โ”€ helm/ # HelmRepositories +โ”‚ โ”‚ โ””โ”€โ”€ git/ # GitRepositories +โ”‚ โ””โ”€โ”€ apps/ # Application Kustomizations +``` + +### Multi-Repository Sync + +```yaml +# GitRepository for llm-workflows +apiVersion: source.toolkit.fluxcd.io/v1 +kind: GitRepository +metadata: + name: llm-workflows + namespace: flux-system +spec: + url: ssh://git@github.com/Billy-Davies-2/llm-workflows + ref: + branch: main + secretRef: + name: github-deploy-key +``` + +### SOPS Integration + +```yaml +# .sops.yaml +creation_rules: + - path_regex: .*\.sops\.yaml$ + age: >- + age1... # Public key +``` + +## Pros and Cons of the Options + +### Manual kubectl apply + +* Good, because simple +* Good, because no setup +* Bad, because no audit trail +* Bad, because no drift detection +* Bad, because not reproducible + +### ArgoCD + +* Good, because great UI +* Good, because app-of-apps pattern +* Good, because large community +* Bad, because heavier resource usage +* Bad, because webhook-dependent sync +* Bad, because SOPS requires plugins + +### Flux CD + +* Good, because lightweight +* Good, because pull-based (no webhooks) +* Good, because native SOPS support +* Good, because multi-source/multi-tenant +* Good, because Kubernetes-native CRDs +* Bad, because no built-in UI +* Bad, because CRD learning curve + +### Rancher Fleet + +* Good, because integrated with Rancher +* Good, because multi-cluster +* Bad, because Rancher ecosystem lock-in +* Bad, because smaller community + +### Pulumi/Terraform + +* Good, because familiar IaC tools +* Good, because drift detection +* Bad, because not Kubernetes-native +* Bad, because requires state management +* Bad, because not continuous reconciliation + +## Links + +* [Flux CD](https://fluxcd.io) +* [SOPS Integration](https://fluxcd.io/flux/guides/mozilla-sops/) +* [flux-local](https://github.com/allenporter/flux-local) - Local testing diff --git a/decisions/0007-use-kserve-for-inference.md b/decisions/0007-use-kserve-for-inference.md new file mode 100644 index 0000000..da9598f --- /dev/null +++ b/decisions/0007-use-kserve-for-inference.md @@ -0,0 +1,115 @@ +# Use KServe for ML Model Serving + +* Status: accepted +* Date: 2025-12-15 +* Deciders: Billy Davies +* Technical Story: Selecting model serving platform for inference services + +## Context and Problem Statement + +We need to deploy multiple ML models (Whisper, XTTS, BGE, vLLM) as inference endpoints. Each model has different requirements for scaling, protocols (HTTP/gRPC), and GPU allocation. + +## Decision Drivers + +* Standardized inference protocol (V2) +* Autoscaling based on load +* Traffic splitting for canary deployments +* Integration with Kubeflow ecosystem +* GPU resource management +* Health checks and readiness + +## Considered Options + +* Raw Kubernetes Deployments + Services +* KServe InferenceService +* Seldon Core +* BentoML +* Ray Serve only + +## Decision Outcome + +Chosen option: "KServe InferenceService", because it provides a standardized, Kubernetes-native approach to model serving with built-in autoscaling and traffic management. + +### Positive Consequences + +* Standardized V2 inference protocol +* Automatic scale-to-zero capability +* Canary/blue-green deployments +* Integration with Kubeflow UI +* Transformer/Explainer components +* GPU resource abstraction + +### Negative Consequences + +* Additional CRDs and operators +* Learning curve for InferenceService spec +* Some overhead for simple deployments +* Knative Serving dependency (optional) + +## Pros and Cons of the Options + +### Raw Kubernetes Deployments + +* Good, because simple +* Good, because full control +* Bad, because no autoscaling logic +* Bad, because manual service mesh +* Bad, because repetitive configuration + +### KServe InferenceService + +* Good, because standardized API +* Good, because autoscaling +* Good, because traffic management +* Good, because Kubeflow integration +* Bad, because operator complexity +* Bad, because Knative optional dependency + +### Seldon Core + +* Good, because mature +* Good, because A/B testing +* Good, because explainability +* Bad, because more complex than KServe +* Bad, because heavier resource usage + +### BentoML + +* Good, because developer-friendly +* Good, because packaging focused +* Bad, because less Kubernetes-native +* Bad, because smaller community + +### Ray Serve + +* Good, because unified compute +* Good, because Python-native +* Good, because fractional GPU +* Bad, because less standardized API +* Bad, because Ray cluster overhead + +## Current Configuration + +```yaml +apiVersion: serving.kserve.io/v1beta1 +kind: InferenceService +metadata: + name: whisper + namespace: ai-ml +spec: + predictor: + minReplicas: 1 + maxReplicas: 3 + containers: + - name: whisper + image: ghcr.io/org/whisper:latest + resources: + limits: + nvidia.com/gpu: 1 +``` + +## Links + +* [KServe](https://kserve.github.io) +* [V2 Inference Protocol](https://kserve.github.io/website/latest/modelserving/data_plane/v2_protocol/) +* Related: [ADR-0005](0005-multi-gpu-strategy.md) - GPU allocation diff --git a/decisions/0008-use-milvus-for-vectors.md b/decisions/0008-use-milvus-for-vectors.md new file mode 100644 index 0000000..5fab389 --- /dev/null +++ b/decisions/0008-use-milvus-for-vectors.md @@ -0,0 +1,107 @@ +# Use Milvus for Vector Storage + +* Status: accepted +* Date: 2025-12-15 +* Deciders: Billy Davies +* Technical Story: Selecting vector database for RAG system + +## Context and Problem Statement + +The RAG (Retrieval-Augmented Generation) system requires a vector database to store document embeddings and perform similarity search. We need to store millions of embeddings and query them with low latency. + +## Decision Drivers + +* Query performance (< 100ms for top-k search) +* Scalability to millions of vectors +* Kubernetes-native deployment +* Active development and community +* Support for metadata filtering +* Backup and restore capabilities + +## Considered Options + +* Milvus +* Pinecone (managed) +* Qdrant +* Weaviate +* pgvector (PostgreSQL extension) +* Chroma + +## Decision Outcome + +Chosen option: "Milvus", because it provides production-grade vector search with excellent Kubernetes support, scalability, and active development. + +### Positive Consequences + +* High-performance similarity search +* Horizontal scalability +* Rich filtering and hybrid search +* Helm chart for Kubernetes +* Active CNCF sandbox project +* GPU acceleration available + +### Negative Consequences + +* Complex architecture (multiple components) +* Higher resource usage than simpler alternatives +* Requires object storage (MinIO) +* Learning curve for optimization + +## Pros and Cons of the Options + +### Milvus + +* Good, because production-proven at scale +* Good, because rich query API +* Good, because Kubernetes-native +* Good, because hybrid search (vector + scalar) +* Good, because CNCF project +* Bad, because complex architecture +* Bad, because higher resource usage + +### Pinecone + +* Good, because fully managed +* Good, because simple API +* Good, because reliable +* Bad, because external dependency +* Bad, because cost at scale +* Bad, because data sovereignty concerns + +### Qdrant + +* Good, because simpler than Milvus +* Good, because Rust performance +* Good, because good filtering +* Bad, because smaller community +* Bad, because less enterprise features + +### Weaviate + +* Good, because built-in vectorization +* Good, because GraphQL API +* Good, because modules system +* Bad, because more opinionated +* Bad, because schema requirements + +### pgvector + +* Good, because familiar PostgreSQL +* Good, because simple deployment +* Good, because ACID transactions +* Bad, because limited scale +* Bad, because slower for large datasets +* Bad, because no specialized optimizations + +### Chroma + +* Good, because simple +* Good, because embedded option +* Bad, because not production-ready at scale +* Bad, because limited features + +## Links + +* [Milvus](https://milvus.io) +* [Milvus Helm Chart](https://github.com/milvus-io/milvus-helm) +* Related: [DOMAIN-MODEL.md](../DOMAIN-MODEL.md) - Chunk/Embedding entities diff --git a/decisions/0009-dual-workflow-engines.md b/decisions/0009-dual-workflow-engines.md new file mode 100644 index 0000000..94be24b --- /dev/null +++ b/decisions/0009-dual-workflow-engines.md @@ -0,0 +1,124 @@ +# Dual Workflow Engine Strategy (Argo + Kubeflow) + +* Status: accepted +* Date: 2026-01-15 +* Deciders: Billy Davies +* Technical Story: Selecting workflow orchestration for ML pipelines + +## Context and Problem Statement + +The AI platform needs workflow orchestration for: +- ML training pipelines with caching +- Document ingestion (batch) +- Complex DAG workflows (training โ†’ evaluation โ†’ deployment) +- Hybrid scenarios combining both + +Should we use one engine or leverage strengths of multiple? + +## Decision Drivers + +* ML-specific features (caching, lineage) +* Complex DAG support +* Kubernetes-native execution +* Visibility and debugging +* Community and ecosystem +* Integration with existing tools + +## Considered Options + +* Kubeflow Pipelines only +* Argo Workflows only +* Both engines with clear use cases +* Airflow on Kubernetes +* Prefect/Dagster + +## Decision Outcome + +Chosen option: "Both engines with clear use cases", using Kubeflow Pipelines for ML-centric workflows and Argo Workflows for complex DAG orchestration. + +### Decision Matrix + +| Use Case | Engine | Reason | +|----------|--------|--------| +| ML training with caching | Kubeflow | Component caching, experiment tracking | +| Model evaluation | Kubeflow | Metric collection, comparison | +| Document ingestion | Argo | Simple DAG, no ML features needed | +| Batch inference | Argo | Parallelization, retries | +| Complex DAG with branching | Argo | Superior control flow | +| Hybrid ML training | Both | Argo orchestrates, KFP for ML steps | + +### Positive Consequences + +* Best tool for each job +* ML pipelines get proper caching +* Complex workflows get better DAG support +* Can integrate via Argo Events +* Gradual migration possible + +### Negative Consequences + +* Two systems to maintain +* Team needs to learn both +* More complex debugging +* Integration overhead + +## Integration Architecture + +``` +NATS Event โ”€โ”€โ–บ Argo Events โ”€โ”€โ–บ Sensor โ”€โ”€โ”ฌโ”€โ”€โ–บ Argo Workflow + โ”‚ + โ””โ”€โ”€โ–บ Kubeflow Pipeline (via API) + + OR + +Argo Workflow โ”€โ”€โ–บ Step: kfp-trigger โ”€โ”€โ–บ Kubeflow Pipeline + (WorkflowTemplate) +``` + +## Pros and Cons of the Options + +### Kubeflow Pipelines only + +* Good, because ML-focused +* Good, because caching +* Good, because experiment tracking +* Bad, because limited DAG features +* Bad, because less flexible control flow + +### Argo Workflows only + +* Good, because powerful DAG +* Good, because flexible +* Good, because great debugging +* Bad, because no ML caching +* Bad, because no experiment tracking + +### Both engines (chosen) + +* Good, because best of both +* Good, because appropriate tool per job +* Good, because can integrate +* Bad, because operational complexity +* Bad, because learning two systems + +### Airflow + +* Good, because mature +* Good, because large community +* Bad, because Python-centric +* Bad, because not Kubernetes-native +* Bad, because no ML features + +### Prefect/Dagster + +* Good, because modern design +* Good, because Python-native +* Bad, because less Kubernetes-native +* Bad, because newer/less proven + +## Links + +* [Kubeflow Pipelines](https://kubeflow.org/docs/components/pipelines/) +* [Argo Workflows](https://argoproj.github.io/workflows/) +* [Argo Events](https://argoproj.github.io/events/) +* Related: [kfp-integration.yaml](../../llm-workflows/argo/kfp-integration.yaml) diff --git a/decisions/0010-use-envoy-gateway.md b/decisions/0010-use-envoy-gateway.md new file mode 100644 index 0000000..77ccdde --- /dev/null +++ b/decisions/0010-use-envoy-gateway.md @@ -0,0 +1,120 @@ +# Use Envoy Gateway for Ingress + +* Status: accepted +* Date: 2025-12-01 +* Deciders: Billy Davies +* Technical Story: Selecting ingress controller for cluster + +## Context and Problem Statement + +We need an ingress solution that supports: +- Gateway API (modern Kubernetes standard) +- gRPC for ML inference +- WebSocket for real-time chat/voice +- Header-based routing for A/B testing +- TLS termination + +## Decision Drivers + +* Gateway API support (HTTPRoute, GRPCRoute) +* WebSocket support +* gRPC support +* Performance at edge +* Active development +* Envoy ecosystem familiarity + +## Considered Options + +* NGINX Ingress Controller +* Traefik +* Envoy Gateway +* Istio Gateway +* Contour + +## Decision Outcome + +Chosen option: "Envoy Gateway", because it's the reference implementation of Gateway API with full Envoy feature set. + +### Positive Consequences + +* Native Gateway API support +* Full Envoy feature set +* WebSocket and gRPC native +* No Istio complexity +* CNCF graduated project (Envoy) +* Easy integration with observability + +### Negative Consequences + +* Newer than alternatives +* Less documentation than NGINX +* Envoy configuration learning curve + +## Pros and Cons of the Options + +### NGINX Ingress + +* Good, because mature +* Good, because well-documented +* Good, because familiar +* Bad, because limited Gateway API +* Bad, because commercial features gated + +### Traefik + +* Good, because auto-discovery +* Good, because good UI +* Good, because Let's Encrypt +* Bad, because Gateway API experimental +* Bad, because less gRPC focus + +### Envoy Gateway + +* Good, because Gateway API native +* Good, because full Envoy features +* Good, because extensible +* Good, because gRPC/WebSocket native +* Bad, because newer project +* Bad, because less community content + +### Istio Gateway + +* Good, because full mesh features +* Good, because Gateway API +* Bad, because overkill without mesh +* Bad, because resource heavy + +### Contour + +* Good, because Envoy-based +* Good, because lightweight +* Bad, because Gateway API evolving +* Bad, because smaller community + +## Configuration Example + +```yaml +apiVersion: gateway.networking.k8s.io/v1 +kind: HTTPRoute +metadata: + name: companions-chat +spec: + parentRefs: + - name: eg-gateway + namespace: network + hostnames: + - companions-chat.lab.daviestechlabs.io + rules: + - matches: + - path: + type: PathPrefix + value: / + backendRefs: + - name: companions-chat + port: 8080 +``` + +## Links + +* [Envoy Gateway](https://gateway.envoyproxy.io) +* [Gateway API](https://gateway-api.sigs.k8s.io) diff --git a/diagrams/README.md b/diagrams/README.md new file mode 100644 index 0000000..e5e219f --- /dev/null +++ b/diagrams/README.md @@ -0,0 +1,35 @@ +# Diagrams + +This directory contains additional architecture diagrams beyond the main C4 diagrams. + +## Available Diagrams + +| File | Description | +|------|-------------| +| [gpu-allocation.mmd](gpu-allocation.mmd) | GPU workload distribution | +| [data-flow-chat.mmd](data-flow-chat.mmd) | Chat request data flow | +| [data-flow-voice.mmd](data-flow-voice.mmd) | Voice request data flow | + +## Rendering Diagrams + +### VS Code + +Install the "Markdown Preview Mermaid Support" extension. + +### CLI + +```bash +# Using mmdc (Mermaid CLI) +npx @mermaid-js/mermaid-cli mmdc -i diagram.mmd -o diagram.png +``` + +### Online + +Use [Mermaid Live Editor](https://mermaid.live) + +## Diagram Conventions + +1. Use `.mmd` extension for Mermaid diagrams +2. Include title as comment at top of file +3. Use consistent styling classes +4. Keep diagrams focused (one concept per diagram) diff --git a/diagrams/data-flow-chat.mmd b/diagrams/data-flow-chat.mmd new file mode 100644 index 0000000..67fd4ea --- /dev/null +++ b/diagrams/data-flow-chat.mmd @@ -0,0 +1,51 @@ +%% Chat Request Data Flow +%% Sequence diagram showing chat message processing + +sequenceDiagram + autonumber + participant U as User + participant W as WebApp
(companions) + participant N as NATS + participant C as Chat Handler + participant V as Valkey
(Cache) + participant E as BGE Embeddings + participant M as Milvus + participant R as Reranker + participant L as vLLM + + U->>W: Send message + W->>N: Publish ai.chat.user.{id}.message + N->>C: Deliver message + + C->>V: Get session history + V-->>C: Previous messages + + alt RAG Enabled + C->>E: Generate query embedding + E-->>C: Query vector + C->>M: Search similar chunks + M-->>C: Top-K chunks + + opt Reranker Enabled + C->>R: Rerank chunks + R-->>C: Reordered chunks + end + end + + C->>L: LLM inference (context + query) + + alt Streaming Enabled + loop For each token + L-->>C: Token + C->>N: Publish ai.chat.response.stream.{id} + N-->>W: Deliver chunk + W-->>U: Display token + end + else Non-streaming + L-->>C: Full response + C->>N: Publish ai.chat.response.{id} + N-->>W: Deliver response + W-->>U: Display response + end + + C->>V: Save to session history diff --git a/diagrams/data-flow-voice.mmd b/diagrams/data-flow-voice.mmd new file mode 100644 index 0000000..872b4f9 --- /dev/null +++ b/diagrams/data-flow-voice.mmd @@ -0,0 +1,46 @@ +%% Voice Request Data Flow +%% Sequence diagram showing voice assistant processing + +sequenceDiagram + autonumber + participant U as User + participant W as Voice WebApp + participant N as NATS + participant VA as Voice Assistant + participant STT as Whisper
(STT) + participant E as BGE Embeddings + participant M as Milvus + participant R as Reranker + participant L as vLLM + participant TTS as XTTS
(TTS) + + U->>W: Record audio + W->>N: Publish ai.voice.user.{id}.request
(msgpack with audio bytes) + N->>VA: Deliver voice request + + VA->>STT: Transcribe audio + STT-->>VA: Transcription text + + alt RAG Enabled + VA->>E: Generate query embedding + E-->>VA: Query vector + VA->>M: Search similar chunks + M-->>VA: Top-K chunks + + opt Reranker Enabled + VA->>R: Rerank chunks + R-->>VA: Reordered chunks + end + end + + VA->>L: LLM inference + L-->>VA: Response text + + VA->>TTS: Synthesize speech + TTS-->>VA: Audio bytes + + VA->>N: Publish ai.voice.response.{id}
(text + audio) + N-->>W: Deliver response + W-->>U: Play audio + show text + + Note over VA,TTS: Total latency target: < 3s diff --git a/diagrams/gpu-allocation.mmd b/diagrams/gpu-allocation.mmd new file mode 100644 index 0000000..82e4f08 --- /dev/null +++ b/diagrams/gpu-allocation.mmd @@ -0,0 +1,47 @@ +%% GPU Allocation Diagram +%% Shows how AI workloads are distributed across GPU nodes + +flowchart TB + subgraph khelben["๐Ÿ–ฅ๏ธ khelben (AMD Strix Halo 64GB)"] + direction TB + vllm["๐Ÿง  vLLM
LLM Inference
100% GPU"] + end + + subgraph elminster["๐Ÿ–ฅ๏ธ elminster (NVIDIA RTX 2070 8GB)"] + direction TB + whisper["๐ŸŽค Whisper
STT
~50% GPU"] + xtts["๐Ÿ”Š XTTS
TTS
~50% GPU"] + end + + subgraph drizzt["๐Ÿ–ฅ๏ธ drizzt (AMD Radeon 680M 12GB)"] + direction TB + embeddings["๐Ÿ“Š BGE Embeddings
Vector Encoding
~80% GPU"] + end + + subgraph danilo["๐Ÿ–ฅ๏ธ danilo (Intel Arc)"] + direction TB + reranker["๐Ÿ“‹ BGE Reranker
Document Ranking
~80% GPU"] + end + + subgraph workloads["Workload Routing"] + chat["๐Ÿ’ฌ Chat Request"] + voice["๐ŸŽค Voice Request"] + end + + chat --> embeddings + chat --> reranker + chat --> vllm + + voice --> whisper + voice --> embeddings + voice --> reranker + voice --> vllm + voice --> xtts + + classDef nvidia fill:#76B900,color:white + classDef amd fill:#ED1C24,color:white + classDef intel fill:#0071C5,color:white + + class whisper,xtts nvidia + class vllm,embeddings amd + class reranker intel diff --git a/specs/BINARY_MESSAGES_AND_JETSTREAM.md b/specs/BINARY_MESSAGES_AND_JETSTREAM.md new file mode 100644 index 0000000..85c7ca9 --- /dev/null +++ b/specs/BINARY_MESSAGES_AND_JETSTREAM.md @@ -0,0 +1,287 @@ +# Binary Messages and JetStream Configuration + +> Technical specification for NATS message handling in the AI platform + +## Overview + +The AI platform uses NATS with JetStream for message persistence. All messages use MessagePack (msgpack) binary format for efficiency, especially when handling audio data. + +## Message Format + +### Why MessagePack? + +1. **Binary efficiency**: Audio data embedded directly without base64 overhead +2. **Compact**: 20-50% smaller than equivalent JSON +3. **Fast**: Lower serialization/deserialization overhead +4. **Compatible**: JSON-like structure, easy debugging + +### Schema + +All messages follow this general structure: + +```python +{ + "request_id": str, # UUID for correlation + "user_id": str, # User identifier + "timestamp": float, # Unix timestamp + "payload": Any, # Type-specific data + "metadata": dict # Optional metadata +} +``` + +### Chat Message + +```python +{ + "request_id": "uuid-here", + "user_id": "user-123", + "username": "john_doe", + "message": "Hello, how are you?", + "premium": False, + "enable_streaming": True, + "enable_rag": True, + "enable_reranker": True, + "top_k": 5, + "session_id": "session-abc" +} +``` + +### Voice Message + +```python +{ + "request_id": "uuid-here", + "user_id": "user-123", + "audio": b"...", # Raw bytes, not base64! + "format": "wav", + "sample_rate": 16000, + "premium": False, + "enable_rag": True, + "language": "en" +} +``` + +### Streaming Response Chunk + +```python +{ + "request_id": "uuid-here", + "type": "chunk", # "chunk", "done", "error" + "content": "token", + "done": False, + "timestamp": 1706000000.0 +} +``` + +## JetStream Configuration + +### Streams + +| Stream | Subjects | Retention | Max Age | Storage | Replicas | +|--------|----------|-----------|---------|---------|----------| +| `COMPANIONS_LOGINS` | `ai.chat.user.*.login` | Limits | 7 days | File | 1 | +| `COMPANIONS_CHAT` | `ai.chat.user.*.message`, `ai.chat.user.*.greeting.*` | Limits | 30 days | File | 1 | +| `AI_CHAT_STREAM` | `ai.chat.response.stream.>` | Limits | 5 min | Memory | 1 | +| `AI_VOICE_STREAM` | `ai.voice.>` | Limits | 1 hour | File | 1 | +| `AI_VOICE_RESPONSE_STREAM` | `ai.voice.response.stream.>` | Limits | 5 min | Memory | 1 | +| `AI_PIPELINE` | `ai.pipeline.>` | Limits | 24 hours | File | 1 | + +### Consumer Configuration + +```yaml +# Durable consumer for chat handler +consumer: + name: chat-handler + durable_name: chat-handler + filter_subjects: + - "ai.chat.user.*.message" + ack_policy: explicit + ack_wait: 30s + max_deliver: 3 + deliver_policy: new +``` + +### Stream Creation (CLI) + +```bash +# Create chat stream +nats stream add COMPANIONS_CHAT \ + --subjects "ai.chat.user.*.message,ai.chat.user.*.greeting.*" \ + --retention limits \ + --max-age 30d \ + --storage file \ + --replicas 1 + +# Create ephemeral stream +nats stream add AI_CHAT_STREAM \ + --subjects "ai.chat.response.stream.>" \ + --retention limits \ + --max-age 5m \ + --storage memory \ + --replicas 1 +``` + +## Python Implementation + +### Publisher + +```python +import nats +import msgpack +from datetime import datetime + +async def publish_chat_message(nc: nats.NATS, user_id: str, message: str): + data = { + "request_id": str(uuid.uuid4()), + "user_id": user_id, + "message": message, + "timestamp": datetime.utcnow().timestamp(), + "enable_streaming": True, + "enable_rag": True, + } + + subject = f"ai.chat.user.{user_id}.message" + await nc.publish(subject, msgpack.packb(data)) +``` + +### Subscriber (JetStream) + +```python +async def message_handler(msg): + try: + data = msgpack.unpackb(msg.data, raw=False) + + # Process message + result = await process_chat(data) + + # Publish response + response_subject = f"ai.chat.response.{data['request_id']}" + await nc.publish(response_subject, msgpack.packb(result)) + + # Acknowledge + await msg.ack() + + except Exception as e: + logger.error(f"Handler error: {e}") + await msg.nak(delay=5) # Retry after 5s + +# Subscribe with JetStream +js = nc.jetstream() +sub = await js.subscribe( + "ai.chat.user.*.message", + cb=message_handler, + durable="chat-handler", + manual_ack=True +) +``` + +### Streaming Response + +```python +async def stream_response(nc, request_id: str, response_generator): + subject = f"ai.chat.response.stream.{request_id}" + + async for token in response_generator: + chunk = { + "request_id": request_id, + "type": "chunk", + "content": token, + "done": False + } + await nc.publish(subject, msgpack.packb(chunk)) + + # Send done marker + done = { + "request_id": request_id, + "type": "done", + "content": "", + "done": True + } + await nc.publish(subject, msgpack.packb(done)) +``` + +## Go Implementation + +### Publisher + +```go +import ( + "github.com/nats-io/nats.go" + "github.com/vmihailenco/msgpack/v5" +) + +type ChatMessage struct { + RequestID string `msgpack:"request_id"` + UserID string `msgpack:"user_id"` + Message string `msgpack:"message"` +} + +func PublishChat(nc *nats.Conn, userID, message string) error { + msg := ChatMessage{ + RequestID: uuid.New().String(), + UserID: userID, + Message: message, + } + + data, err := msgpack.Marshal(msg) + if err != nil { + return err + } + + subject := fmt.Sprintf("ai.chat.user.%s.message", userID) + return nc.Publish(subject, data) +} +``` + +## Error Handling + +### NAK with Delay + +```python +# Temporary failure - retry later +await msg.nak(delay=5) # 5 second delay + +# Permanent failure - move to dead letter +if attempt >= max_retries: + await nc.publish("ai.dlq.chat", msg.data) + await msg.term() # Terminate delivery +``` + +### Dead Letter Queue + +```yaml +stream: + name: AI_DLQ + subjects: + - "ai.dlq.>" + retention: limits + max_age: 7d + storage: file +``` + +## Monitoring + +### Key Metrics + +```bash +# Stream info +nats stream info COMPANIONS_CHAT + +# Consumer info +nats consumer info COMPANIONS_CHAT chat-handler + +# Message rate +nats stream report +``` + +### Prometheus Metrics + +- `nats_stream_messages_total` +- `nats_consumer_pending_messages` +- `nats_consumer_ack_pending` + +## Related + +- [ADR-0003: Use NATS for Messaging](../decisions/0003-use-nats-for-messaging.md) +- [ADR-0004: Use MessagePack](../decisions/0004-use-messagepack-for-nats.md) +- [DOMAIN-MODEL.md](../DOMAIN-MODEL.md) diff --git a/specs/README.md b/specs/README.md new file mode 100644 index 0000000..1b65dfc --- /dev/null +++ b/specs/README.md @@ -0,0 +1,36 @@ +# Specifications + +This directory contains feature-level specifications and technical designs. + +## Contents + +- [BINARY_MESSAGES_AND_JETSTREAM.md](BINARY_MESSAGES_AND_JETSTREAM.md) - MessagePack format and JetStream configuration +- Future specs will be added here + +## Spec Template + +```markdown +# Feature Name + +## Overview +Brief description of the feature + +## Requirements +- Requirement 1 +- Requirement 2 + +## Design +Technical design details + +## API +Interface definitions + +## Implementation Notes +Key implementation considerations + +## Testing +Test strategy + +## Open Questions +Unresolved items +```