feat: add comprehensive architecture documentation

- Add AGENT-ONBOARDING.md for AI agents
- Add ARCHITECTURE.md with full system overview
- Add TECH-STACK.md with complete technology inventory
- Add DOMAIN-MODEL.md with entities and bounded contexts
- Add CODING-CONVENTIONS.md with patterns and practices
- Add GLOSSARY.md with terminology reference
- Add C4 diagrams (Context and Container levels)
- Add 10 ADRs documenting key decisions:
  - Talos Linux, NATS, MessagePack, Multi-GPU strategy
  - GitOps with Flux, KServe, Milvus, Dual workflow engines
  - Envoy Gateway
- Add specs directory with JetStream configuration
- Add diagrams for GPU allocation and data flows

Based on analysis of homelab-k8s2 and llm-workflows repositories
and kubectl cluster-info dump data.
This commit is contained in:
2026-02-01 14:30:05 -05:00
parent 4d4f6f464c
commit 832cda34bd
26 changed files with 3805 additions and 2 deletions

View File

@@ -0,0 +1,71 @@
# [short title of solved problem and solution]
* Status: [proposed | rejected | accepted | deprecated | superseded by [ADR-NNNN](NNNN-example.md)]
* Date: YYYY-MM-DD
* Deciders: [list of people involved in decision]
* Technical Story: [description | ticket/issue URL]
## Context and Problem Statement
[Describe the context and problem statement, e.g., in free form using two to three sentences. You may want to articulate the problem in form of a question.]
## Decision Drivers
* [driver 1, e.g., a force, facing concern, …]
* [driver 2, e.g., a force, facing concern, …]
* … <!-- numbers of drivers can vary -->
## Considered Options
* [option 1]
* [option 2]
* [option 3]
* … <!-- numbers of options can vary -->
## Decision Outcome
Chosen option: "[option N]", because [justification. e.g., only option which meets k.o. criterion decision driver | which resolves force | … | comes out best (see below)].
### Positive Consequences
* [e.g., improvement of quality attribute satisfaction, follow-up decisions required, …]
*
### Negative Consequences
* [e.g., compromising quality attribute, follow-up decisions required, …]
*
## Pros and Cons of the Options
### [option 1]
[example | description | pointer to more information | …]
* Good, because [argument a]
* Good, because [argument b]
* Bad, because [argument c]
* … <!-- numbers of pros and cons can vary -->
### [option 2]
[example | description | pointer to more information | …]
* Good, because [argument a]
* Good, because [argument b]
* Bad, because [argument c]
* … <!-- numbers of pros and cons can vary -->
### [option 3]
[example | description | pointer to more information | …]
* Good, because [argument a]
* Good, because [argument b]
* Bad, because [argument c]
* … <!-- numbers of pros and cons can vary -->
## Links
* [Link type] [Link to ADR] <!-- example: Refined by [ADR-0005](0005-example.md) -->
* … <!-- numbers of links can vary -->

View File

@@ -0,0 +1,79 @@
# Record Architecture Decisions
* Status: accepted
* Date: 2025-11-30
* Deciders: Billy Davies
* Technical Story: Initial setup of homelab documentation
## Context and Problem Statement
As the homelab infrastructure grows in complexity with AI/ML services, multi-GPU configurations, and event-driven architectures, we need a way to document and communicate significant architectural decisions. Without documentation, the rationale behind choices gets lost, making future changes risky and onboarding difficult.
## Decision Drivers
* Need to preserve context for why decisions were made
* Enable future maintainers (including AI agents) to understand the system
* Provide a structured way to evaluate alternatives
* Support the wiki/design process for iterative improvements
## Considered Options
* Informal documentation in README files
* Wiki pages without structure
* Architecture Decision Records (ADRs)
* No documentation (rely on code)
## Decision Outcome
Chosen option: "Architecture Decision Records (ADRs)", because they provide a structured format that captures context, alternatives, and consequences. They're lightweight, version-controlled, and well-suited for technical decisions.
### Positive Consequences
* Clear historical record of decisions
* Structured format makes decisions searchable
* Forces consideration of alternatives
* Git-versioned alongside code
* AI agents can parse and understand decisions
### Negative Consequences
* Requires discipline to create ADRs
* May accumulate outdated decisions over time
* Additional overhead for simple decisions
## Pros and Cons of the Options
### Informal README documentation
* Good, because low friction
* Good, because close to code
* Bad, because no structure for alternatives
* Bad, because decisions get buried in prose
### Wiki pages
* Good, because easy to edit
* Good, because supports rich formatting
* Bad, because separate from code repository
* Bad, because no enforced structure
### ADRs
* Good, because structured format
* Good, because version controlled
* Good, because captures alternatives considered
* Good, because industry-standard practice
* Bad, because requires creating new files
* Bad, because may seem bureaucratic for small decisions
### No documentation
* Good, because no overhead
* Bad, because context is lost
* Bad, because makes onboarding difficult
* Bad, because risky for future changes
## Links
* Based on [MADR template](https://adr.github.io/madr/)
* [ADR GitHub organization](https://adr.github.io/)

View File

@@ -0,0 +1,97 @@
# Use Talos Linux for Kubernetes Nodes
* Status: accepted
* Date: 2025-11-30
* Deciders: Billy Davies
* Technical Story: Selecting OS for bare-metal Kubernetes cluster
## Context and Problem Statement
We need a reliable, secure operating system for running Kubernetes on bare-metal homelab nodes. The OS should minimize attack surface, be easy to manage at scale, and support our GPU requirements (AMD ROCm, NVIDIA CUDA, Intel).
## Decision Drivers
* Security-first design (immutable, minimal)
* API-driven management (no SSH)
* Support for various GPU drivers
* Kubernetes-native focus
* Community support and updates
* Ease of upgrades
## Considered Options
* Ubuntu Server with kubeadm
* Flatcar Container Linux
* Talos Linux
* k3OS (discontinued)
* Rocky Linux with RKE2
## Decision Outcome
Chosen option: "Talos Linux", because it provides an immutable, API-driven, Kubernetes-focused OS that minimizes attack surface and simplifies operations.
### Positive Consequences
* Immutable root filesystem prevents drift
* No SSH reduces attack vectors
* API-driven management integrates well with GitOps
* Schematic system allows custom kernel modules (GPU drivers)
* Consistent configuration across all nodes
* Automatic updates with minimal disruption
### Negative Consequences
* Learning curve for API-driven management
* Debugging requires different approaches (no SSH)
* Custom extensions require schematic IDs
* Less flexibility for non-Kubernetes workloads
## Pros and Cons of the Options
### Ubuntu Server with kubeadm
* Good, because familiar
* Good, because extensive package availability
* Good, because easy debugging via SSH
* Bad, because mutable system leads to drift
* Bad, because large attack surface
* Bad, because manual package management
### Flatcar Container Linux
* Good, because immutable
* Good, because auto-updates
* Good, because container-focused
* Bad, because less Kubernetes-specific
* Bad, because smaller community than Talos
* Bad, because GPU driver setup more complex
### Talos Linux
* Good, because purpose-built for Kubernetes
* Good, because immutable and minimal
* Good, because API-driven (no SSH)
* Good, because excellent Kubernetes integration
* Good, because active development and community
* Good, because schematic system for GPU drivers
* Bad, because learning curve
* Bad, because no traditional debugging
### k3OS
* Good, because simple
* Bad, because discontinued
### Rocky Linux with RKE2
* Good, because enterprise-like
* Good, because familiar Linux experience
* Bad, because mutable system
* Bad, because more operational overhead
* Bad, because larger attack surface
## Links
* [Talos Linux](https://talos.dev)
* [Talos Image Factory](https://factory.talos.dev)
* Related: [ADR-0005](0005-multi-gpu-strategy.md) - GPU driver integration via schematics

View File

@@ -0,0 +1,112 @@
# Use NATS for AI/ML Messaging
* Status: accepted
* Date: 2025-12-01
* Deciders: Billy Davies
* Technical Story: Selecting message bus for AI service orchestration
## Context and Problem Statement
The AI/ML platform requires a messaging system for:
- Real-time chat message routing
- Voice request/response streaming
- Pipeline triggers and status updates
- Event-driven workflow orchestration
We need a messaging system that handles both ephemeral real-time messages and persistent streams.
## Decision Drivers
* Low latency for real-time chat/voice
* Persistence for audit and replay
* Simple operations for homelab
* Support for request-reply pattern
* Wildcard subscriptions for routing
* Binary message support (audio data)
## Considered Options
* Apache Kafka
* RabbitMQ
* Redis Pub/Sub + Streams
* NATS with JetStream
* Apache Pulsar
## Decision Outcome
Chosen option: "NATS with JetStream", because it provides both fire-and-forget messaging and persistent streams with significantly simpler operations than alternatives.
### Positive Consequences
* Sub-millisecond latency for real-time messages
* JetStream provides persistence when needed
* Simple deployment (single binary)
* Excellent Kubernetes integration
* Request-reply pattern built-in
* Wildcard subscriptions for flexible routing
* Low resource footprint
### Negative Consequences
* Less ecosystem than Kafka
* JetStream less mature than Kafka Streams
* No built-in schema registry
* Smaller community than RabbitMQ
## Pros and Cons of the Options
### Apache Kafka
* Good, because industry standard for streaming
* Good, because rich ecosystem (Kafka Streams, Connect)
* Good, because schema registry
* Good, because excellent for high throughput
* Bad, because operationally complex (ZooKeeper/KRaft)
* Bad, because high resource requirements
* Bad, because overkill for homelab scale
* Bad, because higher latency for real-time messages
### RabbitMQ
* Good, because mature and stable
* Good, because flexible routing
* Good, because good management UI
* Bad, because AMQP protocol overhead
* Bad, because not designed for streaming
* Bad, because more complex clustering
### Redis Pub/Sub + Streams
* Good, because simple
* Good, because already might use Redis
* Good, because low latency
* Bad, because pub/sub not persistent
* Bad, because streams API less intuitive
* Bad, because not primary purpose of Redis
### NATS with JetStream
* Good, because extremely low latency
* Good, because simple operations
* Good, because both pub/sub and persistence
* Good, because request-reply built-in
* Good, because wildcard subscriptions
* Good, because low resource usage
* Good, because excellent Go/Python clients
* Bad, because smaller ecosystem
* Bad, because JetStream newer than Kafka
### Apache Pulsar
* Good, because unified messaging + streaming
* Good, because multi-tenancy
* Good, because geo-replication
* Bad, because complex architecture
* Bad, because high resource requirements
* Bad, because smaller community
## Links
* [NATS.io](https://nats.io)
* [JetStream Documentation](https://docs.nats.io/nats-concepts/jetstream)
* Related: [ADR-0004](0004-use-messagepack-for-nats.md) - Message format

View File

@@ -0,0 +1,137 @@
# Use MessagePack for NATS Messages
* Status: accepted
* Date: 2025-12-01
* Deciders: Billy Davies
* Technical Story: Selecting serialization format for NATS messages
## Context and Problem Statement
NATS messages in the AI platform carry various payloads:
- Text chat messages (small)
- Voice audio data (potentially large, base64 or binary)
- Streaming response chunks
- Pipeline parameters
We need a serialization format that handles both text and binary efficiently.
## Decision Drivers
* Efficient binary data handling (audio)
* Compact message size
* Fast serialization/deserialization
* Cross-language support (Python, Go)
* Debugging ability
* Schema flexibility
## Considered Options
* JSON
* Protocol Buffers (protobuf)
* MessagePack (msgpack)
* CBOR
* Avro
## Decision Outcome
Chosen option: "MessagePack (msgpack)", because it provides binary efficiency with JSON-like simplicity and schema-less flexibility.
### Positive Consequences
* Native binary support (no base64 overhead for audio)
* 20-50% smaller than JSON for typical messages
* Faster serialization than JSON
* No schema compilation step
* Easy debugging (can pretty-print like JSON)
* Excellent Python and Go libraries
### Negative Consequences
* Less human-readable than JSON when raw
* No built-in schema validation
* Slightly less common than JSON
## Pros and Cons of the Options
### JSON
* Good, because human-readable
* Good, because universal support
* Good, because no setup required
* Bad, because binary data requires base64 (33% overhead)
* Bad, because larger message sizes
* Bad, because slower parsing
### Protocol Buffers
* Good, because very compact
* Good, because fast
* Good, because schema validation
* Good, because cross-language
* Bad, because requires schema definition
* Bad, because compilation step
* Bad, because less flexible for evolving schemas
* Bad, because overkill for simple messages
### MessagePack
* Good, because binary-efficient
* Good, because JSON-like simplicity
* Good, because no schema required
* Good, because excellent library support
* Good, because can include raw bytes
* Bad, because not human-readable raw
* Bad, because no schema validation
### CBOR
* Good, because binary-efficient
* Good, because IETF standard
* Good, because schema-less
* Bad, because less common libraries
* Bad, because smaller community
* Bad, because similar to msgpack with less adoption
### Avro
* Good, because schema evolution
* Good, because compact
* Good, because schema registry integration
* Bad, because requires schema
* Bad, because more complex setup
* Bad, because Java-centric ecosystem
## Implementation Notes
```python
# Python usage
import msgpack
# Serialize
data = {
"user_id": "user-123",
"audio": audio_bytes, # Raw bytes, no base64
"premium": True
}
payload = msgpack.packb(data)
# Deserialize
data = msgpack.unpackb(payload, raw=False)
```
```go
// Go usage
import "github.com/vmihailenco/msgpack/v5"
type Message struct {
UserID string `msgpack:"user_id"`
Audio []byte `msgpack:"audio"`
}
```
## Links
* [MessagePack Specification](https://msgpack.org)
* [msgpack-python](https://github.com/msgpack/msgpack-python)
* Related: [ADR-0003](0003-use-nats-for-messaging.md) - Message bus choice
* See: [BINARY_MESSAGES_AND_JETSTREAM.md](../specs/BINARY_MESSAGES_AND_JETSTREAM.md)

View File

@@ -0,0 +1,145 @@
# Multi-GPU Heterogeneous Strategy
* Status: accepted
* Date: 2025-12-01
* Deciders: Billy Davies
* Technical Story: GPU allocation strategy for AI workloads
## Context and Problem Statement
The homelab has diverse GPU hardware:
- AMD Strix Halo (64GB unified memory) - khelben
- NVIDIA RTX 2070 (8GB VRAM) - elminster
- AMD Radeon 680M (12GB VRAM) - drizzt
- Intel Arc (integrated) - danilo
Different AI workloads have different requirements. How do we allocate GPUs effectively?
## Decision Drivers
* Maximize utilization of all GPUs
* Match workloads to appropriate hardware
* Support concurrent inference services
* Enable fractional GPU sharing where appropriate
* Minimize cross-vendor complexity
## Considered Options
* Single GPU vendor only
* All workloads on largest GPU
* Workload-specific GPU allocation
* Dynamic GPU scheduling (MIG/fractional)
## Decision Outcome
Chosen option: "Workload-specific GPU allocation with dedicated nodes", where each AI service is pinned to the most appropriate GPU based on requirements.
### Allocation Strategy
| Workload | GPU | Node | Rationale |
|----------|-----|------|-----------|
| vLLM (LLM inference) | AMD Strix Halo (64GB) | khelben (dedicated) | Large models need unified memory |
| Whisper (STT) | NVIDIA RTX 2070 (8GB) | elminster | CUDA optimized, medium memory |
| XTTS (TTS) | NVIDIA RTX 2070 (8GB) | elminster | Shares with Whisper |
| BGE Embeddings | AMD Radeon 680M (12GB) | drizzt | ROCm support, batch processing |
| BGE Reranker | Intel Arc | danilo | Light workload, Intel optimization |
### Positive Consequences
* Each workload gets optimal hardware
* No GPU memory contention for LLM
* NVIDIA services can share via time-slicing
* Cost-effective use of varied hardware
* Clear ownership and debugging
### Negative Consequences
* More complex scheduling (node taints/tolerations)
* Less flexibility for workload migration
* Must maintain multiple GPU driver stacks
* Some GPUs underutilized at times
## Implementation
### Node Taints
```yaml
# khelben - dedicated vLLM node
nodeTaints:
dedicated: "vllm:NoSchedule"
```
### Pod Tolerations and Node Affinity
```yaml
# vLLM deployment
spec:
tolerations:
- key: "dedicated"
operator: "Equal"
value: "vllm"
effect: "NoSchedule"
affinity:
nodeAffinity:
requiredDuringSchedulingIgnoredDuringExecution:
nodeSelectorTerms:
- matchExpressions:
- key: kubernetes.io/hostname
operator: In
values: ["khelben"]
```
### Resource Limits
```yaml
# NVIDIA GPU (elminster)
resources:
limits:
nvidia.com/gpu: 1
# AMD GPU (drizzt, khelben)
resources:
limits:
amd.com/gpu: 1
```
## Pros and Cons of the Options
### Single GPU vendor only
* Good, because simpler driver management
* Good, because consistent tooling
* Bad, because wastes existing hardware
* Bad, because higher cost for new hardware
### All workloads on largest GPU
* Good, because simple scheduling
* Good, because unified memory benefits
* Bad, because memory contention
* Bad, because single point of failure
* Bad, because wastes other GPUs
### Workload-specific allocation (chosen)
* Good, because optimal hardware matching
* Good, because uses all available GPUs
* Good, because clear resource boundaries
* Good, because parallel inference
* Bad, because more complex configuration
* Bad, because multiple driver stacks
### Dynamic GPU scheduling
* Good, because flexible
* Good, because maximizes utilization
* Bad, because complex to implement
* Bad, because MIG not available on consumer GPUs
* Bad, because cross-vendor scheduling immature
## Links
* [Volcano Scheduler](https://volcano.sh)
* [AMD GPU Device Plugin](https://github.com/ROCm/k8s-device-plugin)
* [NVIDIA Device Plugin](https://github.com/NVIDIA/k8s-device-plugin)
* Related: [ADR-0002](0002-use-talos-linux.md) - GPU drivers via Talos schematics

View File

@@ -0,0 +1,140 @@
# GitOps with Flux CD
* Status: accepted
* Date: 2025-11-30
* Deciders: Billy Davies
* Technical Story: Implementing GitOps for cluster management
## Context and Problem Statement
Managing a Kubernetes cluster with numerous applications, configurations, and secrets requires a reliable, auditable, and reproducible approach. Manual `kubectl apply` is error-prone and doesn't track state over time.
## Decision Drivers
* Infrastructure as Code (IaC) principles
* Audit trail for all changes
* Self-healing cluster state
* Multi-repository support
* Secret encryption integration
* Active community and maintenance
## Considered Options
* Manual kubectl apply
* ArgoCD
* Flux CD
* Rancher Fleet
* Pulumi/Terraform for Kubernetes
## Decision Outcome
Chosen option: "Flux CD", because it provides a mature GitOps implementation with excellent multi-source support, SOPS integration, and aligns well with the Kubernetes ecosystem.
### Positive Consequences
* Git is single source of truth
* Automatic drift detection and correction
* Native SOPS/Age secret encryption
* Multi-repository support (homelab-k8s2 + llm-workflows)
* Helm and Kustomize native support
* Webhook-free sync (pull-based)
### Negative Consequences
* No built-in UI (use CLI or third-party)
* Learning curve for CRD-based configuration
* Debugging requires understanding Flux controllers
## Configuration
### Repository Structure
```
homelab-k8s2/
├── kubernetes/
│ ├── flux/ # Flux system config
│ │ ├── config/
│ │ │ ├── cluster.yaml
│ │ │ └── secrets.yaml # SOPS encrypted
│ │ └── repositories/
│ │ ├── helm/ # HelmRepositories
│ │ └── git/ # GitRepositories
│ └── apps/ # Application Kustomizations
```
### Multi-Repository Sync
```yaml
# GitRepository for llm-workflows
apiVersion: source.toolkit.fluxcd.io/v1
kind: GitRepository
metadata:
name: llm-workflows
namespace: flux-system
spec:
url: ssh://git@github.com/Billy-Davies-2/llm-workflows
ref:
branch: main
secretRef:
name: github-deploy-key
```
### SOPS Integration
```yaml
# .sops.yaml
creation_rules:
- path_regex: .*\.sops\.yaml$
age: >-
age1... # Public key
```
## Pros and Cons of the Options
### Manual kubectl apply
* Good, because simple
* Good, because no setup
* Bad, because no audit trail
* Bad, because no drift detection
* Bad, because not reproducible
### ArgoCD
* Good, because great UI
* Good, because app-of-apps pattern
* Good, because large community
* Bad, because heavier resource usage
* Bad, because webhook-dependent sync
* Bad, because SOPS requires plugins
### Flux CD
* Good, because lightweight
* Good, because pull-based (no webhooks)
* Good, because native SOPS support
* Good, because multi-source/multi-tenant
* Good, because Kubernetes-native CRDs
* Bad, because no built-in UI
* Bad, because CRD learning curve
### Rancher Fleet
* Good, because integrated with Rancher
* Good, because multi-cluster
* Bad, because Rancher ecosystem lock-in
* Bad, because smaller community
### Pulumi/Terraform
* Good, because familiar IaC tools
* Good, because drift detection
* Bad, because not Kubernetes-native
* Bad, because requires state management
* Bad, because not continuous reconciliation
## Links
* [Flux CD](https://fluxcd.io)
* [SOPS Integration](https://fluxcd.io/flux/guides/mozilla-sops/)
* [flux-local](https://github.com/allenporter/flux-local) - Local testing

View File

@@ -0,0 +1,115 @@
# Use KServe for ML Model Serving
* Status: accepted
* Date: 2025-12-15
* Deciders: Billy Davies
* Technical Story: Selecting model serving platform for inference services
## Context and Problem Statement
We need to deploy multiple ML models (Whisper, XTTS, BGE, vLLM) as inference endpoints. Each model has different requirements for scaling, protocols (HTTP/gRPC), and GPU allocation.
## Decision Drivers
* Standardized inference protocol (V2)
* Autoscaling based on load
* Traffic splitting for canary deployments
* Integration with Kubeflow ecosystem
* GPU resource management
* Health checks and readiness
## Considered Options
* Raw Kubernetes Deployments + Services
* KServe InferenceService
* Seldon Core
* BentoML
* Ray Serve only
## Decision Outcome
Chosen option: "KServe InferenceService", because it provides a standardized, Kubernetes-native approach to model serving with built-in autoscaling and traffic management.
### Positive Consequences
* Standardized V2 inference protocol
* Automatic scale-to-zero capability
* Canary/blue-green deployments
* Integration with Kubeflow UI
* Transformer/Explainer components
* GPU resource abstraction
### Negative Consequences
* Additional CRDs and operators
* Learning curve for InferenceService spec
* Some overhead for simple deployments
* Knative Serving dependency (optional)
## Pros and Cons of the Options
### Raw Kubernetes Deployments
* Good, because simple
* Good, because full control
* Bad, because no autoscaling logic
* Bad, because manual service mesh
* Bad, because repetitive configuration
### KServe InferenceService
* Good, because standardized API
* Good, because autoscaling
* Good, because traffic management
* Good, because Kubeflow integration
* Bad, because operator complexity
* Bad, because Knative optional dependency
### Seldon Core
* Good, because mature
* Good, because A/B testing
* Good, because explainability
* Bad, because more complex than KServe
* Bad, because heavier resource usage
### BentoML
* Good, because developer-friendly
* Good, because packaging focused
* Bad, because less Kubernetes-native
* Bad, because smaller community
### Ray Serve
* Good, because unified compute
* Good, because Python-native
* Good, because fractional GPU
* Bad, because less standardized API
* Bad, because Ray cluster overhead
## Current Configuration
```yaml
apiVersion: serving.kserve.io/v1beta1
kind: InferenceService
metadata:
name: whisper
namespace: ai-ml
spec:
predictor:
minReplicas: 1
maxReplicas: 3
containers:
- name: whisper
image: ghcr.io/org/whisper:latest
resources:
limits:
nvidia.com/gpu: 1
```
## Links
* [KServe](https://kserve.github.io)
* [V2 Inference Protocol](https://kserve.github.io/website/latest/modelserving/data_plane/v2_protocol/)
* Related: [ADR-0005](0005-multi-gpu-strategy.md) - GPU allocation

View File

@@ -0,0 +1,107 @@
# Use Milvus for Vector Storage
* Status: accepted
* Date: 2025-12-15
* Deciders: Billy Davies
* Technical Story: Selecting vector database for RAG system
## Context and Problem Statement
The RAG (Retrieval-Augmented Generation) system requires a vector database to store document embeddings and perform similarity search. We need to store millions of embeddings and query them with low latency.
## Decision Drivers
* Query performance (< 100ms for top-k search)
* Scalability to millions of vectors
* Kubernetes-native deployment
* Active development and community
* Support for metadata filtering
* Backup and restore capabilities
## Considered Options
* Milvus
* Pinecone (managed)
* Qdrant
* Weaviate
* pgvector (PostgreSQL extension)
* Chroma
## Decision Outcome
Chosen option: "Milvus", because it provides production-grade vector search with excellent Kubernetes support, scalability, and active development.
### Positive Consequences
* High-performance similarity search
* Horizontal scalability
* Rich filtering and hybrid search
* Helm chart for Kubernetes
* Active CNCF sandbox project
* GPU acceleration available
### Negative Consequences
* Complex architecture (multiple components)
* Higher resource usage than simpler alternatives
* Requires object storage (MinIO)
* Learning curve for optimization
## Pros and Cons of the Options
### Milvus
* Good, because production-proven at scale
* Good, because rich query API
* Good, because Kubernetes-native
* Good, because hybrid search (vector + scalar)
* Good, because CNCF project
* Bad, because complex architecture
* Bad, because higher resource usage
### Pinecone
* Good, because fully managed
* Good, because simple API
* Good, because reliable
* Bad, because external dependency
* Bad, because cost at scale
* Bad, because data sovereignty concerns
### Qdrant
* Good, because simpler than Milvus
* Good, because Rust performance
* Good, because good filtering
* Bad, because smaller community
* Bad, because less enterprise features
### Weaviate
* Good, because built-in vectorization
* Good, because GraphQL API
* Good, because modules system
* Bad, because more opinionated
* Bad, because schema requirements
### pgvector
* Good, because familiar PostgreSQL
* Good, because simple deployment
* Good, because ACID transactions
* Bad, because limited scale
* Bad, because slower for large datasets
* Bad, because no specialized optimizations
### Chroma
* Good, because simple
* Good, because embedded option
* Bad, because not production-ready at scale
* Bad, because limited features
## Links
* [Milvus](https://milvus.io)
* [Milvus Helm Chart](https://github.com/milvus-io/milvus-helm)
* Related: [DOMAIN-MODEL.md](../DOMAIN-MODEL.md) - Chunk/Embedding entities

View File

@@ -0,0 +1,124 @@
# Dual Workflow Engine Strategy (Argo + Kubeflow)
* Status: accepted
* Date: 2026-01-15
* Deciders: Billy Davies
* Technical Story: Selecting workflow orchestration for ML pipelines
## Context and Problem Statement
The AI platform needs workflow orchestration for:
- ML training pipelines with caching
- Document ingestion (batch)
- Complex DAG workflows (training → evaluation → deployment)
- Hybrid scenarios combining both
Should we use one engine or leverage strengths of multiple?
## Decision Drivers
* ML-specific features (caching, lineage)
* Complex DAG support
* Kubernetes-native execution
* Visibility and debugging
* Community and ecosystem
* Integration with existing tools
## Considered Options
* Kubeflow Pipelines only
* Argo Workflows only
* Both engines with clear use cases
* Airflow on Kubernetes
* Prefect/Dagster
## Decision Outcome
Chosen option: "Both engines with clear use cases", using Kubeflow Pipelines for ML-centric workflows and Argo Workflows for complex DAG orchestration.
### Decision Matrix
| Use Case | Engine | Reason |
|----------|--------|--------|
| ML training with caching | Kubeflow | Component caching, experiment tracking |
| Model evaluation | Kubeflow | Metric collection, comparison |
| Document ingestion | Argo | Simple DAG, no ML features needed |
| Batch inference | Argo | Parallelization, retries |
| Complex DAG with branching | Argo | Superior control flow |
| Hybrid ML training | Both | Argo orchestrates, KFP for ML steps |
### Positive Consequences
* Best tool for each job
* ML pipelines get proper caching
* Complex workflows get better DAG support
* Can integrate via Argo Events
* Gradual migration possible
### Negative Consequences
* Two systems to maintain
* Team needs to learn both
* More complex debugging
* Integration overhead
## Integration Architecture
```
NATS Event ──► Argo Events ──► Sensor ──┬──► Argo Workflow
└──► Kubeflow Pipeline (via API)
OR
Argo Workflow ──► Step: kfp-trigger ──► Kubeflow Pipeline
(WorkflowTemplate)
```
## Pros and Cons of the Options
### Kubeflow Pipelines only
* Good, because ML-focused
* Good, because caching
* Good, because experiment tracking
* Bad, because limited DAG features
* Bad, because less flexible control flow
### Argo Workflows only
* Good, because powerful DAG
* Good, because flexible
* Good, because great debugging
* Bad, because no ML caching
* Bad, because no experiment tracking
### Both engines (chosen)
* Good, because best of both
* Good, because appropriate tool per job
* Good, because can integrate
* Bad, because operational complexity
* Bad, because learning two systems
### Airflow
* Good, because mature
* Good, because large community
* Bad, because Python-centric
* Bad, because not Kubernetes-native
* Bad, because no ML features
### Prefect/Dagster
* Good, because modern design
* Good, because Python-native
* Bad, because less Kubernetes-native
* Bad, because newer/less proven
## Links
* [Kubeflow Pipelines](https://kubeflow.org/docs/components/pipelines/)
* [Argo Workflows](https://argoproj.github.io/workflows/)
* [Argo Events](https://argoproj.github.io/events/)
* Related: [kfp-integration.yaml](../../llm-workflows/argo/kfp-integration.yaml)

View File

@@ -0,0 +1,120 @@
# Use Envoy Gateway for Ingress
* Status: accepted
* Date: 2025-12-01
* Deciders: Billy Davies
* Technical Story: Selecting ingress controller for cluster
## Context and Problem Statement
We need an ingress solution that supports:
- Gateway API (modern Kubernetes standard)
- gRPC for ML inference
- WebSocket for real-time chat/voice
- Header-based routing for A/B testing
- TLS termination
## Decision Drivers
* Gateway API support (HTTPRoute, GRPCRoute)
* WebSocket support
* gRPC support
* Performance at edge
* Active development
* Envoy ecosystem familiarity
## Considered Options
* NGINX Ingress Controller
* Traefik
* Envoy Gateway
* Istio Gateway
* Contour
## Decision Outcome
Chosen option: "Envoy Gateway", because it's the reference implementation of Gateway API with full Envoy feature set.
### Positive Consequences
* Native Gateway API support
* Full Envoy feature set
* WebSocket and gRPC native
* No Istio complexity
* CNCF graduated project (Envoy)
* Easy integration with observability
### Negative Consequences
* Newer than alternatives
* Less documentation than NGINX
* Envoy configuration learning curve
## Pros and Cons of the Options
### NGINX Ingress
* Good, because mature
* Good, because well-documented
* Good, because familiar
* Bad, because limited Gateway API
* Bad, because commercial features gated
### Traefik
* Good, because auto-discovery
* Good, because good UI
* Good, because Let's Encrypt
* Bad, because Gateway API experimental
* Bad, because less gRPC focus
### Envoy Gateway
* Good, because Gateway API native
* Good, because full Envoy features
* Good, because extensible
* Good, because gRPC/WebSocket native
* Bad, because newer project
* Bad, because less community content
### Istio Gateway
* Good, because full mesh features
* Good, because Gateway API
* Bad, because overkill without mesh
* Bad, because resource heavy
### Contour
* Good, because Envoy-based
* Good, because lightweight
* Bad, because Gateway API evolving
* Bad, because smaller community
## Configuration Example
```yaml
apiVersion: gateway.networking.k8s.io/v1
kind: HTTPRoute
metadata:
name: companions-chat
spec:
parentRefs:
- name: eg-gateway
namespace: network
hostnames:
- companions-chat.lab.daviestechlabs.io
rules:
- matches:
- path:
type: PathPrefix
value: /
backendRefs:
- name: companions-chat
port: 8080
```
## Links
* [Envoy Gateway](https://gateway.envoyproxy.io)
* [Gateway API](https://gateway-api.sigs.k8s.io)