feat: add comprehensive architecture documentation

- Add AGENT-ONBOARDING.md for AI agents - Add ARCHITECTURE.md with full system overview - Add TECH-STACK.md with complete technology inventory - Add DOMAIN-MODEL.md with entities and bounded contexts - Add CODING-CONVENTIONS.md with patterns and practices - Add GLOSSARY.md with terminology reference - Add C4 diagrams (Context and Container levels) - Add 10 ADRs documenting key decisions: - Talos Linux, NATS, MessagePack, Multi-GPU strategy - GitOps with Flux, KServe, Milvus, Dual workflow engines - Envoy Gateway - Add specs directory with JetStream configuration - Add diagrams for GPU allocation and data flows Based on analysis of homelab-k8s2 and llm-workflows repositories and kubectl cluster-info dump data.
2026-02-01 14:30:05 -05:00
parent 4d4f6f464c
commit 832cda34bd
26 changed files with 3805 additions and 2 deletions
--- a/decisions/0009-dual-workflow-engines.md
+++ b/decisions/0009-dual-workflow-engines.md
@@ -0,0 +1,124 @@
+# Dual Workflow Engine Strategy (Argo + Kubeflow)
+
+* Status: accepted
+* Date: 2026-01-15
+* Deciders: Billy Davies
+* Technical Story: Selecting workflow orchestration for ML pipelines
+
+## Context and Problem Statement
+
+The AI platform needs workflow orchestration for:
+- ML training pipelines with caching
+- Document ingestion (batch)
+- Complex DAG workflows (training → evaluation → deployment)
+- Hybrid scenarios combining both
+
+Should we use one engine or leverage strengths of multiple?
+
+## Decision Drivers
+
+* ML-specific features (caching, lineage)
+* Complex DAG support
+* Kubernetes-native execution
+* Visibility and debugging
+* Community and ecosystem
+* Integration with existing tools
+
+## Considered Options
+
+* Kubeflow Pipelines only
+* Argo Workflows only
+* Both engines with clear use cases
+* Airflow on Kubernetes
+* Prefect/Dagster
+
+## Decision Outcome
+
+Chosen option: "Both engines with clear use cases", using Kubeflow Pipelines for ML-centric workflows and Argo Workflows for complex DAG orchestration.
+
+### Decision Matrix
+
+| Use Case | Engine | Reason |
+|----------|--------|--------|
+| ML training with caching | Kubeflow | Component caching, experiment tracking |
+| Model evaluation | Kubeflow | Metric collection, comparison |
+| Document ingestion | Argo | Simple DAG, no ML features needed |
+| Batch inference | Argo | Parallelization, retries |
+| Complex DAG with branching | Argo | Superior control flow |
+| Hybrid ML training | Both | Argo orchestrates, KFP for ML steps |
+
+### Positive Consequences
+
+* Best tool for each job
+* ML pipelines get proper caching
+* Complex workflows get better DAG support
+* Can integrate via Argo Events
+* Gradual migration possible
+
+### Negative Consequences
+
+* Two systems to maintain
+* Team needs to learn both
+* More complex debugging
+* Integration overhead
+
+## Integration Architecture
+
+```
+NATS Event ──► Argo Events ──► Sensor ──┬──► Argo Workflow
+                                        │
+                                        └──► Kubeflow Pipeline (via API)
+
+                    OR
+
+Argo Workflow ──► Step: kfp-trigger ──► Kubeflow Pipeline
+                 (WorkflowTemplate)
+```
+
+## Pros and Cons of the Options
+
+### Kubeflow Pipelines only
+
+* Good, because ML-focused
+* Good, because caching
+* Good, because experiment tracking
+* Bad, because limited DAG features
+* Bad, because less flexible control flow
+
+### Argo Workflows only
+
+* Good, because powerful DAG
+* Good, because flexible
+* Good, because great debugging
+* Bad, because no ML caching
+* Bad, because no experiment tracking
+
+### Both engines (chosen)
+
+* Good, because best of both
+* Good, because appropriate tool per job
+* Good, because can integrate
+* Bad, because operational complexity
+* Bad, because learning two systems
+
+### Airflow
+
+* Good, because mature
+* Good, because large community
+* Bad, because Python-centric
+* Bad, because not Kubernetes-native
+* Bad, because no ML features
+
+### Prefect/Dagster
+
+* Good, because modern design
+* Good, because Python-native
+* Bad, because less Kubernetes-native
+* Bad, because newer/less proven
+
+## Links
+
+* [Kubeflow Pipelines](https://kubeflow.org/docs/components/pipelines/)
+* [Argo Workflows](https://argoproj.github.io/workflows/)
+* [Argo Events](https://argoproj.github.io/events/)
+* Related: [kfp-integration.yaml](../../llm-workflows/argo/kfp-integration.yaml)