feat: add comprehensive architecture documentation
- Add AGENT-ONBOARDING.md for AI agents - Add ARCHITECTURE.md with full system overview - Add TECH-STACK.md with complete technology inventory - Add DOMAIN-MODEL.md with entities and bounded contexts - Add CODING-CONVENTIONS.md with patterns and practices - Add GLOSSARY.md with terminology reference - Add C4 diagrams (Context and Container levels) - Add 10 ADRs documenting key decisions: - Talos Linux, NATS, MessagePack, Multi-GPU strategy - GitOps with Flux, KServe, Milvus, Dual workflow engines - Envoy Gateway - Add specs directory with JetStream configuration - Add diagrams for GPU allocation and data flows Based on analysis of homelab-k8s2 and llm-workflows repositories and kubectl cluster-info dump data.
This commit is contained in:
124
decisions/0009-dual-workflow-engines.md
Normal file
124
decisions/0009-dual-workflow-engines.md
Normal file
@@ -0,0 +1,124 @@
|
||||
# Dual Workflow Engine Strategy (Argo + Kubeflow)
|
||||
|
||||
* Status: accepted
|
||||
* Date: 2026-01-15
|
||||
* Deciders: Billy Davies
|
||||
* Technical Story: Selecting workflow orchestration for ML pipelines
|
||||
|
||||
## Context and Problem Statement
|
||||
|
||||
The AI platform needs workflow orchestration for:
|
||||
- ML training pipelines with caching
|
||||
- Document ingestion (batch)
|
||||
- Complex DAG workflows (training → evaluation → deployment)
|
||||
- Hybrid scenarios combining both
|
||||
|
||||
Should we use one engine or leverage strengths of multiple?
|
||||
|
||||
## Decision Drivers
|
||||
|
||||
* ML-specific features (caching, lineage)
|
||||
* Complex DAG support
|
||||
* Kubernetes-native execution
|
||||
* Visibility and debugging
|
||||
* Community and ecosystem
|
||||
* Integration with existing tools
|
||||
|
||||
## Considered Options
|
||||
|
||||
* Kubeflow Pipelines only
|
||||
* Argo Workflows only
|
||||
* Both engines with clear use cases
|
||||
* Airflow on Kubernetes
|
||||
* Prefect/Dagster
|
||||
|
||||
## Decision Outcome
|
||||
|
||||
Chosen option: "Both engines with clear use cases", using Kubeflow Pipelines for ML-centric workflows and Argo Workflows for complex DAG orchestration.
|
||||
|
||||
### Decision Matrix
|
||||
|
||||
| Use Case | Engine | Reason |
|
||||
|----------|--------|--------|
|
||||
| ML training with caching | Kubeflow | Component caching, experiment tracking |
|
||||
| Model evaluation | Kubeflow | Metric collection, comparison |
|
||||
| Document ingestion | Argo | Simple DAG, no ML features needed |
|
||||
| Batch inference | Argo | Parallelization, retries |
|
||||
| Complex DAG with branching | Argo | Superior control flow |
|
||||
| Hybrid ML training | Both | Argo orchestrates, KFP for ML steps |
|
||||
|
||||
### Positive Consequences
|
||||
|
||||
* Best tool for each job
|
||||
* ML pipelines get proper caching
|
||||
* Complex workflows get better DAG support
|
||||
* Can integrate via Argo Events
|
||||
* Gradual migration possible
|
||||
|
||||
### Negative Consequences
|
||||
|
||||
* Two systems to maintain
|
||||
* Team needs to learn both
|
||||
* More complex debugging
|
||||
* Integration overhead
|
||||
|
||||
## Integration Architecture
|
||||
|
||||
```
|
||||
NATS Event ──► Argo Events ──► Sensor ──┬──► Argo Workflow
|
||||
│
|
||||
└──► Kubeflow Pipeline (via API)
|
||||
|
||||
OR
|
||||
|
||||
Argo Workflow ──► Step: kfp-trigger ──► Kubeflow Pipeline
|
||||
(WorkflowTemplate)
|
||||
```
|
||||
|
||||
## Pros and Cons of the Options
|
||||
|
||||
### Kubeflow Pipelines only
|
||||
|
||||
* Good, because ML-focused
|
||||
* Good, because caching
|
||||
* Good, because experiment tracking
|
||||
* Bad, because limited DAG features
|
||||
* Bad, because less flexible control flow
|
||||
|
||||
### Argo Workflows only
|
||||
|
||||
* Good, because powerful DAG
|
||||
* Good, because flexible
|
||||
* Good, because great debugging
|
||||
* Bad, because no ML caching
|
||||
* Bad, because no experiment tracking
|
||||
|
||||
### Both engines (chosen)
|
||||
|
||||
* Good, because best of both
|
||||
* Good, because appropriate tool per job
|
||||
* Good, because can integrate
|
||||
* Bad, because operational complexity
|
||||
* Bad, because learning two systems
|
||||
|
||||
### Airflow
|
||||
|
||||
* Good, because mature
|
||||
* Good, because large community
|
||||
* Bad, because Python-centric
|
||||
* Bad, because not Kubernetes-native
|
||||
* Bad, because no ML features
|
||||
|
||||
### Prefect/Dagster
|
||||
|
||||
* Good, because modern design
|
||||
* Good, because Python-native
|
||||
* Bad, because less Kubernetes-native
|
||||
* Bad, because newer/less proven
|
||||
|
||||
## Links
|
||||
|
||||
* [Kubeflow Pipelines](https://kubeflow.org/docs/components/pipelines/)
|
||||
* [Argo Workflows](https://argoproj.github.io/workflows/)
|
||||
* [Argo Events](https://argoproj.github.io/events/)
|
||||
* Related: [kfp-integration.yaml](../../llm-workflows/argo/kfp-integration.yaml)
|
||||
Reference in New Issue
Block a user