All checks were successful
Update README with ADR Index / update-readme (push) Successful in 6s
- ADR-0038: Infrastructure metrics collection (smartctl, SNMP, blackbox, unpoller) - ADR-0039: Alerting and notification pipeline (Alertmanager → ntfy → Discord) - Replace llm-workflows GitHub links with Gitea daviestechlabs org repos - Update AGENT-ONBOARDING.md: remove llm-workflows from file tree, add missing repos - Update ADR-0006: fix multi-repo reference - Update ADR-0009: fix broken llm-workflows link - Update ADR-0024: mark ray-serve repo as created, update historical context - Update README: fix ADR-0016 status, add 0038/0039 to table, update badges
125 lines
3.3 KiB
Markdown
125 lines
3.3 KiB
Markdown
# Dual Workflow Engine Strategy (Argo + Kubeflow)
|
|
|
|
* Status: accepted
|
|
* Date: 2026-01-15
|
|
* Deciders: Billy Davies
|
|
* Technical Story: Selecting workflow orchestration for ML pipelines
|
|
|
|
## Context and Problem Statement
|
|
|
|
The AI platform needs workflow orchestration for:
|
|
- ML training pipelines with caching
|
|
- Document ingestion (batch)
|
|
- Complex DAG workflows (training → evaluation → deployment)
|
|
- Hybrid scenarios combining both
|
|
|
|
Should we use one engine or leverage strengths of multiple?
|
|
|
|
## Decision Drivers
|
|
|
|
* ML-specific features (caching, lineage)
|
|
* Complex DAG support
|
|
* Kubernetes-native execution
|
|
* Visibility and debugging
|
|
* Community and ecosystem
|
|
* Integration with existing tools
|
|
|
|
## Considered Options
|
|
|
|
* Kubeflow Pipelines only
|
|
* Argo Workflows only
|
|
* Both engines with clear use cases
|
|
* Airflow on Kubernetes
|
|
* Prefect/Dagster
|
|
|
|
## Decision Outcome
|
|
|
|
Chosen option: "Both engines with clear use cases", using Kubeflow Pipelines for ML-centric workflows and Argo Workflows for complex DAG orchestration.
|
|
|
|
### Decision Matrix
|
|
|
|
| Use Case | Engine | Reason |
|
|
|----------|--------|--------|
|
|
| ML training with caching | Kubeflow | Component caching, experiment tracking |
|
|
| Model evaluation | Kubeflow | Metric collection, comparison |
|
|
| Document ingestion | Argo | Simple DAG, no ML features needed |
|
|
| Batch inference | Argo | Parallelization, retries |
|
|
| Complex DAG with branching | Argo | Superior control flow |
|
|
| Hybrid ML training | Both | Argo orchestrates, KFP for ML steps |
|
|
|
|
### Positive Consequences
|
|
|
|
* Best tool for each job
|
|
* ML pipelines get proper caching
|
|
* Complex workflows get better DAG support
|
|
* Can integrate via Argo Events
|
|
* Gradual migration possible
|
|
|
|
### Negative Consequences
|
|
|
|
* Two systems to maintain
|
|
* Team needs to learn both
|
|
* More complex debugging
|
|
* Integration overhead
|
|
|
|
## Integration Architecture
|
|
|
|
```
|
|
NATS Event ──► Argo Events ──► Sensor ──┬──► Argo Workflow
|
|
│
|
|
└──► Kubeflow Pipeline (via API)
|
|
|
|
OR
|
|
|
|
Argo Workflow ──► Step: kfp-trigger ──► Kubeflow Pipeline
|
|
(WorkflowTemplate)
|
|
```
|
|
|
|
## Pros and Cons of the Options
|
|
|
|
### Kubeflow Pipelines only
|
|
|
|
* Good, because ML-focused
|
|
* Good, because caching
|
|
* Good, because experiment tracking
|
|
* Bad, because limited DAG features
|
|
* Bad, because less flexible control flow
|
|
|
|
### Argo Workflows only
|
|
|
|
* Good, because powerful DAG
|
|
* Good, because flexible
|
|
* Good, because great debugging
|
|
* Bad, because no ML caching
|
|
* Bad, because no experiment tracking
|
|
|
|
### Both engines (chosen)
|
|
|
|
* Good, because best of both
|
|
* Good, because appropriate tool per job
|
|
* Good, because can integrate
|
|
* Bad, because operational complexity
|
|
* Bad, because learning two systems
|
|
|
|
### Airflow
|
|
|
|
* Good, because mature
|
|
* Good, because large community
|
|
* Bad, because Python-centric
|
|
* Bad, because not Kubernetes-native
|
|
* Bad, because no ML features
|
|
|
|
### Prefect/Dagster
|
|
|
|
* Good, because modern design
|
|
* Good, because Python-native
|
|
* Bad, because less Kubernetes-native
|
|
* Bad, because newer/less proven
|
|
|
|
## Links
|
|
|
|
* [Kubeflow Pipelines](https://kubeflow.org/docs/components/pipelines/)
|
|
* [Argo Workflows](https://argoproj.github.io/workflows/)
|
|
* [Argo Events](https://argoproj.github.io/events/)
|
|
* Related: kfp-integration.yaml (formerly in llm-workflows, now in the `argo` repo on Gitea)
|