Files
homelab-design/decisions/0009-dual-workflow-engines.md
Billy D. 8e3e2043c3
All checks were successful
Update README with ADR Index / update-readme (push) Successful in 6s
docs: add ADR-0038/0039 and replace llm-workflows references with decomposed repos
- ADR-0038: Infrastructure metrics collection (smartctl, SNMP, blackbox, unpoller)
- ADR-0039: Alerting and notification pipeline (Alertmanager → ntfy → Discord)
- Replace llm-workflows GitHub links with Gitea daviestechlabs org repos
- Update AGENT-ONBOARDING.md: remove llm-workflows from file tree, add missing repos
- Update ADR-0006: fix multi-repo reference
- Update ADR-0009: fix broken llm-workflows link
- Update ADR-0024: mark ray-serve repo as created, update historical context
- Update README: fix ADR-0016 status, add 0038/0039 to table, update badges
2026-02-09 18:12:37 -05:00

125 lines
3.3 KiB
Markdown

# Dual Workflow Engine Strategy (Argo + Kubeflow)
* Status: accepted
* Date: 2026-01-15
* Deciders: Billy Davies
* Technical Story: Selecting workflow orchestration for ML pipelines
## Context and Problem Statement
The AI platform needs workflow orchestration for:
- ML training pipelines with caching
- Document ingestion (batch)
- Complex DAG workflows (training → evaluation → deployment)
- Hybrid scenarios combining both
Should we use one engine or leverage strengths of multiple?
## Decision Drivers
* ML-specific features (caching, lineage)
* Complex DAG support
* Kubernetes-native execution
* Visibility and debugging
* Community and ecosystem
* Integration with existing tools
## Considered Options
* Kubeflow Pipelines only
* Argo Workflows only
* Both engines with clear use cases
* Airflow on Kubernetes
* Prefect/Dagster
## Decision Outcome
Chosen option: "Both engines with clear use cases", using Kubeflow Pipelines for ML-centric workflows and Argo Workflows for complex DAG orchestration.
### Decision Matrix
| Use Case | Engine | Reason |
|----------|--------|--------|
| ML training with caching | Kubeflow | Component caching, experiment tracking |
| Model evaluation | Kubeflow | Metric collection, comparison |
| Document ingestion | Argo | Simple DAG, no ML features needed |
| Batch inference | Argo | Parallelization, retries |
| Complex DAG with branching | Argo | Superior control flow |
| Hybrid ML training | Both | Argo orchestrates, KFP for ML steps |
### Positive Consequences
* Best tool for each job
* ML pipelines get proper caching
* Complex workflows get better DAG support
* Can integrate via Argo Events
* Gradual migration possible
### Negative Consequences
* Two systems to maintain
* Team needs to learn both
* More complex debugging
* Integration overhead
## Integration Architecture
```
NATS Event ──► Argo Events ──► Sensor ──┬──► Argo Workflow
└──► Kubeflow Pipeline (via API)
OR
Argo Workflow ──► Step: kfp-trigger ──► Kubeflow Pipeline
(WorkflowTemplate)
```
## Pros and Cons of the Options
### Kubeflow Pipelines only
* Good, because ML-focused
* Good, because caching
* Good, because experiment tracking
* Bad, because limited DAG features
* Bad, because less flexible control flow
### Argo Workflows only
* Good, because powerful DAG
* Good, because flexible
* Good, because great debugging
* Bad, because no ML caching
* Bad, because no experiment tracking
### Both engines (chosen)
* Good, because best of both
* Good, because appropriate tool per job
* Good, because can integrate
* Bad, because operational complexity
* Bad, because learning two systems
### Airflow
* Good, because mature
* Good, because large community
* Bad, because Python-centric
* Bad, because not Kubernetes-native
* Bad, because no ML features
### Prefect/Dagster
* Good, because modern design
* Good, because Python-native
* Bad, because less Kubernetes-native
* Bad, because newer/less proven
## Links
* [Kubeflow Pipelines](https://kubeflow.org/docs/components/pipelines/)
* [Argo Workflows](https://argoproj.github.io/workflows/)
* [Argo Events](https://argoproj.github.io/events/)
* Related: kfp-integration.yaml (formerly in llm-workflows, now in the `argo` repo on Gitea)