homelab-design/decisions/0009-dual-workflow-engines.md

# Dual Workflow Engine Strategy (Argo + Kubeflow)

* Status: accepted
* Date: 2026-01-15
* Deciders: Billy Davies
* Technical Story: Selecting workflow orchestration for ML pipelines

## Context and Problem Statement

The AI platform needs workflow orchestration for:
- ML training pipelines with caching
- Document ingestion (batch)
- Complex DAG workflows (training → evaluation → deployment)
- Hybrid scenarios combining both

Should we use one engine or leverage strengths of multiple?

## Decision Drivers

* ML-specific features (caching, lineage)
* Complex DAG support
* Kubernetes-native execution
* Visibility and debugging
* Community and ecosystem
* Integration with existing tools

## Considered Options

* Kubeflow Pipelines only
* Argo Workflows only
* Both engines with clear use cases
* Airflow on Kubernetes
* Prefect/Dagster

## Decision Outcome

Chosen option: "Both engines with clear use cases", using Kubeflow Pipelines for ML-centric workflows and Argo Workflows for complex DAG orchestration.

### Decision Matrix

| Use Case | Engine | Reason |
|----------|--------|--------|
| ML training with caching | Kubeflow | Component caching, experiment tracking |
| Model evaluation | Kubeflow | Metric collection, comparison |
| Document ingestion | Argo | Simple DAG, no ML features needed |
| Batch inference | Argo | Parallelization, retries |
| Complex DAG with branching | Argo | Superior control flow |
| Hybrid ML training | Both | Argo orchestrates, KFP for ML steps |

### Positive Consequences

* Best tool for each job
* ML pipelines get proper caching
* Complex workflows get better DAG support
* Can integrate via Argo Events
* Gradual migration possible

### Negative Consequences

* Two systems to maintain
* Team needs to learn both
* More complex debugging
* Integration overhead

## Integration Architecture

```
NATS Event ──► Argo Events ──► Sensor ──┬──► Argo Workflow
                                        │
                                        └──► Kubeflow Pipeline (via API)

                    OR

Argo Workflow ──► Step: kfp-trigger ──► Kubeflow Pipeline
                 (WorkflowTemplate)
```

## Pros and Cons of the Options

### Kubeflow Pipelines only

* Good, because ML-focused
* Good, because caching
* Good, because experiment tracking
* Bad, because limited DAG features
* Bad, because less flexible control flow

### Argo Workflows only

* Good, because powerful DAG
* Good, because flexible
* Good, because great debugging
* Bad, because no ML caching
* Bad, because no experiment tracking

### Both engines (chosen)

* Good, because best of both
* Good, because appropriate tool per job
* Good, because can integrate
* Bad, because operational complexity
* Bad, because learning two systems

### Airflow

* Good, because mature
* Good, because large community
* Bad, because Python-centric
* Bad, because not Kubernetes-native
* Bad, because no ML features

### Prefect/Dagster

* Good, because modern design
* Good, because Python-native
* Bad, because less Kubernetes-native
* Bad, because newer/less proven

## Links

* [Kubeflow Pipelines](https://kubeflow.org/docs/components/pipelines/)
* [Argo Workflows](https://argoproj.github.io/workflows/)
* [Argo Events](https://argoproj.github.io/events/)
* Related: kfp-integration.yaml (formerly in llm-workflows, now in the `argo` repo on Gitea)