# Dual Workflow Engine Strategy (Argo + Kubeflow) * Status: accepted * Date: 2026-01-15 * Deciders: Billy Davies * Technical Story: Selecting workflow orchestration for ML pipelines ## Context and Problem Statement The AI platform needs workflow orchestration for: - ML training pipelines with caching - Document ingestion (batch) - Complex DAG workflows (training → evaluation → deployment) - Hybrid scenarios combining both Should we use one engine or leverage strengths of multiple? ## Decision Drivers * ML-specific features (caching, lineage) * Complex DAG support * Kubernetes-native execution * Visibility and debugging * Community and ecosystem * Integration with existing tools ## Considered Options * Kubeflow Pipelines only * Argo Workflows only * Both engines with clear use cases * Airflow on Kubernetes * Prefect/Dagster ## Decision Outcome Chosen option: "Both engines with clear use cases", using Kubeflow Pipelines for ML-centric workflows and Argo Workflows for complex DAG orchestration. ### Decision Matrix | Use Case | Engine | Reason | |----------|--------|--------| | ML training with caching | Kubeflow | Component caching, experiment tracking | | Model evaluation | Kubeflow | Metric collection, comparison | | Document ingestion | Argo | Simple DAG, no ML features needed | | Batch inference | Argo | Parallelization, retries | | Complex DAG with branching | Argo | Superior control flow | | Hybrid ML training | Both | Argo orchestrates, KFP for ML steps | ### Positive Consequences * Best tool for each job * ML pipelines get proper caching * Complex workflows get better DAG support * Can integrate via Argo Events * Gradual migration possible ### Negative Consequences * Two systems to maintain * Team needs to learn both * More complex debugging * Integration overhead ## Integration Architecture ``` NATS Event ──► Argo Events ──► Sensor ──┬──► Argo Workflow │ └──► Kubeflow Pipeline (via API) OR Argo Workflow ──► Step: kfp-trigger ──► Kubeflow Pipeline (WorkflowTemplate) ``` ## Pros and Cons of the Options ### Kubeflow Pipelines only * Good, because ML-focused * Good, because caching * Good, because experiment tracking * Bad, because limited DAG features * Bad, because less flexible control flow ### Argo Workflows only * Good, because powerful DAG * Good, because flexible * Good, because great debugging * Bad, because no ML caching * Bad, because no experiment tracking ### Both engines (chosen) * Good, because best of both * Good, because appropriate tool per job * Good, because can integrate * Bad, because operational complexity * Bad, because learning two systems ### Airflow * Good, because mature * Good, because large community * Bad, because Python-centric * Bad, because not Kubernetes-native * Bad, because no ML features ### Prefect/Dagster * Good, because modern design * Good, because Python-native * Bad, because less Kubernetes-native * Bad, because newer/less proven ## Links * [Kubeflow Pipelines](https://kubeflow.org/docs/components/pipelines/) * [Argo Workflows](https://argoproj.github.io/workflows/) * [Argo Events](https://argoproj.github.io/events/) * Related: [kfp-integration.yaml](../../llm-workflows/argo/kfp-integration.yaml)