Files
homelab-design/decisions/0009-dual-workflow-engines.md
Billy D. 8e3e2043c3
All checks were successful
Update README with ADR Index / update-readme (push) Successful in 6s
docs: add ADR-0038/0039 and replace llm-workflows references with decomposed repos
- ADR-0038: Infrastructure metrics collection (smartctl, SNMP, blackbox, unpoller)
- ADR-0039: Alerting and notification pipeline (Alertmanager → ntfy → Discord)
- Replace llm-workflows GitHub links with Gitea daviestechlabs org repos
- Update AGENT-ONBOARDING.md: remove llm-workflows from file tree, add missing repos
- Update ADR-0006: fix multi-repo reference
- Update ADR-0009: fix broken llm-workflows link
- Update ADR-0024: mark ray-serve repo as created, update historical context
- Update README: fix ADR-0016 status, add 0038/0039 to table, update badges
2026-02-09 18:12:37 -05:00

3.3 KiB

Dual Workflow Engine Strategy (Argo + Kubeflow)

  • Status: accepted
  • Date: 2026-01-15
  • Deciders: Billy Davies
  • Technical Story: Selecting workflow orchestration for ML pipelines

Context and Problem Statement

The AI platform needs workflow orchestration for:

  • ML training pipelines with caching
  • Document ingestion (batch)
  • Complex DAG workflows (training → evaluation → deployment)
  • Hybrid scenarios combining both

Should we use one engine or leverage strengths of multiple?

Decision Drivers

  • ML-specific features (caching, lineage)
  • Complex DAG support
  • Kubernetes-native execution
  • Visibility and debugging
  • Community and ecosystem
  • Integration with existing tools

Considered Options

  • Kubeflow Pipelines only
  • Argo Workflows only
  • Both engines with clear use cases
  • Airflow on Kubernetes
  • Prefect/Dagster

Decision Outcome

Chosen option: "Both engines with clear use cases", using Kubeflow Pipelines for ML-centric workflows and Argo Workflows for complex DAG orchestration.

Decision Matrix

Use Case Engine Reason
ML training with caching Kubeflow Component caching, experiment tracking
Model evaluation Kubeflow Metric collection, comparison
Document ingestion Argo Simple DAG, no ML features needed
Batch inference Argo Parallelization, retries
Complex DAG with branching Argo Superior control flow
Hybrid ML training Both Argo orchestrates, KFP for ML steps

Positive Consequences

  • Best tool for each job
  • ML pipelines get proper caching
  • Complex workflows get better DAG support
  • Can integrate via Argo Events
  • Gradual migration possible

Negative Consequences

  • Two systems to maintain
  • Team needs to learn both
  • More complex debugging
  • Integration overhead

Integration Architecture

NATS Event ──► Argo Events ──► Sensor ──┬──► Argo Workflow
                                        │
                                        └──► Kubeflow Pipeline (via API)

                    OR

Argo Workflow ──► Step: kfp-trigger ──► Kubeflow Pipeline
                 (WorkflowTemplate)

Pros and Cons of the Options

Kubeflow Pipelines only

  • Good, because ML-focused
  • Good, because caching
  • Good, because experiment tracking
  • Bad, because limited DAG features
  • Bad, because less flexible control flow

Argo Workflows only

  • Good, because powerful DAG
  • Good, because flexible
  • Good, because great debugging
  • Bad, because no ML caching
  • Bad, because no experiment tracking

Both engines (chosen)

  • Good, because best of both
  • Good, because appropriate tool per job
  • Good, because can integrate
  • Bad, because operational complexity
  • Bad, because learning two systems

Airflow

  • Good, because mature
  • Good, because large community
  • Bad, because Python-centric
  • Bad, because not Kubernetes-native
  • Bad, because no ML features

Prefect/Dagster

  • Good, because modern design
  • Good, because Python-native
  • Bad, because less Kubernetes-native
  • Bad, because newer/less proven