All checks were successful
Update README with ADR Index / update-readme (push) Successful in 6s
- ADR-0038: Infrastructure metrics collection (smartctl, SNMP, blackbox, unpoller) - ADR-0039: Alerting and notification pipeline (Alertmanager → ntfy → Discord) - Replace llm-workflows GitHub links with Gitea daviestechlabs org repos - Update AGENT-ONBOARDING.md: remove llm-workflows from file tree, add missing repos - Update ADR-0006: fix multi-repo reference - Update ADR-0009: fix broken llm-workflows link - Update ADR-0024: mark ray-serve repo as created, update historical context - Update README: fix ADR-0016 status, add 0038/0039 to table, update badges
3.3 KiB
3.3 KiB
Dual Workflow Engine Strategy (Argo + Kubeflow)
- Status: accepted
- Date: 2026-01-15
- Deciders: Billy Davies
- Technical Story: Selecting workflow orchestration for ML pipelines
Context and Problem Statement
The AI platform needs workflow orchestration for:
- ML training pipelines with caching
- Document ingestion (batch)
- Complex DAG workflows (training → evaluation → deployment)
- Hybrid scenarios combining both
Should we use one engine or leverage strengths of multiple?
Decision Drivers
- ML-specific features (caching, lineage)
- Complex DAG support
- Kubernetes-native execution
- Visibility and debugging
- Community and ecosystem
- Integration with existing tools
Considered Options
- Kubeflow Pipelines only
- Argo Workflows only
- Both engines with clear use cases
- Airflow on Kubernetes
- Prefect/Dagster
Decision Outcome
Chosen option: "Both engines with clear use cases", using Kubeflow Pipelines for ML-centric workflows and Argo Workflows for complex DAG orchestration.
Decision Matrix
| Use Case | Engine | Reason |
|---|---|---|
| ML training with caching | Kubeflow | Component caching, experiment tracking |
| Model evaluation | Kubeflow | Metric collection, comparison |
| Document ingestion | Argo | Simple DAG, no ML features needed |
| Batch inference | Argo | Parallelization, retries |
| Complex DAG with branching | Argo | Superior control flow |
| Hybrid ML training | Both | Argo orchestrates, KFP for ML steps |
Positive Consequences
- Best tool for each job
- ML pipelines get proper caching
- Complex workflows get better DAG support
- Can integrate via Argo Events
- Gradual migration possible
Negative Consequences
- Two systems to maintain
- Team needs to learn both
- More complex debugging
- Integration overhead
Integration Architecture
NATS Event ──► Argo Events ──► Sensor ──┬──► Argo Workflow
│
└──► Kubeflow Pipeline (via API)
OR
Argo Workflow ──► Step: kfp-trigger ──► Kubeflow Pipeline
(WorkflowTemplate)
Pros and Cons of the Options
Kubeflow Pipelines only
- Good, because ML-focused
- Good, because caching
- Good, because experiment tracking
- Bad, because limited DAG features
- Bad, because less flexible control flow
Argo Workflows only
- Good, because powerful DAG
- Good, because flexible
- Good, because great debugging
- Bad, because no ML caching
- Bad, because no experiment tracking
Both engines (chosen)
- Good, because best of both
- Good, because appropriate tool per job
- Good, because can integrate
- Bad, because operational complexity
- Bad, because learning two systems
Airflow
- Good, because mature
- Good, because large community
- Bad, because Python-centric
- Bad, because not Kubernetes-native
- Bad, because no ML features
Prefect/Dagster
- Good, because modern design
- Good, because Python-native
- Bad, because less Kubernetes-native
- Bad, because newer/less proven
Links
- Kubeflow Pipelines
- Argo Workflows
- Argo Events
- Related: kfp-integration.yaml (formerly in llm-workflows, now in the
argorepo on Gitea)