Files

Billy D. 8e3e2043c3

Update README with ADR Index / update-readme (push) Successful in 6s

Details

docs: add ADR-0038/0039 and replace llm-workflows references with decomposed repos

- ADR-0038: Infrastructure metrics collection (smartctl, SNMP, blackbox, unpoller)
- ADR-0039: Alerting and notification pipeline (Alertmanager → ntfy → Discord)
- Replace llm-workflows GitHub links with Gitea daviestechlabs org repos
- Update AGENT-ONBOARDING.md: remove llm-workflows from file tree, add missing repos
- Update ADR-0006: fix multi-repo reference
- Update ADR-0009: fix broken llm-workflows link
- Update ADR-0024: mark ray-serve repo as created, update historical context
- Update README: fix ADR-0016 status, add 0038/0039 to table, update badges

2026-02-09 18:12:37 -05:00

3.3 KiB

Raw Blame History

Dual Workflow Engine Strategy (Argo + Kubeflow)

Status: accepted
Date: 2026-01-15
Deciders: Billy Davies
Technical Story: Selecting workflow orchestration for ML pipelines

Context and Problem Statement

The AI platform needs workflow orchestration for:

ML training pipelines with caching
Document ingestion (batch)
Complex DAG workflows (training → evaluation → deployment)
Hybrid scenarios combining both

Should we use one engine or leverage strengths of multiple?

Decision Drivers

ML-specific features (caching, lineage)
Complex DAG support
Kubernetes-native execution
Visibility and debugging
Community and ecosystem
Integration with existing tools

Considered Options

Kubeflow Pipelines only
Argo Workflows only
Both engines with clear use cases
Airflow on Kubernetes
Prefect/Dagster

Decision Outcome

Chosen option: "Both engines with clear use cases", using Kubeflow Pipelines for ML-centric workflows and Argo Workflows for complex DAG orchestration.

Decision Matrix

Use Case	Engine	Reason
ML training with caching	Kubeflow	Component caching, experiment tracking
Model evaluation	Kubeflow	Metric collection, comparison
Document ingestion	Argo	Simple DAG, no ML features needed
Batch inference	Argo	Parallelization, retries
Complex DAG with branching	Argo	Superior control flow
Hybrid ML training	Both	Argo orchestrates, KFP for ML steps

Positive Consequences

Best tool for each job
ML pipelines get proper caching
Complex workflows get better DAG support
Can integrate via Argo Events
Gradual migration possible

Negative Consequences

Two systems to maintain
Team needs to learn both
More complex debugging
Integration overhead

Integration Architecture

NATS Event ──► Argo Events ──► Sensor ──┬──► Argo Workflow
                                        │
                                        └──► Kubeflow Pipeline (via API)

                    OR

Argo Workflow ──► Step: kfp-trigger ──► Kubeflow Pipeline
                 (WorkflowTemplate)

Pros and Cons of the Options

Kubeflow Pipelines only

Good, because ML-focused
Good, because caching
Good, because experiment tracking
Bad, because limited DAG features
Bad, because less flexible control flow

Argo Workflows only

Good, because powerful DAG
Good, because flexible
Good, because great debugging
Bad, because no ML caching
Bad, because no experiment tracking

Both engines (chosen)

Good, because best of both
Good, because appropriate tool per job
Good, because can integrate
Bad, because operational complexity
Bad, because learning two systems

Airflow

Good, because mature
Good, because large community
Bad, because Python-centric
Bad, because not Kubernetes-native
Bad, because no ML features

Prefect/Dagster

Good, because modern design
Good, because Python-native
Bad, because less Kubernetes-native
Bad, because newer/less proven

3.3 KiB Raw Blame History