Files
homelab-design/decisions/0009-dual-workflow-engines.md
Billy D. 832cda34bd feat: add comprehensive architecture documentation
- Add AGENT-ONBOARDING.md for AI agents
- Add ARCHITECTURE.md with full system overview
- Add TECH-STACK.md with complete technology inventory
- Add DOMAIN-MODEL.md with entities and bounded contexts
- Add CODING-CONVENTIONS.md with patterns and practices
- Add GLOSSARY.md with terminology reference
- Add C4 diagrams (Context and Container levels)
- Add 10 ADRs documenting key decisions:
  - Talos Linux, NATS, MessagePack, Multi-GPU strategy
  - GitOps with Flux, KServe, Milvus, Dual workflow engines
  - Envoy Gateway
- Add specs directory with JetStream configuration
- Add diagrams for GPU allocation and data flows

Based on analysis of homelab-k8s2 and llm-workflows repositories
and kubectl cluster-info dump data.
2026-02-01 14:30:05 -05:00

3.3 KiB

Dual Workflow Engine Strategy (Argo + Kubeflow)

  • Status: accepted
  • Date: 2026-01-15
  • Deciders: Billy Davies
  • Technical Story: Selecting workflow orchestration for ML pipelines

Context and Problem Statement

The AI platform needs workflow orchestration for:

  • ML training pipelines with caching
  • Document ingestion (batch)
  • Complex DAG workflows (training → evaluation → deployment)
  • Hybrid scenarios combining both

Should we use one engine or leverage strengths of multiple?

Decision Drivers

  • ML-specific features (caching, lineage)
  • Complex DAG support
  • Kubernetes-native execution
  • Visibility and debugging
  • Community and ecosystem
  • Integration with existing tools

Considered Options

  • Kubeflow Pipelines only
  • Argo Workflows only
  • Both engines with clear use cases
  • Airflow on Kubernetes
  • Prefect/Dagster

Decision Outcome

Chosen option: "Both engines with clear use cases", using Kubeflow Pipelines for ML-centric workflows and Argo Workflows for complex DAG orchestration.

Decision Matrix

Use Case Engine Reason
ML training with caching Kubeflow Component caching, experiment tracking
Model evaluation Kubeflow Metric collection, comparison
Document ingestion Argo Simple DAG, no ML features needed
Batch inference Argo Parallelization, retries
Complex DAG with branching Argo Superior control flow
Hybrid ML training Both Argo orchestrates, KFP for ML steps

Positive Consequences

  • Best tool for each job
  • ML pipelines get proper caching
  • Complex workflows get better DAG support
  • Can integrate via Argo Events
  • Gradual migration possible

Negative Consequences

  • Two systems to maintain
  • Team needs to learn both
  • More complex debugging
  • Integration overhead

Integration Architecture

NATS Event ──► Argo Events ──► Sensor ──┬──► Argo Workflow
                                        │
                                        └──► Kubeflow Pipeline (via API)

                    OR

Argo Workflow ──► Step: kfp-trigger ──► Kubeflow Pipeline
                 (WorkflowTemplate)

Pros and Cons of the Options

Kubeflow Pipelines only

  • Good, because ML-focused
  • Good, because caching
  • Good, because experiment tracking
  • Bad, because limited DAG features
  • Bad, because less flexible control flow

Argo Workflows only

  • Good, because powerful DAG
  • Good, because flexible
  • Good, because great debugging
  • Bad, because no ML caching
  • Bad, because no experiment tracking

Both engines (chosen)

  • Good, because best of both
  • Good, because appropriate tool per job
  • Good, because can integrate
  • Bad, because operational complexity
  • Bad, because learning two systems

Airflow

  • Good, because mature
  • Good, because large community
  • Bad, because Python-centric
  • Bad, because not Kubernetes-native
  • Bad, because no ML features

Prefect/Dagster

  • Good, because modern design
  • Good, because Python-native
  • Bad, because less Kubernetes-native
  • Bad, because newer/less proven