- Add AGENT-ONBOARDING.md for AI agents - Add ARCHITECTURE.md with full system overview - Add TECH-STACK.md with complete technology inventory - Add DOMAIN-MODEL.md with entities and bounded contexts - Add CODING-CONVENTIONS.md with patterns and practices - Add GLOSSARY.md with terminology reference - Add C4 diagrams (Context and Container levels) - Add 10 ADRs documenting key decisions: - Talos Linux, NATS, MessagePack, Multi-GPU strategy - GitOps with Flux, KServe, Milvus, Dual workflow engines - Envoy Gateway - Add specs directory with JetStream configuration - Add diagrams for GPU allocation and data flows Based on analysis of homelab-k8s2 and llm-workflows repositories and kubectl cluster-info dump data.
125 lines
3.3 KiB
Markdown
125 lines
3.3 KiB
Markdown
# Dual Workflow Engine Strategy (Argo + Kubeflow)
|
|
|
|
* Status: accepted
|
|
* Date: 2026-01-15
|
|
* Deciders: Billy Davies
|
|
* Technical Story: Selecting workflow orchestration for ML pipelines
|
|
|
|
## Context and Problem Statement
|
|
|
|
The AI platform needs workflow orchestration for:
|
|
- ML training pipelines with caching
|
|
- Document ingestion (batch)
|
|
- Complex DAG workflows (training → evaluation → deployment)
|
|
- Hybrid scenarios combining both
|
|
|
|
Should we use one engine or leverage strengths of multiple?
|
|
|
|
## Decision Drivers
|
|
|
|
* ML-specific features (caching, lineage)
|
|
* Complex DAG support
|
|
* Kubernetes-native execution
|
|
* Visibility and debugging
|
|
* Community and ecosystem
|
|
* Integration with existing tools
|
|
|
|
## Considered Options
|
|
|
|
* Kubeflow Pipelines only
|
|
* Argo Workflows only
|
|
* Both engines with clear use cases
|
|
* Airflow on Kubernetes
|
|
* Prefect/Dagster
|
|
|
|
## Decision Outcome
|
|
|
|
Chosen option: "Both engines with clear use cases", using Kubeflow Pipelines for ML-centric workflows and Argo Workflows for complex DAG orchestration.
|
|
|
|
### Decision Matrix
|
|
|
|
| Use Case | Engine | Reason |
|
|
|----------|--------|--------|
|
|
| ML training with caching | Kubeflow | Component caching, experiment tracking |
|
|
| Model evaluation | Kubeflow | Metric collection, comparison |
|
|
| Document ingestion | Argo | Simple DAG, no ML features needed |
|
|
| Batch inference | Argo | Parallelization, retries |
|
|
| Complex DAG with branching | Argo | Superior control flow |
|
|
| Hybrid ML training | Both | Argo orchestrates, KFP for ML steps |
|
|
|
|
### Positive Consequences
|
|
|
|
* Best tool for each job
|
|
* ML pipelines get proper caching
|
|
* Complex workflows get better DAG support
|
|
* Can integrate via Argo Events
|
|
* Gradual migration possible
|
|
|
|
### Negative Consequences
|
|
|
|
* Two systems to maintain
|
|
* Team needs to learn both
|
|
* More complex debugging
|
|
* Integration overhead
|
|
|
|
## Integration Architecture
|
|
|
|
```
|
|
NATS Event ──► Argo Events ──► Sensor ──┬──► Argo Workflow
|
|
│
|
|
└──► Kubeflow Pipeline (via API)
|
|
|
|
OR
|
|
|
|
Argo Workflow ──► Step: kfp-trigger ──► Kubeflow Pipeline
|
|
(WorkflowTemplate)
|
|
```
|
|
|
|
## Pros and Cons of the Options
|
|
|
|
### Kubeflow Pipelines only
|
|
|
|
* Good, because ML-focused
|
|
* Good, because caching
|
|
* Good, because experiment tracking
|
|
* Bad, because limited DAG features
|
|
* Bad, because less flexible control flow
|
|
|
|
### Argo Workflows only
|
|
|
|
* Good, because powerful DAG
|
|
* Good, because flexible
|
|
* Good, because great debugging
|
|
* Bad, because no ML caching
|
|
* Bad, because no experiment tracking
|
|
|
|
### Both engines (chosen)
|
|
|
|
* Good, because best of both
|
|
* Good, because appropriate tool per job
|
|
* Good, because can integrate
|
|
* Bad, because operational complexity
|
|
* Bad, because learning two systems
|
|
|
|
### Airflow
|
|
|
|
* Good, because mature
|
|
* Good, because large community
|
|
* Bad, because Python-centric
|
|
* Bad, because not Kubernetes-native
|
|
* Bad, because no ML features
|
|
|
|
### Prefect/Dagster
|
|
|
|
* Good, because modern design
|
|
* Good, because Python-native
|
|
* Bad, because less Kubernetes-native
|
|
* Bad, because newer/less proven
|
|
|
|
## Links
|
|
|
|
* [Kubeflow Pipelines](https://kubeflow.org/docs/components/pipelines/)
|
|
* [Argo Workflows](https://argoproj.github.io/workflows/)
|
|
* [Argo Events](https://argoproj.github.io/events/)
|
|
* Related: [kfp-integration.yaml](../../llm-workflows/argo/kfp-integration.yaml)
|