Files

Billy D. 832cda34bd feat: add comprehensive architecture documentation

- Add AGENT-ONBOARDING.md for AI agents
- Add ARCHITECTURE.md with full system overview
- Add TECH-STACK.md with complete technology inventory
- Add DOMAIN-MODEL.md with entities and bounded contexts
- Add CODING-CONVENTIONS.md with patterns and practices
- Add GLOSSARY.md with terminology reference
- Add C4 diagrams (Context and Container levels)
- Add 10 ADRs documenting key decisions:
  - Talos Linux, NATS, MessagePack, Multi-GPU strategy
  - GitOps with Flux, KServe, Milvus, Dual workflow engines
  - Envoy Gateway
- Add specs directory with JetStream configuration
- Add diagrams for GPU allocation and data flows

Based on analysis of homelab-k8s2 and llm-workflows repositories
and kubectl cluster-info dump data.

2026-02-01 14:30:05 -05:00

3.3 KiB

Raw Blame History

Dual Workflow Engine Strategy (Argo + Kubeflow)

Status: accepted
Date: 2026-01-15
Deciders: Billy Davies
Technical Story: Selecting workflow orchestration for ML pipelines

Context and Problem Statement

The AI platform needs workflow orchestration for:

ML training pipelines with caching
Document ingestion (batch)
Complex DAG workflows (training → evaluation → deployment)
Hybrid scenarios combining both

Should we use one engine or leverage strengths of multiple?

Decision Drivers

ML-specific features (caching, lineage)
Complex DAG support
Kubernetes-native execution
Visibility and debugging
Community and ecosystem
Integration with existing tools

Considered Options

Kubeflow Pipelines only
Argo Workflows only
Both engines with clear use cases
Airflow on Kubernetes
Prefect/Dagster

Decision Outcome

Chosen option: "Both engines with clear use cases", using Kubeflow Pipelines for ML-centric workflows and Argo Workflows for complex DAG orchestration.

Decision Matrix

Use Case	Engine	Reason
ML training with caching	Kubeflow	Component caching, experiment tracking
Model evaluation	Kubeflow	Metric collection, comparison
Document ingestion	Argo	Simple DAG, no ML features needed
Batch inference	Argo	Parallelization, retries
Complex DAG with branching	Argo	Superior control flow
Hybrid ML training	Both	Argo orchestrates, KFP for ML steps

Positive Consequences

Best tool for each job
ML pipelines get proper caching
Complex workflows get better DAG support
Can integrate via Argo Events
Gradual migration possible

Negative Consequences

Two systems to maintain
Team needs to learn both
More complex debugging
Integration overhead

Integration Architecture

NATS Event ──► Argo Events ──► Sensor ──┬──► Argo Workflow
                                        │
                                        └──► Kubeflow Pipeline (via API)

                    OR

Argo Workflow ──► Step: kfp-trigger ──► Kubeflow Pipeline
                 (WorkflowTemplate)

Pros and Cons of the Options

Kubeflow Pipelines only

Good, because ML-focused
Good, because caching
Good, because experiment tracking
Bad, because limited DAG features
Bad, because less flexible control flow

Argo Workflows only

Good, because powerful DAG
Good, because flexible
Good, because great debugging
Bad, because no ML caching
Bad, because no experiment tracking

Both engines (chosen)

Good, because best of both
Good, because appropriate tool per job
Good, because can integrate
Bad, because operational complexity
Bad, because learning two systems

Airflow

Good, because mature
Good, because large community
Bad, because Python-centric
Bad, because not Kubernetes-native
Bad, because no ML features

Prefect/Dagster

Good, because modern design
Good, because Python-native
Bad, because less Kubernetes-native
Bad, because newer/less proven

3.3 KiB Raw Blame History