feat: add comprehensive architecture documentation

- Add AGENT-ONBOARDING.md for AI agents
- Add ARCHITECTURE.md with full system overview
- Add TECH-STACK.md with complete technology inventory
- Add DOMAIN-MODEL.md with entities and bounded contexts
- Add CODING-CONVENTIONS.md with patterns and practices
- Add GLOSSARY.md with terminology reference
- Add C4 diagrams (Context and Container levels)
- Add 10 ADRs documenting key decisions:
  - Talos Linux, NATS, MessagePack, Multi-GPU strategy
  - GitOps with Flux, KServe, Milvus, Dual workflow engines
  - Envoy Gateway
- Add specs directory with JetStream configuration
- Add diagrams for GPU allocation and data flows

Based on analysis of homelab-k8s2 and llm-workflows repositories
and kubectl cluster-info dump data.
This commit is contained in:
2026-02-01 14:30:05 -05:00
parent 4d4f6f464c
commit 832cda34bd
26 changed files with 3805 additions and 2 deletions

View File

@@ -0,0 +1,124 @@
# Dual Workflow Engine Strategy (Argo + Kubeflow)
* Status: accepted
* Date: 2026-01-15
* Deciders: Billy Davies
* Technical Story: Selecting workflow orchestration for ML pipelines
## Context and Problem Statement
The AI platform needs workflow orchestration for:
- ML training pipelines with caching
- Document ingestion (batch)
- Complex DAG workflows (training → evaluation → deployment)
- Hybrid scenarios combining both
Should we use one engine or leverage strengths of multiple?
## Decision Drivers
* ML-specific features (caching, lineage)
* Complex DAG support
* Kubernetes-native execution
* Visibility and debugging
* Community and ecosystem
* Integration with existing tools
## Considered Options
* Kubeflow Pipelines only
* Argo Workflows only
* Both engines with clear use cases
* Airflow on Kubernetes
* Prefect/Dagster
## Decision Outcome
Chosen option: "Both engines with clear use cases", using Kubeflow Pipelines for ML-centric workflows and Argo Workflows for complex DAG orchestration.
### Decision Matrix
| Use Case | Engine | Reason |
|----------|--------|--------|
| ML training with caching | Kubeflow | Component caching, experiment tracking |
| Model evaluation | Kubeflow | Metric collection, comparison |
| Document ingestion | Argo | Simple DAG, no ML features needed |
| Batch inference | Argo | Parallelization, retries |
| Complex DAG with branching | Argo | Superior control flow |
| Hybrid ML training | Both | Argo orchestrates, KFP for ML steps |
### Positive Consequences
* Best tool for each job
* ML pipelines get proper caching
* Complex workflows get better DAG support
* Can integrate via Argo Events
* Gradual migration possible
### Negative Consequences
* Two systems to maintain
* Team needs to learn both
* More complex debugging
* Integration overhead
## Integration Architecture
```
NATS Event ──► Argo Events ──► Sensor ──┬──► Argo Workflow
└──► Kubeflow Pipeline (via API)
OR
Argo Workflow ──► Step: kfp-trigger ──► Kubeflow Pipeline
(WorkflowTemplate)
```
## Pros and Cons of the Options
### Kubeflow Pipelines only
* Good, because ML-focused
* Good, because caching
* Good, because experiment tracking
* Bad, because limited DAG features
* Bad, because less flexible control flow
### Argo Workflows only
* Good, because powerful DAG
* Good, because flexible
* Good, because great debugging
* Bad, because no ML caching
* Bad, because no experiment tracking
### Both engines (chosen)
* Good, because best of both
* Good, because appropriate tool per job
* Good, because can integrate
* Bad, because operational complexity
* Bad, because learning two systems
### Airflow
* Good, because mature
* Good, because large community
* Bad, because Python-centric
* Bad, because not Kubernetes-native
* Bad, because no ML features
### Prefect/Dagster
* Good, because modern design
* Good, because Python-native
* Bad, because less Kubernetes-native
* Bad, because newer/less proven
## Links
* [Kubeflow Pipelines](https://kubeflow.org/docs/components/pipelines/)
* [Argo Workflows](https://argoproj.github.io/workflows/)
* [Argo Events](https://argoproj.github.io/events/)
* Related: [kfp-integration.yaml](../../llm-workflows/argo/kfp-integration.yaml)