feat: Add ML training and batch inference workflows

- batch-inference: LLM inference with optional RAG - qlora-training: QLoRA adapter fine-tuning from Milvus - hybrid-ml-training: Multi-GPU distributed training - coqui-voice-training: XTTS voice cloning - document-ingestion: Ingest documents to Milvus - eventsource-kfp: Argo Events / Kubeflow integration - kfp-integration: Bridge between Argo and Kubeflow
2026-02-01 20:39:42 -05:00
parent a8fc72dd0b
commit 7104698eee
8 changed files with 3365 additions and 1 deletions
--- a/README.md
+++ b/README.md
@@ -1,2 +1,128 @@
-# argo
+# Argo Workflows

+ML training and batch inference workflows for the DaviesTechLabs AI/ML platform.
+
+## Workflows
+
+| Workflow | Description | Trigger |
+|----------|-------------|---------|
+| `batch-inference` | Run LLM inference on batch inputs | `ai.pipeline.trigger` (pipeline="batch-inference") |
+| `qlora-training` | Train QLoRA adapters from Milvus data | `ai.pipeline.trigger` (pipeline="qlora-training") |
+| `hybrid-ml-training` | Multi-GPU distributed training | `ai.pipeline.trigger` (pipeline="hybrid-ml-training") |
+| `coqui-voice-training` | XTTS voice cloning/training | `ai.pipeline.trigger` (pipeline="coqui-voice-training") |
+| `document-ingestion` | Ingest documents into Milvus | `ai.pipeline.trigger` (pipeline="document-ingestion") |
+
+## Integration
+
+| File | Description |
+|------|-------------|
+| `eventsource-kfp.yaml` | Argo Events source for Kubeflow Pipelines integration |
+| `kfp-integration.yaml` | Bridge workflows between Argo and Kubeflow |
+
+## Architecture
+
+```
+NATS (ai.pipeline.trigger)
+         │
+         ▼
+┌─────────────────┐
+│  Argo Events    │
+│  EventSource    │
+└─────────────────┘
+         │
+         ▼
+┌─────────────────┐
+│  Argo Sensor    │
+└─────────────────┘
+         │
+         ▼
+┌─────────────────┐
+│ WorkflowTemplate│
+│  (batch-inf,    │
+│   qlora, etc)   │
+└─────────────────┘
+         │
+         ├──▶ GPU Pods (AMD ROCm / NVIDIA CUDA)
+         ├──▶ Milvus Vector DB
+         ├──▶ vLLM / Ray Serve
+         └──▶ MLflow Tracking
+```
+
+## Workflow Details
+
+### batch-inference
+
+Batch LLM inference with optional RAG:
+
+```bash
+argo submit batch-inference.yaml \
+  -p input-url="s3://bucket/inputs.json" \
+  -p output-url="s3://bucket/outputs.json" \
+  -p use-rag="true" \
+  -p max-tokens="500"
+```
+
+### qlora-training
+
+Fine-tune QLoRA adapters from Milvus knowledge:
+
+```bash
+argo submit qlora-training.yaml \
+  -p reference-model="mistralai/Mistral-7B-Instruct-v0.3" \
+  -p output-name="my-adapter" \
+  -p milvus-collections="docs,wiki" \
+  -p num-epochs="3"
+```
+
+### coqui-voice-training
+
+Train XTTS voice models:
+
+```bash
+argo submit coqui-voice-training.yaml \
+  -p voice-name="my-voice" \
+  -p audio-samples-url="s3://bucket/samples/"
+```
+
+### document-ingestion
+
+Ingest documents into Milvus:
+
+```bash
+argo submit document-ingestion.yaml \
+  -p source-url="s3://bucket/docs/" \
+  -p collection="knowledge_base" \
+  -p chunk-size="512"
+```
+
+## NATS Trigger Format
+
+Workflows are triggered via NATS `ai.pipeline.trigger`:
+
+```json
+{
+  "pipeline": "qlora-training",
+  "parameters": {
+    "reference-model": "mistralai/Mistral-7B-Instruct-v0.3",
+    "output-name": "custom-adapter",
+    "num-epochs": "5"
+  }
+}
+```
+
+## GPU Scheduling
+
+Workflows use node affinity for GPU allocation:
+
+| Node | GPU | Best For |
+|------|-----|----------|
+| khelben | AMD Strix Halo 64GB | Large model training, vLLM |
+| elminster | NVIDIA RTX 2070 | Whisper, XTTS |
+| drizzt | AMD Radeon 680M | Embeddings |
+| danilo | Intel Arc | Reranker |
+
+## Related
+
+- [homelab-design](https://git.daviestechlabs.io/daviestechlabs/homelab-design) - Architecture docs
+- [kuberay-images](https://git.daviestechlabs.io/daviestechlabs/kuberay-images) - Ray worker images
+- [handler-base](https://git.daviestechlabs.io/daviestechlabs/handler-base) - Handler library