feat: Add ML training and batch inference workflows
- batch-inference: LLM inference with optional RAG - qlora-training: QLoRA adapter fine-tuning from Milvus - hybrid-ml-training: Multi-GPU distributed training - coqui-voice-training: XTTS voice cloning - document-ingestion: Ingest documents to Milvus - eventsource-kfp: Argo Events / Kubeflow integration - kfp-integration: Bridge between Argo and Kubeflow
This commit is contained in:
128
README.md
128
README.md
@@ -1,2 +1,128 @@
|
||||
# argo
|
||||
# Argo Workflows
|
||||
|
||||
ML training and batch inference workflows for the DaviesTechLabs AI/ML platform.
|
||||
|
||||
## Workflows
|
||||
|
||||
| Workflow | Description | Trigger |
|
||||
|----------|-------------|---------|
|
||||
| `batch-inference` | Run LLM inference on batch inputs | `ai.pipeline.trigger` (pipeline="batch-inference") |
|
||||
| `qlora-training` | Train QLoRA adapters from Milvus data | `ai.pipeline.trigger` (pipeline="qlora-training") |
|
||||
| `hybrid-ml-training` | Multi-GPU distributed training | `ai.pipeline.trigger` (pipeline="hybrid-ml-training") |
|
||||
| `coqui-voice-training` | XTTS voice cloning/training | `ai.pipeline.trigger` (pipeline="coqui-voice-training") |
|
||||
| `document-ingestion` | Ingest documents into Milvus | `ai.pipeline.trigger` (pipeline="document-ingestion") |
|
||||
|
||||
## Integration
|
||||
|
||||
| File | Description |
|
||||
|------|-------------|
|
||||
| `eventsource-kfp.yaml` | Argo Events source for Kubeflow Pipelines integration |
|
||||
| `kfp-integration.yaml` | Bridge workflows between Argo and Kubeflow |
|
||||
|
||||
## Architecture
|
||||
|
||||
```
|
||||
NATS (ai.pipeline.trigger)
|
||||
│
|
||||
▼
|
||||
┌─────────────────┐
|
||||
│ Argo Events │
|
||||
│ EventSource │
|
||||
└─────────────────┘
|
||||
│
|
||||
▼
|
||||
┌─────────────────┐
|
||||
│ Argo Sensor │
|
||||
└─────────────────┘
|
||||
│
|
||||
▼
|
||||
┌─────────────────┐
|
||||
│ WorkflowTemplate│
|
||||
│ (batch-inf, │
|
||||
│ qlora, etc) │
|
||||
└─────────────────┘
|
||||
│
|
||||
├──▶ GPU Pods (AMD ROCm / NVIDIA CUDA)
|
||||
├──▶ Milvus Vector DB
|
||||
├──▶ vLLM / Ray Serve
|
||||
└──▶ MLflow Tracking
|
||||
```
|
||||
|
||||
## Workflow Details
|
||||
|
||||
### batch-inference
|
||||
|
||||
Batch LLM inference with optional RAG:
|
||||
|
||||
```bash
|
||||
argo submit batch-inference.yaml \
|
||||
-p input-url="s3://bucket/inputs.json" \
|
||||
-p output-url="s3://bucket/outputs.json" \
|
||||
-p use-rag="true" \
|
||||
-p max-tokens="500"
|
||||
```
|
||||
|
||||
### qlora-training
|
||||
|
||||
Fine-tune QLoRA adapters from Milvus knowledge:
|
||||
|
||||
```bash
|
||||
argo submit qlora-training.yaml \
|
||||
-p reference-model="mistralai/Mistral-7B-Instruct-v0.3" \
|
||||
-p output-name="my-adapter" \
|
||||
-p milvus-collections="docs,wiki" \
|
||||
-p num-epochs="3"
|
||||
```
|
||||
|
||||
### coqui-voice-training
|
||||
|
||||
Train XTTS voice models:
|
||||
|
||||
```bash
|
||||
argo submit coqui-voice-training.yaml \
|
||||
-p voice-name="my-voice" \
|
||||
-p audio-samples-url="s3://bucket/samples/"
|
||||
```
|
||||
|
||||
### document-ingestion
|
||||
|
||||
Ingest documents into Milvus:
|
||||
|
||||
```bash
|
||||
argo submit document-ingestion.yaml \
|
||||
-p source-url="s3://bucket/docs/" \
|
||||
-p collection="knowledge_base" \
|
||||
-p chunk-size="512"
|
||||
```
|
||||
|
||||
## NATS Trigger Format
|
||||
|
||||
Workflows are triggered via NATS `ai.pipeline.trigger`:
|
||||
|
||||
```json
|
||||
{
|
||||
"pipeline": "qlora-training",
|
||||
"parameters": {
|
||||
"reference-model": "mistralai/Mistral-7B-Instruct-v0.3",
|
||||
"output-name": "custom-adapter",
|
||||
"num-epochs": "5"
|
||||
}
|
||||
}
|
||||
```
|
||||
|
||||
## GPU Scheduling
|
||||
|
||||
Workflows use node affinity for GPU allocation:
|
||||
|
||||
| Node | GPU | Best For |
|
||||
|------|-----|----------|
|
||||
| khelben | AMD Strix Halo 64GB | Large model training, vLLM |
|
||||
| elminster | NVIDIA RTX 2070 | Whisper, XTTS |
|
||||
| drizzt | AMD Radeon 680M | Embeddings |
|
||||
| danilo | Intel Arc | Reranker |
|
||||
|
||||
## Related
|
||||
|
||||
- [homelab-design](https://git.daviestechlabs.io/daviestechlabs/homelab-design) - Architecture docs
|
||||
- [kuberay-images](https://git.daviestechlabs.io/daviestechlabs/kuberay-images) - Ray worker images
|
||||
- [handler-base](https://git.daviestechlabs.io/daviestechlabs/handler-base) - Handler library
|
||||
|
||||
Reference in New Issue
Block a user