Argo Workflows

ML training and batch inference workflows for the DaviesTechLabs AI/ML platform.

Workflows

Workflow Description Trigger
batch-inference Run LLM inference on batch inputs ai.pipeline.trigger (pipeline="batch-inference")
qlora-training Train QLoRA adapters from Milvus data ai.pipeline.trigger (pipeline="qlora-training")
hybrid-ml-training Multi-GPU distributed training ai.pipeline.trigger (pipeline="hybrid-ml-training")
coqui-voice-training XTTS voice cloning/training ai.pipeline.trigger (pipeline="coqui-voice-training")
document-ingestion Ingest documents into Milvus ai.pipeline.trigger (pipeline="document-ingestion")

Integration

File Description
eventsource-kfp.yaml Argo Events source for Kubeflow Pipelines integration
kfp-integration.yaml Bridge workflows between Argo and Kubeflow

Architecture

NATS (ai.pipeline.trigger)
         │
         ▼
┌─────────────────┐
│  Argo Events    │
│  EventSource    │
└─────────────────┘
         │
         ▼
┌─────────────────┐
│  Argo Sensor    │
└─────────────────┘
         │
         ▼
┌─────────────────┐
│ WorkflowTemplate│
│  (batch-inf,    │
│   qlora, etc)   │
└─────────────────┘
         │
         ├──▶ GPU Pods (AMD ROCm / NVIDIA CUDA)
         ├──▶ Milvus Vector DB
         ├──▶ vLLM / Ray Serve
         └──▶ MLflow Tracking

Workflow Details

batch-inference

Batch LLM inference with optional RAG:

argo submit batch-inference.yaml \
  -p input-url="s3://bucket/inputs.json" \
  -p output-url="s3://bucket/outputs.json" \
  -p use-rag="true" \
  -p max-tokens="500"

qlora-training

Fine-tune QLoRA adapters from Milvus knowledge:

argo submit qlora-training.yaml \
  -p reference-model="mistralai/Mistral-7B-Instruct-v0.3" \
  -p output-name="my-adapter" \
  -p milvus-collections="docs,wiki" \
  -p num-epochs="3"

coqui-voice-training

Train XTTS voice models:

argo submit coqui-voice-training.yaml \
  -p voice-name="my-voice" \
  -p audio-samples-url="s3://bucket/samples/"

document-ingestion

Ingest documents into Milvus:

argo submit document-ingestion.yaml \
  -p source-url="s3://bucket/docs/" \
  -p collection="knowledge_base" \
  -p chunk-size="512"

NATS Trigger Format

Workflows are triggered via NATS ai.pipeline.trigger:

{
  "pipeline": "qlora-training",
  "parameters": {
    "reference-model": "mistralai/Mistral-7B-Instruct-v0.3",
    "output-name": "custom-adapter",
    "num-epochs": "5"
  }
}

GPU Scheduling

Workflows use node affinity for GPU allocation:

Node GPU Best For
khelben AMD Strix Halo 64GB Large model training, vLLM
elminster NVIDIA RTX 2070 Whisper, XTTS
drizzt AMD Radeon 680M Embeddings
danilo Intel Arc Reranker
Description
No description provided
Readme MIT 96 KiB