daviestechlabs/homelab-design

Fork 0

Files

Billy D. 35f17d6342

Update README with ADR Index / update-readme (push) Successful in 6s

Details

docs: add ADR-0054 Kubeflow Pipeline CI/CD and ADR-0055 Internal Python Package Publishing

2026-02-13 14:44:45 -05:00

4.6 KiB

Raw Blame History

Kubeflow Pipeline CI/CD

Status: accepted
Date: 2026-02-13
Deciders: Billy
Technical Story: Automate compilation and upload of Kubeflow Pipelines on git push

Context and Problem Statement

Kubeflow Pipelines are defined as Python scripts (*_pipeline.py) that compile to YAML IR documents. These must be compiled with kfp and then uploaded to the Kubeflow Pipelines API. Doing this manually is error-prone and easy to forget — a push to main should automatically make pipelines available in the Kubeflow UI.

How do we automate the compile-and-upload lifecycle for Kubeflow Pipelines using the existing Gitea Actions CI infrastructure?

Decision Drivers

Pipeline definitions change frequently as new ML workflows are added
Manual kfp pipeline upload is tedious and easy to forget
Kubeflow Pipelines API is accessible within the cluster
Gitea Actions runners already exist (ADR-0031)
Notifications via ntfy are established (ADR-0015)

Considered Options

Gitea Actions workflow with in-cluster KFP API access
Argo Events watching git repo, triggering Argo Workflow to upload
CronJob polling for changes
Manual upload via CLI

Decision Outcome

Chosen option: Option 1 — Gitea Actions workflow, because the runners are already in-cluster, the pattern is consistent with other CI workflows (ADR-0031), and it provides immediate feedback via ntfy.

Positive Consequences

Zero-touch pipeline deployment — push to main and pipelines appear in Kubeflow
Consistent CI pattern across all repositories
Version tracking with timestamped tags (v20260213-143022)
Existing pipelines get new versions; new pipelines are auto-created
ntfy notifications on success/failure

Negative Consequences

Requires NetworkPolicy to allow cross-namespace traffic (gitea → kubeflow)
Pipeline compilation happens in CI, not locally — compilation errors only surface in CI
KFP SDK version must be pinned in CI to match the cluster

Implementation

Workflow Structure

The workflow (.gitea/workflows/compile-upload.yaml) has two jobs:

Job	Purpose
`compile-and-upload`	Find `*_pipeline.py`, compile each with KFP, upload YAML to Kubeflow
`notify`	Send ntfy notification with compile/upload summary

Pipeline Discovery

on:
  push:
    branches: [main]
    paths:
      - "**/*_pipeline.py"
      - "**/*pipeline*.py"
  workflow_dispatch:

Pipelines are discovered at runtime with find . -maxdepth 1 -name '*_pipeline.py', avoiding shell issues with glob expansion in CI variables. The workflow_dispatch trigger allows manual re-runs.

Upload Strategy

The upload step uses an inline Python script with the KFP client:

Connect to ml-pipeline.kubeflow.svc.cluster.local:8888
For each compiled YAML:
- Check if a pipeline with that name already exists
- Exists → upload as a new version with timestamp tag
- New → create the pipeline
Report uploaded/failed counts as job outputs

NetworkPolicy Requirement

Gitea Actions runners run in the gitea namespace. Kubeflow's NetworkPolicies default-deny cross-namespace ingress. A dedicated policy was added:

apiVersion: networking.k8s.io/v1
kind: NetworkPolicy
metadata:
  name: allow-gitea-ingress
  namespace: kubeflow
spec:
  podSelector: {}
  policyTypes:
    - Ingress
  ingress:
    - from:
        - namespaceSelector:
            matchLabels:
              kubernetes.io/metadata.name: gitea

This joins existing policies for envoy (external access) and ai-ml namespace (pipeline-bridge, kfp-sync-job).

Notification

The notify job sends a summary to ntfy.observability.svc.cluster.local:80/gitea-ci including:

Compile count and upload count
Version tag
Failed pipeline names (on failure)
Clickable link to the CI run in Gitea

Current Pipelines

Pipeline	Purpose
`document_ingestion_pipeline`	RAG document processing with MLflow
`evaluation_pipeline`	Model evaluation
`dvd_transcription_pipeline`	DVD audio → transcript via Whisper
`qlora_pdf_pipeline`	QLoRA fine-tune on PDFs from S3
`voice_cloning_pipeline`	Speaker extraction + VITS voice training
`vllm_tuning_pipeline`	vLLM inference parameter tuning

4.6 KiB Raw Blame History