Files
homelab-design/decisions/0054-kubeflow-pipeline-cicd.md
Billy D. 35f17d6342
All checks were successful
Update README with ADR Index / update-readme (push) Successful in 6s
docs: add ADR-0054 Kubeflow Pipeline CI/CD and ADR-0055 Internal Python Package Publishing
2026-02-13 14:44:45 -05:00

4.6 KiB

Kubeflow Pipeline CI/CD

  • Status: accepted
  • Date: 2026-02-13
  • Deciders: Billy
  • Technical Story: Automate compilation and upload of Kubeflow Pipelines on git push

Context and Problem Statement

Kubeflow Pipelines are defined as Python scripts (*_pipeline.py) that compile to YAML IR documents. These must be compiled with kfp and then uploaded to the Kubeflow Pipelines API. Doing this manually is error-prone and easy to forget — a push to main should automatically make pipelines available in the Kubeflow UI.

How do we automate the compile-and-upload lifecycle for Kubeflow Pipelines using the existing Gitea Actions CI infrastructure?

Decision Drivers

  • Pipeline definitions change frequently as new ML workflows are added
  • Manual kfp pipeline upload is tedious and easy to forget
  • Kubeflow Pipelines API is accessible within the cluster
  • Gitea Actions runners already exist (ADR-0031)
  • Notifications via ntfy are established (ADR-0015)

Considered Options

  1. Gitea Actions workflow with in-cluster KFP API access
  2. Argo Events watching git repo, triggering Argo Workflow to upload
  3. CronJob polling for changes
  4. Manual upload via CLI

Decision Outcome

Chosen option: Option 1 — Gitea Actions workflow, because the runners are already in-cluster, the pattern is consistent with other CI workflows (ADR-0031), and it provides immediate feedback via ntfy.

Positive Consequences

  • Zero-touch pipeline deployment — push to main and pipelines appear in Kubeflow
  • Consistent CI pattern across all repositories
  • Version tracking with timestamped tags (v20260213-143022)
  • Existing pipelines get new versions; new pipelines are auto-created
  • ntfy notifications on success/failure

Negative Consequences

  • Requires NetworkPolicy to allow cross-namespace traffic (gitea → kubeflow)
  • Pipeline compilation happens in CI, not locally — compilation errors only surface in CI
  • KFP SDK version must be pinned in CI to match the cluster

Implementation

Workflow Structure

The workflow (.gitea/workflows/compile-upload.yaml) has two jobs:

Job Purpose
compile-and-upload Find *_pipeline.py, compile each with KFP, upload YAML to Kubeflow
notify Send ntfy notification with compile/upload summary

Pipeline Discovery

on:
  push:
    branches: [main]
    paths:
      - "**/*_pipeline.py"
      - "**/*pipeline*.py"
  workflow_dispatch:

Pipelines are discovered at runtime with find . -maxdepth 1 -name '*_pipeline.py', avoiding shell issues with glob expansion in CI variables. The workflow_dispatch trigger allows manual re-runs.

Upload Strategy

The upload step uses an inline Python script with the KFP client:

  1. Connect to ml-pipeline.kubeflow.svc.cluster.local:8888
  2. For each compiled YAML:
    • Check if a pipeline with that name already exists
    • Exists → upload as a new version with timestamp tag
    • New → create the pipeline
  3. Report uploaded/failed counts as job outputs

NetworkPolicy Requirement

Gitea Actions runners run in the gitea namespace. Kubeflow's NetworkPolicies default-deny cross-namespace ingress. A dedicated policy was added:

apiVersion: networking.k8s.io/v1
kind: NetworkPolicy
metadata:
  name: allow-gitea-ingress
  namespace: kubeflow
spec:
  podSelector: {}
  policyTypes:
    - Ingress
  ingress:
    - from:
        - namespaceSelector:
            matchLabels:
              kubernetes.io/metadata.name: gitea

This joins existing policies for envoy (external access) and ai-ml namespace (pipeline-bridge, kfp-sync-job).

Notification

The notify job sends a summary to ntfy.observability.svc.cluster.local:80/gitea-ci including:

  • Compile count and upload count
  • Version tag
  • Failed pipeline names (on failure)
  • Clickable link to the CI run in Gitea

Current Pipelines

Pipeline Purpose
document_ingestion_pipeline RAG document processing with MLflow
evaluation_pipeline Model evaluation
dvd_transcription_pipeline DVD audio → transcript via Whisper
qlora_pdf_pipeline QLoRA fine-tune on PDFs from S3
voice_cloning_pipeline Speaker extraction + VITS voice training
vllm_tuning_pipeline vLLM inference parameter tuning