docs: add ADR-0054 Kubeflow Pipeline CI/CD and ADR-0055 Internal Python Package Publishing
All checks were successful
Update README with ADR Index / update-readme (push) Successful in 6s
All checks were successful
Update README with ADR Index / update-readme (push) Successful in 6s
This commit is contained in:
131
decisions/0054-kubeflow-pipeline-cicd.md
Normal file
131
decisions/0054-kubeflow-pipeline-cicd.md
Normal file
@@ -0,0 +1,131 @@
|
||||
# Kubeflow Pipeline CI/CD
|
||||
|
||||
* Status: accepted
|
||||
* Date: 2026-02-13
|
||||
* Deciders: Billy
|
||||
* Technical Story: Automate compilation and upload of Kubeflow Pipelines on git push
|
||||
|
||||
## Context and Problem Statement
|
||||
|
||||
Kubeflow Pipelines are defined as Python scripts (`*_pipeline.py`) that compile to YAML IR documents. These must be compiled with `kfp` and then uploaded to the Kubeflow Pipelines API. Doing this manually is error-prone and easy to forget — a push to `main` should automatically make pipelines available in the Kubeflow UI.
|
||||
|
||||
How do we automate the compile-and-upload lifecycle for Kubeflow Pipelines using the existing Gitea Actions CI infrastructure?
|
||||
|
||||
## Decision Drivers
|
||||
|
||||
* Pipeline definitions change frequently as new ML workflows are added
|
||||
* Manual `kfp pipeline upload` is tedious and easy to forget
|
||||
* Kubeflow Pipelines API is accessible within the cluster
|
||||
* Gitea Actions runners already exist (ADR-0031)
|
||||
* Notifications via ntfy are established (ADR-0015)
|
||||
|
||||
## Considered Options
|
||||
|
||||
1. **Gitea Actions workflow with in-cluster KFP API access**
|
||||
2. **Argo Events watching git repo, triggering Argo Workflow to upload**
|
||||
3. **CronJob polling for changes**
|
||||
4. **Manual upload via CLI**
|
||||
|
||||
## Decision Outcome
|
||||
|
||||
Chosen option: **Option 1 — Gitea Actions workflow**, because the runners are already in-cluster, the pattern is consistent with other CI workflows (ADR-0031), and it provides immediate feedback via ntfy.
|
||||
|
||||
### Positive Consequences
|
||||
|
||||
* Zero-touch pipeline deployment — push to main and pipelines appear in Kubeflow
|
||||
* Consistent CI pattern across all repositories
|
||||
* Version tracking with timestamped tags (`v20260213-143022`)
|
||||
* Existing pipelines get new versions; new pipelines are auto-created
|
||||
* ntfy notifications on success/failure
|
||||
|
||||
### Negative Consequences
|
||||
|
||||
* Requires NetworkPolicy to allow cross-namespace traffic (gitea → kubeflow)
|
||||
* Pipeline compilation happens in CI, not locally — compilation errors only surface in CI
|
||||
* KFP SDK version must be pinned in CI to match the cluster
|
||||
|
||||
## Implementation
|
||||
|
||||
### Workflow Structure
|
||||
|
||||
The workflow (`.gitea/workflows/compile-upload.yaml`) has two jobs:
|
||||
|
||||
| Job | Purpose |
|
||||
|-----|---------|
|
||||
| `compile-and-upload` | Find `*_pipeline.py`, compile each with KFP, upload YAML to Kubeflow |
|
||||
| `notify` | Send ntfy notification with compile/upload summary |
|
||||
|
||||
### Pipeline Discovery
|
||||
|
||||
```yaml
|
||||
on:
|
||||
push:
|
||||
branches: [main]
|
||||
paths:
|
||||
- "**/*_pipeline.py"
|
||||
- "**/*pipeline*.py"
|
||||
workflow_dispatch:
|
||||
```
|
||||
|
||||
Pipelines are discovered at runtime with `find . -maxdepth 1 -name '*_pipeline.py'`, avoiding shell issues with glob expansion in CI variables. The `workflow_dispatch` trigger allows manual re-runs.
|
||||
|
||||
### Upload Strategy
|
||||
|
||||
The upload step uses an inline Python script with the KFP client:
|
||||
|
||||
1. Connect to `ml-pipeline.kubeflow.svc.cluster.local:8888`
|
||||
2. For each compiled YAML:
|
||||
- Check if a pipeline with that name already exists
|
||||
- **Exists** → upload as a new version with timestamp tag
|
||||
- **New** → create the pipeline
|
||||
3. Report uploaded/failed counts as job outputs
|
||||
|
||||
### NetworkPolicy Requirement
|
||||
|
||||
Gitea Actions runners run in the `gitea` namespace. Kubeflow's NetworkPolicies default-deny cross-namespace ingress. A dedicated policy was added:
|
||||
|
||||
```yaml
|
||||
apiVersion: networking.k8s.io/v1
|
||||
kind: NetworkPolicy
|
||||
metadata:
|
||||
name: allow-gitea-ingress
|
||||
namespace: kubeflow
|
||||
spec:
|
||||
podSelector: {}
|
||||
policyTypes:
|
||||
- Ingress
|
||||
ingress:
|
||||
- from:
|
||||
- namespaceSelector:
|
||||
matchLabels:
|
||||
kubernetes.io/metadata.name: gitea
|
||||
```
|
||||
|
||||
This joins existing policies for envoy (external access) and ai-ml namespace (pipeline-bridge, kfp-sync-job).
|
||||
|
||||
### Notification
|
||||
|
||||
The `notify` job sends a summary to `ntfy.observability.svc.cluster.local:80/gitea-ci` including:
|
||||
- Compile count and upload count
|
||||
- Version tag
|
||||
- Failed pipeline names (on failure)
|
||||
- Clickable link to the CI run in Gitea
|
||||
|
||||
## Current Pipelines
|
||||
|
||||
| Pipeline | Purpose |
|
||||
|----------|---------|
|
||||
| `document_ingestion_pipeline` | RAG document processing with MLflow |
|
||||
| `evaluation_pipeline` | Model evaluation |
|
||||
| `dvd_transcription_pipeline` | DVD audio → transcript via Whisper |
|
||||
| `qlora_pdf_pipeline` | QLoRA fine-tune on PDFs from S3 |
|
||||
| `voice_cloning_pipeline` | Speaker extraction + VITS voice training |
|
||||
| `vllm_tuning_pipeline` | vLLM inference parameter tuning |
|
||||
|
||||
## Links
|
||||
|
||||
* Related to [ADR-0009](0009-dual-workflow-engines.md) (Kubeflow Pipelines)
|
||||
* Related to [ADR-0013](0013-gitea-actions-for-ci.md) (Gitea Actions)
|
||||
* Related to [ADR-0015](0015-ci-notifications-and-semantic-versioning.md) (ntfy notifications)
|
||||
* Related to [ADR-0031](0031-gitea-cicd-strategy.md) (Gitea CI/CD patterns)
|
||||
* Related to [ADR-0043](0043-cilium-cni-network-fabric.md) (NetworkPolicy)
|
||||
Reference in New Issue
Block a user