docs: add ADR-0054 Kubeflow Pipeline CI/CD and ADR-0055 Internal Python Package Publishing
All checks were successful
Update README with ADR Index / update-readme (push) Successful in 6s
All checks were successful
Update README with ADR Index / update-readme (push) Successful in 6s
This commit is contained in:
131
decisions/0054-kubeflow-pipeline-cicd.md
Normal file
131
decisions/0054-kubeflow-pipeline-cicd.md
Normal file
@@ -0,0 +1,131 @@
|
|||||||
|
# Kubeflow Pipeline CI/CD
|
||||||
|
|
||||||
|
* Status: accepted
|
||||||
|
* Date: 2026-02-13
|
||||||
|
* Deciders: Billy
|
||||||
|
* Technical Story: Automate compilation and upload of Kubeflow Pipelines on git push
|
||||||
|
|
||||||
|
## Context and Problem Statement
|
||||||
|
|
||||||
|
Kubeflow Pipelines are defined as Python scripts (`*_pipeline.py`) that compile to YAML IR documents. These must be compiled with `kfp` and then uploaded to the Kubeflow Pipelines API. Doing this manually is error-prone and easy to forget — a push to `main` should automatically make pipelines available in the Kubeflow UI.
|
||||||
|
|
||||||
|
How do we automate the compile-and-upload lifecycle for Kubeflow Pipelines using the existing Gitea Actions CI infrastructure?
|
||||||
|
|
||||||
|
## Decision Drivers
|
||||||
|
|
||||||
|
* Pipeline definitions change frequently as new ML workflows are added
|
||||||
|
* Manual `kfp pipeline upload` is tedious and easy to forget
|
||||||
|
* Kubeflow Pipelines API is accessible within the cluster
|
||||||
|
* Gitea Actions runners already exist (ADR-0031)
|
||||||
|
* Notifications via ntfy are established (ADR-0015)
|
||||||
|
|
||||||
|
## Considered Options
|
||||||
|
|
||||||
|
1. **Gitea Actions workflow with in-cluster KFP API access**
|
||||||
|
2. **Argo Events watching git repo, triggering Argo Workflow to upload**
|
||||||
|
3. **CronJob polling for changes**
|
||||||
|
4. **Manual upload via CLI**
|
||||||
|
|
||||||
|
## Decision Outcome
|
||||||
|
|
||||||
|
Chosen option: **Option 1 — Gitea Actions workflow**, because the runners are already in-cluster, the pattern is consistent with other CI workflows (ADR-0031), and it provides immediate feedback via ntfy.
|
||||||
|
|
||||||
|
### Positive Consequences
|
||||||
|
|
||||||
|
* Zero-touch pipeline deployment — push to main and pipelines appear in Kubeflow
|
||||||
|
* Consistent CI pattern across all repositories
|
||||||
|
* Version tracking with timestamped tags (`v20260213-143022`)
|
||||||
|
* Existing pipelines get new versions; new pipelines are auto-created
|
||||||
|
* ntfy notifications on success/failure
|
||||||
|
|
||||||
|
### Negative Consequences
|
||||||
|
|
||||||
|
* Requires NetworkPolicy to allow cross-namespace traffic (gitea → kubeflow)
|
||||||
|
* Pipeline compilation happens in CI, not locally — compilation errors only surface in CI
|
||||||
|
* KFP SDK version must be pinned in CI to match the cluster
|
||||||
|
|
||||||
|
## Implementation
|
||||||
|
|
||||||
|
### Workflow Structure
|
||||||
|
|
||||||
|
The workflow (`.gitea/workflows/compile-upload.yaml`) has two jobs:
|
||||||
|
|
||||||
|
| Job | Purpose |
|
||||||
|
|-----|---------|
|
||||||
|
| `compile-and-upload` | Find `*_pipeline.py`, compile each with KFP, upload YAML to Kubeflow |
|
||||||
|
| `notify` | Send ntfy notification with compile/upload summary |
|
||||||
|
|
||||||
|
### Pipeline Discovery
|
||||||
|
|
||||||
|
```yaml
|
||||||
|
on:
|
||||||
|
push:
|
||||||
|
branches: [main]
|
||||||
|
paths:
|
||||||
|
- "**/*_pipeline.py"
|
||||||
|
- "**/*pipeline*.py"
|
||||||
|
workflow_dispatch:
|
||||||
|
```
|
||||||
|
|
||||||
|
Pipelines are discovered at runtime with `find . -maxdepth 1 -name '*_pipeline.py'`, avoiding shell issues with glob expansion in CI variables. The `workflow_dispatch` trigger allows manual re-runs.
|
||||||
|
|
||||||
|
### Upload Strategy
|
||||||
|
|
||||||
|
The upload step uses an inline Python script with the KFP client:
|
||||||
|
|
||||||
|
1. Connect to `ml-pipeline.kubeflow.svc.cluster.local:8888`
|
||||||
|
2. For each compiled YAML:
|
||||||
|
- Check if a pipeline with that name already exists
|
||||||
|
- **Exists** → upload as a new version with timestamp tag
|
||||||
|
- **New** → create the pipeline
|
||||||
|
3. Report uploaded/failed counts as job outputs
|
||||||
|
|
||||||
|
### NetworkPolicy Requirement
|
||||||
|
|
||||||
|
Gitea Actions runners run in the `gitea` namespace. Kubeflow's NetworkPolicies default-deny cross-namespace ingress. A dedicated policy was added:
|
||||||
|
|
||||||
|
```yaml
|
||||||
|
apiVersion: networking.k8s.io/v1
|
||||||
|
kind: NetworkPolicy
|
||||||
|
metadata:
|
||||||
|
name: allow-gitea-ingress
|
||||||
|
namespace: kubeflow
|
||||||
|
spec:
|
||||||
|
podSelector: {}
|
||||||
|
policyTypes:
|
||||||
|
- Ingress
|
||||||
|
ingress:
|
||||||
|
- from:
|
||||||
|
- namespaceSelector:
|
||||||
|
matchLabels:
|
||||||
|
kubernetes.io/metadata.name: gitea
|
||||||
|
```
|
||||||
|
|
||||||
|
This joins existing policies for envoy (external access) and ai-ml namespace (pipeline-bridge, kfp-sync-job).
|
||||||
|
|
||||||
|
### Notification
|
||||||
|
|
||||||
|
The `notify` job sends a summary to `ntfy.observability.svc.cluster.local:80/gitea-ci` including:
|
||||||
|
- Compile count and upload count
|
||||||
|
- Version tag
|
||||||
|
- Failed pipeline names (on failure)
|
||||||
|
- Clickable link to the CI run in Gitea
|
||||||
|
|
||||||
|
## Current Pipelines
|
||||||
|
|
||||||
|
| Pipeline | Purpose |
|
||||||
|
|----------|---------|
|
||||||
|
| `document_ingestion_pipeline` | RAG document processing with MLflow |
|
||||||
|
| `evaluation_pipeline` | Model evaluation |
|
||||||
|
| `dvd_transcription_pipeline` | DVD audio → transcript via Whisper |
|
||||||
|
| `qlora_pdf_pipeline` | QLoRA fine-tune on PDFs from S3 |
|
||||||
|
| `voice_cloning_pipeline` | Speaker extraction + VITS voice training |
|
||||||
|
| `vllm_tuning_pipeline` | vLLM inference parameter tuning |
|
||||||
|
|
||||||
|
## Links
|
||||||
|
|
||||||
|
* Related to [ADR-0009](0009-dual-workflow-engines.md) (Kubeflow Pipelines)
|
||||||
|
* Related to [ADR-0013](0013-gitea-actions-for-ci.md) (Gitea Actions)
|
||||||
|
* Related to [ADR-0015](0015-ci-notifications-and-semantic-versioning.md) (ntfy notifications)
|
||||||
|
* Related to [ADR-0031](0031-gitea-cicd-strategy.md) (Gitea CI/CD patterns)
|
||||||
|
* Related to [ADR-0043](0043-cilium-cni-network-fabric.md) (NetworkPolicy)
|
||||||
132
decisions/0055-internal-python-package-publishing.md
Normal file
132
decisions/0055-internal-python-package-publishing.md
Normal file
@@ -0,0 +1,132 @@
|
|||||||
|
# Internal Python Package Publishing
|
||||||
|
|
||||||
|
* Status: accepted
|
||||||
|
* Date: 2026-02-13
|
||||||
|
* Deciders: Billy
|
||||||
|
* Technical Story: Publish reusable Python packages to Gitea's built-in PyPI registry with automated CI
|
||||||
|
|
||||||
|
## Context and Problem Statement
|
||||||
|
|
||||||
|
Shared Python libraries like `mlflow_utils` are used across multiple projects (handler-base, Kubeflow pipelines, Argo workflows). Currently these are consumed via git dependencies or copy-paste. This is fragile — there's no versioning, no quality gate, and no single source of truth for installed versions.
|
||||||
|
|
||||||
|
How do we publish internal Python packages so they can be installed with `pip install` / `uv add` from a private registry, with automated quality checks and versioning?
|
||||||
|
|
||||||
|
## Decision Drivers
|
||||||
|
|
||||||
|
* Shared libraries are consumed by multiple services and pipelines
|
||||||
|
* Need version pinning for reproducible builds
|
||||||
|
* Quality gates (lint, format, test) should run before publishing
|
||||||
|
* Must work with `uv`, `pip`, and KFP container images
|
||||||
|
* Self-hosted — no PyPI.org or external registries
|
||||||
|
* Consistent with existing CI patterns (ADR-0031, ADR-0015)
|
||||||
|
|
||||||
|
## Considered Options
|
||||||
|
|
||||||
|
1. **Gitea's built-in PyPI registry** with CI-driven publish
|
||||||
|
2. **Private PyPI server** (pypiserver or devpi)
|
||||||
|
3. **Git-based dependencies** (`pip install git+https://...`)
|
||||||
|
4. **Vendored copies** in each consuming repository
|
||||||
|
|
||||||
|
## Decision Outcome
|
||||||
|
|
||||||
|
Chosen option: **Option 1 — Gitea's built-in PyPI registry**, because Gitea already provides a packages API with PyPI compatibility, eliminating the need for another service. Combined with `uv build` and `twine upload`, the publish workflow is minimal.
|
||||||
|
|
||||||
|
### Positive Consequences
|
||||||
|
|
||||||
|
* Standard `pip install mlflow-utils --index-url ...` works everywhere
|
||||||
|
* Semantic versioning with git tags provides clear release history
|
||||||
|
* Lint + format + test gates prevent broken packages from publishing
|
||||||
|
* No additional infrastructure — Gitea handles package storage
|
||||||
|
* Consuming projects can pin exact versions
|
||||||
|
|
||||||
|
### Negative Consequences
|
||||||
|
|
||||||
|
* Registry credentials must be configured as CI secrets per repo
|
||||||
|
* Gitea's PyPI registry is basic (no yanking, no project pages)
|
||||||
|
* Version conflicts possible if consumers don't pin
|
||||||
|
|
||||||
|
## Implementation
|
||||||
|
|
||||||
|
### Package Structure
|
||||||
|
|
||||||
|
```
|
||||||
|
mlflow/
|
||||||
|
├── pyproject.toml # hatchling build, ruff+pytest dev deps
|
||||||
|
├── uv.lock # Locked dependencies
|
||||||
|
├── mlflow_utils/
|
||||||
|
│ ├── __init__.py
|
||||||
|
│ ├── client.py
|
||||||
|
│ ├── tracker.py
|
||||||
|
│ ├── inference_tracker.py
|
||||||
|
│ ├── model_registry.py
|
||||||
|
│ ├── kfp_components.py
|
||||||
|
│ ├── experiment_comparison.py
|
||||||
|
│ └── cli.py # CLI entrypoint: mlflow-utils
|
||||||
|
└── tests/
|
||||||
|
└── test_smoke.py # Import validation for all modules
|
||||||
|
```
|
||||||
|
|
||||||
|
### CI Workflow
|
||||||
|
|
||||||
|
Four jobs in `.gitea/workflows/ci.yaml`:
|
||||||
|
|
||||||
|
| Job | Purpose | Gate |
|
||||||
|
|-----|---------|------|
|
||||||
|
| `lint` | `ruff check` + `ruff format --check` | Must pass |
|
||||||
|
| `test` | `pytest -v` | Must pass |
|
||||||
|
| `publish` | Build + upload to Gitea PyPI + tag | After lint+test, main only |
|
||||||
|
| `notify` | ntfy success/failure notification | Always |
|
||||||
|
|
||||||
|
### Key Design Decisions
|
||||||
|
|
||||||
|
**uv over pip for CI**: All jobs use `uv` installed via `curl -LsSf https://astral.sh/uv/install.sh | sh` rather than the `astral-sh/setup-uv` GitHub Action, which is unavailable in Gitea's act runner. `uv sync --frozen --extra dev` ensures reproducible installs from the lockfile.
|
||||||
|
|
||||||
|
**uvx twine for publishing**: Rather than `uv pip install twine --system` (blocked by Debian's externally-managed environment), `uvx twine upload` runs twine in an ephemeral virtual environment.
|
||||||
|
|
||||||
|
**Semantic versioning from commit messages**: Same pattern as ADR-0015 — commit prefixes (`major:`, `feat:`, `fix:`) determine version bumps. The publish step patches `pyproject.toml` at build time via `sed`, builds with `uv build`, uploads with `twine`, then tags.
|
||||||
|
|
||||||
|
### Registry Configuration
|
||||||
|
|
||||||
|
| Setting | Value |
|
||||||
|
|---------|-------|
|
||||||
|
| **Registry URL** | `http://gitea-http.gitea.svc.cluster.local:3000/api/packages/daviestechlabs/pypi` |
|
||||||
|
| **Auth** | `REGISTRY_USER` + `REGISTRY_TOKEN` repo secrets (Gitea admin credentials) |
|
||||||
|
| **External URL** | `https://git.daviestechlabs.io/api/packages/daviestechlabs/pypi/simple/` |
|
||||||
|
|
||||||
|
### Consuming Packages
|
||||||
|
|
||||||
|
From any project or Dockerfile:
|
||||||
|
|
||||||
|
```bash
|
||||||
|
# uv
|
||||||
|
uv add mlflow-utils --index-url https://git.daviestechlabs.io/api/packages/daviestechlabs/pypi/simple/
|
||||||
|
|
||||||
|
# pip
|
||||||
|
pip install mlflow-utils --index-url https://git.daviestechlabs.io/api/packages/daviestechlabs/pypi/simple/
|
||||||
|
```
|
||||||
|
|
||||||
|
### Quality Gates
|
||||||
|
|
||||||
|
| Tool | Check | Config |
|
||||||
|
|------|-------|--------|
|
||||||
|
| ruff check | Lint rules (F, E, W, I) | `line-length = 120` in pyproject.toml |
|
||||||
|
| ruff format | Code formatting | Consistent with check config |
|
||||||
|
| pytest | Import smoke tests, unit tests | `kfp` auto-skipped if not installed |
|
||||||
|
|
||||||
|
## Future Packages
|
||||||
|
|
||||||
|
This pattern applies to any shared Python library:
|
||||||
|
|
||||||
|
| Candidate | Repository | Status |
|
||||||
|
|-----------|-----------|--------|
|
||||||
|
| `mlflow-utils` | `mlflow` | Published |
|
||||||
|
| `handler-base` | `handler-base` | Candidate |
|
||||||
|
| `ray-serve-apps` | `ray-serve` | Candidate |
|
||||||
|
|
||||||
|
## Links
|
||||||
|
|
||||||
|
* Related to [ADR-0012](0012-use-uv-for-python-development.md) (uv for Python)
|
||||||
|
* Related to [ADR-0015](0015-ci-notifications-and-semantic-versioning.md) (semantic versioning)
|
||||||
|
* Related to [ADR-0031](0031-gitea-cicd-strategy.md) (Gitea CI/CD patterns)
|
||||||
|
* Related to [ADR-0047](0047-mlflow-experiment-tracking.md) (mlflow_utils library)
|
||||||
|
* Updates [ADR-0020](0020-internal-registry-for-cicd.md) (internal registry — now includes PyPI)
|
||||||
Reference in New Issue
Block a user