diff --git a/decisions/0054-kubeflow-pipeline-cicd.md b/decisions/0054-kubeflow-pipeline-cicd.md new file mode 100644 index 0000000..380c54f --- /dev/null +++ b/decisions/0054-kubeflow-pipeline-cicd.md @@ -0,0 +1,131 @@ +# Kubeflow Pipeline CI/CD + +* Status: accepted +* Date: 2026-02-13 +* Deciders: Billy +* Technical Story: Automate compilation and upload of Kubeflow Pipelines on git push + +## Context and Problem Statement + +Kubeflow Pipelines are defined as Python scripts (`*_pipeline.py`) that compile to YAML IR documents. These must be compiled with `kfp` and then uploaded to the Kubeflow Pipelines API. Doing this manually is error-prone and easy to forget — a push to `main` should automatically make pipelines available in the Kubeflow UI. + +How do we automate the compile-and-upload lifecycle for Kubeflow Pipelines using the existing Gitea Actions CI infrastructure? + +## Decision Drivers + +* Pipeline definitions change frequently as new ML workflows are added +* Manual `kfp pipeline upload` is tedious and easy to forget +* Kubeflow Pipelines API is accessible within the cluster +* Gitea Actions runners already exist (ADR-0031) +* Notifications via ntfy are established (ADR-0015) + +## Considered Options + +1. **Gitea Actions workflow with in-cluster KFP API access** +2. **Argo Events watching git repo, triggering Argo Workflow to upload** +3. **CronJob polling for changes** +4. **Manual upload via CLI** + +## Decision Outcome + +Chosen option: **Option 1 — Gitea Actions workflow**, because the runners are already in-cluster, the pattern is consistent with other CI workflows (ADR-0031), and it provides immediate feedback via ntfy. + +### Positive Consequences + +* Zero-touch pipeline deployment — push to main and pipelines appear in Kubeflow +* Consistent CI pattern across all repositories +* Version tracking with timestamped tags (`v20260213-143022`) +* Existing pipelines get new versions; new pipelines are auto-created +* ntfy notifications on success/failure + +### Negative Consequences + +* Requires NetworkPolicy to allow cross-namespace traffic (gitea → kubeflow) +* Pipeline compilation happens in CI, not locally — compilation errors only surface in CI +* KFP SDK version must be pinned in CI to match the cluster + +## Implementation + +### Workflow Structure + +The workflow (`.gitea/workflows/compile-upload.yaml`) has two jobs: + +| Job | Purpose | +|-----|---------| +| `compile-and-upload` | Find `*_pipeline.py`, compile each with KFP, upload YAML to Kubeflow | +| `notify` | Send ntfy notification with compile/upload summary | + +### Pipeline Discovery + +```yaml +on: + push: + branches: [main] + paths: + - "**/*_pipeline.py" + - "**/*pipeline*.py" + workflow_dispatch: +``` + +Pipelines are discovered at runtime with `find . -maxdepth 1 -name '*_pipeline.py'`, avoiding shell issues with glob expansion in CI variables. The `workflow_dispatch` trigger allows manual re-runs. + +### Upload Strategy + +The upload step uses an inline Python script with the KFP client: + +1. Connect to `ml-pipeline.kubeflow.svc.cluster.local:8888` +2. For each compiled YAML: + - Check if a pipeline with that name already exists + - **Exists** → upload as a new version with timestamp tag + - **New** → create the pipeline +3. Report uploaded/failed counts as job outputs + +### NetworkPolicy Requirement + +Gitea Actions runners run in the `gitea` namespace. Kubeflow's NetworkPolicies default-deny cross-namespace ingress. A dedicated policy was added: + +```yaml +apiVersion: networking.k8s.io/v1 +kind: NetworkPolicy +metadata: + name: allow-gitea-ingress + namespace: kubeflow +spec: + podSelector: {} + policyTypes: + - Ingress + ingress: + - from: + - namespaceSelector: + matchLabels: + kubernetes.io/metadata.name: gitea +``` + +This joins existing policies for envoy (external access) and ai-ml namespace (pipeline-bridge, kfp-sync-job). + +### Notification + +The `notify` job sends a summary to `ntfy.observability.svc.cluster.local:80/gitea-ci` including: +- Compile count and upload count +- Version tag +- Failed pipeline names (on failure) +- Clickable link to the CI run in Gitea + +## Current Pipelines + +| Pipeline | Purpose | +|----------|---------| +| `document_ingestion_pipeline` | RAG document processing with MLflow | +| `evaluation_pipeline` | Model evaluation | +| `dvd_transcription_pipeline` | DVD audio → transcript via Whisper | +| `qlora_pdf_pipeline` | QLoRA fine-tune on PDFs from S3 | +| `voice_cloning_pipeline` | Speaker extraction + VITS voice training | +| `vllm_tuning_pipeline` | vLLM inference parameter tuning | + +## Links + +* Related to [ADR-0009](0009-dual-workflow-engines.md) (Kubeflow Pipelines) +* Related to [ADR-0013](0013-gitea-actions-for-ci.md) (Gitea Actions) +* Related to [ADR-0015](0015-ci-notifications-and-semantic-versioning.md) (ntfy notifications) +* Related to [ADR-0031](0031-gitea-cicd-strategy.md) (Gitea CI/CD patterns) +* Related to [ADR-0043](0043-cilium-cni-network-fabric.md) (NetworkPolicy) diff --git a/decisions/0055-internal-python-package-publishing.md b/decisions/0055-internal-python-package-publishing.md new file mode 100644 index 0000000..c0bcdbf --- /dev/null +++ b/decisions/0055-internal-python-package-publishing.md @@ -0,0 +1,132 @@ +# Internal Python Package Publishing + +* Status: accepted +* Date: 2026-02-13 +* Deciders: Billy +* Technical Story: Publish reusable Python packages to Gitea's built-in PyPI registry with automated CI + +## Context and Problem Statement + +Shared Python libraries like `mlflow_utils` are used across multiple projects (handler-base, Kubeflow pipelines, Argo workflows). Currently these are consumed via git dependencies or copy-paste. This is fragile — there's no versioning, no quality gate, and no single source of truth for installed versions. + +How do we publish internal Python packages so they can be installed with `pip install` / `uv add` from a private registry, with automated quality checks and versioning? + +## Decision Drivers + +* Shared libraries are consumed by multiple services and pipelines +* Need version pinning for reproducible builds +* Quality gates (lint, format, test) should run before publishing +* Must work with `uv`, `pip`, and KFP container images +* Self-hosted — no PyPI.org or external registries +* Consistent with existing CI patterns (ADR-0031, ADR-0015) + +## Considered Options + +1. **Gitea's built-in PyPI registry** with CI-driven publish +2. **Private PyPI server** (pypiserver or devpi) +3. **Git-based dependencies** (`pip install git+https://...`) +4. **Vendored copies** in each consuming repository + +## Decision Outcome + +Chosen option: **Option 1 — Gitea's built-in PyPI registry**, because Gitea already provides a packages API with PyPI compatibility, eliminating the need for another service. Combined with `uv build` and `twine upload`, the publish workflow is minimal. + +### Positive Consequences + +* Standard `pip install mlflow-utils --index-url ...` works everywhere +* Semantic versioning with git tags provides clear release history +* Lint + format + test gates prevent broken packages from publishing +* No additional infrastructure — Gitea handles package storage +* Consuming projects can pin exact versions + +### Negative Consequences + +* Registry credentials must be configured as CI secrets per repo +* Gitea's PyPI registry is basic (no yanking, no project pages) +* Version conflicts possible if consumers don't pin + +## Implementation + +### Package Structure + +``` +mlflow/ +├── pyproject.toml # hatchling build, ruff+pytest dev deps +├── uv.lock # Locked dependencies +├── mlflow_utils/ +│ ├── __init__.py +│ ├── client.py +│ ├── tracker.py +│ ├── inference_tracker.py +│ ├── model_registry.py +│ ├── kfp_components.py +│ ├── experiment_comparison.py +│ └── cli.py # CLI entrypoint: mlflow-utils +└── tests/ + └── test_smoke.py # Import validation for all modules +``` + +### CI Workflow + +Four jobs in `.gitea/workflows/ci.yaml`: + +| Job | Purpose | Gate | +|-----|---------|------| +| `lint` | `ruff check` + `ruff format --check` | Must pass | +| `test` | `pytest -v` | Must pass | +| `publish` | Build + upload to Gitea PyPI + tag | After lint+test, main only | +| `notify` | ntfy success/failure notification | Always | + +### Key Design Decisions + +**uv over pip for CI**: All jobs use `uv` installed via `curl -LsSf https://astral.sh/uv/install.sh | sh` rather than the `astral-sh/setup-uv` GitHub Action, which is unavailable in Gitea's act runner. `uv sync --frozen --extra dev` ensures reproducible installs from the lockfile. + +**uvx twine for publishing**: Rather than `uv pip install twine --system` (blocked by Debian's externally-managed environment), `uvx twine upload` runs twine in an ephemeral virtual environment. + +**Semantic versioning from commit messages**: Same pattern as ADR-0015 — commit prefixes (`major:`, `feat:`, `fix:`) determine version bumps. The publish step patches `pyproject.toml` at build time via `sed`, builds with `uv build`, uploads with `twine`, then tags. + +### Registry Configuration + +| Setting | Value | +|---------|-------| +| **Registry URL** | `http://gitea-http.gitea.svc.cluster.local:3000/api/packages/daviestechlabs/pypi` | +| **Auth** | `REGISTRY_USER` + `REGISTRY_TOKEN` repo secrets (Gitea admin credentials) | +| **External URL** | `https://git.daviestechlabs.io/api/packages/daviestechlabs/pypi/simple/` | + +### Consuming Packages + +From any project or Dockerfile: + +```bash +# uv +uv add mlflow-utils --index-url https://git.daviestechlabs.io/api/packages/daviestechlabs/pypi/simple/ + +# pip +pip install mlflow-utils --index-url https://git.daviestechlabs.io/api/packages/daviestechlabs/pypi/simple/ +``` + +### Quality Gates + +| Tool | Check | Config | +|------|-------|--------| +| ruff check | Lint rules (F, E, W, I) | `line-length = 120` in pyproject.toml | +| ruff format | Code formatting | Consistent with check config | +| pytest | Import smoke tests, unit tests | `kfp` auto-skipped if not installed | + +## Future Packages + +This pattern applies to any shared Python library: + +| Candidate | Repository | Status | +|-----------|-----------|--------| +| `mlflow-utils` | `mlflow` | Published | +| `handler-base` | `handler-base` | Candidate | +| `ray-serve-apps` | `ray-serve` | Candidate | + +## Links + +* Related to [ADR-0012](0012-use-uv-for-python-development.md) (uv for Python) +* Related to [ADR-0015](0015-ci-notifications-and-semantic-versioning.md) (semantic versioning) +* Related to [ADR-0031](0031-gitea-cicd-strategy.md) (Gitea CI/CD patterns) +* Related to [ADR-0047](0047-mlflow-experiment-tracking.md) (mlflow_utils library) +* Updates [ADR-0020](0020-internal-registry-for-cicd.md) (internal registry — now includes PyPI)