Files

Update README with ADR Index / update-readme (push) Successful in 6s

Details

docs: add ADRs 0043-0053 covering remaining architecture gaps

New ADRs:
- 0043: Cilium CNI and Network Fabric
- 0044: DNS and External Access Architecture
- 0045: TLS Certificate Strategy (cert-manager)
- 0046: Companions Frontend Architecture
- 0047: MLflow Experiment Tracking and Model Registry
- 0048: Entertainment and Media Stack
- 0049: Self-Hosted Productivity Suite
- 0050: Argo Rollouts Progressive Delivery
- 0051: KEDA Event-Driven Autoscaling
- 0052: Cluster Utilities (Spegel, Descheduler, Reloader, CSI-NFS)
- 0053: Vaultwarden Password Management

README updated with table entries and badge count (53 total).

2026-02-09 18:37:14 -05:00

4.9 KiB

Raw Blame History

MLflow Experiment Tracking and Model Registry

Status: accepted
Date: 2026-02-09
Deciders: Billy
Technical Story: Provide centralized experiment tracking, model versioning, and artifact storage for ML workflows

Context and Problem Statement

ML training pipelines (Kubeflow, Argo) produce metrics, parameters, and model artifacts that must be tracked across experiments. Without centralized tracking, comparing model performance, reproducing results, and managing model versions becomes ad hoc and error-prone.

How do we provide experiment tracking and model registry that integrates with both Kubeflow Pipelines and Argo Workflows?

Decision Drivers

Track metrics, parameters, and artifacts across all training runs
Compare experiments to select best models
Version models with metadata for deployment decisions
Integrate with both Kubeflow and Argo workflow engines
Python-native API (all ML code is Python)
Self-hosted with no external dependencies

Considered Options

MLflow — Open-source experiment tracking and model registry
Weights & Biases (W&B) — SaaS experiment tracking
Neptune.ai — SaaS ML metadata store
Kubeflow Metadata — Built-in Kubeflow tracking

Decision Outcome

Chosen option: MLflow, because it's open-source, self-hostable, has a mature Python SDK, and provides both experiment tracking and model registry in a single tool.

Positive Consequences

Self-hosted — no SaaS costs or external dependencies
Python SDK integrates naturally with training code
Model registry provides versioning with stage transitions
REST API enables integration from any workflow engine
Artifact storage on NFS provides shared access across pods

Negative Consequences

Another service to maintain (server + database + artifact storage)
Concurrent access to SQLite/file artifacts can be tricky (mitigated by PostgreSQL backend)
UI is functional but not as polished as commercial alternatives

Deployment Configuration


Chart	`mlflow` from `https://community-charts.github.io/helm-charts`
Namespace	`mlflow`
Server	uvicorn (gunicorn disabled)
Resources	200m/512Mi request → 1 CPU/2Gi limit
Strategy	Recreate

Backend Store

PostgreSQL via CloudNativePG:

1 instance, amd64 node affinity
10Gi Longhorn storage, max_connections: 200
Credentials from Vault via ExternalSecret

Artifact Store

50Gi NFS PVC (nfs-slow StorageClass, ReadWriteMany)
Mounted at /mlflow/artifacts
Proxied artifact storage (clients access via MLflow server, not directly)

NFS provides ReadWriteMany access so multiple training pods can write artifacts concurrently.

MLflow Utils Library

The mlflow/ repository contains mlflow_utils, a Python package that wraps the MLflow API for homelab-specific patterns:

Module	Purpose
`client.py`	MLflow client wrapper with homelab defaults
`tracker.py`	Experiment tracking with auto-logging
`inference_tracker.py`	Async, batched inference metrics logging
`model_registry.py`	Model versioning with KServe metadata
`kfp_components.py`	Kubeflow Pipeline components for MLflow
`experiment_comparison.py`	Compare runs across experiments
`cli.py`	CLI for common operations

This library is used by handler-base, Kubeflow pipelines, and Argo training workflows to provide consistent MLflow integration across the platform.

Integration Points

Kubeflow Pipelines ──→ mlflow_utils.kfp_components ──→ MLflow Server
                                                           │
Argo Workflows ──→ mlflow_utils.tracker ──→────────────────┤
                                                           │
handler-base ──→ mlflow_utils.inference_tracker ──→────────┤
                                                           ▼
                                                    ┌──────────────┐
                                                    │  PostgreSQL  │
                                                    │  (metadata)  │
                                                    ├──────────────┤
                                                    │  NFS         │
                                                    │  (artifacts) │
                                                    └──────────────┘

Access: mlflow.lab.daviestechlabs.io via envoy-internal gateway.

4.9 KiB Raw Blame History