docs: add ADRs 0043-0053 covering remaining architecture gaps
All checks were successful
Update README with ADR Index / update-readme (push) Successful in 6s
All checks were successful
Update README with ADR Index / update-readme (push) Successful in 6s
New ADRs: - 0043: Cilium CNI and Network Fabric - 0044: DNS and External Access Architecture - 0045: TLS Certificate Strategy (cert-manager) - 0046: Companions Frontend Architecture - 0047: MLflow Experiment Tracking and Model Registry - 0048: Entertainment and Media Stack - 0049: Self-Hosted Productivity Suite - 0050: Argo Rollouts Progressive Delivery - 0051: KEDA Event-Driven Autoscaling - 0052: Cluster Utilities (Spegel, Descheduler, Reloader, CSI-NFS) - 0053: Vaultwarden Password Management README updated with table entries and badge count (53 total).
This commit is contained in:
114
decisions/0047-mlflow-experiment-tracking.md
Normal file
114
decisions/0047-mlflow-experiment-tracking.md
Normal file
@@ -0,0 +1,114 @@
|
||||
# MLflow Experiment Tracking and Model Registry
|
||||
|
||||
* Status: accepted
|
||||
* Date: 2026-02-09
|
||||
* Deciders: Billy
|
||||
* Technical Story: Provide centralized experiment tracking, model versioning, and artifact storage for ML workflows
|
||||
|
||||
## Context and Problem Statement
|
||||
|
||||
ML training pipelines (Kubeflow, Argo) produce metrics, parameters, and model artifacts that must be tracked across experiments. Without centralized tracking, comparing model performance, reproducing results, and managing model versions becomes ad hoc and error-prone.
|
||||
|
||||
How do we provide experiment tracking and model registry that integrates with both Kubeflow Pipelines and Argo Workflows?
|
||||
|
||||
## Decision Drivers
|
||||
|
||||
* Track metrics, parameters, and artifacts across all training runs
|
||||
* Compare experiments to select best models
|
||||
* Version models with metadata for deployment decisions
|
||||
* Integrate with both Kubeflow and Argo workflow engines
|
||||
* Python-native API (all ML code is Python)
|
||||
* Self-hosted with no external dependencies
|
||||
|
||||
## Considered Options
|
||||
|
||||
1. **MLflow** — Open-source experiment tracking and model registry
|
||||
2. **Weights & Biases (W&B)** — SaaS experiment tracking
|
||||
3. **Neptune.ai** — SaaS ML metadata store
|
||||
4. **Kubeflow Metadata** — Built-in Kubeflow tracking
|
||||
|
||||
## Decision Outcome
|
||||
|
||||
Chosen option: **MLflow**, because it's open-source, self-hostable, has a mature Python SDK, and provides both experiment tracking and model registry in a single tool.
|
||||
|
||||
### Positive Consequences
|
||||
|
||||
* Self-hosted — no SaaS costs or external dependencies
|
||||
* Python SDK integrates naturally with training code
|
||||
* Model registry provides versioning with stage transitions
|
||||
* REST API enables integration from any workflow engine
|
||||
* Artifact storage on NFS provides shared access across pods
|
||||
|
||||
### Negative Consequences
|
||||
|
||||
* Another service to maintain (server + database + artifact storage)
|
||||
* Concurrent access to SQLite/file artifacts can be tricky (mitigated by PostgreSQL backend)
|
||||
* UI is functional but not as polished as commercial alternatives
|
||||
|
||||
## Deployment Configuration
|
||||
|
||||
| | |
|
||||
|---|---|
|
||||
| **Chart** | `mlflow` from `https://community-charts.github.io/helm-charts` |
|
||||
| **Namespace** | `mlflow` |
|
||||
| **Server** | uvicorn (gunicorn disabled) |
|
||||
| **Resources** | 200m/512Mi request → 1 CPU/2Gi limit |
|
||||
| **Strategy** | Recreate |
|
||||
|
||||
### Backend Store
|
||||
|
||||
PostgreSQL via **CloudNativePG**:
|
||||
- 1 instance, `amd64` node affinity
|
||||
- 10Gi Longhorn storage, `max_connections: 200`
|
||||
- Credentials from Vault via ExternalSecret
|
||||
|
||||
### Artifact Store
|
||||
|
||||
- 50Gi NFS PVC (`nfs-slow` StorageClass, ReadWriteMany)
|
||||
- Mounted at `/mlflow/artifacts`
|
||||
- Proxied artifact storage (clients access via MLflow server, not directly)
|
||||
|
||||
NFS provides ReadWriteMany access so multiple training pods can write artifacts concurrently.
|
||||
|
||||
## MLflow Utils Library
|
||||
|
||||
The `mlflow/` repository contains `mlflow_utils`, a Python package that wraps the MLflow API for homelab-specific patterns:
|
||||
|
||||
| Module | Purpose |
|
||||
|--------|---------|
|
||||
| `client.py` | MLflow client wrapper with homelab defaults |
|
||||
| `tracker.py` | Experiment tracking with auto-logging |
|
||||
| `inference_tracker.py` | Async, batched inference metrics logging |
|
||||
| `model_registry.py` | Model versioning with KServe metadata |
|
||||
| `kfp_components.py` | Kubeflow Pipeline components for MLflow |
|
||||
| `experiment_comparison.py` | Compare runs across experiments |
|
||||
| `cli.py` | CLI for common operations |
|
||||
|
||||
This library is used by `handler-base`, Kubeflow pipelines, and Argo training workflows to provide consistent MLflow integration across the platform.
|
||||
|
||||
## Integration Points
|
||||
|
||||
```
|
||||
Kubeflow Pipelines ──→ mlflow_utils.kfp_components ──→ MLflow Server
|
||||
│
|
||||
Argo Workflows ──→ mlflow_utils.tracker ──→────────────────┤
|
||||
│
|
||||
handler-base ──→ mlflow_utils.inference_tracker ──→────────┤
|
||||
▼
|
||||
┌──────────────┐
|
||||
│ PostgreSQL │
|
||||
│ (metadata) │
|
||||
├──────────────┤
|
||||
│ NFS │
|
||||
│ (artifacts) │
|
||||
└──────────────┘
|
||||
```
|
||||
|
||||
**Access:** `mlflow.lab.daviestechlabs.io` via envoy-internal gateway.
|
||||
|
||||
## Links
|
||||
|
||||
* Related to [ADR-0009](0009-dual-workflow-engines.md) (Argo + Kubeflow workflows)
|
||||
* Related to [ADR-0027](0027-database-strategy.md) (CNPG PostgreSQL)
|
||||
* Related to [ADR-0026](0026-storage-strategy.md) (NFS artifact storage)
|
||||
* [MLflow Documentation](https://mlflow.org/docs/latest/)
|
||||
Reference in New Issue
Block a user