From 200cc5704b2fb8ecc5f7103b68b911de75c8d6be Mon Sep 17 00:00:00 2001
From: "Billy D." <billy.davies.10@icloud.com>
Date: Sun, 15 Feb 2026 11:19:09 -0500
Subject: [PATCH] docs: update node inventory and 70B QLoRA feasibility
 analysis

---
 ARCHITECTURE.md                               |  47 ++-
 .../0058-training-strategy-cpu-dgx-spark.md   | 383 ++++++++++++++++++
 2 files changed, 420 insertions(+), 10 deletions(-)
 create mode 100644 decisions/0058-training-strategy-cpu-dgx-spark.md

diff --git a/ARCHITECTURE.md b/ARCHITECTURE.md
index 727a9eb..af67679 100644
--- a/ARCHITECTURE.md
+++ b/ARCHITECTURE.md
@@ -115,10 +115,10 @@ The homelab is a production-grade Kubernetes cluster running on bare-metal hardw
 ┌─────────────────────────────────────────────────────────────────────────────┐
 │                          PLATFORM LAYER                                      │
 ├─────────────────────────────────────────────────────────────────────────────┤
-│  Talos Linux v1.12.1  │  Kubernetes v1.35.0  │  Cilium CNI                 │
+│  Talos Linux v1.12.x  │  Kubernetes v1.35.0  │  Cilium CNI                 │
 │                                                                              │
-│  Nodes: storm, bruenor, catti (control) │ elminster, khelben, drizzt,      │
-│                                          │ danilo (workers)                 │
+│  14 nodes: 3 control plane │ 4 GPU workers │ 2 CPU-only x86 workers       │
+│            │ 5 Raspberry Pi (arm64) workers                                 │
 └─────────────────────────────────────────────────────────────────────────────┘
 ```
 
@@ -134,14 +134,41 @@ The homelab is a production-grade Kubernetes cluster running on bare-metal hardw
 
 **VIP**: 192.168.100.20 (shared across control plane)
 
-### Worker Nodes
+### Worker Nodes — GPU
 
-| Node | IP | CPU | GPU | GPU Memory | Workload |
-|------|-------|-----|-----|------------|----------|
-| elminster | 192.168.100.31 | Intel | NVIDIA RTX 2070 | 8GB VRAM | Whisper, XTTS |
-| khelben | 192.168.100.32 | AMD Ryzen | AMD Strix Halo | 64GB Unified | vLLM (dedicated) |
-| drizzt | 192.168.100.40 | AMD Ryzen 7 6800H | AMD Radeon 680M | 12GB VRAM | BGE Embeddings |
-| danilo | 192.168.100.41 | Intel Core Ultra 9 | Intel Arc | 16GB Shared | Reranker |
+| Node | IP | CPU | RAM | GPU | GPU Memory | Workload |
+|------|-------|-----|-----|-----|------------|----------|
+| elminster | 192.168.100.31 | Intel (16c) | 62 GB | NVIDIA RTX 2070 | 8 GB VRAM | Whisper, XTTS |
+| khelben | 192.168.100.32 | AMD Ryzen (32c) | 94 GB | AMD Strix Halo | 32 GB Unified | vLLM (dedicated) |
+| drizzt | 192.168.100.40 | AMD Ryzen 7 6800H (16c) | 27 GB | AMD Radeon 680M | 12 GB VRAM | BGE Embeddings |
+| danilo | 192.168.100.41 | Intel Core Ultra 9 (22c) | 62 GB | Intel Arc | 16 GB Shared | Reranker |
+
+### Worker Nodes — CPU-only (x86_64)
+
+| Node | IP | CPU | RAM | Workload |
+|------|-------|-----|-----|----------|
+| regis | 192.168.100.43 | Intel (4c) | 16 GB | General workloads |
+| wulfgar | 192.168.100.42 | Intel (4c) | 31 GB | General workloads |
+
+### Worker Nodes — Raspberry Pi (arm64)
+
+| Node | IP | CPU | RAM | Workload |
+|------|-------|-----|-----|----------|
+| durnan | 192.168.100.54 | Cortex-A72 (4c) | 4 GB | Lightweight services |
+| jarlaxle | 192.168.100.53 | Cortex-A72 (4c) | 4 GB | Lightweight services |
+| mirt | 192.168.100.52 | Cortex-A72 (4c) | 4 GB | Lightweight services |
+| volo | 192.168.100.51 | Cortex-A72 (4c) | 4 GB | Lightweight services |
+| elaith | 192.168.100.55 | Cortex-A72 (4c) | 8 GB | Lightweight services |
+
+### Cluster Totals
+
+| Resource | Total |
+|----------|-------|
+| Nodes | 14 (3 control + 11 worker) |
+| CPU cores | ~126 |
+| System RAM | ~378 GB |
+| Architectures | amd64, arm64 |
+| GPUs | 4 (NVIDIA, AMD, Intel) |
 
 ## Networking
 
diff --git a/decisions/0058-training-strategy-cpu-dgx-spark.md b/decisions/0058-training-strategy-cpu-dgx-spark.md
new file mode 100644
index 0000000..a5ed8ea
--- /dev/null
+++ b/decisions/0058-training-strategy-cpu-dgx-spark.md
@@ -0,0 +1,383 @@
+# Training Strategy – Distributed CPU Now, DGX Spark Later
+
+* Status: accepted
+* Date: 2026-02-14
+* Deciders: Billy
+* Technical Story: Enable distributed model fine-tuning on spare CPU capacity without disrupting inference workloads; plan a migration path to dedicated GPU training hardware
+
+## Context and Problem Statement
+
+All GPUs in the homelab cluster are fully allocated to inference serving via KubeRay:
+
+| Node | GPU | Accelerator | Serving |
+|---|---|---|---|
+| elminster | RTX 2070 8 GB | CUDA | Whisper (0.5) + TTS (0.5) |
+| khelben | Strix Halo 128 GB | ROCm | vLLM / LLM (0.95) |
+| drizzt | Radeon 680M | ROCm | BGE-Large embeddings (0.8) |
+| danilo | Intel Arc | Intel | BGE reranker (0.8) |
+
+Training workloads (QLoRA, LoRA, full fine-tune) cannot share these GPUs without degrading real-time inference latency.  However, the cluster has **14 nodes with spare CPU and RAM** that can be pooled for distributed training:
+
+| Node | CPU | RAM | Architecture | Available for Training |
+|------|-----|-----|-------------|----------------------|
+| storm (cp) | 4c | 16 GB | amd64 | Limited (control plane duties) |
+| bruenor (cp) | 4c | 16 GB | amd64 | Limited (control plane duties) |
+| catti (cp) | 4c | 16 GB | amd64 | Limited (control plane duties) |
+| elminster | 16c | 62 GB | amd64 | Spare CPU (GPU reserved for inference) |
+| khelben | 32c | 94 GB | amd64 | Spare CPU (GPU reserved for inference) |
+| drizzt | 16c | 27 GB | amd64 | Spare CPU (GPU reserved for inference) |
+| danilo | 22c | 62 GB | amd64 | Spare CPU (GPU reserved for inference) |
+| regis | 4c | 16 GB | amd64 | Fully available |
+| wulfgar | 4c | 31 GB | amd64 | Fully available |
+| durnan (Pi) | 4c | 4 GB | arm64 | Available (small models only) |
+| jarlaxle (Pi) | 4c | 4 GB | arm64 | Available (small models only) |
+| mirt (Pi) | 4c | 4 GB | arm64 | Available (small models only) |
+| volo (Pi) | 4c | 4 GB | arm64 | Available (small models only) |
+| elaith (Pi) | 4c | 8 GB | arm64 | Available (small models only) |
+
+**Cluster totals: ~126 CPU cores, ~378 GB RAM across 14 nodes.**
+
+Rather than training on a single node, we can **distribute training across all 14 nodes** using Ray Train with data-parallel LoRA, harvesting spare CPU from every machine in the cluster — including Raspberry Pis and control-plane nodes.  Each additional worker reduces wall-clock training time roughly linearly.
+
+The NVIDIA **DGX Spark** (GB10 Grace Blackwell, 128 GB unified LPDDR5X, ~1 PFLOPS FP4) is an upcoming desktop-class device specifically designed for local AI development.  If purchased, it would provide the first *dedicated* training accelerator in the cluster.
+
+We need a training strategy that:
+
+1. Works **today** with existing hardware (distributed CPU)
+2. **Scales horizontally** — add more nodes or cores to reduce training time
+3. Extends cleanly to a **DGX Spark** when available
+4. Integrates with the existing Kubeflow Pipelines + MLflow + KubeRay stack
+5. Does not impact inference serving
+
+## Decision Drivers
+
+* Zero GPU budget available for training today
+* Spare CPU/RAM on every node (~126 cores, ~378 GB cluster-wide across 14 nodes)
+* Training cadence is low (weekly/monthly, not continuous)
+* Small-model fine-tuning (1B–8B) is the primary use case for data-parallel CPU training
+* Distributed training recovers time lost from being CPU-only
+* Mixed architectures (amd64 + arm64 Raspberry Pis) require multi-arch Ray images
+* DGX Spark would unlock larger models (up to ~70B with NF4) and 10–50× faster training
+* Must reuse existing pipeline tooling (kfp, MLflow, Gitea adapter repos)
+
+## Considered Options
+
+1. **Distributed CPU LoRA training via Ray Train + KubeRay RayJob**
+2. **Single-node CPU training in a Kubeflow pipeline step**
+3. **Reserve a GPU fraction for training (time-share with inference)**
+4. **Offload training to cloud (Lambda Labs, RunPod, etc.)**
+5. **Wait for DGX Spark before doing any training**
+
+## Decision Outcome
+
+Chosen option: **Option 1 — Distributed CPU LoRA training via Ray Train, with a clear upgrade path to DGX Spark**
+
+A new `cpu_training_pipeline.py` Kubeflow pipeline trains small models (Qwen 2.5 3B, Llama 3.2 3B, Phi-3.5 Mini 3.8B, etc.) using LoRA on CPU in float32.  The training step submits a **KubeRay RayJob** that creates an ephemeral Ray cluster with N CPU-only workers distributed across all available cluster nodes — including GPU workers (using spare CPU), CPU-only x86 workers, and even Raspberry Pis for small models.  Ray Train's `TorchTrainer` handles data-parallel training with gradient synchronisation via AllReduce (Gloo backend).
+
+The pipeline follows an 8-step pattern:
+
+1. Fetch PDFs from Quobjects S3
+2. Prepare instruction-tuning dataset
+3. Upload prepared data to S3 (shared storage for Ray workers)
+4. Submit KubeRay RayJob (N distributed CPU workers)
+5. Download trained adapter from S3
+6. Sanity evaluation on CPU
+7. Push adapter to Gitea
+8. Log metrics to MLflow
+
+### Positive Consequences
+
+* Training starts immediately with zero additional hardware cost
+* **Scales horizontally**: up to 14 nodes; Raspberry Pis can participate for small models
+* GPUs remain 100% dedicated to inference — no latency impact
+* Ephemeral Ray cluster: resources released immediately after training
+* Adapters are small (10–100 MB) even from CPU training
+* MLflow tracks all experiments regardless of compute backend
+* DGX Spark upgrade is additive, not a rewrite
+
+### Negative Consequences
+
+* CPU training is slower than GPU even when distributed
+* Each worker loads a full copy of the model (data parallelism, not model parallelism)
+* Limited to small models (≤8B) per worker due to memory constraints
+* float32 training uses ~2× the memory of bf16/fp16
+* Requires RBAC setup for pipeline SA to create RayJob/ConfigMap resources
+
+## Distributed Ray CPU Training Design
+
+### Architecture
+
+```
+Kubeflow Pipeline (KFP component)
+       │
+       ├── Creates ConfigMap with training script
+       ├── Creates KubeRay RayJob CR
+       │         │
+       │         ▼
+       │    ┌──────────┐
+       │    │ Ray Head  │  (coordinator, 0 CPUs for training)
+       │    └────┬─────┘
+       │         │
+       │    ┌────┴──────────────────────────────────────────────────────┐
+       │    │         │           │           │            │            │
+       │    ▼         ▼           ▼           ▼            ▼            ▼
+       │ ┌────────┐┌────────┐┌────────┐┌──────────┐┌──────────┐┌──────────┐
+       │ │Worker 1││Worker 2││Worker 3││ Worker 4 ││ Worker 5 ││Worker N… │
+       │ │khelben ││elmins. ││danilo  ││ wulfgar  ││ regis    ││ Pi nodes │
+       │ │32c/94Gi││16c/62Gi││22c/62Gi││ 4c/31Gi  ││ 4c/16Gi  ││ 4c/4-8Gi│
+       │ └────────┘└────────┘└────────┘└──────────┘└──────────┘└──────────┘
+       │         │         │         │          │           │          │
+       │         └─────────┴─── AllReduce (Gloo) ──────────┴──────────┘
+       │                              │
+       │                       Adapter → S3
+       │
+       ├── Polls RayJob status until SUCCEEDED
+       ├── Downloads adapter from S3
+       └── Evaluate → Gitea → MLflow
+```
+
+### How It Scales
+
+| Workers | Nodes Used | Total CPUs | Est. Time (3B LoRA) | Notes |
+|---|---|---|---|---|
+| 1 | 1 node | 4–32 | 4–6 h | Baseline, single-worker |
+| 4 | 4 nodes (GPU workers) | 86 | 1–1.5 h | Uses spare CPU on GPU nodes |
+| 6 | + regis, wulfgar | 94 | 45–60 min | All x86 workers |
+| 9 | + control plane | 106 | 30–45 min | Control plane also contributes |
+| 11 | + Pis (small models) | ~126 | 20–35 min | Full cluster, arm64 + amd64 |
+
+**Note:** Raspberry Pi workers (4 GB RAM) can only participate in training for models that fit in ~3 GB after OS/system overhead.  For ≤3B models with LoRA, this is feasible; for 8B+ models, exclude Pi nodes from the RayJob spec.
+
+Effective batch size = `per_device_batch_size × num_workers × gradient_accumulation_steps`.  With 6 workers, the effective batch of `1 × 6 × 16 = 96` is competitive with GPU training batches.
+
+### RayJob Lifecycle
+
+1. **Create**: KFP component creates a `RayJob` CR + `ConfigMap` (training script)
+2. **Schedule**: KubeRay operator allocates head pod + N worker pods across nodes
+3. **Install**: Workers install pip deps via `runtimeEnvYAML` (torch, peft, trl, etc.)
+4. **Train**: `TorchTrainer` runs `train_func` on each worker with DDP and `prepare_trainer` integration for HuggingFace
+5. **Save**: Rank-0 worker saves adapter + metadata to S3
+6. **Teardown**: `shutdownAfterJobFinishes: true` destroys all pods on completion
+7. **TTL**: RayJob CR auto-deleted after 300 seconds
+
+### Resource Allocation
+
+Workers are configured per-node based on available resources.  The pipeline supports heterogeneous worker specs:
+
+```yaml
+# Large x86 workers (khelben, elminster, danilo)
+cpu_limit:      "8"
+memory_limit:   "32Gi"
+
+# Medium x86 workers (drizzt, wulfgar)
+cpu_limit:      "4"
+memory_limit:   "16Gi"
+
+# Small x86 workers (regis, control plane)
+cpu_limit:      "2"
+memory_limit:   "8Gi"
+
+# Raspberry Pi workers (durnan, jarlaxle, mirt, volo, elaith)
+cpu_limit:      "2"
+memory_limit:   "3Gi"   # Only for ≤3B models
+
+# Ray head (coordinator only, no training)
+cpu_limit:      "2"
+memory_limit:   "4Gi"
+```
+
+Each worker requests only spare CPU and RAM — inference GPU allocations are untouched.  All pods are ephemeral — zero standing resource cost.
+
+### Data Flow via S3
+
+Training data is shared between KFP and Ray workers through S3:
+
+```
+KFP: prepare_data → /tmp/train.json ───upload──→ s3://training-data/ray-training-runs/{run_id}/data/
+                                                      │
+Ray Worker 1: ←──download──────────────────────────────┤
+Ray Worker 2: ←──download──────────────────────────────┤
+Ray Worker N: ←──download──────────────────────────────┘
+
+Ray Worker 0 (rank 0): adapter ──upload──→ s3://training-data/ray-training-runs/{run_id}/adapter/
+                                                      │
+KFP: download_adapter ←──────────────────────────────────┘
+```
+
+### Model Selection for CPU Training
+
+| Model | Parameters | RAM (float32) | Trainable (LoRA r=16) | Est. Time (4 workers, 16 CPU) |
+|---|---|---|---|---|
+| Qwen 2.5 3B Instruct | 3B | ~12 GB | ~4 M | 1–1.5 h |
+| Llama 3.2 3B Instruct | 3B | ~12 GB | ~4 M | 1–1.5 h |
+| Phi-3.5 Mini Instruct | 3.8B | ~15 GB | ~5 M | 1.5–2 h |
+| Llama 3.1 8B Instruct | 8B | ~32 GB | ~8 M | 3–5 h |
+
+### Can You QLoRA a 70B Model on CPU by Pooling All Cluster RAM?
+
+**Short answer: Not with data parallelism alone; requires model parallelism (DeepSpeed ZeRO-3).**
+
+With the current data-parallel design (Ray Train `TorchTrainer`), **every worker loads a full copy of the model**.  A 70B model at 4-bit NF4 quantisation needs ~40 GB just for weights, plus optimizer states and activations.  No single Raspberry Pi (4–8 GB) or small x86 node (16 GB) can hold this.
+
+| Approach | Mechanism | 70B QLoRA Feasible? | Notes |
+|----------|-----------|---------------------|-------|
+| Data parallelism (current) | Each worker loads full model | No — need ≥48 GB per worker | Only khelben (94 GB) fits |
+| DeepSpeed ZeRO-3 | Model sharded across workers | **Possible** — ~40 GB split across N workers | Needs large x86 nodes only |
+| FSDP (Fully Sharded) | Similar to ZeRO-3 | **Possible** — PyTorch native | Same node constraints |
+
+**Feasible pool for 70B model parallelism (x86 nodes with ≥27 GB RAM):**
+
+| Node | RAM | Role in ZeRO-3 Shard |
+|------|-----|---------------------|
+| khelben | 94 GB | Primary shard host |
+| elminster | 62 GB | Shard host |
+| danilo | 62 GB | Shard host |
+| wulfgar | 31 GB | Shard host |
+| drizzt | 27 GB | Shard host |
+| **Total** | **~276 GB** | **Sufficient for 70B NF4 + optimizer** |
+
+A 70B model at 4-bit with LoRA optimizer states needs roughly ~60–80 GB total.  With 276 GB available across 5 nodes using ZeRO-3 sharding, this is feasible — but introduces significant complexity:
+
+1. **DeepSpeed ZeRO-3** or **FSDP** replaces simple DDP data parallelism
+2. **Network bandwidth** becomes critical — Gloo AllReduce over 1GbE is very slow for model shards
+3. **Training time** would be measured in days, not hours
+4. **Raspberry Pis are excluded** — too little RAM for any shard of a 70B model
+5. **10GbE or InfiniBand** would be strongly recommended for tolerable training speed
+
+**Recommendation:** 70B training is technically possible with model parallelism across the large x86 nodes but impractical for regular use.  The DGX Spark (128 GB unified memory) is a far better path to 70B fine-tuning.  The CPU cluster is best suited for **≤8B models with data parallelism**, which works well across all node types.
+
+### Training Configuration (CPU-optimised)
+
+| Parameter | CPU Value | QLoRA (GPU) Value | Rationale |
+|---|---|---|---|
+| `dtype` | float32 | bf16 + NF4 4-bit | No GPU tensor cores for mixed precision |
+| `optim` | adamw_torch | paged_adamw_8bit | No CUDA paging on CPU |
+| `batch_size` | 1 | 2 | Memory constrained per worker |
+| `gradient_accumulation` | 16 | 8 | Compensate for small batch |
+| `max_seq_length` | 1024 | 2048 | Halved to fit in RAM |
+| `lora_r` | 16 | 64 | Fewer params, faster |
+| `gradient_checkpointing` | true | false | Trade compute for memory |
+| `no_cuda` | true | false | Explicit CPU-only |
+| `backend` | Gloo (AllReduce) | NCCL | Gloo is the CPU distributed backend |
+
+### Pipeline Integration
+
+```
+Kubeflow UI / Argo cron trigger
+       │
+       ▼
+ cpu_training_pipeline.yaml
+       │
+       ├── 1. fetch_pdfs_from_s3         (python:3.13-slim, boto3)
+       ├── 2. prepare_training_data      (python:3.13-slim, PyMuPDF)
+       ├── 3. upload_data_to_s3          (python:3.13-slim, boto3)
+       ├── 4. submit_ray_training_job    (python:3.13-slim, kubernetes + boto3)
+       │       └── RayJob: N workers × (4 CPU, 16Gi)
+       │           Image: rayproject/ray:2.44.1-py311-cpu
+       │           Deps:  torch, peft, trl, transformers, accelerate, datasets
+       ├── 5. download_adapter_from_s3   (python:3.13-slim, boto3)
+       ├── 6. evaluate_adapter_cpu       (python:3.13-slim, torch+peft)
+       ├── 7. push_adapter_to_gitea      (python:3.13-slim, requests)
+       └── 8. log_training_metrics       (python:3.13-slim, mlflow)
+```
+
+### RBAC Requirements
+
+The Kubeflow pipeline service account needs permissions to create RayJob and ConfigMap resources:
+
+```yaml
+apiVersion: rbac.authorization.k8s.io/v1
+kind: ClusterRole
+metadata:
+  name: kubeflow-rayjob-creator
+rules:
+  - apiGroups: ["ray.io"]
+    resources: ["rayjobs"]
+    verbs: ["create", "get", "list", "watch", "delete"]
+  - apiGroups: [""]
+    resources: ["configmaps"]
+    verbs: ["create", "delete"]
+```
+
+## DGX Spark Upgrade Path
+
+### Device Specifications
+
+| Spec | Value |
+|---|---|
+| SoC | GB10 Grace Blackwell Superchip |
+| CPU | NVIDIA Grace (Arm), 20 cores |
+| GPU | Blackwell, CUDA compute capability 12.0 |
+| Memory | 128 GB unified LPDDR5X (CPU + GPU shared) |
+| AI Performance | 1 PFLOPS FP4 |
+| Connectivity | USB-C, 10GbE, Wi-Fi 7 |
+| OS | Ubuntu-based DGX OS (Linux) |
+| Power | <150 W |
+
+### Integration Plan
+
+1. **Network**: Connect DGX Spark via 10GbE to the lab switch; assign a static IP in the cluster network
+2. **Kubernetes**: Join the DGX Spark as a Kubernetes worker node with taints:
+   ```yaml
+   taints:
+     - key: nvidia.com/training
+       value: "true"
+       effect: NoSchedule
+   labels:
+     node.kubernetes.io/instance-type: dgx-spark
+     accelerator: blackwell
+   ```
+3. **KubeRay**: Add a dedicated `training` RayCluster (separate from the inference `RayService`) or submit `RayJob` resources that request the DGX Spark's GPU.  The same `TorchTrainer` pattern applies — just change `use_gpu=True` in `ScalingConfig`.
+4. **Pipeline**: Create `dgx_spark_training_pipeline.py` mirroring the CPU pipeline but with:
+   - `set_accelerator_type("nvidia.com/gpu")`
+   - `set_gpu_limit(1)`
+   - `node_selector={"accelerator": "blackwell"}`
+   - BitsAndBytesConfig 4-bit NF4 quantisation (like existing QLoRA pipeline)
+   - bf16 compute dtype
+   - Larger models: 8B–70B
+5. **MLflow**: Same experiment tracking; tag runs with `training_device=dgx-spark`
+6. **Scheduling**: Training jobs tolerate the `nvidia.com/training` taint; inference deployments do not.  This guarantees workload isolation.
+
+### Model Capacity on DGX Spark
+
+| Model | Parameters | Memory (NF4 4-bit) | Memory (bf16) | Fits? |
+|---|---|---|---|---|
+| Llama 3.1 8B | 8B | ~6 GB | ~16 GB | Yes (full or QLoRA) |
+| Llama 3.1 70B | 70B | ~40 GB | ~140 GB | QLoRA only |
+| Qwen 2.5 72B | 72B | ~42 GB | ~144 GB | QLoRA only |
+| Mixtral 8×7B | 46.7B | ~28 GB | ~94 GB | QLoRA or full LoRA |
+| Llama 3.1 405B | 405B | ~230 GB | N/A | No |
+
+## Migration Strategy
+
+```
+Phase 1 (now):     Distributed CPU via RayJob → small models (≤8B),
+                   ~1–5 h per run, up to 11 workers across 14 nodes
+                   Pis participate for ≤3B models, free
+
+Phase 2 (DGX):     GPU pipelines on DGX Spark → large models (≤70B),
+                   minutes per run, dedicated hardware
+
+Phase 3 (hybrid):  CPU for lightweight experiments + DGX for
+                   production fine-tunes; both report to MLflow
+                   Training cluster can also include DGX as a
+                   Ray worker alongside CPU nodes for mixed runs
+```
+
+All phases share:
+- Same Kubeflow Pipelines UI
+- Same S3 data source
+- Same Gitea adapter repositories
+- Same MLflow experiment tracking
+- Same evaluation pipeline
+- Same Ray Train `TorchTrainer` API
+
+## Links
+
+* Related: [ADR-0054](0054-kubeflow-pipeline-cicd.md) — Kubeflow Pipeline CI/CD
+* Related: [ADR-0011](0011-kuberay-unified-gpu-backend.md) — KubeRay Unified GPU Backend
+* Related: `kubeflow/qlora_pdf_pipeline.py` — Existing GPU QLoRA pipeline
+* Related: `kubeflow/cpu_training_pipeline.py` — New distributed CPU training pipeline (this ADR)
+* [NVIDIA DGX Spark](https://www.nvidia.com/en-us/products/dgx/spark/) — Product page
+* [Ray Train TorchTrainer](https://docs.ray.io/en/latest/train/getting-started-pytorch.html) — Distributed training docs
+* [PEFT LoRA](https://huggingface.co/docs/peft/main/en/conceptual_guides/lora) — LoRA documentation