homelab-design/decisions/0058-training-strategy-cpu-dgx-spark.md

# Training Strategy – Distributed CPU Now, DGX Spark Later

* Status: accepted
* Date: 2026-02-14
* Deciders: Billy
* Technical Story: Enable distributed model fine-tuning on spare CPU capacity without disrupting inference workloads; plan a migration path to dedicated GPU training hardware

## Context and Problem Statement

All GPUs in the homelab cluster are fully allocated to inference serving via KubeRay:

| Node | GPU | Accelerator | Serving |
|---|---|---|---|
| elminster | RTX 2070 8 GB | CUDA | Whisper (0.5) + TTS (0.5) |
| khelben | Strix Halo 128 GB | ROCm | vLLM / LLM (0.95) |
| drizzt | Radeon 680M | ROCm | BGE-Large embeddings (0.8) |
| danilo | Intel Arc | Intel | BGE reranker (0.8) |

Training workloads (QLoRA, LoRA, full fine-tune) cannot share these GPUs without degrading real-time inference latency.  However, the cluster has **14 nodes with spare CPU and RAM** that can be pooled for distributed training:

| Node | CPU | RAM | Architecture | Available for Training |
|------|-----|-----|-------------|----------------------|
| storm (cp) | 4c | 16 GB | amd64 | Limited (control plane duties) |
| bruenor (cp) | 4c | 16 GB | amd64 | Limited (control plane duties) |
| catti (cp) | 4c | 16 GB | amd64 | Limited (control plane duties) |
| elminster | 16c | 62 GB | amd64 | Spare CPU (GPU reserved for inference) |
| khelben | 32c | 94 GB | amd64 | Spare CPU (GPU reserved for inference) |
| drizzt | 16c | 27 GB | amd64 | Spare CPU (GPU reserved for inference) |
| danilo | 22c | 62 GB | amd64 | Spare CPU (GPU reserved for inference) |
| regis | 4c | 16 GB | amd64 | Fully available |
| wulfgar | 4c | 31 GB | amd64 | Fully available |
| durnan (Pi) | 4c | 4 GB | arm64 | Available (small models only) |
| jarlaxle (Pi) | 4c | 4 GB | arm64 | Available (small models only) |
| mirt (Pi) | 4c | 4 GB | arm64 | Available (small models only) |
| volo (Pi) | 4c | 4 GB | arm64 | Available (small models only) |
| elaith (Pi) | 4c | 8 GB | arm64 | Available (small models only) |

**Cluster totals: ~126 CPU cores, ~378 GB RAM across 14 nodes.**

Rather than training on a single node, we can **distribute training across all 14 nodes** using Ray Train with data-parallel LoRA, harvesting spare CPU from every machine in the cluster — including Raspberry Pis and control-plane nodes.  Each additional worker reduces wall-clock training time roughly linearly.

The NVIDIA **DGX Spark** (GB10 Grace Blackwell, 128 GB unified LPDDR5X, ~1 PFLOPS FP4) is an upcoming desktop-class device specifically designed for local AI development.  If purchased, it would provide the first *dedicated* training accelerator in the cluster.

We need a training strategy that:

1. Works **today** with existing hardware (distributed CPU)
2. **Scales horizontally** — add more nodes or cores to reduce training time
3. Extends cleanly to a **DGX Spark** when available
4. Integrates with the existing Kubeflow Pipelines + MLflow + KubeRay stack
5. Does not impact inference serving

## Decision Drivers

* Zero GPU budget available for training today
* Spare CPU/RAM on every node (~126 cores, ~378 GB cluster-wide across 14 nodes)
* Training cadence is low (weekly/monthly, not continuous)
* Small-model fine-tuning (1B–8B) is the primary use case for data-parallel CPU training
* Distributed training recovers time lost from being CPU-only
* Mixed architectures (amd64 + arm64 Raspberry Pis) require multi-arch Ray images
* DGX Spark would unlock larger models (up to ~70B with NF4) and 10–50× faster training
* Must reuse existing pipeline tooling (kfp, MLflow, Gitea adapter repos)

## Considered Options

1. **Distributed CPU LoRA training via Ray Train + KubeRay RayJob**
2. **Single-node CPU training in a Kubeflow pipeline step**
3. **Reserve a GPU fraction for training (time-share with inference)**
4. **Offload training to cloud (Lambda Labs, RunPod, etc.)**
5. **Wait for DGX Spark before doing any training**

## Decision Outcome

Chosen option: **Option 1 — Distributed CPU LoRA training via Ray Train, with a clear upgrade path to DGX Spark**

A new `cpu_training_pipeline.py` Kubeflow pipeline trains small models (Qwen 2.5 3B, Llama 3.2 3B, Phi-3.5 Mini 3.8B, etc.) using LoRA on CPU in float32.  The training step submits a **KubeRay RayJob** that creates an ephemeral Ray cluster with N CPU-only workers distributed across all available cluster nodes — including GPU workers (using spare CPU), CPU-only x86 workers, and even Raspberry Pis for small models.  Ray Train's `TorchTrainer` handles data-parallel training with gradient synchronisation via AllReduce (Gloo backend).

The pipeline follows an 8-step pattern:

1. Fetch PDFs from Quobjects S3
2. Prepare instruction-tuning dataset
3. Upload prepared data to S3 (shared storage for Ray workers)
4. Submit KubeRay RayJob (N distributed CPU workers)
5. Download trained adapter from S3
6. Sanity evaluation on CPU
7. Push adapter to Gitea
8. Log metrics to MLflow

### Positive Consequences

* Training starts immediately with zero additional hardware cost
* **Scales horizontally**: up to 14 nodes; Raspberry Pis can participate for small models
* GPUs remain 100% dedicated to inference — no latency impact
* Ephemeral Ray cluster: resources released immediately after training
* Adapters are small (10–100 MB) even from CPU training
* MLflow tracks all experiments regardless of compute backend
* DGX Spark upgrade is additive, not a rewrite

### Negative Consequences

* CPU training is slower than GPU even when distributed
* Each worker loads a full copy of the model (data parallelism, not model parallelism)
* Limited to small models (≤8B) per worker due to memory constraints
* float32 training uses ~2× the memory of bf16/fp16
* Requires RBAC setup for pipeline SA to create RayJob/ConfigMap resources

## Distributed Ray CPU Training Design

### Architecture

```
Kubeflow Pipeline (KFP component)
       │
       ├── Creates ConfigMap with training script
       ├── Creates KubeRay RayJob CR
       │         │
       │         ▼
       │    ┌──────────┐
       │    │ Ray Head  │  (coordinator, 0 CPUs for training)
       │    └────┬─────┘
       │         │
       │    ┌────┴──────────────────────────────────────────────────────┐
       │    │         │           │           │            │            │
       │    ▼         ▼           ▼           ▼            ▼            ▼
       │ ┌────────┐┌────────┐┌────────┐┌──────────┐┌──────────┐┌──────────┐
       │ │Worker 1││Worker 2││Worker 3││ Worker 4 ││ Worker 5 ││Worker N… │
       │ │khelben ││elmins. ││danilo  ││ wulfgar  ││ regis    ││ Pi nodes │
       │ │32c/94Gi││16c/62Gi││22c/62Gi││ 4c/31Gi  ││ 4c/16Gi  ││ 4c/4-8Gi│
       │ └────────┘└────────┘└────────┘└──────────┘└──────────┘└──────────┘
       │         │         │         │          │           │          │
       │         └─────────┴─── AllReduce (Gloo) ──────────┴──────────┘
       │                              │
       │                       Adapter → S3
       │
       ├── Polls RayJob status until SUCCEEDED
       ├── Downloads adapter from S3
       └── Evaluate → Gitea → MLflow
```

### How It Scales

| Workers | Nodes Used | Total CPUs | Est. Time (3B LoRA) | Notes |
|---|---|---|---|---|
| 1 | 1 node | 4–32 | 4–6 h | Baseline, single-worker |
| 4 | 4 nodes (GPU workers) | 86 | 1–1.5 h | Uses spare CPU on GPU nodes |
| 6 | + regis, wulfgar | 94 | 45–60 min | All x86 workers |
| 9 | + control plane | 106 | 30–45 min | Control plane also contributes |
| 11 | + Pis (small models) | ~126 | 20–35 min | Full cluster, arm64 + amd64 |

**Note:** Raspberry Pi workers (4 GB RAM) can only participate in training for models that fit in ~3 GB after OS/system overhead.  For ≤3B models with LoRA, this is feasible; for 8B+ models, exclude Pi nodes from the RayJob spec.

Effective batch size = `per_device_batch_size × num_workers × gradient_accumulation_steps`.  With 6 workers, the effective batch of `1 × 6 × 16 = 96` is competitive with GPU training batches.

### RayJob Lifecycle

1. **Create**: KFP component creates a `RayJob` CR + `ConfigMap` (training script)
2. **Schedule**: KubeRay operator allocates head pod + N worker pods across nodes
3. **Install**: Workers install pip deps via `runtimeEnvYAML` (torch, peft, trl, etc.)
4. **Train**: `TorchTrainer` runs `train_func` on each worker with DDP and `prepare_trainer` integration for HuggingFace
5. **Save**: Rank-0 worker saves adapter + metadata to S3
6. **Teardown**: `shutdownAfterJobFinishes: true` destroys all pods on completion
7. **TTL**: RayJob CR auto-deleted after 300 seconds

### Resource Allocation

Workers are configured per-node based on available resources.  The pipeline supports heterogeneous worker specs:

```yaml
# Large x86 workers (khelben, elminster, danilo)
cpu_limit:      "8"
memory_limit:   "32Gi"

# Medium x86 workers (drizzt, wulfgar)
cpu_limit:      "4"
memory_limit:   "16Gi"

# Small x86 workers (regis, control plane)
cpu_limit:      "2"
memory_limit:   "8Gi"

# Raspberry Pi workers (durnan, jarlaxle, mirt, volo, elaith)
cpu_limit:      "2"
memory_limit:   "3Gi"   # Only for ≤3B models

# Ray head (coordinator only, no training)
cpu_limit:      "2"
memory_limit:   "4Gi"
```

Each worker requests only spare CPU and RAM — inference GPU allocations are untouched.  All pods are ephemeral — zero standing resource cost.

### Data Flow via S3

Training data is shared between KFP and Ray workers through S3:

```
KFP: prepare_data → /tmp/train.json ───upload──→ s3://training-data/ray-training-runs/{run_id}/data/
                                                      │
Ray Worker 1: ←──download──────────────────────────────┤
Ray Worker 2: ←──download──────────────────────────────┤
Ray Worker N: ←──download──────────────────────────────┘

Ray Worker 0 (rank 0): adapter ──upload──→ s3://training-data/ray-training-runs/{run_id}/adapter/
                                                      │
KFP: download_adapter ←──────────────────────────────────┘
```

### Model Selection for CPU Training

| Model | Parameters | RAM (float32) | Trainable (LoRA r=16) | Est. Time (4 workers, 16 CPU) |
|---|---|---|---|---|
| Qwen 2.5 3B Instruct | 3B | ~12 GB | ~4 M | 1–1.5 h |
| Llama 3.2 3B Instruct | 3B | ~12 GB | ~4 M | 1–1.5 h |
| Phi-3.5 Mini Instruct | 3.8B | ~15 GB | ~5 M | 1.5–2 h |
| Llama 3.1 8B Instruct | 8B | ~32 GB | ~8 M | 3–5 h |

### Can You QLoRA a 70B Model on CPU by Pooling All Cluster RAM?

**Short answer: Not with data parallelism alone; requires model parallelism (DeepSpeed ZeRO-3).**

With the current data-parallel design (Ray Train `TorchTrainer`), **every worker loads a full copy of the model**.  A 70B model at 4-bit NF4 quantisation needs ~40 GB just for weights, plus optimizer states and activations.  No single Raspberry Pi (4–8 GB) or small x86 node (16 GB) can hold this.

| Approach | Mechanism | 70B QLoRA Feasible? | Notes |
|----------|-----------|---------------------|-------|
| Data parallelism (current) | Each worker loads full model | No — need ≥48 GB per worker | Only khelben (94 GB) fits |
| DeepSpeed ZeRO-3 | Model sharded across workers | **Possible** — ~40 GB split across N workers | Needs large x86 nodes only |
| FSDP (Fully Sharded) | Similar to ZeRO-3 | **Possible** — PyTorch native | Same node constraints |

**Feasible pool for 70B model parallelism (x86 nodes with ≥27 GB RAM):**

| Node | RAM | Role in ZeRO-3 Shard |
|------|-----|---------------------|
| khelben | 94 GB | Primary shard host |
| elminster | 62 GB | Shard host |
| danilo | 62 GB | Shard host |
| wulfgar | 31 GB | Shard host |
| drizzt | 27 GB | Shard host |
| **Total** | **~276 GB** | **Sufficient for 70B NF4 + optimizer** |

A 70B model at 4-bit with LoRA optimizer states needs roughly ~60–80 GB total.  With 276 GB available across 5 nodes using ZeRO-3 sharding, this is feasible — but introduces significant complexity:

1. **DeepSpeed ZeRO-3** or **FSDP** replaces simple DDP data parallelism
2. **Network bandwidth** becomes critical — Gloo AllReduce over 1GbE is very slow for model shards
3. **Training time** would be measured in days, not hours
4. **Raspberry Pis are excluded** — too little RAM for any shard of a 70B model
5. **10GbE or InfiniBand** would be strongly recommended for tolerable training speed

**Recommendation:** 70B training is technically possible with model parallelism across the large x86 nodes but impractical for regular use.  The DGX Spark (128 GB unified memory) is a far better path to 70B fine-tuning.  The CPU cluster is best suited for **≤8B models with data parallelism**, which works well across all node types.

### Training Configuration (CPU-optimised)

| Parameter | CPU Value | QLoRA (GPU) Value | Rationale |
|---|---|---|---|
| `dtype` | float32 | bf16 + NF4 4-bit | No GPU tensor cores for mixed precision |
| `optim` | adamw_torch | paged_adamw_8bit | No CUDA paging on CPU |
| `batch_size` | 1 | 2 | Memory constrained per worker |
| `gradient_accumulation` | 16 | 8 | Compensate for small batch |
| `max_seq_length` | 1024 | 2048 | Halved to fit in RAM |
| `lora_r` | 16 | 64 | Fewer params, faster |
| `gradient_checkpointing` | true | false | Trade compute for memory |
| `no_cuda` | true | false | Explicit CPU-only |
| `backend` | Gloo (AllReduce) | NCCL | Gloo is the CPU distributed backend |

### Pipeline Integration

```
Kubeflow UI / Argo cron trigger
       │
       ▼
 cpu_training_pipeline.yaml
       │
       ├── 1. fetch_pdfs_from_s3         (python:3.13-slim, boto3)
       ├── 2. prepare_training_data      (python:3.13-slim, PyMuPDF)
       ├── 3. upload_data_to_s3          (python:3.13-slim, boto3)
       ├── 4. submit_ray_training_job    (python:3.13-slim, kubernetes + boto3)
       │       └── RayJob: N workers × (4 CPU, 16Gi)
       │           Image: rayproject/ray:2.44.1-py311-cpu
       │           Deps:  torch, peft, trl, transformers, accelerate, datasets
       ├── 5. download_adapter_from_s3   (python:3.13-slim, boto3)
       ├── 6. evaluate_adapter_cpu       (python:3.13-slim, torch+peft)
       ├── 7. push_adapter_to_gitea      (python:3.13-slim, requests)
       └── 8. log_training_metrics       (python:3.13-slim, mlflow)
```

### RBAC Requirements

The Kubeflow pipeline service account needs permissions to create RayJob and ConfigMap resources:

```yaml
apiVersion: rbac.authorization.k8s.io/v1
kind: ClusterRole
metadata:
  name: kubeflow-rayjob-creator
rules:
  - apiGroups: ["ray.io"]
    resources: ["rayjobs"]
    verbs: ["create", "get", "list", "watch", "delete"]
  - apiGroups: [""]
    resources: ["configmaps"]
    verbs: ["create", "delete"]
```

## DGX Spark Upgrade Path

### Device Specifications

| Spec | Value |
|---|---|
| SoC | GB10 Grace Blackwell Superchip |
| CPU | NVIDIA Grace (Arm), 20 cores |
| GPU | Blackwell, CUDA compute capability 12.0 |
| Memory | 128 GB unified LPDDR5X (CPU + GPU shared) |
| AI Performance | 1 PFLOPS FP4 |
| Connectivity | USB-C, 10GbE, Wi-Fi 7 |
| OS | Ubuntu-based DGX OS (Linux) |
| Power | <150 W |

### Integration Plan

1. **Network**: Connect DGX Spark via 10GbE to the lab switch; assign a static IP in the cluster network
2. **Kubernetes**: Join the DGX Spark as a Kubernetes worker node with taints:
   ```yaml
   taints:
     - key: nvidia.com/training
       value: "true"
       effect: NoSchedule
   labels:
     node.kubernetes.io/instance-type: dgx-spark
     accelerator: blackwell
   ```
3. **KubeRay**: Add a dedicated `training` RayCluster (separate from the inference `RayService`) or submit `RayJob` resources that request the DGX Spark's GPU.  The same `TorchTrainer` pattern applies — just change `use_gpu=True` in `ScalingConfig`.
4. **Pipeline**: Create `dgx_spark_training_pipeline.py` mirroring the CPU pipeline but with:
   - `set_accelerator_type("nvidia.com/gpu")`
   - `set_gpu_limit(1)`
   - `node_selector={"accelerator": "blackwell"}`
   - BitsAndBytesConfig 4-bit NF4 quantisation (like existing QLoRA pipeline)
   - bf16 compute dtype
   - Larger models: 8B–70B
5. **MLflow**: Same experiment tracking; tag runs with `training_device=dgx-spark`
6. **Scheduling**: Training jobs tolerate the `nvidia.com/training` taint; inference deployments do not.  This guarantees workload isolation.

### Model Capacity on DGX Spark

| Model | Parameters | Memory (NF4 4-bit) | Memory (bf16) | Fits? |
|---|---|---|---|---|
| Llama 3.1 8B | 8B | ~6 GB | ~16 GB | Yes (full or QLoRA) |
| Llama 3.1 70B | 70B | ~40 GB | ~140 GB | QLoRA only |
| Qwen 2.5 72B | 72B | ~42 GB | ~144 GB | QLoRA only |
| Mixtral 8×7B | 46.7B | ~28 GB | ~94 GB | QLoRA or full LoRA |
| Llama 3.1 405B | 405B | ~230 GB | N/A | No |

## Migration Strategy

```
Phase 1 (now):     Distributed CPU via RayJob → small models (≤8B),
                   ~1–5 h per run, up to 11 workers across 14 nodes
                   Pis participate for ≤3B models, free

Phase 2 (DGX):     GPU pipelines on DGX Spark → large models (≤70B),
                   minutes per run, dedicated hardware

Phase 3 (hybrid):  CPU for lightweight experiments + DGX for
                   production fine-tunes; both report to MLflow
                   Training cluster can also include DGX as a
                   Ray worker alongside CPU nodes for mixed runs
```

All phases share:
- Same Kubeflow Pipelines UI
- Same S3 data source
- Same Gitea adapter repositories
- Same MLflow experiment tracking
- Same evaluation pipeline
- Same Ray Train `TorchTrainer` API

## Links

* Related: [ADR-0054](0054-kubeflow-pipeline-cicd.md) — Kubeflow Pipeline CI/CD
* Related: [ADR-0011](0011-kuberay-unified-gpu-backend.md) — KubeRay Unified GPU Backend
* Related: `kubeflow/qlora_pdf_pipeline.py` — Existing GPU QLoRA pipeline
* Related: `kubeflow/cpu_training_pipeline.py` — New distributed CPU training pipeline (this ADR)
* [NVIDIA DGX Spark](https://www.nvidia.com/en-us/products/dgx/spark/) — Product page
* [Ray Train TorchTrainer](https://docs.ray.io/en/latest/train/getting-started-pytorch.html) — Distributed training docs
* [PEFT LoRA](https://huggingface.co/docs/peft/main/en/conceptual_guides/lora) — LoRA documentation