All checks were successful
Update README with ADR Index / update-readme (push) Successful in 7s
384 lines
19 KiB
Markdown
384 lines
19 KiB
Markdown
# Training Strategy – Distributed CPU Now, DGX Spark Later
|
||
|
||
* Status: accepted
|
||
* Date: 2026-02-14
|
||
* Deciders: Billy
|
||
* Technical Story: Enable distributed model fine-tuning on spare CPU capacity without disrupting inference workloads; plan a migration path to dedicated GPU training hardware
|
||
|
||
## Context and Problem Statement
|
||
|
||
All GPUs in the homelab cluster are fully allocated to inference serving via KubeRay:
|
||
|
||
| Node | GPU | Accelerator | Serving |
|
||
|---|---|---|---|
|
||
| elminster | RTX 2070 8 GB | CUDA | Whisper (0.5) + TTS (0.5) |
|
||
| khelben | Strix Halo 128 GB | ROCm | vLLM / LLM (0.95) |
|
||
| drizzt | Radeon 680M | ROCm | BGE-Large embeddings (0.8) |
|
||
| danilo | Intel Arc | Intel | BGE reranker (0.8) |
|
||
|
||
Training workloads (QLoRA, LoRA, full fine-tune) cannot share these GPUs without degrading real-time inference latency. However, the cluster has **14 nodes with spare CPU and RAM** that can be pooled for distributed training:
|
||
|
||
| Node | CPU | RAM | Architecture | Available for Training |
|
||
|------|-----|-----|-------------|----------------------|
|
||
| storm (cp) | 4c | 16 GB | amd64 | Limited (control plane duties) |
|
||
| bruenor (cp) | 4c | 16 GB | amd64 | Limited (control plane duties) |
|
||
| catti (cp) | 4c | 16 GB | amd64 | Limited (control plane duties) |
|
||
| elminster | 16c | 62 GB | amd64 | Spare CPU (GPU reserved for inference) |
|
||
| khelben | 32c | 94 GB | amd64 | Spare CPU (GPU reserved for inference) |
|
||
| drizzt | 16c | 27 GB | amd64 | Spare CPU (GPU reserved for inference) |
|
||
| danilo | 22c | 62 GB | amd64 | Spare CPU (GPU reserved for inference) |
|
||
| regis | 4c | 16 GB | amd64 | Fully available |
|
||
| wulfgar | 4c | 31 GB | amd64 | Fully available |
|
||
| durnan (Pi) | 4c | 4 GB | arm64 | Available (small models only) |
|
||
| jarlaxle (Pi) | 4c | 4 GB | arm64 | Available (small models only) |
|
||
| mirt (Pi) | 4c | 4 GB | arm64 | Available (small models only) |
|
||
| volo (Pi) | 4c | 4 GB | arm64 | Available (small models only) |
|
||
| elaith (Pi) | 4c | 8 GB | arm64 | Available (small models only) |
|
||
|
||
**Cluster totals: ~126 CPU cores, ~378 GB RAM across 14 nodes.**
|
||
|
||
Rather than training on a single node, we can **distribute training across all 14 nodes** using Ray Train with data-parallel LoRA, harvesting spare CPU from every machine in the cluster — including Raspberry Pis and control-plane nodes. Each additional worker reduces wall-clock training time roughly linearly.
|
||
|
||
The NVIDIA **DGX Spark** (GB10 Grace Blackwell, 128 GB unified LPDDR5X, ~1 PFLOPS FP4) is an upcoming desktop-class device specifically designed for local AI development. If purchased, it would provide the first *dedicated* training accelerator in the cluster.
|
||
|
||
We need a training strategy that:
|
||
|
||
1. Works **today** with existing hardware (distributed CPU)
|
||
2. **Scales horizontally** — add more nodes or cores to reduce training time
|
||
3. Extends cleanly to a **DGX Spark** when available
|
||
4. Integrates with the existing Kubeflow Pipelines + MLflow + KubeRay stack
|
||
5. Does not impact inference serving
|
||
|
||
## Decision Drivers
|
||
|
||
* Zero GPU budget available for training today
|
||
* Spare CPU/RAM on every node (~126 cores, ~378 GB cluster-wide across 14 nodes)
|
||
* Training cadence is low (weekly/monthly, not continuous)
|
||
* Small-model fine-tuning (1B–8B) is the primary use case for data-parallel CPU training
|
||
* Distributed training recovers time lost from being CPU-only
|
||
* Mixed architectures (amd64 + arm64 Raspberry Pis) require multi-arch Ray images
|
||
* DGX Spark would unlock larger models (up to ~70B with NF4) and 10–50× faster training
|
||
* Must reuse existing pipeline tooling (kfp, MLflow, Gitea adapter repos)
|
||
|
||
## Considered Options
|
||
|
||
1. **Distributed CPU LoRA training via Ray Train + KubeRay RayJob**
|
||
2. **Single-node CPU training in a Kubeflow pipeline step**
|
||
3. **Reserve a GPU fraction for training (time-share with inference)**
|
||
4. **Offload training to cloud (Lambda Labs, RunPod, etc.)**
|
||
5. **Wait for DGX Spark before doing any training**
|
||
|
||
## Decision Outcome
|
||
|
||
Chosen option: **Option 1 — Distributed CPU LoRA training via Ray Train, with a clear upgrade path to DGX Spark**
|
||
|
||
A new `cpu_training_pipeline.py` Kubeflow pipeline trains small models (Qwen 2.5 3B, Llama 3.2 3B, Phi-3.5 Mini 3.8B, etc.) using LoRA on CPU in float32. The training step submits a **KubeRay RayJob** that creates an ephemeral Ray cluster with N CPU-only workers distributed across all available cluster nodes — including GPU workers (using spare CPU), CPU-only x86 workers, and even Raspberry Pis for small models. Ray Train's `TorchTrainer` handles data-parallel training with gradient synchronisation via AllReduce (Gloo backend).
|
||
|
||
The pipeline follows an 8-step pattern:
|
||
|
||
1. Fetch PDFs from Quobjects S3
|
||
2. Prepare instruction-tuning dataset
|
||
3. Upload prepared data to S3 (shared storage for Ray workers)
|
||
4. Submit KubeRay RayJob (N distributed CPU workers)
|
||
5. Download trained adapter from S3
|
||
6. Sanity evaluation on CPU
|
||
7. Push adapter to Gitea
|
||
8. Log metrics to MLflow
|
||
|
||
### Positive Consequences
|
||
|
||
* Training starts immediately with zero additional hardware cost
|
||
* **Scales horizontally**: up to 14 nodes; Raspberry Pis can participate for small models
|
||
* GPUs remain 100% dedicated to inference — no latency impact
|
||
* Ephemeral Ray cluster: resources released immediately after training
|
||
* Adapters are small (10–100 MB) even from CPU training
|
||
* MLflow tracks all experiments regardless of compute backend
|
||
* DGX Spark upgrade is additive, not a rewrite
|
||
|
||
### Negative Consequences
|
||
|
||
* CPU training is slower than GPU even when distributed
|
||
* Each worker loads a full copy of the model (data parallelism, not model parallelism)
|
||
* Limited to small models (≤8B) per worker due to memory constraints
|
||
* float32 training uses ~2× the memory of bf16/fp16
|
||
* Requires RBAC setup for pipeline SA to create RayJob/ConfigMap resources
|
||
|
||
## Distributed Ray CPU Training Design
|
||
|
||
### Architecture
|
||
|
||
```
|
||
Kubeflow Pipeline (KFP component)
|
||
│
|
||
├── Creates ConfigMap with training script
|
||
├── Creates KubeRay RayJob CR
|
||
│ │
|
||
│ ▼
|
||
│ ┌──────────┐
|
||
│ │ Ray Head │ (coordinator, 0 CPUs for training)
|
||
│ └────┬─────┘
|
||
│ │
|
||
│ ┌────┴──────────────────────────────────────────────────────┐
|
||
│ │ │ │ │ │ │
|
||
│ ▼ ▼ ▼ ▼ ▼ ▼
|
||
│ ┌────────┐┌────────┐┌────────┐┌──────────┐┌──────────┐┌──────────┐
|
||
│ │Worker 1││Worker 2││Worker 3││ Worker 4 ││ Worker 5 ││Worker N… │
|
||
│ │khelben ││elmins. ││danilo ││ wulfgar ││ regis ││ Pi nodes │
|
||
│ │32c/94Gi││16c/62Gi││22c/62Gi││ 4c/31Gi ││ 4c/16Gi ││ 4c/4-8Gi│
|
||
│ └────────┘└────────┘└────────┘└──────────┘└──────────┘└──────────┘
|
||
│ │ │ │ │ │ │
|
||
│ └─────────┴─── AllReduce (Gloo) ──────────┴──────────┘
|
||
│ │
|
||
│ Adapter → S3
|
||
│
|
||
├── Polls RayJob status until SUCCEEDED
|
||
├── Downloads adapter from S3
|
||
└── Evaluate → Gitea → MLflow
|
||
```
|
||
|
||
### How It Scales
|
||
|
||
| Workers | Nodes Used | Total CPUs | Est. Time (3B LoRA) | Notes |
|
||
|---|---|---|---|---|
|
||
| 1 | 1 node | 4–32 | 4–6 h | Baseline, single-worker |
|
||
| 4 | 4 nodes (GPU workers) | 86 | 1–1.5 h | Uses spare CPU on GPU nodes |
|
||
| 6 | + regis, wulfgar | 94 | 45–60 min | All x86 workers |
|
||
| 9 | + control plane | 106 | 30–45 min | Control plane also contributes |
|
||
| 11 | + Pis (small models) | ~126 | 20–35 min | Full cluster, arm64 + amd64 |
|
||
|
||
**Note:** Raspberry Pi workers (4 GB RAM) can only participate in training for models that fit in ~3 GB after OS/system overhead. For ≤3B models with LoRA, this is feasible; for 8B+ models, exclude Pi nodes from the RayJob spec.
|
||
|
||
Effective batch size = `per_device_batch_size × num_workers × gradient_accumulation_steps`. With 6 workers, the effective batch of `1 × 6 × 16 = 96` is competitive with GPU training batches.
|
||
|
||
### RayJob Lifecycle
|
||
|
||
1. **Create**: KFP component creates a `RayJob` CR + `ConfigMap` (training script)
|
||
2. **Schedule**: KubeRay operator allocates head pod + N worker pods across nodes
|
||
3. **Install**: Workers install pip deps via `runtimeEnvYAML` (torch, peft, trl, etc.)
|
||
4. **Train**: `TorchTrainer` runs `train_func` on each worker with DDP and `prepare_trainer` integration for HuggingFace
|
||
5. **Save**: Rank-0 worker saves adapter + metadata to S3
|
||
6. **Teardown**: `shutdownAfterJobFinishes: true` destroys all pods on completion
|
||
7. **TTL**: RayJob CR auto-deleted after 300 seconds
|
||
|
||
### Resource Allocation
|
||
|
||
Workers are configured per-node based on available resources. The pipeline supports heterogeneous worker specs:
|
||
|
||
```yaml
|
||
# Large x86 workers (khelben, elminster, danilo)
|
||
cpu_limit: "8"
|
||
memory_limit: "32Gi"
|
||
|
||
# Medium x86 workers (drizzt, wulfgar)
|
||
cpu_limit: "4"
|
||
memory_limit: "16Gi"
|
||
|
||
# Small x86 workers (regis, control plane)
|
||
cpu_limit: "2"
|
||
memory_limit: "8Gi"
|
||
|
||
# Raspberry Pi workers (durnan, jarlaxle, mirt, volo, elaith)
|
||
cpu_limit: "2"
|
||
memory_limit: "3Gi" # Only for ≤3B models
|
||
|
||
# Ray head (coordinator only, no training)
|
||
cpu_limit: "2"
|
||
memory_limit: "4Gi"
|
||
```
|
||
|
||
Each worker requests only spare CPU and RAM — inference GPU allocations are untouched. All pods are ephemeral — zero standing resource cost.
|
||
|
||
### Data Flow via S3
|
||
|
||
Training data is shared between KFP and Ray workers through S3:
|
||
|
||
```
|
||
KFP: prepare_data → /tmp/train.json ───upload──→ s3://training-data/ray-training-runs/{run_id}/data/
|
||
│
|
||
Ray Worker 1: ←──download──────────────────────────────┤
|
||
Ray Worker 2: ←──download──────────────────────────────┤
|
||
Ray Worker N: ←──download──────────────────────────────┘
|
||
|
||
Ray Worker 0 (rank 0): adapter ──upload──→ s3://training-data/ray-training-runs/{run_id}/adapter/
|
||
│
|
||
KFP: download_adapter ←──────────────────────────────────┘
|
||
```
|
||
|
||
### Model Selection for CPU Training
|
||
|
||
| Model | Parameters | RAM (float32) | Trainable (LoRA r=16) | Est. Time (4 workers, 16 CPU) |
|
||
|---|---|---|---|---|
|
||
| Qwen 2.5 3B Instruct | 3B | ~12 GB | ~4 M | 1–1.5 h |
|
||
| Llama 3.2 3B Instruct | 3B | ~12 GB | ~4 M | 1–1.5 h |
|
||
| Phi-3.5 Mini Instruct | 3.8B | ~15 GB | ~5 M | 1.5–2 h |
|
||
| Llama 3.1 8B Instruct | 8B | ~32 GB | ~8 M | 3–5 h |
|
||
|
||
### Can You QLoRA a 70B Model on CPU by Pooling All Cluster RAM?
|
||
|
||
**Short answer: Not with data parallelism alone; requires model parallelism (DeepSpeed ZeRO-3).**
|
||
|
||
With the current data-parallel design (Ray Train `TorchTrainer`), **every worker loads a full copy of the model**. A 70B model at 4-bit NF4 quantisation needs ~40 GB just for weights, plus optimizer states and activations. No single Raspberry Pi (4–8 GB) or small x86 node (16 GB) can hold this.
|
||
|
||
| Approach | Mechanism | 70B QLoRA Feasible? | Notes |
|
||
|----------|-----------|---------------------|-------|
|
||
| Data parallelism (current) | Each worker loads full model | No — need ≥48 GB per worker | Only khelben (94 GB) fits |
|
||
| DeepSpeed ZeRO-3 | Model sharded across workers | **Possible** — ~40 GB split across N workers | Needs large x86 nodes only |
|
||
| FSDP (Fully Sharded) | Similar to ZeRO-3 | **Possible** — PyTorch native | Same node constraints |
|
||
|
||
**Feasible pool for 70B model parallelism (x86 nodes with ≥27 GB RAM):**
|
||
|
||
| Node | RAM | Role in ZeRO-3 Shard |
|
||
|------|-----|---------------------|
|
||
| khelben | 94 GB | Primary shard host |
|
||
| elminster | 62 GB | Shard host |
|
||
| danilo | 62 GB | Shard host |
|
||
| wulfgar | 31 GB | Shard host |
|
||
| drizzt | 27 GB | Shard host |
|
||
| **Total** | **~276 GB** | **Sufficient for 70B NF4 + optimizer** |
|
||
|
||
A 70B model at 4-bit with LoRA optimizer states needs roughly ~60–80 GB total. With 276 GB available across 5 nodes using ZeRO-3 sharding, this is feasible — but introduces significant complexity:
|
||
|
||
1. **DeepSpeed ZeRO-3** or **FSDP** replaces simple DDP data parallelism
|
||
2. **Network bandwidth** becomes critical — Gloo AllReduce over 1GbE is very slow for model shards
|
||
3. **Training time** would be measured in days, not hours
|
||
4. **Raspberry Pis are excluded** — too little RAM for any shard of a 70B model
|
||
5. **10GbE or InfiniBand** would be strongly recommended for tolerable training speed
|
||
|
||
**Recommendation:** 70B training is technically possible with model parallelism across the large x86 nodes but impractical for regular use. The DGX Spark (128 GB unified memory) is a far better path to 70B fine-tuning. The CPU cluster is best suited for **≤8B models with data parallelism**, which works well across all node types.
|
||
|
||
### Training Configuration (CPU-optimised)
|
||
|
||
| Parameter | CPU Value | QLoRA (GPU) Value | Rationale |
|
||
|---|---|---|---|
|
||
| `dtype` | float32 | bf16 + NF4 4-bit | No GPU tensor cores for mixed precision |
|
||
| `optim` | adamw_torch | paged_adamw_8bit | No CUDA paging on CPU |
|
||
| `batch_size` | 1 | 2 | Memory constrained per worker |
|
||
| `gradient_accumulation` | 16 | 8 | Compensate for small batch |
|
||
| `max_seq_length` | 1024 | 2048 | Halved to fit in RAM |
|
||
| `lora_r` | 16 | 64 | Fewer params, faster |
|
||
| `gradient_checkpointing` | true | false | Trade compute for memory |
|
||
| `no_cuda` | true | false | Explicit CPU-only |
|
||
| `backend` | Gloo (AllReduce) | NCCL | Gloo is the CPU distributed backend |
|
||
|
||
### Pipeline Integration
|
||
|
||
```
|
||
Kubeflow UI / Argo cron trigger
|
||
│
|
||
▼
|
||
cpu_training_pipeline.yaml
|
||
│
|
||
├── 1. fetch_pdfs_from_s3 (python:3.13-slim, boto3)
|
||
├── 2. prepare_training_data (python:3.13-slim, PyMuPDF)
|
||
├── 3. upload_data_to_s3 (python:3.13-slim, boto3)
|
||
├── 4. submit_ray_training_job (python:3.13-slim, kubernetes + boto3)
|
||
│ └── RayJob: N workers × (4 CPU, 16Gi)
|
||
│ Image: rayproject/ray:2.44.1-py311-cpu
|
||
│ Deps: torch, peft, trl, transformers, accelerate, datasets
|
||
├── 5. download_adapter_from_s3 (python:3.13-slim, boto3)
|
||
├── 6. evaluate_adapter_cpu (python:3.13-slim, torch+peft)
|
||
├── 7. push_adapter_to_gitea (python:3.13-slim, requests)
|
||
└── 8. log_training_metrics (python:3.13-slim, mlflow)
|
||
```
|
||
|
||
### RBAC Requirements
|
||
|
||
The Kubeflow pipeline service account needs permissions to create RayJob and ConfigMap resources:
|
||
|
||
```yaml
|
||
apiVersion: rbac.authorization.k8s.io/v1
|
||
kind: ClusterRole
|
||
metadata:
|
||
name: kubeflow-rayjob-creator
|
||
rules:
|
||
- apiGroups: ["ray.io"]
|
||
resources: ["rayjobs"]
|
||
verbs: ["create", "get", "list", "watch", "delete"]
|
||
- apiGroups: [""]
|
||
resources: ["configmaps"]
|
||
verbs: ["create", "delete"]
|
||
```
|
||
|
||
## DGX Spark Upgrade Path
|
||
|
||
### Device Specifications
|
||
|
||
| Spec | Value |
|
||
|---|---|
|
||
| SoC | GB10 Grace Blackwell Superchip |
|
||
| CPU | NVIDIA Grace (Arm), 20 cores |
|
||
| GPU | Blackwell, CUDA compute capability 12.0 |
|
||
| Memory | 128 GB unified LPDDR5X (CPU + GPU shared) |
|
||
| AI Performance | 1 PFLOPS FP4 |
|
||
| Connectivity | USB-C, 10GbE, Wi-Fi 7 |
|
||
| OS | Ubuntu-based DGX OS (Linux) |
|
||
| Power | <150 W |
|
||
|
||
### Integration Plan
|
||
|
||
1. **Network**: Connect DGX Spark via 10GbE to the lab switch; assign a static IP in the cluster network
|
||
2. **Kubernetes**: Join the DGX Spark as a Kubernetes worker node with taints:
|
||
```yaml
|
||
taints:
|
||
- key: nvidia.com/training
|
||
value: "true"
|
||
effect: NoSchedule
|
||
labels:
|
||
node.kubernetes.io/instance-type: dgx-spark
|
||
accelerator: blackwell
|
||
```
|
||
3. **KubeRay**: Add a dedicated `training` RayCluster (separate from the inference `RayService`) or submit `RayJob` resources that request the DGX Spark's GPU. The same `TorchTrainer` pattern applies — just change `use_gpu=True` in `ScalingConfig`.
|
||
4. **Pipeline**: Create `dgx_spark_training_pipeline.py` mirroring the CPU pipeline but with:
|
||
- `set_accelerator_type("nvidia.com/gpu")`
|
||
- `set_gpu_limit(1)`
|
||
- `node_selector={"accelerator": "blackwell"}`
|
||
- BitsAndBytesConfig 4-bit NF4 quantisation (like existing QLoRA pipeline)
|
||
- bf16 compute dtype
|
||
- Larger models: 8B–70B
|
||
5. **MLflow**: Same experiment tracking; tag runs with `training_device=dgx-spark`
|
||
6. **Scheduling**: Training jobs tolerate the `nvidia.com/training` taint; inference deployments do not. This guarantees workload isolation.
|
||
|
||
### Model Capacity on DGX Spark
|
||
|
||
| Model | Parameters | Memory (NF4 4-bit) | Memory (bf16) | Fits? |
|
||
|---|---|---|---|---|
|
||
| Llama 3.1 8B | 8B | ~6 GB | ~16 GB | Yes (full or QLoRA) |
|
||
| Llama 3.1 70B | 70B | ~40 GB | ~140 GB | QLoRA only |
|
||
| Qwen 2.5 72B | 72B | ~42 GB | ~144 GB | QLoRA only |
|
||
| Mixtral 8×7B | 46.7B | ~28 GB | ~94 GB | QLoRA or full LoRA |
|
||
| Llama 3.1 405B | 405B | ~230 GB | N/A | No |
|
||
|
||
## Migration Strategy
|
||
|
||
```
|
||
Phase 1 (now): Distributed CPU via RayJob → small models (≤8B),
|
||
~1–5 h per run, up to 11 workers across 14 nodes
|
||
Pis participate for ≤3B models, free
|
||
|
||
Phase 2 (DGX): GPU pipelines on DGX Spark → large models (≤70B),
|
||
minutes per run, dedicated hardware
|
||
|
||
Phase 3 (hybrid): CPU for lightweight experiments + DGX for
|
||
production fine-tunes; both report to MLflow
|
||
Training cluster can also include DGX as a
|
||
Ray worker alongside CPU nodes for mixed runs
|
||
```
|
||
|
||
All phases share:
|
||
- Same Kubeflow Pipelines UI
|
||
- Same S3 data source
|
||
- Same Gitea adapter repositories
|
||
- Same MLflow experiment tracking
|
||
- Same evaluation pipeline
|
||
- Same Ray Train `TorchTrainer` API
|
||
|
||
## Links
|
||
|
||
* Related: [ADR-0054](0054-kubeflow-pipeline-cicd.md) — Kubeflow Pipeline CI/CD
|
||
* Related: [ADR-0011](0011-kuberay-unified-gpu-backend.md) — KubeRay Unified GPU Backend
|
||
* Related: `kubeflow/qlora_pdf_pipeline.py` — Existing GPU QLoRA pipeline
|
||
* Related: `kubeflow/cpu_training_pipeline.py` — New distributed CPU training pipeline (this ADR)
|
||
* [NVIDIA DGX Spark](https://www.nvidia.com/en-us/products/dgx/spark/) — Product page
|
||
* [Ray Train TorchTrainer](https://docs.ray.io/en/latest/train/getting-started-pytorch.html) — Distributed training docs
|
||
* [PEFT LoRA](https://huggingface.co/docs/peft/main/en/conceptual_guides/lora) — LoRA documentation
|