daviestechlabs/homelab-design

Fork 0

Files

Billy D. 200cc5704b

Update README with ADR Index / update-readme (push) Successful in 7s

Details

docs: update node inventory and 70B QLoRA feasibility analysis

2026-02-15 11:19:22 -05:00

19 KiB

Raw Blame History

Training Strategy – Distributed CPU Now, DGX Spark Later

Status: accepted
Date: 2026-02-14
Deciders: Billy
Technical Story: Enable distributed model fine-tuning on spare CPU capacity without disrupting inference workloads; plan a migration path to dedicated GPU training hardware

Context and Problem Statement

All GPUs in the homelab cluster are fully allocated to inference serving via KubeRay:

Node	GPU	Accelerator	Serving
elminster	RTX 2070 8 GB	CUDA	Whisper (0.5) + TTS (0.5)
khelben	Strix Halo 128 GB	ROCm	vLLM / LLM (0.95)
drizzt	Radeon 680M	ROCm	BGE-Large embeddings (0.8)
danilo	Intel Arc	Intel	BGE reranker (0.8)

Training workloads (QLoRA, LoRA, full fine-tune) cannot share these GPUs without degrading real-time inference latency. However, the cluster has 14 nodes with spare CPU and RAM that can be pooled for distributed training:

Node	CPU	RAM	Architecture	Available for Training
storm (cp)	4c	16 GB	amd64	Limited (control plane duties)
bruenor (cp)	4c	16 GB	amd64	Limited (control plane duties)
catti (cp)	4c	16 GB	amd64	Limited (control plane duties)
elminster	16c	62 GB	amd64	Spare CPU (GPU reserved for inference)
khelben	32c	94 GB	amd64	Spare CPU (GPU reserved for inference)
drizzt	16c	27 GB	amd64	Spare CPU (GPU reserved for inference)
danilo	22c	62 GB	amd64	Spare CPU (GPU reserved for inference)
regis	4c	16 GB	amd64	Fully available
wulfgar	4c	31 GB	amd64	Fully available
durnan (Pi)	4c	4 GB	arm64	Available (small models only)
jarlaxle (Pi)	4c	4 GB	arm64	Available (small models only)
mirt (Pi)	4c	4 GB	arm64	Available (small models only)
volo (Pi)	4c	4 GB	arm64	Available (small models only)
elaith (Pi)	4c	8 GB	arm64	Available (small models only)

Cluster totals: ~126 CPU cores, ~378 GB RAM across 14 nodes.

Rather than training on a single node, we can distribute training across all 14 nodes using Ray Train with data-parallel LoRA, harvesting spare CPU from every machine in the cluster — including Raspberry Pis and control-plane nodes. Each additional worker reduces wall-clock training time roughly linearly.

The NVIDIA DGX Spark (GB10 Grace Blackwell, 128 GB unified LPDDR5X, ~1 PFLOPS FP4) is an upcoming desktop-class device specifically designed for local AI development. If purchased, it would provide the first dedicated training accelerator in the cluster.

We need a training strategy that:

Works today with existing hardware (distributed CPU)
Scales horizontally — add more nodes or cores to reduce training time
Extends cleanly to a DGX Spark when available
Integrates with the existing Kubeflow Pipelines + MLflow + KubeRay stack
Does not impact inference serving

Decision Drivers

Zero GPU budget available for training today
Spare CPU/RAM on every node (~126 cores, ~378 GB cluster-wide across 14 nodes)
Training cadence is low (weekly/monthly, not continuous)
Small-model fine-tuning (1B–8B) is the primary use case for data-parallel CPU training
Distributed training recovers time lost from being CPU-only
Mixed architectures (amd64 + arm64 Raspberry Pis) require multi-arch Ray images
DGX Spark would unlock larger models (up to ~70B with NF4) and 10–50× faster training
Must reuse existing pipeline tooling (kfp, MLflow, Gitea adapter repos)

Considered Options

Distributed CPU LoRA training via Ray Train + KubeRay RayJob
Single-node CPU training in a Kubeflow pipeline step
Reserve a GPU fraction for training (time-share with inference)
Offload training to cloud (Lambda Labs, RunPod, etc.)
Wait for DGX Spark before doing any training

Decision Outcome

Chosen option: Option 1 — Distributed CPU LoRA training via Ray Train, with a clear upgrade path to DGX Spark

A new cpu_training_pipeline.py Kubeflow pipeline trains small models (Qwen 2.5 3B, Llama 3.2 3B, Phi-3.5 Mini 3.8B, etc.) using LoRA on CPU in float32. The training step submits a KubeRay RayJob that creates an ephemeral Ray cluster with N CPU-only workers distributed across all available cluster nodes — including GPU workers (using spare CPU), CPU-only x86 workers, and even Raspberry Pis for small models. Ray Train's TorchTrainer handles data-parallel training with gradient synchronisation via AllReduce (Gloo backend).

The pipeline follows an 8-step pattern:

Fetch PDFs from Quobjects S3
Prepare instruction-tuning dataset
Upload prepared data to S3 (shared storage for Ray workers)
Submit KubeRay RayJob (N distributed CPU workers)
Download trained adapter from S3
Sanity evaluation on CPU
Push adapter to Gitea
Log metrics to MLflow

Positive Consequences

Training starts immediately with zero additional hardware cost
Scales horizontally: up to 14 nodes; Raspberry Pis can participate for small models
GPUs remain 100% dedicated to inference — no latency impact
Ephemeral Ray cluster: resources released immediately after training
Adapters are small (10–100 MB) even from CPU training
MLflow tracks all experiments regardless of compute backend
DGX Spark upgrade is additive, not a rewrite

Negative Consequences

CPU training is slower than GPU even when distributed
Each worker loads a full copy of the model (data parallelism, not model parallelism)
Limited to small models (≤8B) per worker due to memory constraints
float32 training uses ~2× the memory of bf16/fp16
Requires RBAC setup for pipeline SA to create RayJob/ConfigMap resources

Distributed Ray CPU Training Design

Architecture

Kubeflow Pipeline (KFP component)
       │
       ├── Creates ConfigMap with training script
       ├── Creates KubeRay RayJob CR
       │         │
       │         ▼
       │    ┌──────────┐
       │    │ Ray Head  │  (coordinator, 0 CPUs for training)
       │    └────┬─────┘
       │         │
       │    ┌────┴──────────────────────────────────────────────────────┐
       │    │         │           │           │            │            │
       │    ▼         ▼           ▼           ▼            ▼            ▼
       │ ┌────────┐┌────────┐┌────────┐┌──────────┐┌──────────┐┌──────────┐
       │ │Worker 1││Worker 2││Worker 3││ Worker 4 ││ Worker 5 ││Worker N… │
       │ │khelben ││elmins. ││danilo  ││ wulfgar  ││ regis    ││ Pi nodes │
       │ │32c/94Gi││16c/62Gi││22c/62Gi││ 4c/31Gi  ││ 4c/16Gi  ││ 4c/4-8Gi│
       │ └────────┘└────────┘└────────┘└──────────┘└──────────┘└──────────┘
       │         │         │         │          │           │          │
       │         └─────────┴─── AllReduce (Gloo) ──────────┴──────────┘
       │                              │
       │                       Adapter → S3
       │
       ├── Polls RayJob status until SUCCEEDED
       ├── Downloads adapter from S3
       └── Evaluate → Gitea → MLflow

How It Scales

Workers	Nodes Used	Total CPUs	Est. Time (3B LoRA)	Notes
1	1 node	4–32	4–6 h	Baseline, single-worker
4	4 nodes (GPU workers)	86	1–1.5 h	Uses spare CPU on GPU nodes
6	+ regis, wulfgar	94	45–60 min	All x86 workers
9	+ control plane	106	30–45 min	Control plane also contributes
11	+ Pis (small models)	~126	20–35 min	Full cluster, arm64 + amd64

Note: Raspberry Pi workers (4 GB RAM) can only participate in training for models that fit in ~3 GB after OS/system overhead. For ≤3B models with LoRA, this is feasible; for 8B+ models, exclude Pi nodes from the RayJob spec.

Effective batch size = per_device_batch_size × num_workers × gradient_accumulation_steps. With 6 workers, the effective batch of 1 × 6 × 16 = 96 is competitive with GPU training batches.

RayJob Lifecycle

Create: KFP component creates a RayJob CR + ConfigMap (training script)
Schedule: KubeRay operator allocates head pod + N worker pods across nodes
Install: Workers install pip deps via runtimeEnvYAML (torch, peft, trl, etc.)
Train: TorchTrainer runs train_func on each worker with DDP and prepare_trainer integration for HuggingFace
Save: Rank-0 worker saves adapter + metadata to S3
Teardown: shutdownAfterJobFinishes: true destroys all pods on completion
TTL: RayJob CR auto-deleted after 300 seconds

Resource Allocation

Workers are configured per-node based on available resources. The pipeline supports heterogeneous worker specs:

# Large x86 workers (khelben, elminster, danilo)
cpu_limit:      "8"
memory_limit:   "32Gi"

# Medium x86 workers (drizzt, wulfgar)
cpu_limit:      "4"
memory_limit:   "16Gi"

# Small x86 workers (regis, control plane)
cpu_limit:      "2"
memory_limit:   "8Gi"

# Raspberry Pi workers (durnan, jarlaxle, mirt, volo, elaith)
cpu_limit:      "2"
memory_limit:   "3Gi"   # Only for ≤3B models

# Ray head (coordinator only, no training)
cpu_limit:      "2"
memory_limit:   "4Gi"

Each worker requests only spare CPU and RAM — inference GPU allocations are untouched. All pods are ephemeral — zero standing resource cost.

Data Flow via S3

Training data is shared between KFP and Ray workers through S3:

KFP: prepare_data → /tmp/train.json ───upload──→ s3://training-data/ray-training-runs/{run_id}/data/
                                                      │
Ray Worker 1: ←──download──────────────────────────────┤
Ray Worker 2: ←──download──────────────────────────────┤
Ray Worker N: ←──download──────────────────────────────┘

Ray Worker 0 (rank 0): adapter ──upload──→ s3://training-data/ray-training-runs/{run_id}/adapter/
                                                      │
KFP: download_adapter ←──────────────────────────────────┘

Model Selection for CPU Training

Model	Parameters	RAM (float32)	Trainable (LoRA r=16)	Est. Time (4 workers, 16 CPU)
Qwen 2.5 3B Instruct	3B	~12 GB	~4 M	1–1.5 h
Llama 3.2 3B Instruct	3B	~12 GB	~4 M	1–1.5 h
Phi-3.5 Mini Instruct	3.8B	~15 GB	~5 M	1.5–2 h
Llama 3.1 8B Instruct	8B	~32 GB	~8 M	3–5 h

Can You QLoRA a 70B Model on CPU by Pooling All Cluster RAM?

Short answer: Not with data parallelism alone; requires model parallelism (DeepSpeed ZeRO-3).

With the current data-parallel design (Ray Train TorchTrainer), every worker loads a full copy of the model. A 70B model at 4-bit NF4 quantisation needs ~40 GB just for weights, plus optimizer states and activations. No single Raspberry Pi (4–8 GB) or small x86 node (16 GB) can hold this.

Approach	Mechanism	70B QLoRA Feasible?	Notes
Data parallelism (current)	Each worker loads full model	No — need ≥48 GB per worker	Only khelben (94 GB) fits
DeepSpeed ZeRO-3	Model sharded across workers	Possible — ~40 GB split across N workers	Needs large x86 nodes only
FSDP (Fully Sharded)	Similar to ZeRO-3	Possible — PyTorch native	Same node constraints

Feasible pool for 70B model parallelism (x86 nodes with ≥27 GB RAM):

Node	RAM	Role in ZeRO-3 Shard
khelben	94 GB	Primary shard host
elminster	62 GB	Shard host
danilo	62 GB	Shard host
wulfgar	31 GB	Shard host
drizzt	27 GB	Shard host
Total	~276 GB	Sufficient for 70B NF4 + optimizer

A 70B model at 4-bit with LoRA optimizer states needs roughly ~60–80 GB total. With 276 GB available across 5 nodes using ZeRO-3 sharding, this is feasible — but introduces significant complexity:

DeepSpeed ZeRO-3 or FSDP replaces simple DDP data parallelism
Network bandwidth becomes critical — Gloo AllReduce over 1GbE is very slow for model shards
Training time would be measured in days, not hours
Raspberry Pis are excluded — too little RAM for any shard of a 70B model
10GbE or InfiniBand would be strongly recommended for tolerable training speed

Recommendation: 70B training is technically possible with model parallelism across the large x86 nodes but impractical for regular use. The DGX Spark (128 GB unified memory) is a far better path to 70B fine-tuning. The CPU cluster is best suited for ≤8B models with data parallelism, which works well across all node types.

Training Configuration (CPU-optimised)

Parameter	CPU Value	QLoRA (GPU) Value	Rationale
`dtype`	float32	bf16 + NF4 4-bit	No GPU tensor cores for mixed precision
`optim`	adamw_torch	paged_adamw_8bit	No CUDA paging on CPU
`batch_size`	1	2	Memory constrained per worker
`gradient_accumulation`	16	8	Compensate for small batch
`max_seq_length`	1024	2048	Halved to fit in RAM
`lora_r`	16	64	Fewer params, faster
`gradient_checkpointing`	true	false	Trade compute for memory
`no_cuda`	true	false	Explicit CPU-only
`backend`	Gloo (AllReduce)	NCCL	Gloo is the CPU distributed backend

Pipeline Integration

Kubeflow UI / Argo cron trigger
       │
       ▼
 cpu_training_pipeline.yaml
       │
       ├── 1. fetch_pdfs_from_s3         (python:3.13-slim, boto3)
       ├── 2. prepare_training_data      (python:3.13-slim, PyMuPDF)
       ├── 3. upload_data_to_s3          (python:3.13-slim, boto3)
       ├── 4. submit_ray_training_job    (python:3.13-slim, kubernetes + boto3)
       │       └── RayJob: N workers × (4 CPU, 16Gi)
       │           Image: rayproject/ray:2.44.1-py311-cpu
       │           Deps:  torch, peft, trl, transformers, accelerate, datasets
       ├── 5. download_adapter_from_s3   (python:3.13-slim, boto3)
       ├── 6. evaluate_adapter_cpu       (python:3.13-slim, torch+peft)
       ├── 7. push_adapter_to_gitea      (python:3.13-slim, requests)
       └── 8. log_training_metrics       (python:3.13-slim, mlflow)

RBAC Requirements

The Kubeflow pipeline service account needs permissions to create RayJob and ConfigMap resources:

apiVersion: rbac.authorization.k8s.io/v1
kind: ClusterRole
metadata:
  name: kubeflow-rayjob-creator
rules:
  - apiGroups: ["ray.io"]
    resources: ["rayjobs"]
    verbs: ["create", "get", "list", "watch", "delete"]
  - apiGroups: [""]
    resources: ["configmaps"]
    verbs: ["create", "delete"]

DGX Spark Upgrade Path

Device Specifications

Spec	Value
SoC	GB10 Grace Blackwell Superchip
CPU	NVIDIA Grace (Arm), 20 cores
GPU	Blackwell, CUDA compute capability 12.0
Memory	128 GB unified LPDDR5X (CPU + GPU shared)
AI Performance	1 PFLOPS FP4
Connectivity	USB-C, 10GbE, Wi-Fi 7
OS	Ubuntu-based DGX OS (Linux)
Power	<150 W

Integration Plan

Network: Connect DGX Spark via 10GbE to the lab switch; assign a static IP in the cluster network

Kubernetes: Join the DGX Spark as a Kubernetes worker node with taints:

taints:
  - key: nvidia.com/training
    value: "true"
    effect: NoSchedule
labels:
  node.kubernetes.io/instance-type: dgx-spark
  accelerator: blackwell

KubeRay: Add a dedicated training RayCluster (separate from the inference RayService) or submit RayJob resources that request the DGX Spark's GPU. The same TorchTrainer pattern applies — just change use_gpu=True in ScalingConfig.
Pipeline: Create dgx_spark_training_pipeline.py mirroring the CPU pipeline but with:
- set_accelerator_type("nvidia.com/gpu")
- set_gpu_limit(1)
- node_selector={"accelerator": "blackwell"}
- BitsAndBytesConfig 4-bit NF4 quantisation (like existing QLoRA pipeline)
- bf16 compute dtype
- Larger models: 8B–70B
MLflow: Same experiment tracking; tag runs with training_device=dgx-spark
Scheduling: Training jobs tolerate the nvidia.com/training taint; inference deployments do not. This guarantees workload isolation.

Model Capacity on DGX Spark

Model	Parameters	Memory (NF4 4-bit)	Memory (bf16)	Fits?
Llama 3.1 8B	8B	~6 GB	~16 GB	Yes (full or QLoRA)
Llama 3.1 70B	70B	~40 GB	~140 GB	QLoRA only
Qwen 2.5 72B	72B	~42 GB	~144 GB	QLoRA only
Mixtral 8×7B	46.7B	~28 GB	~94 GB	QLoRA or full LoRA
Llama 3.1 405B	405B	~230 GB	N/A	No

Migration Strategy

Phase 1 (now):     Distributed CPU via RayJob → small models (≤8B),
                   ~1–5 h per run, up to 11 workers across 14 nodes
                   Pis participate for ≤3B models, free

Phase 2 (DGX):     GPU pipelines on DGX Spark → large models (≤70B),
                   minutes per run, dedicated hardware

Phase 3 (hybrid):  CPU for lightweight experiments + DGX for
                   production fine-tunes; both report to MLflow
                   Training cluster can also include DGX as a
                   Ray worker alongside CPU nodes for mixed runs

All phases share:

Same Kubeflow Pipelines UI
Same S3 data source
Same Gitea adapter repositories
Same MLflow experiment tracking
Same evaluation pipeline
Same Ray Train TorchTrainer API

19 KiB Raw Blame History Unescape Escape