Files
homelab-design/decisions/0058-training-strategy-cpu-dgx-spark.md
Billy D. 200cc5704b
All checks were successful
Update README with ADR Index / update-readme (push) Successful in 7s
docs: update node inventory and 70B QLoRA feasibility analysis
2026-02-15 11:19:22 -05:00

19 KiB
Raw Blame History

Training Strategy Distributed CPU Now, DGX Spark Later

  • Status: accepted
  • Date: 2026-02-14
  • Deciders: Billy
  • Technical Story: Enable distributed model fine-tuning on spare CPU capacity without disrupting inference workloads; plan a migration path to dedicated GPU training hardware

Context and Problem Statement

All GPUs in the homelab cluster are fully allocated to inference serving via KubeRay:

Node GPU Accelerator Serving
elminster RTX 2070 8 GB CUDA Whisper (0.5) + TTS (0.5)
khelben Strix Halo 128 GB ROCm vLLM / LLM (0.95)
drizzt Radeon 680M ROCm BGE-Large embeddings (0.8)
danilo Intel Arc Intel BGE reranker (0.8)

Training workloads (QLoRA, LoRA, full fine-tune) cannot share these GPUs without degrading real-time inference latency. However, the cluster has 14 nodes with spare CPU and RAM that can be pooled for distributed training:

Node CPU RAM Architecture Available for Training
storm (cp) 4c 16 GB amd64 Limited (control plane duties)
bruenor (cp) 4c 16 GB amd64 Limited (control plane duties)
catti (cp) 4c 16 GB amd64 Limited (control plane duties)
elminster 16c 62 GB amd64 Spare CPU (GPU reserved for inference)
khelben 32c 94 GB amd64 Spare CPU (GPU reserved for inference)
drizzt 16c 27 GB amd64 Spare CPU (GPU reserved for inference)
danilo 22c 62 GB amd64 Spare CPU (GPU reserved for inference)
regis 4c 16 GB amd64 Fully available
wulfgar 4c 31 GB amd64 Fully available
durnan (Pi) 4c 4 GB arm64 Available (small models only)
jarlaxle (Pi) 4c 4 GB arm64 Available (small models only)
mirt (Pi) 4c 4 GB arm64 Available (small models only)
volo (Pi) 4c 4 GB arm64 Available (small models only)
elaith (Pi) 4c 8 GB arm64 Available (small models only)

Cluster totals: ~126 CPU cores, ~378 GB RAM across 14 nodes.

Rather than training on a single node, we can distribute training across all 14 nodes using Ray Train with data-parallel LoRA, harvesting spare CPU from every machine in the cluster — including Raspberry Pis and control-plane nodes. Each additional worker reduces wall-clock training time roughly linearly.

The NVIDIA DGX Spark (GB10 Grace Blackwell, 128 GB unified LPDDR5X, ~1 PFLOPS FP4) is an upcoming desktop-class device specifically designed for local AI development. If purchased, it would provide the first dedicated training accelerator in the cluster.

We need a training strategy that:

  1. Works today with existing hardware (distributed CPU)
  2. Scales horizontally — add more nodes or cores to reduce training time
  3. Extends cleanly to a DGX Spark when available
  4. Integrates with the existing Kubeflow Pipelines + MLflow + KubeRay stack
  5. Does not impact inference serving

Decision Drivers

  • Zero GPU budget available for training today
  • Spare CPU/RAM on every node (~126 cores, ~378 GB cluster-wide across 14 nodes)
  • Training cadence is low (weekly/monthly, not continuous)
  • Small-model fine-tuning (1B8B) is the primary use case for data-parallel CPU training
  • Distributed training recovers time lost from being CPU-only
  • Mixed architectures (amd64 + arm64 Raspberry Pis) require multi-arch Ray images
  • DGX Spark would unlock larger models (up to ~70B with NF4) and 1050× faster training
  • Must reuse existing pipeline tooling (kfp, MLflow, Gitea adapter repos)

Considered Options

  1. Distributed CPU LoRA training via Ray Train + KubeRay RayJob
  2. Single-node CPU training in a Kubeflow pipeline step
  3. Reserve a GPU fraction for training (time-share with inference)
  4. Offload training to cloud (Lambda Labs, RunPod, etc.)
  5. Wait for DGX Spark before doing any training

Decision Outcome

Chosen option: Option 1 — Distributed CPU LoRA training via Ray Train, with a clear upgrade path to DGX Spark

A new cpu_training_pipeline.py Kubeflow pipeline trains small models (Qwen 2.5 3B, Llama 3.2 3B, Phi-3.5 Mini 3.8B, etc.) using LoRA on CPU in float32. The training step submits a KubeRay RayJob that creates an ephemeral Ray cluster with N CPU-only workers distributed across all available cluster nodes — including GPU workers (using spare CPU), CPU-only x86 workers, and even Raspberry Pis for small models. Ray Train's TorchTrainer handles data-parallel training with gradient synchronisation via AllReduce (Gloo backend).

The pipeline follows an 8-step pattern:

  1. Fetch PDFs from Quobjects S3
  2. Prepare instruction-tuning dataset
  3. Upload prepared data to S3 (shared storage for Ray workers)
  4. Submit KubeRay RayJob (N distributed CPU workers)
  5. Download trained adapter from S3
  6. Sanity evaluation on CPU
  7. Push adapter to Gitea
  8. Log metrics to MLflow

Positive Consequences

  • Training starts immediately with zero additional hardware cost
  • Scales horizontally: up to 14 nodes; Raspberry Pis can participate for small models
  • GPUs remain 100% dedicated to inference — no latency impact
  • Ephemeral Ray cluster: resources released immediately after training
  • Adapters are small (10100 MB) even from CPU training
  • MLflow tracks all experiments regardless of compute backend
  • DGX Spark upgrade is additive, not a rewrite

Negative Consequences

  • CPU training is slower than GPU even when distributed
  • Each worker loads a full copy of the model (data parallelism, not model parallelism)
  • Limited to small models (≤8B) per worker due to memory constraints
  • float32 training uses ~2× the memory of bf16/fp16
  • Requires RBAC setup for pipeline SA to create RayJob/ConfigMap resources

Distributed Ray CPU Training Design

Architecture

Kubeflow Pipeline (KFP component)
       │
       ├── Creates ConfigMap with training script
       ├── Creates KubeRay RayJob CR
       │         │
       │         ▼
       │    ┌──────────┐
       │    │ Ray Head  │  (coordinator, 0 CPUs for training)
       │    └────┬─────┘
       │         │
       │    ┌────┴──────────────────────────────────────────────────────┐
       │    │         │           │           │            │            │
       │    ▼         ▼           ▼           ▼            ▼            ▼
       │ ┌────────┐┌────────┐┌────────┐┌──────────┐┌──────────┐┌──────────┐
       │ │Worker 1││Worker 2││Worker 3││ Worker 4 ││ Worker 5 ││Worker N… │
       │ │khelben ││elmins. ││danilo  ││ wulfgar  ││ regis    ││ Pi nodes │
       │ │32c/94Gi││16c/62Gi││22c/62Gi││ 4c/31Gi  ││ 4c/16Gi  ││ 4c/4-8Gi│
       │ └────────┘└────────┘└────────┘└──────────┘└──────────┘└──────────┘
       │         │         │         │          │           │          │
       │         └─────────┴─── AllReduce (Gloo) ──────────┴──────────┘
       │                              │
       │                       Adapter → S3
       │
       ├── Polls RayJob status until SUCCEEDED
       ├── Downloads adapter from S3
       └── Evaluate → Gitea → MLflow

How It Scales

Workers Nodes Used Total CPUs Est. Time (3B LoRA) Notes
1 1 node 432 46 h Baseline, single-worker
4 4 nodes (GPU workers) 86 11.5 h Uses spare CPU on GPU nodes
6 + regis, wulfgar 94 4560 min All x86 workers
9 + control plane 106 3045 min Control plane also contributes
11 + Pis (small models) ~126 2035 min Full cluster, arm64 + amd64

Note: Raspberry Pi workers (4 GB RAM) can only participate in training for models that fit in ~3 GB after OS/system overhead. For ≤3B models with LoRA, this is feasible; for 8B+ models, exclude Pi nodes from the RayJob spec.

Effective batch size = per_device_batch_size × num_workers × gradient_accumulation_steps. With 6 workers, the effective batch of 1 × 6 × 16 = 96 is competitive with GPU training batches.

RayJob Lifecycle

  1. Create: KFP component creates a RayJob CR + ConfigMap (training script)
  2. Schedule: KubeRay operator allocates head pod + N worker pods across nodes
  3. Install: Workers install pip deps via runtimeEnvYAML (torch, peft, trl, etc.)
  4. Train: TorchTrainer runs train_func on each worker with DDP and prepare_trainer integration for HuggingFace
  5. Save: Rank-0 worker saves adapter + metadata to S3
  6. Teardown: shutdownAfterJobFinishes: true destroys all pods on completion
  7. TTL: RayJob CR auto-deleted after 300 seconds

Resource Allocation

Workers are configured per-node based on available resources. The pipeline supports heterogeneous worker specs:

# Large x86 workers (khelben, elminster, danilo)
cpu_limit:      "8"
memory_limit:   "32Gi"

# Medium x86 workers (drizzt, wulfgar)
cpu_limit:      "4"
memory_limit:   "16Gi"

# Small x86 workers (regis, control plane)
cpu_limit:      "2"
memory_limit:   "8Gi"

# Raspberry Pi workers (durnan, jarlaxle, mirt, volo, elaith)
cpu_limit:      "2"
memory_limit:   "3Gi"   # Only for ≤3B models

# Ray head (coordinator only, no training)
cpu_limit:      "2"
memory_limit:   "4Gi"

Each worker requests only spare CPU and RAM — inference GPU allocations are untouched. All pods are ephemeral — zero standing resource cost.

Data Flow via S3

Training data is shared between KFP and Ray workers through S3:

KFP: prepare_data → /tmp/train.json ───upload──→ s3://training-data/ray-training-runs/{run_id}/data/
                                                      │
Ray Worker 1: ←──download──────────────────────────────┤
Ray Worker 2: ←──download──────────────────────────────┤
Ray Worker N: ←──download──────────────────────────────┘

Ray Worker 0 (rank 0): adapter ──upload──→ s3://training-data/ray-training-runs/{run_id}/adapter/
                                                      │
KFP: download_adapter ←──────────────────────────────────┘

Model Selection for CPU Training

Model Parameters RAM (float32) Trainable (LoRA r=16) Est. Time (4 workers, 16 CPU)
Qwen 2.5 3B Instruct 3B ~12 GB ~4 M 11.5 h
Llama 3.2 3B Instruct 3B ~12 GB ~4 M 11.5 h
Phi-3.5 Mini Instruct 3.8B ~15 GB ~5 M 1.52 h
Llama 3.1 8B Instruct 8B ~32 GB ~8 M 35 h

Can You QLoRA a 70B Model on CPU by Pooling All Cluster RAM?

Short answer: Not with data parallelism alone; requires model parallelism (DeepSpeed ZeRO-3).

With the current data-parallel design (Ray Train TorchTrainer), every worker loads a full copy of the model. A 70B model at 4-bit NF4 quantisation needs ~40 GB just for weights, plus optimizer states and activations. No single Raspberry Pi (48 GB) or small x86 node (16 GB) can hold this.

Approach Mechanism 70B QLoRA Feasible? Notes
Data parallelism (current) Each worker loads full model No — need ≥48 GB per worker Only khelben (94 GB) fits
DeepSpeed ZeRO-3 Model sharded across workers Possible — ~40 GB split across N workers Needs large x86 nodes only
FSDP (Fully Sharded) Similar to ZeRO-3 Possible — PyTorch native Same node constraints

Feasible pool for 70B model parallelism (x86 nodes with ≥27 GB RAM):

Node RAM Role in ZeRO-3 Shard
khelben 94 GB Primary shard host
elminster 62 GB Shard host
danilo 62 GB Shard host
wulfgar 31 GB Shard host
drizzt 27 GB Shard host
Total ~276 GB Sufficient for 70B NF4 + optimizer

A 70B model at 4-bit with LoRA optimizer states needs roughly ~6080 GB total. With 276 GB available across 5 nodes using ZeRO-3 sharding, this is feasible — but introduces significant complexity:

  1. DeepSpeed ZeRO-3 or FSDP replaces simple DDP data parallelism
  2. Network bandwidth becomes critical — Gloo AllReduce over 1GbE is very slow for model shards
  3. Training time would be measured in days, not hours
  4. Raspberry Pis are excluded — too little RAM for any shard of a 70B model
  5. 10GbE or InfiniBand would be strongly recommended for tolerable training speed

Recommendation: 70B training is technically possible with model parallelism across the large x86 nodes but impractical for regular use. The DGX Spark (128 GB unified memory) is a far better path to 70B fine-tuning. The CPU cluster is best suited for ≤8B models with data parallelism, which works well across all node types.

Training Configuration (CPU-optimised)

Parameter CPU Value QLoRA (GPU) Value Rationale
dtype float32 bf16 + NF4 4-bit No GPU tensor cores for mixed precision
optim adamw_torch paged_adamw_8bit No CUDA paging on CPU
batch_size 1 2 Memory constrained per worker
gradient_accumulation 16 8 Compensate for small batch
max_seq_length 1024 2048 Halved to fit in RAM
lora_r 16 64 Fewer params, faster
gradient_checkpointing true false Trade compute for memory
no_cuda true false Explicit CPU-only
backend Gloo (AllReduce) NCCL Gloo is the CPU distributed backend

Pipeline Integration

Kubeflow UI / Argo cron trigger
       │
       ▼
 cpu_training_pipeline.yaml
       │
       ├── 1. fetch_pdfs_from_s3         (python:3.13-slim, boto3)
       ├── 2. prepare_training_data      (python:3.13-slim, PyMuPDF)
       ├── 3. upload_data_to_s3          (python:3.13-slim, boto3)
       ├── 4. submit_ray_training_job    (python:3.13-slim, kubernetes + boto3)
       │       └── RayJob: N workers × (4 CPU, 16Gi)
       │           Image: rayproject/ray:2.44.1-py311-cpu
       │           Deps:  torch, peft, trl, transformers, accelerate, datasets
       ├── 5. download_adapter_from_s3   (python:3.13-slim, boto3)
       ├── 6. evaluate_adapter_cpu       (python:3.13-slim, torch+peft)
       ├── 7. push_adapter_to_gitea      (python:3.13-slim, requests)
       └── 8. log_training_metrics       (python:3.13-slim, mlflow)

RBAC Requirements

The Kubeflow pipeline service account needs permissions to create RayJob and ConfigMap resources:

apiVersion: rbac.authorization.k8s.io/v1
kind: ClusterRole
metadata:
  name: kubeflow-rayjob-creator
rules:
  - apiGroups: ["ray.io"]
    resources: ["rayjobs"]
    verbs: ["create", "get", "list", "watch", "delete"]
  - apiGroups: [""]
    resources: ["configmaps"]
    verbs: ["create", "delete"]

DGX Spark Upgrade Path

Device Specifications

Spec Value
SoC GB10 Grace Blackwell Superchip
CPU NVIDIA Grace (Arm), 20 cores
GPU Blackwell, CUDA compute capability 12.0
Memory 128 GB unified LPDDR5X (CPU + GPU shared)
AI Performance 1 PFLOPS FP4
Connectivity USB-C, 10GbE, Wi-Fi 7
OS Ubuntu-based DGX OS (Linux)
Power <150 W

Integration Plan

  1. Network: Connect DGX Spark via 10GbE to the lab switch; assign a static IP in the cluster network
  2. Kubernetes: Join the DGX Spark as a Kubernetes worker node with taints:
    taints:
      - key: nvidia.com/training
        value: "true"
        effect: NoSchedule
    labels:
      node.kubernetes.io/instance-type: dgx-spark
      accelerator: blackwell
    
  3. KubeRay: Add a dedicated training RayCluster (separate from the inference RayService) or submit RayJob resources that request the DGX Spark's GPU. The same TorchTrainer pattern applies — just change use_gpu=True in ScalingConfig.
  4. Pipeline: Create dgx_spark_training_pipeline.py mirroring the CPU pipeline but with:
    • set_accelerator_type("nvidia.com/gpu")
    • set_gpu_limit(1)
    • node_selector={"accelerator": "blackwell"}
    • BitsAndBytesConfig 4-bit NF4 quantisation (like existing QLoRA pipeline)
    • bf16 compute dtype
    • Larger models: 8B70B
  5. MLflow: Same experiment tracking; tag runs with training_device=dgx-spark
  6. Scheduling: Training jobs tolerate the nvidia.com/training taint; inference deployments do not. This guarantees workload isolation.

Model Capacity on DGX Spark

Model Parameters Memory (NF4 4-bit) Memory (bf16) Fits?
Llama 3.1 8B 8B ~6 GB ~16 GB Yes (full or QLoRA)
Llama 3.1 70B 70B ~40 GB ~140 GB QLoRA only
Qwen 2.5 72B 72B ~42 GB ~144 GB QLoRA only
Mixtral 8×7B 46.7B ~28 GB ~94 GB QLoRA or full LoRA
Llama 3.1 405B 405B ~230 GB N/A No

Migration Strategy

Phase 1 (now):     Distributed CPU via RayJob → small models (≤8B),
                   ~15 h per run, up to 11 workers across 14 nodes
                   Pis participate for ≤3B models, free

Phase 2 (DGX):     GPU pipelines on DGX Spark → large models (≤70B),
                   minutes per run, dedicated hardware

Phase 3 (hybrid):  CPU for lightweight experiments + DGX for
                   production fine-tunes; both report to MLflow
                   Training cluster can also include DGX as a
                   Ray worker alongside CPU nodes for mixed runs

All phases share:

  • Same Kubeflow Pipelines UI
  • Same S3 data source
  • Same Gitea adapter repositories
  • Same MLflow experiment tracking
  • Same evaluation pipeline
  • Same Ray Train TorchTrainer API
  • Related: ADR-0054 — Kubeflow Pipeline CI/CD
  • Related: ADR-0011 — KubeRay Unified GPU Backend
  • Related: kubeflow/qlora_pdf_pipeline.py — Existing GPU QLoRA pipeline
  • Related: kubeflow/cpu_training_pipeline.py — New distributed CPU training pipeline (this ADR)
  • NVIDIA DGX Spark — Product page
  • Ray Train TorchTrainer — Distributed training docs
  • PEFT LoRA — LoRA documentation