19 KiB
Training Strategy – Distributed CPU Now, DGX Spark Later
- Status: accepted
- Date: 2026-02-14
- Deciders: Billy
- Technical Story: Enable distributed model fine-tuning on spare CPU capacity without disrupting inference workloads; plan a migration path to dedicated GPU training hardware
Context and Problem Statement
All GPUs in the homelab cluster are fully allocated to inference serving via KubeRay:
| Node | GPU | Accelerator | Serving |
|---|---|---|---|
| elminster | RTX 2070 8 GB | CUDA | Whisper (0.5) + TTS (0.5) |
| khelben | Strix Halo 128 GB | ROCm | vLLM / LLM (0.95) |
| drizzt | Radeon 680M | ROCm | BGE-Large embeddings (0.8) |
| danilo | Intel Arc | Intel | BGE reranker (0.8) |
Training workloads (QLoRA, LoRA, full fine-tune) cannot share these GPUs without degrading real-time inference latency. However, the cluster has 14 nodes with spare CPU and RAM that can be pooled for distributed training:
| Node | CPU | RAM | Architecture | Available for Training |
|---|---|---|---|---|
| storm (cp) | 4c | 16 GB | amd64 | Limited (control plane duties) |
| bruenor (cp) | 4c | 16 GB | amd64 | Limited (control plane duties) |
| catti (cp) | 4c | 16 GB | amd64 | Limited (control plane duties) |
| elminster | 16c | 62 GB | amd64 | Spare CPU (GPU reserved for inference) |
| khelben | 32c | 94 GB | amd64 | Spare CPU (GPU reserved for inference) |
| drizzt | 16c | 27 GB | amd64 | Spare CPU (GPU reserved for inference) |
| danilo | 22c | 62 GB | amd64 | Spare CPU (GPU reserved for inference) |
| regis | 4c | 16 GB | amd64 | Fully available |
| wulfgar | 4c | 31 GB | amd64 | Fully available |
| durnan (Pi) | 4c | 4 GB | arm64 | Available (small models only) |
| jarlaxle (Pi) | 4c | 4 GB | arm64 | Available (small models only) |
| mirt (Pi) | 4c | 4 GB | arm64 | Available (small models only) |
| volo (Pi) | 4c | 4 GB | arm64 | Available (small models only) |
| elaith (Pi) | 4c | 8 GB | arm64 | Available (small models only) |
Cluster totals: ~126 CPU cores, ~378 GB RAM across 14 nodes.
Rather than training on a single node, we can distribute training across all 14 nodes using Ray Train with data-parallel LoRA, harvesting spare CPU from every machine in the cluster — including Raspberry Pis and control-plane nodes. Each additional worker reduces wall-clock training time roughly linearly.
The NVIDIA DGX Spark (GB10 Grace Blackwell, 128 GB unified LPDDR5X, ~1 PFLOPS FP4) is an upcoming desktop-class device specifically designed for local AI development. If purchased, it would provide the first dedicated training accelerator in the cluster.
We need a training strategy that:
- Works today with existing hardware (distributed CPU)
- Scales horizontally — add more nodes or cores to reduce training time
- Extends cleanly to a DGX Spark when available
- Integrates with the existing Kubeflow Pipelines + MLflow + KubeRay stack
- Does not impact inference serving
Decision Drivers
- Zero GPU budget available for training today
- Spare CPU/RAM on every node (~126 cores, ~378 GB cluster-wide across 14 nodes)
- Training cadence is low (weekly/monthly, not continuous)
- Small-model fine-tuning (1B–8B) is the primary use case for data-parallel CPU training
- Distributed training recovers time lost from being CPU-only
- Mixed architectures (amd64 + arm64 Raspberry Pis) require multi-arch Ray images
- DGX Spark would unlock larger models (up to ~70B with NF4) and 10–50× faster training
- Must reuse existing pipeline tooling (kfp, MLflow, Gitea adapter repos)
Considered Options
- Distributed CPU LoRA training via Ray Train + KubeRay RayJob
- Single-node CPU training in a Kubeflow pipeline step
- Reserve a GPU fraction for training (time-share with inference)
- Offload training to cloud (Lambda Labs, RunPod, etc.)
- Wait for DGX Spark before doing any training
Decision Outcome
Chosen option: Option 1 — Distributed CPU LoRA training via Ray Train, with a clear upgrade path to DGX Spark
A new cpu_training_pipeline.py Kubeflow pipeline trains small models (Qwen 2.5 3B, Llama 3.2 3B, Phi-3.5 Mini 3.8B, etc.) using LoRA on CPU in float32. The training step submits a KubeRay RayJob that creates an ephemeral Ray cluster with N CPU-only workers distributed across all available cluster nodes — including GPU workers (using spare CPU), CPU-only x86 workers, and even Raspberry Pis for small models. Ray Train's TorchTrainer handles data-parallel training with gradient synchronisation via AllReduce (Gloo backend).
The pipeline follows an 8-step pattern:
- Fetch PDFs from Quobjects S3
- Prepare instruction-tuning dataset
- Upload prepared data to S3 (shared storage for Ray workers)
- Submit KubeRay RayJob (N distributed CPU workers)
- Download trained adapter from S3
- Sanity evaluation on CPU
- Push adapter to Gitea
- Log metrics to MLflow
Positive Consequences
- Training starts immediately with zero additional hardware cost
- Scales horizontally: up to 14 nodes; Raspberry Pis can participate for small models
- GPUs remain 100% dedicated to inference — no latency impact
- Ephemeral Ray cluster: resources released immediately after training
- Adapters are small (10–100 MB) even from CPU training
- MLflow tracks all experiments regardless of compute backend
- DGX Spark upgrade is additive, not a rewrite
Negative Consequences
- CPU training is slower than GPU even when distributed
- Each worker loads a full copy of the model (data parallelism, not model parallelism)
- Limited to small models (≤8B) per worker due to memory constraints
- float32 training uses ~2× the memory of bf16/fp16
- Requires RBAC setup for pipeline SA to create RayJob/ConfigMap resources
Distributed Ray CPU Training Design
Architecture
Kubeflow Pipeline (KFP component)
│
├── Creates ConfigMap with training script
├── Creates KubeRay RayJob CR
│ │
│ ▼
│ ┌──────────┐
│ │ Ray Head │ (coordinator, 0 CPUs for training)
│ └────┬─────┘
│ │
│ ┌────┴──────────────────────────────────────────────────────┐
│ │ │ │ │ │ │
│ ▼ ▼ ▼ ▼ ▼ ▼
│ ┌────────┐┌────────┐┌────────┐┌──────────┐┌──────────┐┌──────────┐
│ │Worker 1││Worker 2││Worker 3││ Worker 4 ││ Worker 5 ││Worker N… │
│ │khelben ││elmins. ││danilo ││ wulfgar ││ regis ││ Pi nodes │
│ │32c/94Gi││16c/62Gi││22c/62Gi││ 4c/31Gi ││ 4c/16Gi ││ 4c/4-8Gi│
│ └────────┘└────────┘└────────┘└──────────┘└──────────┘└──────────┘
│ │ │ │ │ │ │
│ └─────────┴─── AllReduce (Gloo) ──────────┴──────────┘
│ │
│ Adapter → S3
│
├── Polls RayJob status until SUCCEEDED
├── Downloads adapter from S3
└── Evaluate → Gitea → MLflow
How It Scales
| Workers | Nodes Used | Total CPUs | Est. Time (3B LoRA) | Notes |
|---|---|---|---|---|
| 1 | 1 node | 4–32 | 4–6 h | Baseline, single-worker |
| 4 | 4 nodes (GPU workers) | 86 | 1–1.5 h | Uses spare CPU on GPU nodes |
| 6 | + regis, wulfgar | 94 | 45–60 min | All x86 workers |
| 9 | + control plane | 106 | 30–45 min | Control plane also contributes |
| 11 | + Pis (small models) | ~126 | 20–35 min | Full cluster, arm64 + amd64 |
Note: Raspberry Pi workers (4 GB RAM) can only participate in training for models that fit in ~3 GB after OS/system overhead. For ≤3B models with LoRA, this is feasible; for 8B+ models, exclude Pi nodes from the RayJob spec.
Effective batch size = per_device_batch_size × num_workers × gradient_accumulation_steps. With 6 workers, the effective batch of 1 × 6 × 16 = 96 is competitive with GPU training batches.
RayJob Lifecycle
- Create: KFP component creates a
RayJobCR +ConfigMap(training script) - Schedule: KubeRay operator allocates head pod + N worker pods across nodes
- Install: Workers install pip deps via
runtimeEnvYAML(torch, peft, trl, etc.) - Train:
TorchTrainerrunstrain_funcon each worker with DDP andprepare_trainerintegration for HuggingFace - Save: Rank-0 worker saves adapter + metadata to S3
- Teardown:
shutdownAfterJobFinishes: truedestroys all pods on completion - TTL: RayJob CR auto-deleted after 300 seconds
Resource Allocation
Workers are configured per-node based on available resources. The pipeline supports heterogeneous worker specs:
# Large x86 workers (khelben, elminster, danilo)
cpu_limit: "8"
memory_limit: "32Gi"
# Medium x86 workers (drizzt, wulfgar)
cpu_limit: "4"
memory_limit: "16Gi"
# Small x86 workers (regis, control plane)
cpu_limit: "2"
memory_limit: "8Gi"
# Raspberry Pi workers (durnan, jarlaxle, mirt, volo, elaith)
cpu_limit: "2"
memory_limit: "3Gi" # Only for ≤3B models
# Ray head (coordinator only, no training)
cpu_limit: "2"
memory_limit: "4Gi"
Each worker requests only spare CPU and RAM — inference GPU allocations are untouched. All pods are ephemeral — zero standing resource cost.
Data Flow via S3
Training data is shared between KFP and Ray workers through S3:
KFP: prepare_data → /tmp/train.json ───upload──→ s3://training-data/ray-training-runs/{run_id}/data/
│
Ray Worker 1: ←──download──────────────────────────────┤
Ray Worker 2: ←──download──────────────────────────────┤
Ray Worker N: ←──download──────────────────────────────┘
Ray Worker 0 (rank 0): adapter ──upload──→ s3://training-data/ray-training-runs/{run_id}/adapter/
│
KFP: download_adapter ←──────────────────────────────────┘
Model Selection for CPU Training
| Model | Parameters | RAM (float32) | Trainable (LoRA r=16) | Est. Time (4 workers, 16 CPU) |
|---|---|---|---|---|
| Qwen 2.5 3B Instruct | 3B | ~12 GB | ~4 M | 1–1.5 h |
| Llama 3.2 3B Instruct | 3B | ~12 GB | ~4 M | 1–1.5 h |
| Phi-3.5 Mini Instruct | 3.8B | ~15 GB | ~5 M | 1.5–2 h |
| Llama 3.1 8B Instruct | 8B | ~32 GB | ~8 M | 3–5 h |
Can You QLoRA a 70B Model on CPU by Pooling All Cluster RAM?
Short answer: Not with data parallelism alone; requires model parallelism (DeepSpeed ZeRO-3).
With the current data-parallel design (Ray Train TorchTrainer), every worker loads a full copy of the model. A 70B model at 4-bit NF4 quantisation needs ~40 GB just for weights, plus optimizer states and activations. No single Raspberry Pi (4–8 GB) or small x86 node (16 GB) can hold this.
| Approach | Mechanism | 70B QLoRA Feasible? | Notes |
|---|---|---|---|
| Data parallelism (current) | Each worker loads full model | No — need ≥48 GB per worker | Only khelben (94 GB) fits |
| DeepSpeed ZeRO-3 | Model sharded across workers | Possible — ~40 GB split across N workers | Needs large x86 nodes only |
| FSDP (Fully Sharded) | Similar to ZeRO-3 | Possible — PyTorch native | Same node constraints |
Feasible pool for 70B model parallelism (x86 nodes with ≥27 GB RAM):
| Node | RAM | Role in ZeRO-3 Shard |
|---|---|---|
| khelben | 94 GB | Primary shard host |
| elminster | 62 GB | Shard host |
| danilo | 62 GB | Shard host |
| wulfgar | 31 GB | Shard host |
| drizzt | 27 GB | Shard host |
| Total | ~276 GB | Sufficient for 70B NF4 + optimizer |
A 70B model at 4-bit with LoRA optimizer states needs roughly ~60–80 GB total. With 276 GB available across 5 nodes using ZeRO-3 sharding, this is feasible — but introduces significant complexity:
- DeepSpeed ZeRO-3 or FSDP replaces simple DDP data parallelism
- Network bandwidth becomes critical — Gloo AllReduce over 1GbE is very slow for model shards
- Training time would be measured in days, not hours
- Raspberry Pis are excluded — too little RAM for any shard of a 70B model
- 10GbE or InfiniBand would be strongly recommended for tolerable training speed
Recommendation: 70B training is technically possible with model parallelism across the large x86 nodes but impractical for regular use. The DGX Spark (128 GB unified memory) is a far better path to 70B fine-tuning. The CPU cluster is best suited for ≤8B models with data parallelism, which works well across all node types.
Training Configuration (CPU-optimised)
| Parameter | CPU Value | QLoRA (GPU) Value | Rationale |
|---|---|---|---|
dtype |
float32 | bf16 + NF4 4-bit | No GPU tensor cores for mixed precision |
optim |
adamw_torch | paged_adamw_8bit | No CUDA paging on CPU |
batch_size |
1 | 2 | Memory constrained per worker |
gradient_accumulation |
16 | 8 | Compensate for small batch |
max_seq_length |
1024 | 2048 | Halved to fit in RAM |
lora_r |
16 | 64 | Fewer params, faster |
gradient_checkpointing |
true | false | Trade compute for memory |
no_cuda |
true | false | Explicit CPU-only |
backend |
Gloo (AllReduce) | NCCL | Gloo is the CPU distributed backend |
Pipeline Integration
Kubeflow UI / Argo cron trigger
│
▼
cpu_training_pipeline.yaml
│
├── 1. fetch_pdfs_from_s3 (python:3.13-slim, boto3)
├── 2. prepare_training_data (python:3.13-slim, PyMuPDF)
├── 3. upload_data_to_s3 (python:3.13-slim, boto3)
├── 4. submit_ray_training_job (python:3.13-slim, kubernetes + boto3)
│ └── RayJob: N workers × (4 CPU, 16Gi)
│ Image: rayproject/ray:2.44.1-py311-cpu
│ Deps: torch, peft, trl, transformers, accelerate, datasets
├── 5. download_adapter_from_s3 (python:3.13-slim, boto3)
├── 6. evaluate_adapter_cpu (python:3.13-slim, torch+peft)
├── 7. push_adapter_to_gitea (python:3.13-slim, requests)
└── 8. log_training_metrics (python:3.13-slim, mlflow)
RBAC Requirements
The Kubeflow pipeline service account needs permissions to create RayJob and ConfigMap resources:
apiVersion: rbac.authorization.k8s.io/v1
kind: ClusterRole
metadata:
name: kubeflow-rayjob-creator
rules:
- apiGroups: ["ray.io"]
resources: ["rayjobs"]
verbs: ["create", "get", "list", "watch", "delete"]
- apiGroups: [""]
resources: ["configmaps"]
verbs: ["create", "delete"]
DGX Spark Upgrade Path
Device Specifications
| Spec | Value |
|---|---|
| SoC | GB10 Grace Blackwell Superchip |
| CPU | NVIDIA Grace (Arm), 20 cores |
| GPU | Blackwell, CUDA compute capability 12.0 |
| Memory | 128 GB unified LPDDR5X (CPU + GPU shared) |
| AI Performance | 1 PFLOPS FP4 |
| Connectivity | USB-C, 10GbE, Wi-Fi 7 |
| OS | Ubuntu-based DGX OS (Linux) |
| Power | <150 W |
Integration Plan
- Network: Connect DGX Spark via 10GbE to the lab switch; assign a static IP in the cluster network
- Kubernetes: Join the DGX Spark as a Kubernetes worker node with taints:
taints: - key: nvidia.com/training value: "true" effect: NoSchedule labels: node.kubernetes.io/instance-type: dgx-spark accelerator: blackwell - KubeRay: Add a dedicated
trainingRayCluster (separate from the inferenceRayService) or submitRayJobresources that request the DGX Spark's GPU. The sameTorchTrainerpattern applies — just changeuse_gpu=TrueinScalingConfig. - Pipeline: Create
dgx_spark_training_pipeline.pymirroring the CPU pipeline but with:set_accelerator_type("nvidia.com/gpu")set_gpu_limit(1)node_selector={"accelerator": "blackwell"}- BitsAndBytesConfig 4-bit NF4 quantisation (like existing QLoRA pipeline)
- bf16 compute dtype
- Larger models: 8B–70B
- MLflow: Same experiment tracking; tag runs with
training_device=dgx-spark - Scheduling: Training jobs tolerate the
nvidia.com/trainingtaint; inference deployments do not. This guarantees workload isolation.
Model Capacity on DGX Spark
| Model | Parameters | Memory (NF4 4-bit) | Memory (bf16) | Fits? |
|---|---|---|---|---|
| Llama 3.1 8B | 8B | ~6 GB | ~16 GB | Yes (full or QLoRA) |
| Llama 3.1 70B | 70B | ~40 GB | ~140 GB | QLoRA only |
| Qwen 2.5 72B | 72B | ~42 GB | ~144 GB | QLoRA only |
| Mixtral 8×7B | 46.7B | ~28 GB | ~94 GB | QLoRA or full LoRA |
| Llama 3.1 405B | 405B | ~230 GB | N/A | No |
Migration Strategy
Phase 1 (now): Distributed CPU via RayJob → small models (≤8B),
~1–5 h per run, up to 11 workers across 14 nodes
Pis participate for ≤3B models, free
Phase 2 (DGX): GPU pipelines on DGX Spark → large models (≤70B),
minutes per run, dedicated hardware
Phase 3 (hybrid): CPU for lightweight experiments + DGX for
production fine-tunes; both report to MLflow
Training cluster can also include DGX as a
Ray worker alongside CPU nodes for mixed runs
All phases share:
- Same Kubeflow Pipelines UI
- Same S3 data source
- Same Gitea adapter repositories
- Same MLflow experiment tracking
- Same evaluation pipeline
- Same Ray Train
TorchTrainerAPI
Links
- Related: ADR-0054 — Kubeflow Pipeline CI/CD
- Related: ADR-0011 — KubeRay Unified GPU Backend
- Related:
kubeflow/qlora_pdf_pipeline.py— Existing GPU QLoRA pipeline - Related:
kubeflow/cpu_training_pipeline.py— New distributed CPU training pipeline (this ADR) - NVIDIA DGX Spark — Product page
- Ray Train TorchTrainer — Distributed training docs
- PEFT LoRA — LoRA documentation