From 200cc5704b2fb8ecc5f7103b68b911de75c8d6be Mon Sep 17 00:00:00 2001 From: "Billy D." Date: Sun, 15 Feb 2026 11:19:09 -0500 Subject: [PATCH] docs: update node inventory and 70B QLoRA feasibility analysis --- ARCHITECTURE.md | 47 ++- .../0058-training-strategy-cpu-dgx-spark.md | 383 ++++++++++++++++++ 2 files changed, 420 insertions(+), 10 deletions(-) create mode 100644 decisions/0058-training-strategy-cpu-dgx-spark.md diff --git a/ARCHITECTURE.md b/ARCHITECTURE.md index 727a9eb..af67679 100644 --- a/ARCHITECTURE.md +++ b/ARCHITECTURE.md @@ -115,10 +115,10 @@ The homelab is a production-grade Kubernetes cluster running on bare-metal hardw ┌─────────────────────────────────────────────────────────────────────────────┐ │ PLATFORM LAYER │ ├─────────────────────────────────────────────────────────────────────────────┤ -│ Talos Linux v1.12.1 │ Kubernetes v1.35.0 │ Cilium CNI │ +│ Talos Linux v1.12.x │ Kubernetes v1.35.0 │ Cilium CNI │ │ │ -│ Nodes: storm, bruenor, catti (control) │ elminster, khelben, drizzt, │ -│ │ danilo (workers) │ +│ 14 nodes: 3 control plane │ 4 GPU workers │ 2 CPU-only x86 workers │ +│ │ 5 Raspberry Pi (arm64) workers │ └─────────────────────────────────────────────────────────────────────────────┘ ``` @@ -134,14 +134,41 @@ The homelab is a production-grade Kubernetes cluster running on bare-metal hardw **VIP**: 192.168.100.20 (shared across control plane) -### Worker Nodes +### Worker Nodes — GPU -| Node | IP | CPU | GPU | GPU Memory | Workload | -|------|-------|-----|-----|------------|----------| -| elminster | 192.168.100.31 | Intel | NVIDIA RTX 2070 | 8GB VRAM | Whisper, XTTS | -| khelben | 192.168.100.32 | AMD Ryzen | AMD Strix Halo | 64GB Unified | vLLM (dedicated) | -| drizzt | 192.168.100.40 | AMD Ryzen 7 6800H | AMD Radeon 680M | 12GB VRAM | BGE Embeddings | -| danilo | 192.168.100.41 | Intel Core Ultra 9 | Intel Arc | 16GB Shared | Reranker | +| Node | IP | CPU | RAM | GPU | GPU Memory | Workload | +|------|-------|-----|-----|-----|------------|----------| +| elminster | 192.168.100.31 | Intel (16c) | 62 GB | NVIDIA RTX 2070 | 8 GB VRAM | Whisper, XTTS | +| khelben | 192.168.100.32 | AMD Ryzen (32c) | 94 GB | AMD Strix Halo | 32 GB Unified | vLLM (dedicated) | +| drizzt | 192.168.100.40 | AMD Ryzen 7 6800H (16c) | 27 GB | AMD Radeon 680M | 12 GB VRAM | BGE Embeddings | +| danilo | 192.168.100.41 | Intel Core Ultra 9 (22c) | 62 GB | Intel Arc | 16 GB Shared | Reranker | + +### Worker Nodes — CPU-only (x86_64) + +| Node | IP | CPU | RAM | Workload | +|------|-------|-----|-----|----------| +| regis | 192.168.100.43 | Intel (4c) | 16 GB | General workloads | +| wulfgar | 192.168.100.42 | Intel (4c) | 31 GB | General workloads | + +### Worker Nodes — Raspberry Pi (arm64) + +| Node | IP | CPU | RAM | Workload | +|------|-------|-----|-----|----------| +| durnan | 192.168.100.54 | Cortex-A72 (4c) | 4 GB | Lightweight services | +| jarlaxle | 192.168.100.53 | Cortex-A72 (4c) | 4 GB | Lightweight services | +| mirt | 192.168.100.52 | Cortex-A72 (4c) | 4 GB | Lightweight services | +| volo | 192.168.100.51 | Cortex-A72 (4c) | 4 GB | Lightweight services | +| elaith | 192.168.100.55 | Cortex-A72 (4c) | 8 GB | Lightweight services | + +### Cluster Totals + +| Resource | Total | +|----------|-------| +| Nodes | 14 (3 control + 11 worker) | +| CPU cores | ~126 | +| System RAM | ~378 GB | +| Architectures | amd64, arm64 | +| GPUs | 4 (NVIDIA, AMD, Intel) | ## Networking diff --git a/decisions/0058-training-strategy-cpu-dgx-spark.md b/decisions/0058-training-strategy-cpu-dgx-spark.md new file mode 100644 index 0000000..a5ed8ea --- /dev/null +++ b/decisions/0058-training-strategy-cpu-dgx-spark.md @@ -0,0 +1,383 @@ +# Training Strategy – Distributed CPU Now, DGX Spark Later + +* Status: accepted +* Date: 2026-02-14 +* Deciders: Billy +* Technical Story: Enable distributed model fine-tuning on spare CPU capacity without disrupting inference workloads; plan a migration path to dedicated GPU training hardware + +## Context and Problem Statement + +All GPUs in the homelab cluster are fully allocated to inference serving via KubeRay: + +| Node | GPU | Accelerator | Serving | +|---|---|---|---| +| elminster | RTX 2070 8 GB | CUDA | Whisper (0.5) + TTS (0.5) | +| khelben | Strix Halo 128 GB | ROCm | vLLM / LLM (0.95) | +| drizzt | Radeon 680M | ROCm | BGE-Large embeddings (0.8) | +| danilo | Intel Arc | Intel | BGE reranker (0.8) | + +Training workloads (QLoRA, LoRA, full fine-tune) cannot share these GPUs without degrading real-time inference latency. However, the cluster has **14 nodes with spare CPU and RAM** that can be pooled for distributed training: + +| Node | CPU | RAM | Architecture | Available for Training | +|------|-----|-----|-------------|----------------------| +| storm (cp) | 4c | 16 GB | amd64 | Limited (control plane duties) | +| bruenor (cp) | 4c | 16 GB | amd64 | Limited (control plane duties) | +| catti (cp) | 4c | 16 GB | amd64 | Limited (control plane duties) | +| elminster | 16c | 62 GB | amd64 | Spare CPU (GPU reserved for inference) | +| khelben | 32c | 94 GB | amd64 | Spare CPU (GPU reserved for inference) | +| drizzt | 16c | 27 GB | amd64 | Spare CPU (GPU reserved for inference) | +| danilo | 22c | 62 GB | amd64 | Spare CPU (GPU reserved for inference) | +| regis | 4c | 16 GB | amd64 | Fully available | +| wulfgar | 4c | 31 GB | amd64 | Fully available | +| durnan (Pi) | 4c | 4 GB | arm64 | Available (small models only) | +| jarlaxle (Pi) | 4c | 4 GB | arm64 | Available (small models only) | +| mirt (Pi) | 4c | 4 GB | arm64 | Available (small models only) | +| volo (Pi) | 4c | 4 GB | arm64 | Available (small models only) | +| elaith (Pi) | 4c | 8 GB | arm64 | Available (small models only) | + +**Cluster totals: ~126 CPU cores, ~378 GB RAM across 14 nodes.** + +Rather than training on a single node, we can **distribute training across all 14 nodes** using Ray Train with data-parallel LoRA, harvesting spare CPU from every machine in the cluster — including Raspberry Pis and control-plane nodes. Each additional worker reduces wall-clock training time roughly linearly. + +The NVIDIA **DGX Spark** (GB10 Grace Blackwell, 128 GB unified LPDDR5X, ~1 PFLOPS FP4) is an upcoming desktop-class device specifically designed for local AI development. If purchased, it would provide the first *dedicated* training accelerator in the cluster. + +We need a training strategy that: + +1. Works **today** with existing hardware (distributed CPU) +2. **Scales horizontally** — add more nodes or cores to reduce training time +3. Extends cleanly to a **DGX Spark** when available +4. Integrates with the existing Kubeflow Pipelines + MLflow + KubeRay stack +5. Does not impact inference serving + +## Decision Drivers + +* Zero GPU budget available for training today +* Spare CPU/RAM on every node (~126 cores, ~378 GB cluster-wide across 14 nodes) +* Training cadence is low (weekly/monthly, not continuous) +* Small-model fine-tuning (1B–8B) is the primary use case for data-parallel CPU training +* Distributed training recovers time lost from being CPU-only +* Mixed architectures (amd64 + arm64 Raspberry Pis) require multi-arch Ray images +* DGX Spark would unlock larger models (up to ~70B with NF4) and 10–50× faster training +* Must reuse existing pipeline tooling (kfp, MLflow, Gitea adapter repos) + +## Considered Options + +1. **Distributed CPU LoRA training via Ray Train + KubeRay RayJob** +2. **Single-node CPU training in a Kubeflow pipeline step** +3. **Reserve a GPU fraction for training (time-share with inference)** +4. **Offload training to cloud (Lambda Labs, RunPod, etc.)** +5. **Wait for DGX Spark before doing any training** + +## Decision Outcome + +Chosen option: **Option 1 — Distributed CPU LoRA training via Ray Train, with a clear upgrade path to DGX Spark** + +A new `cpu_training_pipeline.py` Kubeflow pipeline trains small models (Qwen 2.5 3B, Llama 3.2 3B, Phi-3.5 Mini 3.8B, etc.) using LoRA on CPU in float32. The training step submits a **KubeRay RayJob** that creates an ephemeral Ray cluster with N CPU-only workers distributed across all available cluster nodes — including GPU workers (using spare CPU), CPU-only x86 workers, and even Raspberry Pis for small models. Ray Train's `TorchTrainer` handles data-parallel training with gradient synchronisation via AllReduce (Gloo backend). + +The pipeline follows an 8-step pattern: + +1. Fetch PDFs from Quobjects S3 +2. Prepare instruction-tuning dataset +3. Upload prepared data to S3 (shared storage for Ray workers) +4. Submit KubeRay RayJob (N distributed CPU workers) +5. Download trained adapter from S3 +6. Sanity evaluation on CPU +7. Push adapter to Gitea +8. Log metrics to MLflow + +### Positive Consequences + +* Training starts immediately with zero additional hardware cost +* **Scales horizontally**: up to 14 nodes; Raspberry Pis can participate for small models +* GPUs remain 100% dedicated to inference — no latency impact +* Ephemeral Ray cluster: resources released immediately after training +* Adapters are small (10–100 MB) even from CPU training +* MLflow tracks all experiments regardless of compute backend +* DGX Spark upgrade is additive, not a rewrite + +### Negative Consequences + +* CPU training is slower than GPU even when distributed +* Each worker loads a full copy of the model (data parallelism, not model parallelism) +* Limited to small models (≤8B) per worker due to memory constraints +* float32 training uses ~2× the memory of bf16/fp16 +* Requires RBAC setup for pipeline SA to create RayJob/ConfigMap resources + +## Distributed Ray CPU Training Design + +### Architecture + +``` +Kubeflow Pipeline (KFP component) + │ + ├── Creates ConfigMap with training script + ├── Creates KubeRay RayJob CR + │ │ + │ ▼ + │ ┌──────────┐ + │ │ Ray Head │ (coordinator, 0 CPUs for training) + │ └────┬─────┘ + │ │ + │ ┌────┴──────────────────────────────────────────────────────┐ + │ │ │ │ │ │ │ + │ ▼ ▼ ▼ ▼ ▼ ▼ + │ ┌────────┐┌────────┐┌────────┐┌──────────┐┌──────────┐┌──────────┐ + │ │Worker 1││Worker 2││Worker 3││ Worker 4 ││ Worker 5 ││Worker N… │ + │ │khelben ││elmins. ││danilo ││ wulfgar ││ regis ││ Pi nodes │ + │ │32c/94Gi││16c/62Gi││22c/62Gi││ 4c/31Gi ││ 4c/16Gi ││ 4c/4-8Gi│ + │ └────────┘└────────┘└────────┘└──────────┘└──────────┘└──────────┘ + │ │ │ │ │ │ │ + │ └─────────┴─── AllReduce (Gloo) ──────────┴──────────┘ + │ │ + │ Adapter → S3 + │ + ├── Polls RayJob status until SUCCEEDED + ├── Downloads adapter from S3 + └── Evaluate → Gitea → MLflow +``` + +### How It Scales + +| Workers | Nodes Used | Total CPUs | Est. Time (3B LoRA) | Notes | +|---|---|---|---|---| +| 1 | 1 node | 4–32 | 4–6 h | Baseline, single-worker | +| 4 | 4 nodes (GPU workers) | 86 | 1–1.5 h | Uses spare CPU on GPU nodes | +| 6 | + regis, wulfgar | 94 | 45–60 min | All x86 workers | +| 9 | + control plane | 106 | 30–45 min | Control plane also contributes | +| 11 | + Pis (small models) | ~126 | 20–35 min | Full cluster, arm64 + amd64 | + +**Note:** Raspberry Pi workers (4 GB RAM) can only participate in training for models that fit in ~3 GB after OS/system overhead. For ≤3B models with LoRA, this is feasible; for 8B+ models, exclude Pi nodes from the RayJob spec. + +Effective batch size = `per_device_batch_size × num_workers × gradient_accumulation_steps`. With 6 workers, the effective batch of `1 × 6 × 16 = 96` is competitive with GPU training batches. + +### RayJob Lifecycle + +1. **Create**: KFP component creates a `RayJob` CR + `ConfigMap` (training script) +2. **Schedule**: KubeRay operator allocates head pod + N worker pods across nodes +3. **Install**: Workers install pip deps via `runtimeEnvYAML` (torch, peft, trl, etc.) +4. **Train**: `TorchTrainer` runs `train_func` on each worker with DDP and `prepare_trainer` integration for HuggingFace +5. **Save**: Rank-0 worker saves adapter + metadata to S3 +6. **Teardown**: `shutdownAfterJobFinishes: true` destroys all pods on completion +7. **TTL**: RayJob CR auto-deleted after 300 seconds + +### Resource Allocation + +Workers are configured per-node based on available resources. The pipeline supports heterogeneous worker specs: + +```yaml +# Large x86 workers (khelben, elminster, danilo) +cpu_limit: "8" +memory_limit: "32Gi" + +# Medium x86 workers (drizzt, wulfgar) +cpu_limit: "4" +memory_limit: "16Gi" + +# Small x86 workers (regis, control plane) +cpu_limit: "2" +memory_limit: "8Gi" + +# Raspberry Pi workers (durnan, jarlaxle, mirt, volo, elaith) +cpu_limit: "2" +memory_limit: "3Gi" # Only for ≤3B models + +# Ray head (coordinator only, no training) +cpu_limit: "2" +memory_limit: "4Gi" +``` + +Each worker requests only spare CPU and RAM — inference GPU allocations are untouched. All pods are ephemeral — zero standing resource cost. + +### Data Flow via S3 + +Training data is shared between KFP and Ray workers through S3: + +``` +KFP: prepare_data → /tmp/train.json ───upload──→ s3://training-data/ray-training-runs/{run_id}/data/ + │ +Ray Worker 1: ←──download──────────────────────────────┤ +Ray Worker 2: ←──download──────────────────────────────┤ +Ray Worker N: ←──download──────────────────────────────┘ + +Ray Worker 0 (rank 0): adapter ──upload──→ s3://training-data/ray-training-runs/{run_id}/adapter/ + │ +KFP: download_adapter ←──────────────────────────────────┘ +``` + +### Model Selection for CPU Training + +| Model | Parameters | RAM (float32) | Trainable (LoRA r=16) | Est. Time (4 workers, 16 CPU) | +|---|---|---|---|---| +| Qwen 2.5 3B Instruct | 3B | ~12 GB | ~4 M | 1–1.5 h | +| Llama 3.2 3B Instruct | 3B | ~12 GB | ~4 M | 1–1.5 h | +| Phi-3.5 Mini Instruct | 3.8B | ~15 GB | ~5 M | 1.5–2 h | +| Llama 3.1 8B Instruct | 8B | ~32 GB | ~8 M | 3–5 h | + +### Can You QLoRA a 70B Model on CPU by Pooling All Cluster RAM? + +**Short answer: Not with data parallelism alone; requires model parallelism (DeepSpeed ZeRO-3).** + +With the current data-parallel design (Ray Train `TorchTrainer`), **every worker loads a full copy of the model**. A 70B model at 4-bit NF4 quantisation needs ~40 GB just for weights, plus optimizer states and activations. No single Raspberry Pi (4–8 GB) or small x86 node (16 GB) can hold this. + +| Approach | Mechanism | 70B QLoRA Feasible? | Notes | +|----------|-----------|---------------------|-------| +| Data parallelism (current) | Each worker loads full model | No — need ≥48 GB per worker | Only khelben (94 GB) fits | +| DeepSpeed ZeRO-3 | Model sharded across workers | **Possible** — ~40 GB split across N workers | Needs large x86 nodes only | +| FSDP (Fully Sharded) | Similar to ZeRO-3 | **Possible** — PyTorch native | Same node constraints | + +**Feasible pool for 70B model parallelism (x86 nodes with ≥27 GB RAM):** + +| Node | RAM | Role in ZeRO-3 Shard | +|------|-----|---------------------| +| khelben | 94 GB | Primary shard host | +| elminster | 62 GB | Shard host | +| danilo | 62 GB | Shard host | +| wulfgar | 31 GB | Shard host | +| drizzt | 27 GB | Shard host | +| **Total** | **~276 GB** | **Sufficient for 70B NF4 + optimizer** | + +A 70B model at 4-bit with LoRA optimizer states needs roughly ~60–80 GB total. With 276 GB available across 5 nodes using ZeRO-3 sharding, this is feasible — but introduces significant complexity: + +1. **DeepSpeed ZeRO-3** or **FSDP** replaces simple DDP data parallelism +2. **Network bandwidth** becomes critical — Gloo AllReduce over 1GbE is very slow for model shards +3. **Training time** would be measured in days, not hours +4. **Raspberry Pis are excluded** — too little RAM for any shard of a 70B model +5. **10GbE or InfiniBand** would be strongly recommended for tolerable training speed + +**Recommendation:** 70B training is technically possible with model parallelism across the large x86 nodes but impractical for regular use. The DGX Spark (128 GB unified memory) is a far better path to 70B fine-tuning. The CPU cluster is best suited for **≤8B models with data parallelism**, which works well across all node types. + +### Training Configuration (CPU-optimised) + +| Parameter | CPU Value | QLoRA (GPU) Value | Rationale | +|---|---|---|---| +| `dtype` | float32 | bf16 + NF4 4-bit | No GPU tensor cores for mixed precision | +| `optim` | adamw_torch | paged_adamw_8bit | No CUDA paging on CPU | +| `batch_size` | 1 | 2 | Memory constrained per worker | +| `gradient_accumulation` | 16 | 8 | Compensate for small batch | +| `max_seq_length` | 1024 | 2048 | Halved to fit in RAM | +| `lora_r` | 16 | 64 | Fewer params, faster | +| `gradient_checkpointing` | true | false | Trade compute for memory | +| `no_cuda` | true | false | Explicit CPU-only | +| `backend` | Gloo (AllReduce) | NCCL | Gloo is the CPU distributed backend | + +### Pipeline Integration + +``` +Kubeflow UI / Argo cron trigger + │ + ▼ + cpu_training_pipeline.yaml + │ + ├── 1. fetch_pdfs_from_s3 (python:3.13-slim, boto3) + ├── 2. prepare_training_data (python:3.13-slim, PyMuPDF) + ├── 3. upload_data_to_s3 (python:3.13-slim, boto3) + ├── 4. submit_ray_training_job (python:3.13-slim, kubernetes + boto3) + │ └── RayJob: N workers × (4 CPU, 16Gi) + │ Image: rayproject/ray:2.44.1-py311-cpu + │ Deps: torch, peft, trl, transformers, accelerate, datasets + ├── 5. download_adapter_from_s3 (python:3.13-slim, boto3) + ├── 6. evaluate_adapter_cpu (python:3.13-slim, torch+peft) + ├── 7. push_adapter_to_gitea (python:3.13-slim, requests) + └── 8. log_training_metrics (python:3.13-slim, mlflow) +``` + +### RBAC Requirements + +The Kubeflow pipeline service account needs permissions to create RayJob and ConfigMap resources: + +```yaml +apiVersion: rbac.authorization.k8s.io/v1 +kind: ClusterRole +metadata: + name: kubeflow-rayjob-creator +rules: + - apiGroups: ["ray.io"] + resources: ["rayjobs"] + verbs: ["create", "get", "list", "watch", "delete"] + - apiGroups: [""] + resources: ["configmaps"] + verbs: ["create", "delete"] +``` + +## DGX Spark Upgrade Path + +### Device Specifications + +| Spec | Value | +|---|---| +| SoC | GB10 Grace Blackwell Superchip | +| CPU | NVIDIA Grace (Arm), 20 cores | +| GPU | Blackwell, CUDA compute capability 12.0 | +| Memory | 128 GB unified LPDDR5X (CPU + GPU shared) | +| AI Performance | 1 PFLOPS FP4 | +| Connectivity | USB-C, 10GbE, Wi-Fi 7 | +| OS | Ubuntu-based DGX OS (Linux) | +| Power | <150 W | + +### Integration Plan + +1. **Network**: Connect DGX Spark via 10GbE to the lab switch; assign a static IP in the cluster network +2. **Kubernetes**: Join the DGX Spark as a Kubernetes worker node with taints: + ```yaml + taints: + - key: nvidia.com/training + value: "true" + effect: NoSchedule + labels: + node.kubernetes.io/instance-type: dgx-spark + accelerator: blackwell + ``` +3. **KubeRay**: Add a dedicated `training` RayCluster (separate from the inference `RayService`) or submit `RayJob` resources that request the DGX Spark's GPU. The same `TorchTrainer` pattern applies — just change `use_gpu=True` in `ScalingConfig`. +4. **Pipeline**: Create `dgx_spark_training_pipeline.py` mirroring the CPU pipeline but with: + - `set_accelerator_type("nvidia.com/gpu")` + - `set_gpu_limit(1)` + - `node_selector={"accelerator": "blackwell"}` + - BitsAndBytesConfig 4-bit NF4 quantisation (like existing QLoRA pipeline) + - bf16 compute dtype + - Larger models: 8B–70B +5. **MLflow**: Same experiment tracking; tag runs with `training_device=dgx-spark` +6. **Scheduling**: Training jobs tolerate the `nvidia.com/training` taint; inference deployments do not. This guarantees workload isolation. + +### Model Capacity on DGX Spark + +| Model | Parameters | Memory (NF4 4-bit) | Memory (bf16) | Fits? | +|---|---|---|---|---| +| Llama 3.1 8B | 8B | ~6 GB | ~16 GB | Yes (full or QLoRA) | +| Llama 3.1 70B | 70B | ~40 GB | ~140 GB | QLoRA only | +| Qwen 2.5 72B | 72B | ~42 GB | ~144 GB | QLoRA only | +| Mixtral 8×7B | 46.7B | ~28 GB | ~94 GB | QLoRA or full LoRA | +| Llama 3.1 405B | 405B | ~230 GB | N/A | No | + +## Migration Strategy + +``` +Phase 1 (now): Distributed CPU via RayJob → small models (≤8B), + ~1–5 h per run, up to 11 workers across 14 nodes + Pis participate for ≤3B models, free + +Phase 2 (DGX): GPU pipelines on DGX Spark → large models (≤70B), + minutes per run, dedicated hardware + +Phase 3 (hybrid): CPU for lightweight experiments + DGX for + production fine-tunes; both report to MLflow + Training cluster can also include DGX as a + Ray worker alongside CPU nodes for mixed runs +``` + +All phases share: +- Same Kubeflow Pipelines UI +- Same S3 data source +- Same Gitea adapter repositories +- Same MLflow experiment tracking +- Same evaluation pipeline +- Same Ray Train `TorchTrainer` API + +## Links + +* Related: [ADR-0054](0054-kubeflow-pipeline-cicd.md) — Kubeflow Pipeline CI/CD +* Related: [ADR-0011](0011-kuberay-unified-gpu-backend.md) — KubeRay Unified GPU Backend +* Related: `kubeflow/qlora_pdf_pipeline.py` — Existing GPU QLoRA pipeline +* Related: `kubeflow/cpu_training_pipeline.py` — New distributed CPU training pipeline (this ADR) +* [NVIDIA DGX Spark](https://www.nvidia.com/en-us/products/dgx/spark/) — Product page +* [Ray Train TorchTrainer](https://docs.ray.io/en/latest/train/getting-started-pytorch.html) — Distributed training docs +* [PEFT LoRA](https://huggingface.co/docs/peft/main/en/conceptual_guides/lora) — LoRA documentation