docs: update node inventory and 70B QLoRA feasibility analysis
All checks were successful
Update README with ADR Index / update-readme (push) Successful in 7s
All checks were successful
Update README with ADR Index / update-readme (push) Successful in 7s
This commit is contained in:
@@ -115,10 +115,10 @@ The homelab is a production-grade Kubernetes cluster running on bare-metal hardw
|
|||||||
┌─────────────────────────────────────────────────────────────────────────────┐
|
┌─────────────────────────────────────────────────────────────────────────────┐
|
||||||
│ PLATFORM LAYER │
|
│ PLATFORM LAYER │
|
||||||
├─────────────────────────────────────────────────────────────────────────────┤
|
├─────────────────────────────────────────────────────────────────────────────┤
|
||||||
│ Talos Linux v1.12.1 │ Kubernetes v1.35.0 │ Cilium CNI │
|
│ Talos Linux v1.12.x │ Kubernetes v1.35.0 │ Cilium CNI │
|
||||||
│ │
|
│ │
|
||||||
│ Nodes: storm, bruenor, catti (control) │ elminster, khelben, drizzt, │
|
│ 14 nodes: 3 control plane │ 4 GPU workers │ 2 CPU-only x86 workers │
|
||||||
│ │ danilo (workers) │
|
│ │ 5 Raspberry Pi (arm64) workers │
|
||||||
└─────────────────────────────────────────────────────────────────────────────┘
|
└─────────────────────────────────────────────────────────────────────────────┘
|
||||||
```
|
```
|
||||||
|
|
||||||
@@ -134,14 +134,41 @@ The homelab is a production-grade Kubernetes cluster running on bare-metal hardw
|
|||||||
|
|
||||||
**VIP**: 192.168.100.20 (shared across control plane)
|
**VIP**: 192.168.100.20 (shared across control plane)
|
||||||
|
|
||||||
### Worker Nodes
|
### Worker Nodes — GPU
|
||||||
|
|
||||||
| Node | IP | CPU | GPU | GPU Memory | Workload |
|
| Node | IP | CPU | RAM | GPU | GPU Memory | Workload |
|
||||||
|------|-------|-----|-----|------------|----------|
|
|------|-------|-----|-----|-----|------------|----------|
|
||||||
| elminster | 192.168.100.31 | Intel | NVIDIA RTX 2070 | 8GB VRAM | Whisper, XTTS |
|
| elminster | 192.168.100.31 | Intel (16c) | 62 GB | NVIDIA RTX 2070 | 8 GB VRAM | Whisper, XTTS |
|
||||||
| khelben | 192.168.100.32 | AMD Ryzen | AMD Strix Halo | 64GB Unified | vLLM (dedicated) |
|
| khelben | 192.168.100.32 | AMD Ryzen (32c) | 94 GB | AMD Strix Halo | 32 GB Unified | vLLM (dedicated) |
|
||||||
| drizzt | 192.168.100.40 | AMD Ryzen 7 6800H | AMD Radeon 680M | 12GB VRAM | BGE Embeddings |
|
| drizzt | 192.168.100.40 | AMD Ryzen 7 6800H (16c) | 27 GB | AMD Radeon 680M | 12 GB VRAM | BGE Embeddings |
|
||||||
| danilo | 192.168.100.41 | Intel Core Ultra 9 | Intel Arc | 16GB Shared | Reranker |
|
| danilo | 192.168.100.41 | Intel Core Ultra 9 (22c) | 62 GB | Intel Arc | 16 GB Shared | Reranker |
|
||||||
|
|
||||||
|
### Worker Nodes — CPU-only (x86_64)
|
||||||
|
|
||||||
|
| Node | IP | CPU | RAM | Workload |
|
||||||
|
|------|-------|-----|-----|----------|
|
||||||
|
| regis | 192.168.100.43 | Intel (4c) | 16 GB | General workloads |
|
||||||
|
| wulfgar | 192.168.100.42 | Intel (4c) | 31 GB | General workloads |
|
||||||
|
|
||||||
|
### Worker Nodes — Raspberry Pi (arm64)
|
||||||
|
|
||||||
|
| Node | IP | CPU | RAM | Workload |
|
||||||
|
|------|-------|-----|-----|----------|
|
||||||
|
| durnan | 192.168.100.54 | Cortex-A72 (4c) | 4 GB | Lightweight services |
|
||||||
|
| jarlaxle | 192.168.100.53 | Cortex-A72 (4c) | 4 GB | Lightweight services |
|
||||||
|
| mirt | 192.168.100.52 | Cortex-A72 (4c) | 4 GB | Lightweight services |
|
||||||
|
| volo | 192.168.100.51 | Cortex-A72 (4c) | 4 GB | Lightweight services |
|
||||||
|
| elaith | 192.168.100.55 | Cortex-A72 (4c) | 8 GB | Lightweight services |
|
||||||
|
|
||||||
|
### Cluster Totals
|
||||||
|
|
||||||
|
| Resource | Total |
|
||||||
|
|----------|-------|
|
||||||
|
| Nodes | 14 (3 control + 11 worker) |
|
||||||
|
| CPU cores | ~126 |
|
||||||
|
| System RAM | ~378 GB |
|
||||||
|
| Architectures | amd64, arm64 |
|
||||||
|
| GPUs | 4 (NVIDIA, AMD, Intel) |
|
||||||
|
|
||||||
## Networking
|
## Networking
|
||||||
|
|
||||||
|
|||||||
383
decisions/0058-training-strategy-cpu-dgx-spark.md
Normal file
383
decisions/0058-training-strategy-cpu-dgx-spark.md
Normal file
@@ -0,0 +1,383 @@
|
|||||||
|
# Training Strategy – Distributed CPU Now, DGX Spark Later
|
||||||
|
|
||||||
|
* Status: accepted
|
||||||
|
* Date: 2026-02-14
|
||||||
|
* Deciders: Billy
|
||||||
|
* Technical Story: Enable distributed model fine-tuning on spare CPU capacity without disrupting inference workloads; plan a migration path to dedicated GPU training hardware
|
||||||
|
|
||||||
|
## Context and Problem Statement
|
||||||
|
|
||||||
|
All GPUs in the homelab cluster are fully allocated to inference serving via KubeRay:
|
||||||
|
|
||||||
|
| Node | GPU | Accelerator | Serving |
|
||||||
|
|---|---|---|---|
|
||||||
|
| elminster | RTX 2070 8 GB | CUDA | Whisper (0.5) + TTS (0.5) |
|
||||||
|
| khelben | Strix Halo 128 GB | ROCm | vLLM / LLM (0.95) |
|
||||||
|
| drizzt | Radeon 680M | ROCm | BGE-Large embeddings (0.8) |
|
||||||
|
| danilo | Intel Arc | Intel | BGE reranker (0.8) |
|
||||||
|
|
||||||
|
Training workloads (QLoRA, LoRA, full fine-tune) cannot share these GPUs without degrading real-time inference latency. However, the cluster has **14 nodes with spare CPU and RAM** that can be pooled for distributed training:
|
||||||
|
|
||||||
|
| Node | CPU | RAM | Architecture | Available for Training |
|
||||||
|
|------|-----|-----|-------------|----------------------|
|
||||||
|
| storm (cp) | 4c | 16 GB | amd64 | Limited (control plane duties) |
|
||||||
|
| bruenor (cp) | 4c | 16 GB | amd64 | Limited (control plane duties) |
|
||||||
|
| catti (cp) | 4c | 16 GB | amd64 | Limited (control plane duties) |
|
||||||
|
| elminster | 16c | 62 GB | amd64 | Spare CPU (GPU reserved for inference) |
|
||||||
|
| khelben | 32c | 94 GB | amd64 | Spare CPU (GPU reserved for inference) |
|
||||||
|
| drizzt | 16c | 27 GB | amd64 | Spare CPU (GPU reserved for inference) |
|
||||||
|
| danilo | 22c | 62 GB | amd64 | Spare CPU (GPU reserved for inference) |
|
||||||
|
| regis | 4c | 16 GB | amd64 | Fully available |
|
||||||
|
| wulfgar | 4c | 31 GB | amd64 | Fully available |
|
||||||
|
| durnan (Pi) | 4c | 4 GB | arm64 | Available (small models only) |
|
||||||
|
| jarlaxle (Pi) | 4c | 4 GB | arm64 | Available (small models only) |
|
||||||
|
| mirt (Pi) | 4c | 4 GB | arm64 | Available (small models only) |
|
||||||
|
| volo (Pi) | 4c | 4 GB | arm64 | Available (small models only) |
|
||||||
|
| elaith (Pi) | 4c | 8 GB | arm64 | Available (small models only) |
|
||||||
|
|
||||||
|
**Cluster totals: ~126 CPU cores, ~378 GB RAM across 14 nodes.**
|
||||||
|
|
||||||
|
Rather than training on a single node, we can **distribute training across all 14 nodes** using Ray Train with data-parallel LoRA, harvesting spare CPU from every machine in the cluster — including Raspberry Pis and control-plane nodes. Each additional worker reduces wall-clock training time roughly linearly.
|
||||||
|
|
||||||
|
The NVIDIA **DGX Spark** (GB10 Grace Blackwell, 128 GB unified LPDDR5X, ~1 PFLOPS FP4) is an upcoming desktop-class device specifically designed for local AI development. If purchased, it would provide the first *dedicated* training accelerator in the cluster.
|
||||||
|
|
||||||
|
We need a training strategy that:
|
||||||
|
|
||||||
|
1. Works **today** with existing hardware (distributed CPU)
|
||||||
|
2. **Scales horizontally** — add more nodes or cores to reduce training time
|
||||||
|
3. Extends cleanly to a **DGX Spark** when available
|
||||||
|
4. Integrates with the existing Kubeflow Pipelines + MLflow + KubeRay stack
|
||||||
|
5. Does not impact inference serving
|
||||||
|
|
||||||
|
## Decision Drivers
|
||||||
|
|
||||||
|
* Zero GPU budget available for training today
|
||||||
|
* Spare CPU/RAM on every node (~126 cores, ~378 GB cluster-wide across 14 nodes)
|
||||||
|
* Training cadence is low (weekly/monthly, not continuous)
|
||||||
|
* Small-model fine-tuning (1B–8B) is the primary use case for data-parallel CPU training
|
||||||
|
* Distributed training recovers time lost from being CPU-only
|
||||||
|
* Mixed architectures (amd64 + arm64 Raspberry Pis) require multi-arch Ray images
|
||||||
|
* DGX Spark would unlock larger models (up to ~70B with NF4) and 10–50× faster training
|
||||||
|
* Must reuse existing pipeline tooling (kfp, MLflow, Gitea adapter repos)
|
||||||
|
|
||||||
|
## Considered Options
|
||||||
|
|
||||||
|
1. **Distributed CPU LoRA training via Ray Train + KubeRay RayJob**
|
||||||
|
2. **Single-node CPU training in a Kubeflow pipeline step**
|
||||||
|
3. **Reserve a GPU fraction for training (time-share with inference)**
|
||||||
|
4. **Offload training to cloud (Lambda Labs, RunPod, etc.)**
|
||||||
|
5. **Wait for DGX Spark before doing any training**
|
||||||
|
|
||||||
|
## Decision Outcome
|
||||||
|
|
||||||
|
Chosen option: **Option 1 — Distributed CPU LoRA training via Ray Train, with a clear upgrade path to DGX Spark**
|
||||||
|
|
||||||
|
A new `cpu_training_pipeline.py` Kubeflow pipeline trains small models (Qwen 2.5 3B, Llama 3.2 3B, Phi-3.5 Mini 3.8B, etc.) using LoRA on CPU in float32. The training step submits a **KubeRay RayJob** that creates an ephemeral Ray cluster with N CPU-only workers distributed across all available cluster nodes — including GPU workers (using spare CPU), CPU-only x86 workers, and even Raspberry Pis for small models. Ray Train's `TorchTrainer` handles data-parallel training with gradient synchronisation via AllReduce (Gloo backend).
|
||||||
|
|
||||||
|
The pipeline follows an 8-step pattern:
|
||||||
|
|
||||||
|
1. Fetch PDFs from Quobjects S3
|
||||||
|
2. Prepare instruction-tuning dataset
|
||||||
|
3. Upload prepared data to S3 (shared storage for Ray workers)
|
||||||
|
4. Submit KubeRay RayJob (N distributed CPU workers)
|
||||||
|
5. Download trained adapter from S3
|
||||||
|
6. Sanity evaluation on CPU
|
||||||
|
7. Push adapter to Gitea
|
||||||
|
8. Log metrics to MLflow
|
||||||
|
|
||||||
|
### Positive Consequences
|
||||||
|
|
||||||
|
* Training starts immediately with zero additional hardware cost
|
||||||
|
* **Scales horizontally**: up to 14 nodes; Raspberry Pis can participate for small models
|
||||||
|
* GPUs remain 100% dedicated to inference — no latency impact
|
||||||
|
* Ephemeral Ray cluster: resources released immediately after training
|
||||||
|
* Adapters are small (10–100 MB) even from CPU training
|
||||||
|
* MLflow tracks all experiments regardless of compute backend
|
||||||
|
* DGX Spark upgrade is additive, not a rewrite
|
||||||
|
|
||||||
|
### Negative Consequences
|
||||||
|
|
||||||
|
* CPU training is slower than GPU even when distributed
|
||||||
|
* Each worker loads a full copy of the model (data parallelism, not model parallelism)
|
||||||
|
* Limited to small models (≤8B) per worker due to memory constraints
|
||||||
|
* float32 training uses ~2× the memory of bf16/fp16
|
||||||
|
* Requires RBAC setup for pipeline SA to create RayJob/ConfigMap resources
|
||||||
|
|
||||||
|
## Distributed Ray CPU Training Design
|
||||||
|
|
||||||
|
### Architecture
|
||||||
|
|
||||||
|
```
|
||||||
|
Kubeflow Pipeline (KFP component)
|
||||||
|
│
|
||||||
|
├── Creates ConfigMap with training script
|
||||||
|
├── Creates KubeRay RayJob CR
|
||||||
|
│ │
|
||||||
|
│ ▼
|
||||||
|
│ ┌──────────┐
|
||||||
|
│ │ Ray Head │ (coordinator, 0 CPUs for training)
|
||||||
|
│ └────┬─────┘
|
||||||
|
│ │
|
||||||
|
│ ┌────┴──────────────────────────────────────────────────────┐
|
||||||
|
│ │ │ │ │ │ │
|
||||||
|
│ ▼ ▼ ▼ ▼ ▼ ▼
|
||||||
|
│ ┌────────┐┌────────┐┌────────┐┌──────────┐┌──────────┐┌──────────┐
|
||||||
|
│ │Worker 1││Worker 2││Worker 3││ Worker 4 ││ Worker 5 ││Worker N… │
|
||||||
|
│ │khelben ││elmins. ││danilo ││ wulfgar ││ regis ││ Pi nodes │
|
||||||
|
│ │32c/94Gi││16c/62Gi││22c/62Gi││ 4c/31Gi ││ 4c/16Gi ││ 4c/4-8Gi│
|
||||||
|
│ └────────┘└────────┘└────────┘└──────────┘└──────────┘└──────────┘
|
||||||
|
│ │ │ │ │ │ │
|
||||||
|
│ └─────────┴─── AllReduce (Gloo) ──────────┴──────────┘
|
||||||
|
│ │
|
||||||
|
│ Adapter → S3
|
||||||
|
│
|
||||||
|
├── Polls RayJob status until SUCCEEDED
|
||||||
|
├── Downloads adapter from S3
|
||||||
|
└── Evaluate → Gitea → MLflow
|
||||||
|
```
|
||||||
|
|
||||||
|
### How It Scales
|
||||||
|
|
||||||
|
| Workers | Nodes Used | Total CPUs | Est. Time (3B LoRA) | Notes |
|
||||||
|
|---|---|---|---|---|
|
||||||
|
| 1 | 1 node | 4–32 | 4–6 h | Baseline, single-worker |
|
||||||
|
| 4 | 4 nodes (GPU workers) | 86 | 1–1.5 h | Uses spare CPU on GPU nodes |
|
||||||
|
| 6 | + regis, wulfgar | 94 | 45–60 min | All x86 workers |
|
||||||
|
| 9 | + control plane | 106 | 30–45 min | Control plane also contributes |
|
||||||
|
| 11 | + Pis (small models) | ~126 | 20–35 min | Full cluster, arm64 + amd64 |
|
||||||
|
|
||||||
|
**Note:** Raspberry Pi workers (4 GB RAM) can only participate in training for models that fit in ~3 GB after OS/system overhead. For ≤3B models with LoRA, this is feasible; for 8B+ models, exclude Pi nodes from the RayJob spec.
|
||||||
|
|
||||||
|
Effective batch size = `per_device_batch_size × num_workers × gradient_accumulation_steps`. With 6 workers, the effective batch of `1 × 6 × 16 = 96` is competitive with GPU training batches.
|
||||||
|
|
||||||
|
### RayJob Lifecycle
|
||||||
|
|
||||||
|
1. **Create**: KFP component creates a `RayJob` CR + `ConfigMap` (training script)
|
||||||
|
2. **Schedule**: KubeRay operator allocates head pod + N worker pods across nodes
|
||||||
|
3. **Install**: Workers install pip deps via `runtimeEnvYAML` (torch, peft, trl, etc.)
|
||||||
|
4. **Train**: `TorchTrainer` runs `train_func` on each worker with DDP and `prepare_trainer` integration for HuggingFace
|
||||||
|
5. **Save**: Rank-0 worker saves adapter + metadata to S3
|
||||||
|
6. **Teardown**: `shutdownAfterJobFinishes: true` destroys all pods on completion
|
||||||
|
7. **TTL**: RayJob CR auto-deleted after 300 seconds
|
||||||
|
|
||||||
|
### Resource Allocation
|
||||||
|
|
||||||
|
Workers are configured per-node based on available resources. The pipeline supports heterogeneous worker specs:
|
||||||
|
|
||||||
|
```yaml
|
||||||
|
# Large x86 workers (khelben, elminster, danilo)
|
||||||
|
cpu_limit: "8"
|
||||||
|
memory_limit: "32Gi"
|
||||||
|
|
||||||
|
# Medium x86 workers (drizzt, wulfgar)
|
||||||
|
cpu_limit: "4"
|
||||||
|
memory_limit: "16Gi"
|
||||||
|
|
||||||
|
# Small x86 workers (regis, control plane)
|
||||||
|
cpu_limit: "2"
|
||||||
|
memory_limit: "8Gi"
|
||||||
|
|
||||||
|
# Raspberry Pi workers (durnan, jarlaxle, mirt, volo, elaith)
|
||||||
|
cpu_limit: "2"
|
||||||
|
memory_limit: "3Gi" # Only for ≤3B models
|
||||||
|
|
||||||
|
# Ray head (coordinator only, no training)
|
||||||
|
cpu_limit: "2"
|
||||||
|
memory_limit: "4Gi"
|
||||||
|
```
|
||||||
|
|
||||||
|
Each worker requests only spare CPU and RAM — inference GPU allocations are untouched. All pods are ephemeral — zero standing resource cost.
|
||||||
|
|
||||||
|
### Data Flow via S3
|
||||||
|
|
||||||
|
Training data is shared between KFP and Ray workers through S3:
|
||||||
|
|
||||||
|
```
|
||||||
|
KFP: prepare_data → /tmp/train.json ───upload──→ s3://training-data/ray-training-runs/{run_id}/data/
|
||||||
|
│
|
||||||
|
Ray Worker 1: ←──download──────────────────────────────┤
|
||||||
|
Ray Worker 2: ←──download──────────────────────────────┤
|
||||||
|
Ray Worker N: ←──download──────────────────────────────┘
|
||||||
|
|
||||||
|
Ray Worker 0 (rank 0): adapter ──upload──→ s3://training-data/ray-training-runs/{run_id}/adapter/
|
||||||
|
│
|
||||||
|
KFP: download_adapter ←──────────────────────────────────┘
|
||||||
|
```
|
||||||
|
|
||||||
|
### Model Selection for CPU Training
|
||||||
|
|
||||||
|
| Model | Parameters | RAM (float32) | Trainable (LoRA r=16) | Est. Time (4 workers, 16 CPU) |
|
||||||
|
|---|---|---|---|---|
|
||||||
|
| Qwen 2.5 3B Instruct | 3B | ~12 GB | ~4 M | 1–1.5 h |
|
||||||
|
| Llama 3.2 3B Instruct | 3B | ~12 GB | ~4 M | 1–1.5 h |
|
||||||
|
| Phi-3.5 Mini Instruct | 3.8B | ~15 GB | ~5 M | 1.5–2 h |
|
||||||
|
| Llama 3.1 8B Instruct | 8B | ~32 GB | ~8 M | 3–5 h |
|
||||||
|
|
||||||
|
### Can You QLoRA a 70B Model on CPU by Pooling All Cluster RAM?
|
||||||
|
|
||||||
|
**Short answer: Not with data parallelism alone; requires model parallelism (DeepSpeed ZeRO-3).**
|
||||||
|
|
||||||
|
With the current data-parallel design (Ray Train `TorchTrainer`), **every worker loads a full copy of the model**. A 70B model at 4-bit NF4 quantisation needs ~40 GB just for weights, plus optimizer states and activations. No single Raspberry Pi (4–8 GB) or small x86 node (16 GB) can hold this.
|
||||||
|
|
||||||
|
| Approach | Mechanism | 70B QLoRA Feasible? | Notes |
|
||||||
|
|----------|-----------|---------------------|-------|
|
||||||
|
| Data parallelism (current) | Each worker loads full model | No — need ≥48 GB per worker | Only khelben (94 GB) fits |
|
||||||
|
| DeepSpeed ZeRO-3 | Model sharded across workers | **Possible** — ~40 GB split across N workers | Needs large x86 nodes only |
|
||||||
|
| FSDP (Fully Sharded) | Similar to ZeRO-3 | **Possible** — PyTorch native | Same node constraints |
|
||||||
|
|
||||||
|
**Feasible pool for 70B model parallelism (x86 nodes with ≥27 GB RAM):**
|
||||||
|
|
||||||
|
| Node | RAM | Role in ZeRO-3 Shard |
|
||||||
|
|------|-----|---------------------|
|
||||||
|
| khelben | 94 GB | Primary shard host |
|
||||||
|
| elminster | 62 GB | Shard host |
|
||||||
|
| danilo | 62 GB | Shard host |
|
||||||
|
| wulfgar | 31 GB | Shard host |
|
||||||
|
| drizzt | 27 GB | Shard host |
|
||||||
|
| **Total** | **~276 GB** | **Sufficient for 70B NF4 + optimizer** |
|
||||||
|
|
||||||
|
A 70B model at 4-bit with LoRA optimizer states needs roughly ~60–80 GB total. With 276 GB available across 5 nodes using ZeRO-3 sharding, this is feasible — but introduces significant complexity:
|
||||||
|
|
||||||
|
1. **DeepSpeed ZeRO-3** or **FSDP** replaces simple DDP data parallelism
|
||||||
|
2. **Network bandwidth** becomes critical — Gloo AllReduce over 1GbE is very slow for model shards
|
||||||
|
3. **Training time** would be measured in days, not hours
|
||||||
|
4. **Raspberry Pis are excluded** — too little RAM for any shard of a 70B model
|
||||||
|
5. **10GbE or InfiniBand** would be strongly recommended for tolerable training speed
|
||||||
|
|
||||||
|
**Recommendation:** 70B training is technically possible with model parallelism across the large x86 nodes but impractical for regular use. The DGX Spark (128 GB unified memory) is a far better path to 70B fine-tuning. The CPU cluster is best suited for **≤8B models with data parallelism**, which works well across all node types.
|
||||||
|
|
||||||
|
### Training Configuration (CPU-optimised)
|
||||||
|
|
||||||
|
| Parameter | CPU Value | QLoRA (GPU) Value | Rationale |
|
||||||
|
|---|---|---|---|
|
||||||
|
| `dtype` | float32 | bf16 + NF4 4-bit | No GPU tensor cores for mixed precision |
|
||||||
|
| `optim` | adamw_torch | paged_adamw_8bit | No CUDA paging on CPU |
|
||||||
|
| `batch_size` | 1 | 2 | Memory constrained per worker |
|
||||||
|
| `gradient_accumulation` | 16 | 8 | Compensate for small batch |
|
||||||
|
| `max_seq_length` | 1024 | 2048 | Halved to fit in RAM |
|
||||||
|
| `lora_r` | 16 | 64 | Fewer params, faster |
|
||||||
|
| `gradient_checkpointing` | true | false | Trade compute for memory |
|
||||||
|
| `no_cuda` | true | false | Explicit CPU-only |
|
||||||
|
| `backend` | Gloo (AllReduce) | NCCL | Gloo is the CPU distributed backend |
|
||||||
|
|
||||||
|
### Pipeline Integration
|
||||||
|
|
||||||
|
```
|
||||||
|
Kubeflow UI / Argo cron trigger
|
||||||
|
│
|
||||||
|
▼
|
||||||
|
cpu_training_pipeline.yaml
|
||||||
|
│
|
||||||
|
├── 1. fetch_pdfs_from_s3 (python:3.13-slim, boto3)
|
||||||
|
├── 2. prepare_training_data (python:3.13-slim, PyMuPDF)
|
||||||
|
├── 3. upload_data_to_s3 (python:3.13-slim, boto3)
|
||||||
|
├── 4. submit_ray_training_job (python:3.13-slim, kubernetes + boto3)
|
||||||
|
│ └── RayJob: N workers × (4 CPU, 16Gi)
|
||||||
|
│ Image: rayproject/ray:2.44.1-py311-cpu
|
||||||
|
│ Deps: torch, peft, trl, transformers, accelerate, datasets
|
||||||
|
├── 5. download_adapter_from_s3 (python:3.13-slim, boto3)
|
||||||
|
├── 6. evaluate_adapter_cpu (python:3.13-slim, torch+peft)
|
||||||
|
├── 7. push_adapter_to_gitea (python:3.13-slim, requests)
|
||||||
|
└── 8. log_training_metrics (python:3.13-slim, mlflow)
|
||||||
|
```
|
||||||
|
|
||||||
|
### RBAC Requirements
|
||||||
|
|
||||||
|
The Kubeflow pipeline service account needs permissions to create RayJob and ConfigMap resources:
|
||||||
|
|
||||||
|
```yaml
|
||||||
|
apiVersion: rbac.authorization.k8s.io/v1
|
||||||
|
kind: ClusterRole
|
||||||
|
metadata:
|
||||||
|
name: kubeflow-rayjob-creator
|
||||||
|
rules:
|
||||||
|
- apiGroups: ["ray.io"]
|
||||||
|
resources: ["rayjobs"]
|
||||||
|
verbs: ["create", "get", "list", "watch", "delete"]
|
||||||
|
- apiGroups: [""]
|
||||||
|
resources: ["configmaps"]
|
||||||
|
verbs: ["create", "delete"]
|
||||||
|
```
|
||||||
|
|
||||||
|
## DGX Spark Upgrade Path
|
||||||
|
|
||||||
|
### Device Specifications
|
||||||
|
|
||||||
|
| Spec | Value |
|
||||||
|
|---|---|
|
||||||
|
| SoC | GB10 Grace Blackwell Superchip |
|
||||||
|
| CPU | NVIDIA Grace (Arm), 20 cores |
|
||||||
|
| GPU | Blackwell, CUDA compute capability 12.0 |
|
||||||
|
| Memory | 128 GB unified LPDDR5X (CPU + GPU shared) |
|
||||||
|
| AI Performance | 1 PFLOPS FP4 |
|
||||||
|
| Connectivity | USB-C, 10GbE, Wi-Fi 7 |
|
||||||
|
| OS | Ubuntu-based DGX OS (Linux) |
|
||||||
|
| Power | <150 W |
|
||||||
|
|
||||||
|
### Integration Plan
|
||||||
|
|
||||||
|
1. **Network**: Connect DGX Spark via 10GbE to the lab switch; assign a static IP in the cluster network
|
||||||
|
2. **Kubernetes**: Join the DGX Spark as a Kubernetes worker node with taints:
|
||||||
|
```yaml
|
||||||
|
taints:
|
||||||
|
- key: nvidia.com/training
|
||||||
|
value: "true"
|
||||||
|
effect: NoSchedule
|
||||||
|
labels:
|
||||||
|
node.kubernetes.io/instance-type: dgx-spark
|
||||||
|
accelerator: blackwell
|
||||||
|
```
|
||||||
|
3. **KubeRay**: Add a dedicated `training` RayCluster (separate from the inference `RayService`) or submit `RayJob` resources that request the DGX Spark's GPU. The same `TorchTrainer` pattern applies — just change `use_gpu=True` in `ScalingConfig`.
|
||||||
|
4. **Pipeline**: Create `dgx_spark_training_pipeline.py` mirroring the CPU pipeline but with:
|
||||||
|
- `set_accelerator_type("nvidia.com/gpu")`
|
||||||
|
- `set_gpu_limit(1)`
|
||||||
|
- `node_selector={"accelerator": "blackwell"}`
|
||||||
|
- BitsAndBytesConfig 4-bit NF4 quantisation (like existing QLoRA pipeline)
|
||||||
|
- bf16 compute dtype
|
||||||
|
- Larger models: 8B–70B
|
||||||
|
5. **MLflow**: Same experiment tracking; tag runs with `training_device=dgx-spark`
|
||||||
|
6. **Scheduling**: Training jobs tolerate the `nvidia.com/training` taint; inference deployments do not. This guarantees workload isolation.
|
||||||
|
|
||||||
|
### Model Capacity on DGX Spark
|
||||||
|
|
||||||
|
| Model | Parameters | Memory (NF4 4-bit) | Memory (bf16) | Fits? |
|
||||||
|
|---|---|---|---|---|
|
||||||
|
| Llama 3.1 8B | 8B | ~6 GB | ~16 GB | Yes (full or QLoRA) |
|
||||||
|
| Llama 3.1 70B | 70B | ~40 GB | ~140 GB | QLoRA only |
|
||||||
|
| Qwen 2.5 72B | 72B | ~42 GB | ~144 GB | QLoRA only |
|
||||||
|
| Mixtral 8×7B | 46.7B | ~28 GB | ~94 GB | QLoRA or full LoRA |
|
||||||
|
| Llama 3.1 405B | 405B | ~230 GB | N/A | No |
|
||||||
|
|
||||||
|
## Migration Strategy
|
||||||
|
|
||||||
|
```
|
||||||
|
Phase 1 (now): Distributed CPU via RayJob → small models (≤8B),
|
||||||
|
~1–5 h per run, up to 11 workers across 14 nodes
|
||||||
|
Pis participate for ≤3B models, free
|
||||||
|
|
||||||
|
Phase 2 (DGX): GPU pipelines on DGX Spark → large models (≤70B),
|
||||||
|
minutes per run, dedicated hardware
|
||||||
|
|
||||||
|
Phase 3 (hybrid): CPU for lightweight experiments + DGX for
|
||||||
|
production fine-tunes; both report to MLflow
|
||||||
|
Training cluster can also include DGX as a
|
||||||
|
Ray worker alongside CPU nodes for mixed runs
|
||||||
|
```
|
||||||
|
|
||||||
|
All phases share:
|
||||||
|
- Same Kubeflow Pipelines UI
|
||||||
|
- Same S3 data source
|
||||||
|
- Same Gitea adapter repositories
|
||||||
|
- Same MLflow experiment tracking
|
||||||
|
- Same evaluation pipeline
|
||||||
|
- Same Ray Train `TorchTrainer` API
|
||||||
|
|
||||||
|
## Links
|
||||||
|
|
||||||
|
* Related: [ADR-0054](0054-kubeflow-pipeline-cicd.md) — Kubeflow Pipeline CI/CD
|
||||||
|
* Related: [ADR-0011](0011-kuberay-unified-gpu-backend.md) — KubeRay Unified GPU Backend
|
||||||
|
* Related: `kubeflow/qlora_pdf_pipeline.py` — Existing GPU QLoRA pipeline
|
||||||
|
* Related: `kubeflow/cpu_training_pipeline.py` — New distributed CPU training pipeline (this ADR)
|
||||||
|
* [NVIDIA DGX Spark](https://www.nvidia.com/en-us/products/dgx/spark/) — Product page
|
||||||
|
* [Ray Train TorchTrainer](https://docs.ray.io/en/latest/train/getting-started-pytorch.html) — Distributed training docs
|
||||||
|
* [PEFT LoRA](https://huggingface.co/docs/peft/main/en/conceptual_guides/lora) — LoRA documentation
|
||||||
Reference in New Issue
Block a user