From 78c704728f4985e3bcc9061fe36829ec82bfe688 Mon Sep 17 00:00:00 2001 From: "Billy D." Date: Mon, 16 Feb 2026 15:41:52 -0500 Subject: [PATCH] update to homelab design to account for waterdeep. --- decisions/0059-mac-mini-ray-worker.md | 337 ++++++++++++++++++++++++++ 1 file changed, 337 insertions(+) create mode 100644 decisions/0059-mac-mini-ray-worker.md diff --git a/decisions/0059-mac-mini-ray-worker.md b/decisions/0059-mac-mini-ray-worker.md new file mode 100644 index 0000000..39fd5be --- /dev/null +++ b/decisions/0059-mac-mini-ray-worker.md @@ -0,0 +1,337 @@ +# Add Mac Mini M4 Pro (waterdeep) to Ray Cluster as External Worker + +* Status: proposed +* Date: 2026-02-16 +* Deciders: Billy +* Technical Story: Expand Ray cluster with Apple Silicon compute for inference and training + +## Context and Problem Statement + +The homelab Ray cluster currently runs entirely within Kubernetes, with GPU workers pinned to specific nodes: + +| Node | GPU | Memory | Workload | +|------|-----|--------|----------| +| khelben | Strix Halo (ROCm) | 128 GB unified | vLLM 70B (0.95 GPU) | +| elminster | RTX 2070 (CUDA) | 8 GB VRAM | Whisper (0.5) + TTS (0.5) | +| drizzt | Radeon 680M (ROCm) | 12 GB VRAM | Embeddings (0.8) | +| danilo | Intel Arc (i915) | ~6 GB shared | Reranker (0.8) | + +All GPUs are fully allocated to inference (see [ADR-0005](0005-multi-gpu-strategy.md), [ADR-0011](0011-kuberay-unified-gpu-backend.md)). Training is currently CPU-only and distributed across cluster nodes via Ray Train ([ADR-0058](0058-training-strategy-cpu-dgx-spark.md)). + +**waterdeep** is a Mac Mini M4 Pro with 48 GB of unified memory that currently serves as a development workstation (see [ADR-0037](0037-node-naming-conventions.md)). Its Apple Silicon GPU (MPS backend) and unified memory architecture make it a strong candidate for both inference and training workloads — but macOS cannot run Talos Linux or easily join the Kubernetes cluster as a native node. + +How do we integrate waterdeep's compute into the Ray cluster without disrupting the existing Kubernetes-managed infrastructure? + +## Decision Drivers + +* 48 GB unified memory is sufficient for medium-large models (e.g., 7B–30B at Q4/Q8 quantisation) +* Apple Silicon MPS backend is supported by PyTorch and vLLM (experimental) +* macOS cannot run Talos Linux — must integrate without Kubernetes +* Ray natively supports heterogeneous clusters with external workers +* Must not impact existing inference serving stability +* Training workloads ([ADR-0058](0058-training-strategy-cpu-dgx-spark.md)) would benefit from a GPU-accelerated worker +* ARM64 architecture requires compatible Python packages and model formats + +## Considered Options + +1. **External Ray worker on macOS** — run a Ray worker process natively on waterdeep that connects to the cluster Ray head over the network +2. **Linux VM on Mac** — run UTM/Parallels VM with Linux, join as a Kubernetes node +3. **K3s agent on macOS** — run K3s directly on macOS via Docker Desktop + +## Decision Outcome + +Chosen option: **Option 1 — External Ray worker on macOS**, because Ray natively supports heterogeneous workers joining over the network. This avoids the complexity of running Kubernetes on macOS, lets waterdeep remain a development workstation, and leverages Apple Silicon MPS acceleration transparently through PyTorch. + +### Positive Consequences + +* Zero Kubernetes overhead on waterdeep — remains a usable dev workstation +* 48 GB unified memory available for models (vs split VRAM/RAM on discrete GPUs) +* MPS GPU acceleration for both inference and training +* Adds a 5th GPU class to the Ray fleet (Apple MPS alongside ROCm, CUDA, Intel, RDNA2) +* Training jobs ([ADR-0058](0058-training-strategy-cpu-dgx-spark.md)) gain a GPU-accelerated worker +* Can run a secondary LLM instance for overflow or A/B testing +* Quick to set up — single `ray start` command +* Worker can be stopped/started without affecting the cluster + +### Negative Consequences + +* Not managed by KubeRay or Flux — requires manual or launchd-based lifecycle management +* Network dependency — if waterdeep sleeps or disconnects, Ray tasks on it fail +* MPS backend has limited operator coverage compared to CUDA/ROCm +* Python environment must be maintained separately (not in a container image) +* No Longhorn storage — model cache managed locally or via NFS mount +* Monitoring not automatically scraped by Prometheus (needs node-exporter or push gateway) + +## Pros and Cons of the Options + +### Option 1: External Ray worker on macOS + +* Good, because Ray is designed for heterogeneous multi-node clusters +* Good, because no VM overhead — full access to Metal/MPS and unified memory +* Good, because waterdeep remains a functional dev workstation +* Good, because trivial to start/stop (single process) +* Bad, because not managed by Kubernetes or GitOps +* Bad, because requires manual Python environment management +* Bad, because MPS support in vLLM is experimental + +### Option 2: Linux VM on Mac + +* Good, because would be a standard Kubernetes node +* Good, because managed by KubeRay like other workers +* Bad, because VM overhead reduces available memory (hypervisor, guest OS) +* Bad, because no MPS/Metal GPU passthrough to Linux VMs on Apple Silicon +* Bad, because complex to maintain (VM lifecycle, networking, storage) +* Bad, because wastes the primary advantage (Apple Silicon GPU) + +### Option 3: K3s agent on macOS + +* Good, because Kubernetes-native, managed by Flux +* Bad, because K3s on macOS requires Docker Desktop (resource overhead) +* Bad, because container networking on macOS is fragile +* Bad, because MPS device access from within Docker containers is unreliable +* Bad, because not a supported K3s configuration + +## Architecture + +``` +┌──────────────────────────────────────────────────────────────────────────┐ +│ Kubernetes Cluster (Talos) │ +│ │ +│ ┌──────────────────────────────────────────────────────────────────┐ │ +│ │ RayService (ai-inference) — KubeRay managed │ │ +│ │ │ │ +│ │ Head: wulfgar │ │ +│ │ Workers: khelben (ROCm), elminster (CUDA), │ │ +│ │ drizzt (RDNA2), danilo (Intel) │ │ +│ └──────────────────────┬───────────────────────────────────────────┘ │ +│ │ Ray GCS (port 6379) │ +│ │ │ +└─────────────────────────┼────────────────────────────────────────────────┘ + │ Home network (LAN) + │ +┌─────────────────────────┼────────────────────────────────────────────────┐ +│ waterdeep (Mac Mini M4 Pro) │ +│ │ │ +│ ┌──────────────────────▼───────────────────────────────────────────┐ │ +│ │ External Ray Worker (ray start --address=...) │ │ +│ │ │ │ +│ │ • 12-core CPU (8P + 4E) + 16-core Neural Engine │ │ +│ │ • 48 GB unified memory (shared CPU/GPU) │ │ +│ │ • MPS (Metal) GPU backend via PyTorch │ │ +│ │ • Custom resource: gpu_apple_mps: 1 │ │ +│ │ │ │ +│ │ Workloads: │ │ +│ │ ├── Inference: secondary LLM (7B–30B), overflow serving │ │ +│ │ └── Training: LoRA/QLoRA fine-tuning via Ray Train │ │ +│ └──────────────────────────────────────────────────────────────────┘ │ +│ │ +│ Model cache: ~/Library/Caches/huggingface + NFS mount │ +└──────────────────────────────────────────────────────────────────────────┘ +``` + +## Updated GPU Fleet + +| Node | GPU | Backend | Memory | Custom Resource | Workload | +|------|-----|---------|--------|-----------------|----------| +| khelben | Strix Halo | ROCm | 128 GB unified | `gpu_strixhalo: 1` | vLLM 70B | +| elminster | RTX 2070 | CUDA | 8 GB VRAM | `gpu_nvidia: 1` | Whisper + TTS | +| drizzt | Radeon 680M | ROCm | 12 GB VRAM | `gpu_rdna2: 1` | Embeddings | +| danilo | Intel Arc | i915/IPEX | ~6 GB shared | `gpu_intel: 1` | Reranker | +| **waterdeep** | **M4 Pro** | **MPS (Metal)** | **48 GB unified** | **`gpu_apple_mps: 1`** | **LLM (7B–30B) + Training** | + +## Implementation Plan + +### 1. Network Prerequisites + +waterdeep must be able to reach the Ray head node's GCS port: + +```bash +# From waterdeep, verify connectivity +nc -zv 6379 +``` + +The Ray head service (`ai-inference-raycluster-head-svc`) is ClusterIP-only. Options to expose it: + +| Approach | Complexity | Recommended | +|----------|-----------|-------------| +| NodePort service on port 6379 | Low | For initial setup | +| Envoy Gateway TCPRoute | Medium | For production use | +| Tailscale/WireGuard mesh | Medium | If already in use | + +### 2. Python Environment on waterdeep + +```bash +# Install uv (per ADR-0012) +curl -LsSf https://astral.sh/uv/install.sh | sh + +# Create Ray worker environment +uv venv ~/ray-worker --python 3.12 +source ~/ray-worker/bin/activate + +# Install Ray with ML dependencies +uv pip install "ray[default]==2.53.0" torch torchvision torchaudio \ + transformers accelerate peft bitsandbytes \ + ray-serve-apps # internal package from Gitea PyPI + +# Verify MPS availability +python -c "import torch; print(torch.backends.mps.is_available())" +``` + +### 3. Start Ray Worker + +```bash +# Join the cluster with custom resources +ray start \ + --address=":6379" \ + --num-cpus=12 \ + --num-gpus=1 \ + --resources='{"gpu_apple_mps": 1}' \ + --block +``` + +### 4. launchd Service (Persistent) + +```xml + + + + + + Label + io.ray.worker + ProgramArguments + + /Users/billy/ray-worker/bin/ray + start + --address=RAY_HEAD_IP:6379 + --num-cpus=12 + --num-gpus=1 + --resources={"gpu_apple_mps": 1} + --block + + RunAtLoad + + KeepAlive + + StandardOutPath + /tmp/ray-worker.log + StandardErrorPath + /tmp/ray-worker-error.log + EnvironmentVariables + + PATH + /Users/billy/ray-worker/bin:/usr/local/bin:/usr/bin:/bin + + + +``` + +```bash +launchctl load ~/Library/LaunchAgents/io.ray.worker.plist +``` + +### 5. Model Cache via NFS + +Mount the NAS model cache on waterdeep so models are shared with the cluster: + +```bash +# Mount candlekeep NFS share +sudo mount -t nfs candlekeep.lab.daviestechlabs.io:/volume1/models \ + /Volumes/model-cache + +# Or add to /etc/fstab for persistence +# candlekeep.lab.daviestechlabs.io:/volume1/models /Volumes/model-cache nfs rw 0 0 + +# Symlink to HuggingFace cache location +ln -s /Volumes/model-cache ~/.cache/huggingface/hub +``` + +### 6. Ray Serve Deployment Targeting + +To schedule a deployment specifically on waterdeep, use the `gpu_apple_mps` custom resource in the RayService config: + +```yaml +# In rayservice.yaml serveConfigV2 +- name: llm-secondary + route_prefix: /llm-secondary + import_path: ray_serve.serve_llm:app + runtime_env: + env_vars: + MODEL_ID: "Qwen/Qwen2.5-32B-Instruct-AWQ" + DEVICE: "mps" + MAX_MODEL_LEN: "4096" + deployments: + - name: LLMDeployment + num_replicas: 1 + ray_actor_options: + num_gpus: 0.95 + resources: + gpu_apple_mps: 1 +``` + +### 7. Training Integration + +Ray Train jobs from [ADR-0058](0058-training-strategy-cpu-dgx-spark.md) will automatically discover waterdeep as an available worker. To prefer it for GPU-accelerated training: + +```python +# In cpu_training_pipeline.py — updated to prefer MPS when available +trainer = TorchTrainer( + train_func, + scaling_config=ScalingConfig( + num_workers=1, + use_gpu=True, + resources_per_worker={"gpu_apple_mps": 1}, + ), +) +``` + +## Monitoring + +Since waterdeep is not a Kubernetes node, standard Prometheus scraping won't reach it. Options: + +| Approach | Notes | +|----------|-------| +| Prometheus push gateway | Ray worker pushes metrics periodically | +| Node-exporter on macOS | Homebrew `node_exporter`, scraped by Prometheus via static target | +| Ray Dashboard | Already shows all connected workers (ray-serve.lab.daviestechlabs.io) | + +The Ray Dashboard at `ray-serve.lab.daviestechlabs.io` will automatically show waterdeep as a connected node with its resources, tasks, and memory usage — no additional configuration needed. + +## Power Management + +To prevent macOS from sleeping and disconnecting the Ray worker: + +```bash +# Disable sleep when on power adapter +sudo pmset -c sleep 0 displaysleep 0 disksleep 0 + +# Or use caffeinate for the Ray process +caffeinate -s ray start --address=... --block +``` + +## Security Considerations + +* Ray's GCS port (6379) will be exposed outside the cluster — restrict with firewall rules to waterdeep's IP only +* The Ray worker has no RBAC — it executes whatever tasks the head assigns +* Model weights on NFS are read-only from waterdeep (mount with `ro` option if possible) +* Consider Tailscale or WireGuard for encrypted transport if the Ray GCS traffic crosses untrusted network segments + +## Future Considerations + +* **DGX Spark** ([ADR-0058](0058-training-strategy-cpu-dgx-spark.md)): When acquired, waterdeep can shift to secondary inference while DGX Spark handles training +* **vLLM MPS maturity**: As vLLM's MPS backend matures, waterdeep could serve larger models more efficiently +* **MLX backend**: Apple's MLX framework may provide better performance than PyTorch MPS for some workloads — worth evaluating as an alternative serving backend +* **Second Mac Mini**: If another Apple Silicon node is added, the external-worker pattern scales trivially + +## Links + +* [Ray Clusters — Adding External Workers](https://docs.ray.io/en/latest/cluster/vms/getting-started.html) +* [PyTorch MPS Backend](https://pytorch.org/docs/stable/notes/mps.html) +* [vLLM Apple Silicon Support](https://docs.vllm.ai/en/latest/) +* Related: [ADR-0005](0005-multi-gpu-strategy.md) — Multi-GPU strategy +* Related: [ADR-0011](0011-kuberay-unified-gpu-backend.md) — KubeRay unified GPU backend +* Related: [ADR-0024](0024-ray-repository-structure.md) — Ray repository structure +* Related: [ADR-0035](0035-arm64-worker-strategy.md) — ARM64 worker strategy +* Related: [ADR-0037](0037-node-naming-conventions.md) — Node naming conventions +* Related: [ADR-0058](0058-training-strategy-cpu-dgx-spark.md) — Training strategy