16 KiB
Add Mac Mini M4 Pro (waterdeep) to Ray Cluster as External Worker
- Status: proposed
- Date: 2026-02-16
- Deciders: Billy
- Technical Story: Expand Ray cluster with Apple Silicon compute for inference and training
Context and Problem Statement
The homelab Ray cluster currently runs entirely within Kubernetes, with GPU workers pinned to specific nodes:
| Node | GPU | Memory | Workload |
|---|---|---|---|
| khelben | Strix Halo (ROCm) | 128 GB unified | vLLM 70B (0.95 GPU) |
| elminster | RTX 2070 (CUDA) | 8 GB VRAM | Whisper (0.5) + TTS (0.5) |
| drizzt | Radeon 680M (ROCm) | 12 GB VRAM | Embeddings (0.8) |
| danilo | Intel Arc (i915) | ~6 GB shared | Reranker (0.8) |
All GPUs are fully allocated to inference (see ADR-0005, ADR-0011). Training is currently CPU-only and distributed across cluster nodes via Ray Train (ADR-0058).
waterdeep is a Mac Mini M4 Pro with 48 GB of unified memory that currently serves as a development workstation (see ADR-0037). Its Apple Silicon GPU (MPS backend) and unified memory architecture make it a strong candidate for both inference and training workloads — but macOS cannot run Talos Linux or easily join the Kubernetes cluster as a native node.
How do we integrate waterdeep's compute into the Ray cluster without disrupting the existing Kubernetes-managed infrastructure?
Decision Drivers
- 48 GB unified memory is sufficient for medium-large models (e.g., 7B–30B at Q4/Q8 quantisation)
- Apple Silicon MPS backend is supported by PyTorch and vLLM (experimental)
- macOS cannot run Talos Linux — must integrate without Kubernetes
- Ray natively supports heterogeneous clusters with external workers
- Must not impact existing inference serving stability
- Training workloads (ADR-0058) would benefit from a GPU-accelerated worker
- ARM64 architecture requires compatible Python packages and model formats
Considered Options
- External Ray worker on macOS — run a Ray worker process natively on waterdeep that connects to the cluster Ray head over the network
- Linux VM on Mac — run UTM/Parallels VM with Linux, join as a Kubernetes node
- K3s agent on macOS — run K3s directly on macOS via Docker Desktop
Decision Outcome
Chosen option: Option 1 — External Ray worker on macOS, because Ray natively supports heterogeneous workers joining over the network. This avoids the complexity of running Kubernetes on macOS, lets waterdeep remain a development workstation, and leverages Apple Silicon MPS acceleration transparently through PyTorch.
Positive Consequences
- Zero Kubernetes overhead on waterdeep — remains a usable dev workstation
- 48 GB unified memory available for models (vs split VRAM/RAM on discrete GPUs)
- MPS GPU acceleration for both inference and training
- Adds a 5th GPU class to the Ray fleet (Apple MPS alongside ROCm, CUDA, Intel, RDNA2)
- Training jobs (ADR-0058) gain a GPU-accelerated worker
- Can run a secondary LLM instance for overflow or A/B testing
- Quick to set up — single
ray startcommand - Worker can be stopped/started without affecting the cluster
Negative Consequences
- Not managed by KubeRay or Flux — requires manual or launchd-based lifecycle management
- Network dependency — if waterdeep sleeps or disconnects, Ray tasks on it fail
- MPS backend has limited operator coverage compared to CUDA/ROCm
- Python environment must be maintained separately (not in a container image)
- No Longhorn storage — model cache managed locally or via NFS mount from gravenhollow (nfs-fast)
- Monitoring not automatically scraped by Prometheus (needs node-exporter or push gateway)
Pros and Cons of the Options
Option 1: External Ray worker on macOS
- Good, because Ray is designed for heterogeneous multi-node clusters
- Good, because no VM overhead — full access to Metal/MPS and unified memory
- Good, because waterdeep remains a functional dev workstation
- Good, because trivial to start/stop (single process)
- Bad, because not managed by Kubernetes or GitOps
- Bad, because requires manual Python environment management
- Bad, because MPS support in vLLM is experimental
Option 2: Linux VM on Mac
- Good, because would be a standard Kubernetes node
- Good, because managed by KubeRay like other workers
- Bad, because VM overhead reduces available memory (hypervisor, guest OS)
- Bad, because no MPS/Metal GPU passthrough to Linux VMs on Apple Silicon
- Bad, because complex to maintain (VM lifecycle, networking, storage)
- Bad, because wastes the primary advantage (Apple Silicon GPU)
Option 3: K3s agent on macOS
- Good, because Kubernetes-native, managed by Flux
- Bad, because K3s on macOS requires Docker Desktop (resource overhead)
- Bad, because container networking on macOS is fragile
- Bad, because MPS device access from within Docker containers is unreliable
- Bad, because not a supported K3s configuration
Architecture
┌──────────────────────────────────────────────────────────────────────────┐
│ Kubernetes Cluster (Talos) │
│ │
│ ┌──────────────────────────────────────────────────────────────────┐ │
│ │ RayService (ai-inference) — KubeRay managed │ │
│ │ │ │
│ │ Head: wulfgar │ │
│ │ Workers: khelben (ROCm), elminster (CUDA), │ │
│ │ drizzt (RDNA2), danilo (Intel) │ │
│ └──────────────────────┬───────────────────────────────────────────┘ │
│ │ Ray GCS (port 6379) │
│ │ │
└─────────────────────────┼────────────────────────────────────────────────┘
│ Home network (LAN)
│
┌─────────────────────────┼────────────────────────────────────────────────┐
│ waterdeep (Mac Mini M4 Pro) │
│ │ │
│ ┌──────────────────────▼───────────────────────────────────────────┐ │
│ │ External Ray Worker (ray start --address=...) │ │
│ │ │ │
│ │ • 12-core CPU (8P + 4E) + 16-core Neural Engine │ │
│ │ • 48 GB unified memory (shared CPU/GPU) │ │
│ │ • MPS (Metal) GPU backend via PyTorch │ │
│ │ • Custom resource: gpu_apple_mps: 1 │ │
│ │ │ │
│ │ Workloads: │ │
│ │ ├── Inference: secondary LLM (7B–30B), overflow serving │ │
│ │ └── Training: LoRA/QLoRA fine-tuning via Ray Train │ │
│ └──────────────────────────────────────────────────────────────────┘ │
│ │
│ Model cache: ~/Library/Caches/huggingface + NFS mount (gravenhollow) │
└──────────────────────────────────────────────────────────────────────────┘
Updated GPU Fleet
| Node | GPU | Backend | Memory | Custom Resource | Workload |
|---|---|---|---|---|---|
| khelben | Strix Halo | ROCm | 128 GB unified | gpu_strixhalo: 1 |
vLLM 70B |
| elminster | RTX 2070 | CUDA | 8 GB VRAM | gpu_nvidia: 1 |
Whisper + TTS |
| drizzt | Radeon 680M | ROCm | 12 GB VRAM | gpu_rdna2: 1 |
Embeddings |
| danilo | Intel Arc | i915/IPEX | ~6 GB shared | gpu_intel: 1 |
Reranker |
| waterdeep | M4 Pro | MPS (Metal) | 48 GB unified | gpu_apple_mps: 1 |
LLM (7B–30B) + Training |
Implementation Plan
1. Network Prerequisites
waterdeep must be able to reach the Ray head node's GCS port:
# From waterdeep, verify connectivity
nc -zv <ray-head-ip> 6379
The Ray head service (ai-inference-raycluster-head-svc) is ClusterIP-only. Options to expose it:
| Approach | Complexity | Recommended |
|---|---|---|
| NodePort service on port 6379 | Low | For initial setup |
| Envoy Gateway TCPRoute | Medium | For production use |
| Tailscale/WireGuard mesh | Medium | If already in use |
2. Python Environment on waterdeep
# Install uv (per ADR-0012)
curl -LsSf https://astral.sh/uv/install.sh | sh
# Create Ray worker environment
uv venv ~/ray-worker --python 3.12
source ~/ray-worker/bin/activate
# Install Ray with ML dependencies
uv pip install "ray[default]==2.53.0" torch torchvision torchaudio \
transformers accelerate peft bitsandbytes \
ray-serve-apps # internal package from Gitea PyPI
# Verify MPS availability
python -c "import torch; print(torch.backends.mps.is_available())"
3. Start Ray Worker
# Join the cluster with custom resources
ray start \
--address="<ray-head-ip>:6379" \
--num-cpus=12 \
--num-gpus=1 \
--resources='{"gpu_apple_mps": 1}' \
--block
4. launchd Service (Persistent)
<!-- ~/Library/LaunchAgents/io.ray.worker.plist -->
<?xml version="1.0" encoding="UTF-8"?>
<!DOCTYPE plist PUBLIC "-//Apple//DTD PLIST 1.0//EN"
"http://www.apple.com/DTDs/PropertyList-1.0.dtd">
<plist version="1.0">
<dict>
<key>Label</key>
<string>io.ray.worker</string>
<key>ProgramArguments</key>
<array>
<string>/Users/billy/ray-worker/bin/ray</string>
<string>start</string>
<string>--address=RAY_HEAD_IP:6379</string>
<string>--num-cpus=12</string>
<string>--num-gpus=1</string>
<string>--resources={"gpu_apple_mps": 1}</string>
<string>--block</string>
</array>
<key>RunAtLoad</key>
<true/>
<key>KeepAlive</key>
<true/>
<key>StandardOutPath</key>
<string>/tmp/ray-worker.log</string>
<key>StandardErrorPath</key>
<string>/tmp/ray-worker-error.log</string>
<key>EnvironmentVariables</key>
<dict>
<key>PATH</key>
<string>/Users/billy/ray-worker/bin:/usr/local/bin:/usr/bin:/bin</string>
</dict>
</dict>
</plist>
launchctl load ~/Library/LaunchAgents/io.ray.worker.plist
5. Model Cache via NFS
Mount the gravenhollow NFS share on waterdeep so models are shared with the cluster via the fast all-SSD NAS:
# Mount gravenhollow NFS share (all-SSD, dual 10GbE)
sudo mount -t nfs gravenhollow.lab.daviestechlabs.io:/mnt/gravenhollow/kubernetes/models \
/Volumes/model-cache
# Or add to /etc/fstab for persistence
# gravenhollow.lab.daviestechlabs.io:/mnt/gravenhollow/kubernetes/models /Volumes/model-cache nfs rw 0 0
# Symlink to HuggingFace cache location
ln -s /Volumes/model-cache ~/.cache/huggingface/hub
6. Ray Serve Deployment Targeting
To schedule a deployment specifically on waterdeep, use the gpu_apple_mps custom resource in the RayService config:
# In rayservice.yaml serveConfigV2
- name: llm-secondary
route_prefix: /llm-secondary
import_path: ray_serve.serve_llm:app
runtime_env:
env_vars:
MODEL_ID: "Qwen/Qwen2.5-32B-Instruct-AWQ"
DEVICE: "mps"
MAX_MODEL_LEN: "4096"
deployments:
- name: LLMDeployment
num_replicas: 1
ray_actor_options:
num_gpus: 0.95
resources:
gpu_apple_mps: 1
7. Training Integration
Ray Train jobs from ADR-0058 will automatically discover waterdeep as an available worker. To prefer it for GPU-accelerated training:
# In cpu_training_pipeline.py — updated to prefer MPS when available
trainer = TorchTrainer(
train_func,
scaling_config=ScalingConfig(
num_workers=1,
use_gpu=True,
resources_per_worker={"gpu_apple_mps": 1},
),
)
Monitoring
Since waterdeep is not a Kubernetes node, standard Prometheus scraping won't reach it. Options:
| Approach | Notes |
|---|---|
| Prometheus push gateway | Ray worker pushes metrics periodically |
| Node-exporter on macOS | Homebrew node_exporter, scraped by Prometheus via static target |
| Ray Dashboard | Already shows all connected workers (ray-serve.lab.daviestechlabs.io) |
The Ray Dashboard at ray-serve.lab.daviestechlabs.io will automatically show waterdeep as a connected node with its resources, tasks, and memory usage — no additional configuration needed.
Power Management
To prevent macOS from sleeping and disconnecting the Ray worker:
# Disable sleep when on power adapter
sudo pmset -c sleep 0 displaysleep 0 disksleep 0
# Or use caffeinate for the Ray process
caffeinate -s ray start --address=... --block
Security Considerations
- Ray's GCS port (6379) will be exposed outside the cluster — restrict with firewall rules to waterdeep's IP only
- The Ray worker has no RBAC — it executes whatever tasks the head assigns
- Model weights on NFS are read-only from waterdeep (mount with
rooption if possible) - NFS traffic to gravenhollow traverses the LAN — ensure dual 10GbE links are active
- Consider Tailscale or WireGuard for encrypted transport if the Ray GCS traffic crosses untrusted network segments
Future Considerations
- DGX Spark (ADR-0058): When acquired, waterdeep can shift to secondary inference while DGX Spark handles training
- vLLM MPS maturity: As vLLM's MPS backend matures, waterdeep could serve larger models more efficiently
- MLX backend: Apple's MLX framework may provide better performance than PyTorch MPS for some workloads — worth evaluating as an alternative serving backend
- Second Mac Mini: If another Apple Silicon node is added, the external-worker pattern scales trivially
Links
- Ray Clusters — Adding External Workers
- PyTorch MPS Backend
- vLLM Apple Silicon Support
- Related: ADR-0005 — Multi-GPU strategy
- Related: ADR-0011 — KubeRay unified GPU backend
- Related: ADR-0024 — Ray repository structure
- Related: ADR-0035 — ARM64 worker strategy
- Related: ADR-0037 — Node naming conventions
- Related: ADR-0058 — Training strategy