All checks were successful
Update README with ADR Index / update-readme (push) Successful in 6s
339 lines
16 KiB
Markdown
339 lines
16 KiB
Markdown
# Add Mac Mini M4 Pro (waterdeep) to Ray Cluster as External Worker
|
||
|
||
* Status: proposed
|
||
* Date: 2026-02-16
|
||
* Deciders: Billy
|
||
* Technical Story: Expand Ray cluster with Apple Silicon compute for inference and training
|
||
|
||
## Context and Problem Statement
|
||
|
||
The homelab Ray cluster currently runs entirely within Kubernetes, with GPU workers pinned to specific nodes:
|
||
|
||
| Node | GPU | Memory | Workload |
|
||
|------|-----|--------|----------|
|
||
| khelben | Strix Halo (ROCm) | 128 GB unified | vLLM 70B (0.95 GPU) |
|
||
| elminster | RTX 2070 (CUDA) | 8 GB VRAM | Whisper (0.5) + TTS (0.5) |
|
||
| drizzt | Radeon 680M (ROCm) | 12 GB VRAM | Embeddings (0.8) |
|
||
| danilo | Intel Arc (i915) | ~6 GB shared | Reranker (0.8) |
|
||
|
||
All GPUs are fully allocated to inference (see [ADR-0005](0005-multi-gpu-strategy.md), [ADR-0011](0011-kuberay-unified-gpu-backend.md)). Training is currently CPU-only and distributed across cluster nodes via Ray Train ([ADR-0058](0058-training-strategy-cpu-dgx-spark.md)).
|
||
|
||
**waterdeep** is a Mac Mini M4 Pro with 48 GB of unified memory that currently serves as a development workstation (see [ADR-0037](0037-node-naming-conventions.md)). Its Apple Silicon GPU (MPS backend) and unified memory architecture make it a strong candidate for both inference and training workloads — but macOS cannot run Talos Linux or easily join the Kubernetes cluster as a native node.
|
||
|
||
How do we integrate waterdeep's compute into the Ray cluster without disrupting the existing Kubernetes-managed infrastructure?
|
||
|
||
## Decision Drivers
|
||
|
||
* 48 GB unified memory is sufficient for medium-large models (e.g., 7B–30B at Q4/Q8 quantisation)
|
||
* Apple Silicon MPS backend is supported by PyTorch and vLLM (experimental)
|
||
* macOS cannot run Talos Linux — must integrate without Kubernetes
|
||
* Ray natively supports heterogeneous clusters with external workers
|
||
* Must not impact existing inference serving stability
|
||
* Training workloads ([ADR-0058](0058-training-strategy-cpu-dgx-spark.md)) would benefit from a GPU-accelerated worker
|
||
* ARM64 architecture requires compatible Python packages and model formats
|
||
|
||
## Considered Options
|
||
|
||
1. **External Ray worker on macOS** — run a Ray worker process natively on waterdeep that connects to the cluster Ray head over the network
|
||
2. **Linux VM on Mac** — run UTM/Parallels VM with Linux, join as a Kubernetes node
|
||
3. **K3s agent on macOS** — run K3s directly on macOS via Docker Desktop
|
||
|
||
## Decision Outcome
|
||
|
||
Chosen option: **Option 1 — External Ray worker on macOS**, because Ray natively supports heterogeneous workers joining over the network. This avoids the complexity of running Kubernetes on macOS, lets waterdeep remain a development workstation, and leverages Apple Silicon MPS acceleration transparently through PyTorch.
|
||
|
||
### Positive Consequences
|
||
|
||
* Zero Kubernetes overhead on waterdeep — remains a usable dev workstation
|
||
* 48 GB unified memory available for models (vs split VRAM/RAM on discrete GPUs)
|
||
* MPS GPU acceleration for both inference and training
|
||
* Adds a 5th GPU class to the Ray fleet (Apple MPS alongside ROCm, CUDA, Intel, RDNA2)
|
||
* Training jobs ([ADR-0058](0058-training-strategy-cpu-dgx-spark.md)) gain a GPU-accelerated worker
|
||
* Can run a secondary LLM instance for overflow or A/B testing
|
||
* Quick to set up — single `ray start` command
|
||
* Worker can be stopped/started without affecting the cluster
|
||
|
||
### Negative Consequences
|
||
|
||
* Not managed by KubeRay or Flux — requires manual or launchd-based lifecycle management
|
||
* Network dependency — if waterdeep sleeps or disconnects, Ray tasks on it fail
|
||
* MPS backend has limited operator coverage compared to CUDA/ROCm
|
||
* Python environment must be maintained separately (not in a container image)
|
||
* No Longhorn storage — model cache managed locally or via NFS mount from gravenhollow (nfs-fast)
|
||
* Monitoring not automatically scraped by Prometheus (needs node-exporter or push gateway)
|
||
|
||
## Pros and Cons of the Options
|
||
|
||
### Option 1: External Ray worker on macOS
|
||
|
||
* Good, because Ray is designed for heterogeneous multi-node clusters
|
||
* Good, because no VM overhead — full access to Metal/MPS and unified memory
|
||
* Good, because waterdeep remains a functional dev workstation
|
||
* Good, because trivial to start/stop (single process)
|
||
* Bad, because not managed by Kubernetes or GitOps
|
||
* Bad, because requires manual Python environment management
|
||
* Bad, because MPS support in vLLM is experimental
|
||
|
||
### Option 2: Linux VM on Mac
|
||
|
||
* Good, because would be a standard Kubernetes node
|
||
* Good, because managed by KubeRay like other workers
|
||
* Bad, because VM overhead reduces available memory (hypervisor, guest OS)
|
||
* Bad, because no MPS/Metal GPU passthrough to Linux VMs on Apple Silicon
|
||
* Bad, because complex to maintain (VM lifecycle, networking, storage)
|
||
* Bad, because wastes the primary advantage (Apple Silicon GPU)
|
||
|
||
### Option 3: K3s agent on macOS
|
||
|
||
* Good, because Kubernetes-native, managed by Flux
|
||
* Bad, because K3s on macOS requires Docker Desktop (resource overhead)
|
||
* Bad, because container networking on macOS is fragile
|
||
* Bad, because MPS device access from within Docker containers is unreliable
|
||
* Bad, because not a supported K3s configuration
|
||
|
||
## Architecture
|
||
|
||
```
|
||
┌──────────────────────────────────────────────────────────────────────────┐
|
||
│ Kubernetes Cluster (Talos) │
|
||
│ │
|
||
│ ┌──────────────────────────────────────────────────────────────────┐ │
|
||
│ │ RayService (ai-inference) — KubeRay managed │ │
|
||
│ │ │ │
|
||
│ │ Head: wulfgar │ │
|
||
│ │ Workers: khelben (ROCm), elminster (CUDA), │ │
|
||
│ │ drizzt (RDNA2), danilo (Intel) │ │
|
||
│ └──────────────────────┬───────────────────────────────────────────┘ │
|
||
│ │ Ray GCS (port 6379) │
|
||
│ │ │
|
||
└─────────────────────────┼────────────────────────────────────────────────┘
|
||
│ Home network (LAN)
|
||
│
|
||
┌─────────────────────────┼────────────────────────────────────────────────┐
|
||
│ waterdeep (Mac Mini M4 Pro) │
|
||
│ │ │
|
||
│ ┌──────────────────────▼───────────────────────────────────────────┐ │
|
||
│ │ External Ray Worker (ray start --address=...) │ │
|
||
│ │ │ │
|
||
│ │ • 12-core CPU (8P + 4E) + 16-core Neural Engine │ │
|
||
│ │ • 48 GB unified memory (shared CPU/GPU) │ │
|
||
│ │ • MPS (Metal) GPU backend via PyTorch │ │
|
||
│ │ • Custom resource: gpu_apple_mps: 1 │ │
|
||
│ │ │ │
|
||
│ │ Workloads: │ │
|
||
│ │ ├── Inference: secondary LLM (7B–30B), overflow serving │ │
|
||
│ │ └── Training: LoRA/QLoRA fine-tuning via Ray Train │ │
|
||
│ └──────────────────────────────────────────────────────────────────┘ │
|
||
│ │
|
||
│ Model cache: ~/Library/Caches/huggingface + NFS mount (gravenhollow) │
|
||
└──────────────────────────────────────────────────────────────────────────┘
|
||
```
|
||
|
||
## Updated GPU Fleet
|
||
|
||
| Node | GPU | Backend | Memory | Custom Resource | Workload |
|
||
|------|-----|---------|--------|-----------------|----------|
|
||
| khelben | Strix Halo | ROCm | 128 GB unified | `gpu_strixhalo: 1` | vLLM 70B |
|
||
| elminster | RTX 2070 | CUDA | 8 GB VRAM | `gpu_nvidia: 1` | Whisper + TTS |
|
||
| drizzt | Radeon 680M | ROCm | 12 GB VRAM | `gpu_rdna2: 1` | Embeddings |
|
||
| danilo | Intel Arc | i915/IPEX | ~6 GB shared | `gpu_intel: 1` | Reranker |
|
||
| **waterdeep** | **M4 Pro** | **MPS (Metal)** | **48 GB unified** | **`gpu_apple_mps: 1`** | **LLM (7B–30B) + Training** |
|
||
|
||
## Implementation Plan
|
||
|
||
### 1. Network Prerequisites
|
||
|
||
waterdeep must be able to reach the Ray head node's GCS port:
|
||
|
||
```bash
|
||
# From waterdeep, verify connectivity
|
||
nc -zv <ray-head-ip> 6379
|
||
```
|
||
|
||
The Ray head service (`ai-inference-raycluster-head-svc`) is ClusterIP-only. Options to expose it:
|
||
|
||
| Approach | Complexity | Recommended |
|
||
|----------|-----------|-------------|
|
||
| NodePort service on port 6379 | Low | For initial setup |
|
||
| Envoy Gateway TCPRoute | Medium | For production use |
|
||
| Tailscale/WireGuard mesh | Medium | If already in use |
|
||
|
||
### 2. Python Environment on waterdeep
|
||
|
||
```bash
|
||
# Install uv (per ADR-0012)
|
||
curl -LsSf https://astral.sh/uv/install.sh | sh
|
||
|
||
# Create Ray worker environment
|
||
uv venv ~/ray-worker --python 3.12
|
||
source ~/ray-worker/bin/activate
|
||
|
||
# Install Ray with ML dependencies
|
||
uv pip install "ray[default]==2.53.0" torch torchvision torchaudio \
|
||
transformers accelerate peft bitsandbytes \
|
||
ray-serve-apps # internal package from Gitea PyPI
|
||
|
||
# Verify MPS availability
|
||
python -c "import torch; print(torch.backends.mps.is_available())"
|
||
```
|
||
|
||
### 3. Start Ray Worker
|
||
|
||
```bash
|
||
# Join the cluster with custom resources
|
||
ray start \
|
||
--address="<ray-head-ip>:6379" \
|
||
--num-cpus=12 \
|
||
--num-gpus=1 \
|
||
--resources='{"gpu_apple_mps": 1}' \
|
||
--block
|
||
```
|
||
|
||
### 4. launchd Service (Persistent)
|
||
|
||
```xml
|
||
<!-- ~/Library/LaunchAgents/io.ray.worker.plist -->
|
||
<?xml version="1.0" encoding="UTF-8"?>
|
||
<!DOCTYPE plist PUBLIC "-//Apple//DTD PLIST 1.0//EN"
|
||
"http://www.apple.com/DTDs/PropertyList-1.0.dtd">
|
||
<plist version="1.0">
|
||
<dict>
|
||
<key>Label</key>
|
||
<string>io.ray.worker</string>
|
||
<key>ProgramArguments</key>
|
||
<array>
|
||
<string>/Users/billy/ray-worker/bin/ray</string>
|
||
<string>start</string>
|
||
<string>--address=RAY_HEAD_IP:6379</string>
|
||
<string>--num-cpus=12</string>
|
||
<string>--num-gpus=1</string>
|
||
<string>--resources={"gpu_apple_mps": 1}</string>
|
||
<string>--block</string>
|
||
</array>
|
||
<key>RunAtLoad</key>
|
||
<true/>
|
||
<key>KeepAlive</key>
|
||
<true/>
|
||
<key>StandardOutPath</key>
|
||
<string>/tmp/ray-worker.log</string>
|
||
<key>StandardErrorPath</key>
|
||
<string>/tmp/ray-worker-error.log</string>
|
||
<key>EnvironmentVariables</key>
|
||
<dict>
|
||
<key>PATH</key>
|
||
<string>/Users/billy/ray-worker/bin:/usr/local/bin:/usr/bin:/bin</string>
|
||
</dict>
|
||
</dict>
|
||
</plist>
|
||
```
|
||
|
||
```bash
|
||
launchctl load ~/Library/LaunchAgents/io.ray.worker.plist
|
||
```
|
||
|
||
### 5. Model Cache via NFS
|
||
|
||
Mount the gravenhollow NFS share on waterdeep so models are shared with the cluster via the fast all-SSD NAS:
|
||
|
||
```bash
|
||
# Mount gravenhollow NFS share (all-SSD, dual 10GbE)
|
||
sudo mount -t nfs gravenhollow.lab.daviestechlabs.io:/mnt/gravenhollow/kubernetes/models \
|
||
/Volumes/model-cache
|
||
|
||
# Or add to /etc/fstab for persistence
|
||
# gravenhollow.lab.daviestechlabs.io:/mnt/gravenhollow/kubernetes/models /Volumes/model-cache nfs rw 0 0
|
||
|
||
# Symlink to HuggingFace cache location
|
||
ln -s /Volumes/model-cache ~/.cache/huggingface/hub
|
||
```
|
||
|
||
### 6. Ray Serve Deployment Targeting
|
||
|
||
To schedule a deployment specifically on waterdeep, use the `gpu_apple_mps` custom resource in the RayService config:
|
||
|
||
```yaml
|
||
# In rayservice.yaml serveConfigV2
|
||
- name: llm-secondary
|
||
route_prefix: /llm-secondary
|
||
import_path: ray_serve.serve_llm:app
|
||
runtime_env:
|
||
env_vars:
|
||
MODEL_ID: "Qwen/Qwen2.5-32B-Instruct-AWQ"
|
||
DEVICE: "mps"
|
||
MAX_MODEL_LEN: "4096"
|
||
deployments:
|
||
- name: LLMDeployment
|
||
num_replicas: 1
|
||
ray_actor_options:
|
||
num_gpus: 0.95
|
||
resources:
|
||
gpu_apple_mps: 1
|
||
```
|
||
|
||
### 7. Training Integration
|
||
|
||
Ray Train jobs from [ADR-0058](0058-training-strategy-cpu-dgx-spark.md) will automatically discover waterdeep as an available worker. To prefer it for GPU-accelerated training:
|
||
|
||
```python
|
||
# In cpu_training_pipeline.py — updated to prefer MPS when available
|
||
trainer = TorchTrainer(
|
||
train_func,
|
||
scaling_config=ScalingConfig(
|
||
num_workers=1,
|
||
use_gpu=True,
|
||
resources_per_worker={"gpu_apple_mps": 1},
|
||
),
|
||
)
|
||
```
|
||
|
||
## Monitoring
|
||
|
||
Since waterdeep is not a Kubernetes node, standard Prometheus scraping won't reach it. Options:
|
||
|
||
| Approach | Notes |
|
||
|----------|-------|
|
||
| Prometheus push gateway | Ray worker pushes metrics periodically |
|
||
| Node-exporter on macOS | Homebrew `node_exporter`, scraped by Prometheus via static target |
|
||
| Ray Dashboard | Already shows all connected workers (ray-serve.lab.daviestechlabs.io) |
|
||
|
||
The Ray Dashboard at `ray-serve.lab.daviestechlabs.io` will automatically show waterdeep as a connected node with its resources, tasks, and memory usage — no additional configuration needed.
|
||
|
||
## Power Management
|
||
|
||
To prevent macOS from sleeping and disconnecting the Ray worker:
|
||
|
||
```bash
|
||
# Disable sleep when on power adapter
|
||
sudo pmset -c sleep 0 displaysleep 0 disksleep 0
|
||
|
||
# Or use caffeinate for the Ray process
|
||
caffeinate -s ray start --address=... --block
|
||
```
|
||
|
||
## Security Considerations
|
||
|
||
* Ray's GCS port (6379) will be exposed outside the cluster — restrict with firewall rules to waterdeep's IP only
|
||
* The Ray worker has no RBAC — it executes whatever tasks the head assigns
|
||
* Model weights on NFS are read-only from waterdeep (mount with `ro` option if possible)
|
||
* NFS traffic to gravenhollow traverses the LAN — ensure dual 10GbE links are active
|
||
* Consider Tailscale or WireGuard for encrypted transport if the Ray GCS traffic crosses untrusted network segments
|
||
|
||
## Future Considerations
|
||
|
||
* **DGX Spark** ([ADR-0058](0058-training-strategy-cpu-dgx-spark.md)): When acquired, waterdeep can shift to secondary inference while DGX Spark handles training
|
||
* **vLLM MPS maturity**: As vLLM's MPS backend matures, waterdeep could serve larger models more efficiently
|
||
* **MLX backend**: Apple's MLX framework may provide better performance than PyTorch MPS for some workloads — worth evaluating as an alternative serving backend
|
||
* **Second Mac Mini**: If another Apple Silicon node is added, the external-worker pattern scales trivially
|
||
|
||
## Links
|
||
|
||
* [Ray Clusters — Adding External Workers](https://docs.ray.io/en/latest/cluster/vms/getting-started.html)
|
||
* [PyTorch MPS Backend](https://pytorch.org/docs/stable/notes/mps.html)
|
||
* [vLLM Apple Silicon Support](https://docs.vllm.ai/en/latest/)
|
||
* Related: [ADR-0005](0005-multi-gpu-strategy.md) — Multi-GPU strategy
|
||
* Related: [ADR-0011](0011-kuberay-unified-gpu-backend.md) — KubeRay unified GPU backend
|
||
* Related: [ADR-0024](0024-ray-repository-structure.md) — Ray repository structure
|
||
* Related: [ADR-0035](0035-arm64-worker-strategy.md) — ARM64 worker strategy
|
||
* Related: [ADR-0037](0037-node-naming-conventions.md) — Node naming conventions
|
||
* Related: [ADR-0058](0058-training-strategy-cpu-dgx-spark.md) — Training strategy
|