Files
homelab-design/decisions/0059-mac-mini-ray-worker.md
Billy D. b4e608f002
All checks were successful
Update README with ADR Index / update-readme (push) Successful in 6s
updates to finish nfs-fast implementation.
2026-02-16 18:08:38 -05:00

339 lines
16 KiB
Markdown
Raw Blame History

This file contains ambiguous Unicode characters
This file contains Unicode characters that might be confused with other characters. If you think that this is intentional, you can safely ignore this warning. Use the Escape button to reveal them.
# Add Mac Mini M4 Pro (waterdeep) to Ray Cluster as External Worker
* Status: proposed
* Date: 2026-02-16
* Deciders: Billy
* Technical Story: Expand Ray cluster with Apple Silicon compute for inference and training
## Context and Problem Statement
The homelab Ray cluster currently runs entirely within Kubernetes, with GPU workers pinned to specific nodes:
| Node | GPU | Memory | Workload |
|------|-----|--------|----------|
| khelben | Strix Halo (ROCm) | 128 GB unified | vLLM 70B (0.95 GPU) |
| elminster | RTX 2070 (CUDA) | 8 GB VRAM | Whisper (0.5) + TTS (0.5) |
| drizzt | Radeon 680M (ROCm) | 12 GB VRAM | Embeddings (0.8) |
| danilo | Intel Arc (i915) | ~6 GB shared | Reranker (0.8) |
All GPUs are fully allocated to inference (see [ADR-0005](0005-multi-gpu-strategy.md), [ADR-0011](0011-kuberay-unified-gpu-backend.md)). Training is currently CPU-only and distributed across cluster nodes via Ray Train ([ADR-0058](0058-training-strategy-cpu-dgx-spark.md)).
**waterdeep** is a Mac Mini M4 Pro with 48 GB of unified memory that currently serves as a development workstation (see [ADR-0037](0037-node-naming-conventions.md)). Its Apple Silicon GPU (MPS backend) and unified memory architecture make it a strong candidate for both inference and training workloads — but macOS cannot run Talos Linux or easily join the Kubernetes cluster as a native node.
How do we integrate waterdeep's compute into the Ray cluster without disrupting the existing Kubernetes-managed infrastructure?
## Decision Drivers
* 48 GB unified memory is sufficient for medium-large models (e.g., 7B30B at Q4/Q8 quantisation)
* Apple Silicon MPS backend is supported by PyTorch and vLLM (experimental)
* macOS cannot run Talos Linux — must integrate without Kubernetes
* Ray natively supports heterogeneous clusters with external workers
* Must not impact existing inference serving stability
* Training workloads ([ADR-0058](0058-training-strategy-cpu-dgx-spark.md)) would benefit from a GPU-accelerated worker
* ARM64 architecture requires compatible Python packages and model formats
## Considered Options
1. **External Ray worker on macOS** — run a Ray worker process natively on waterdeep that connects to the cluster Ray head over the network
2. **Linux VM on Mac** — run UTM/Parallels VM with Linux, join as a Kubernetes node
3. **K3s agent on macOS** — run K3s directly on macOS via Docker Desktop
## Decision Outcome
Chosen option: **Option 1 — External Ray worker on macOS**, because Ray natively supports heterogeneous workers joining over the network. This avoids the complexity of running Kubernetes on macOS, lets waterdeep remain a development workstation, and leverages Apple Silicon MPS acceleration transparently through PyTorch.
### Positive Consequences
* Zero Kubernetes overhead on waterdeep — remains a usable dev workstation
* 48 GB unified memory available for models (vs split VRAM/RAM on discrete GPUs)
* MPS GPU acceleration for both inference and training
* Adds a 5th GPU class to the Ray fleet (Apple MPS alongside ROCm, CUDA, Intel, RDNA2)
* Training jobs ([ADR-0058](0058-training-strategy-cpu-dgx-spark.md)) gain a GPU-accelerated worker
* Can run a secondary LLM instance for overflow or A/B testing
* Quick to set up — single `ray start` command
* Worker can be stopped/started without affecting the cluster
### Negative Consequences
* Not managed by KubeRay or Flux — requires manual or launchd-based lifecycle management
* Network dependency — if waterdeep sleeps or disconnects, Ray tasks on it fail
* MPS backend has limited operator coverage compared to CUDA/ROCm
* Python environment must be maintained separately (not in a container image)
* No Longhorn storage — model cache managed locally or via NFS mount from gravenhollow (nfs-fast)
* Monitoring not automatically scraped by Prometheus (needs node-exporter or push gateway)
## Pros and Cons of the Options
### Option 1: External Ray worker on macOS
* Good, because Ray is designed for heterogeneous multi-node clusters
* Good, because no VM overhead — full access to Metal/MPS and unified memory
* Good, because waterdeep remains a functional dev workstation
* Good, because trivial to start/stop (single process)
* Bad, because not managed by Kubernetes or GitOps
* Bad, because requires manual Python environment management
* Bad, because MPS support in vLLM is experimental
### Option 2: Linux VM on Mac
* Good, because would be a standard Kubernetes node
* Good, because managed by KubeRay like other workers
* Bad, because VM overhead reduces available memory (hypervisor, guest OS)
* Bad, because no MPS/Metal GPU passthrough to Linux VMs on Apple Silicon
* Bad, because complex to maintain (VM lifecycle, networking, storage)
* Bad, because wastes the primary advantage (Apple Silicon GPU)
### Option 3: K3s agent on macOS
* Good, because Kubernetes-native, managed by Flux
* Bad, because K3s on macOS requires Docker Desktop (resource overhead)
* Bad, because container networking on macOS is fragile
* Bad, because MPS device access from within Docker containers is unreliable
* Bad, because not a supported K3s configuration
## Architecture
```
┌──────────────────────────────────────────────────────────────────────────┐
│ Kubernetes Cluster (Talos) │
│ │
│ ┌──────────────────────────────────────────────────────────────────┐ │
│ │ RayService (ai-inference) — KubeRay managed │ │
│ │ │ │
│ │ Head: wulfgar │ │
│ │ Workers: khelben (ROCm), elminster (CUDA), │ │
│ │ drizzt (RDNA2), danilo (Intel) │ │
│ └──────────────────────┬───────────────────────────────────────────┘ │
│ │ Ray GCS (port 6379) │
│ │ │
└─────────────────────────┼────────────────────────────────────────────────┘
│ Home network (LAN)
┌─────────────────────────┼────────────────────────────────────────────────┐
│ waterdeep (Mac Mini M4 Pro) │
│ │ │
│ ┌──────────────────────▼───────────────────────────────────────────┐ │
│ │ External Ray Worker (ray start --address=...) │ │
│ │ │ │
│ │ • 12-core CPU (8P + 4E) + 16-core Neural Engine │ │
│ │ • 48 GB unified memory (shared CPU/GPU) │ │
│ │ • MPS (Metal) GPU backend via PyTorch │ │
│ │ • Custom resource: gpu_apple_mps: 1 │ │
│ │ │ │
│ │ Workloads: │ │
│ │ ├── Inference: secondary LLM (7B30B), overflow serving │ │
│ │ └── Training: LoRA/QLoRA fine-tuning via Ray Train │ │
│ └──────────────────────────────────────────────────────────────────┘ │
│ │
│ Model cache: ~/Library/Caches/huggingface + NFS mount (gravenhollow) │
└──────────────────────────────────────────────────────────────────────────┘
```
## Updated GPU Fleet
| Node | GPU | Backend | Memory | Custom Resource | Workload |
|------|-----|---------|--------|-----------------|----------|
| khelben | Strix Halo | ROCm | 128 GB unified | `gpu_strixhalo: 1` | vLLM 70B |
| elminster | RTX 2070 | CUDA | 8 GB VRAM | `gpu_nvidia: 1` | Whisper + TTS |
| drizzt | Radeon 680M | ROCm | 12 GB VRAM | `gpu_rdna2: 1` | Embeddings |
| danilo | Intel Arc | i915/IPEX | ~6 GB shared | `gpu_intel: 1` | Reranker |
| **waterdeep** | **M4 Pro** | **MPS (Metal)** | **48 GB unified** | **`gpu_apple_mps: 1`** | **LLM (7B30B) + Training** |
## Implementation Plan
### 1. Network Prerequisites
waterdeep must be able to reach the Ray head node's GCS port:
```bash
# From waterdeep, verify connectivity
nc -zv <ray-head-ip> 6379
```
The Ray head service (`ai-inference-raycluster-head-svc`) is ClusterIP-only. Options to expose it:
| Approach | Complexity | Recommended |
|----------|-----------|-------------|
| NodePort service on port 6379 | Low | For initial setup |
| Envoy Gateway TCPRoute | Medium | For production use |
| Tailscale/WireGuard mesh | Medium | If already in use |
### 2. Python Environment on waterdeep
```bash
# Install uv (per ADR-0012)
curl -LsSf https://astral.sh/uv/install.sh | sh
# Create Ray worker environment
uv venv ~/ray-worker --python 3.12
source ~/ray-worker/bin/activate
# Install Ray with ML dependencies
uv pip install "ray[default]==2.53.0" torch torchvision torchaudio \
transformers accelerate peft bitsandbytes \
ray-serve-apps # internal package from Gitea PyPI
# Verify MPS availability
python -c "import torch; print(torch.backends.mps.is_available())"
```
### 3. Start Ray Worker
```bash
# Join the cluster with custom resources
ray start \
--address="<ray-head-ip>:6379" \
--num-cpus=12 \
--num-gpus=1 \
--resources='{"gpu_apple_mps": 1}' \
--block
```
### 4. launchd Service (Persistent)
```xml
<!-- ~/Library/LaunchAgents/io.ray.worker.plist -->
<?xml version="1.0" encoding="UTF-8"?>
<!DOCTYPE plist PUBLIC "-//Apple//DTD PLIST 1.0//EN"
"http://www.apple.com/DTDs/PropertyList-1.0.dtd">
<plist version="1.0">
<dict>
<key>Label</key>
<string>io.ray.worker</string>
<key>ProgramArguments</key>
<array>
<string>/Users/billy/ray-worker/bin/ray</string>
<string>start</string>
<string>--address=RAY_HEAD_IP:6379</string>
<string>--num-cpus=12</string>
<string>--num-gpus=1</string>
<string>--resources={"gpu_apple_mps": 1}</string>
<string>--block</string>
</array>
<key>RunAtLoad</key>
<true/>
<key>KeepAlive</key>
<true/>
<key>StandardOutPath</key>
<string>/tmp/ray-worker.log</string>
<key>StandardErrorPath</key>
<string>/tmp/ray-worker-error.log</string>
<key>EnvironmentVariables</key>
<dict>
<key>PATH</key>
<string>/Users/billy/ray-worker/bin:/usr/local/bin:/usr/bin:/bin</string>
</dict>
</dict>
</plist>
```
```bash
launchctl load ~/Library/LaunchAgents/io.ray.worker.plist
```
### 5. Model Cache via NFS
Mount the gravenhollow NFS share on waterdeep so models are shared with the cluster via the fast all-SSD NAS:
```bash
# Mount gravenhollow NFS share (all-SSD, dual 10GbE)
sudo mount -t nfs gravenhollow.lab.daviestechlabs.io:/mnt/gravenhollow/kubernetes/models \
/Volumes/model-cache
# Or add to /etc/fstab for persistence
# gravenhollow.lab.daviestechlabs.io:/mnt/gravenhollow/kubernetes/models /Volumes/model-cache nfs rw 0 0
# Symlink to HuggingFace cache location
ln -s /Volumes/model-cache ~/.cache/huggingface/hub
```
### 6. Ray Serve Deployment Targeting
To schedule a deployment specifically on waterdeep, use the `gpu_apple_mps` custom resource in the RayService config:
```yaml
# In rayservice.yaml serveConfigV2
- name: llm-secondary
route_prefix: /llm-secondary
import_path: ray_serve.serve_llm:app
runtime_env:
env_vars:
MODEL_ID: "Qwen/Qwen2.5-32B-Instruct-AWQ"
DEVICE: "mps"
MAX_MODEL_LEN: "4096"
deployments:
- name: LLMDeployment
num_replicas: 1
ray_actor_options:
num_gpus: 0.95
resources:
gpu_apple_mps: 1
```
### 7. Training Integration
Ray Train jobs from [ADR-0058](0058-training-strategy-cpu-dgx-spark.md) will automatically discover waterdeep as an available worker. To prefer it for GPU-accelerated training:
```python
# In cpu_training_pipeline.py — updated to prefer MPS when available
trainer = TorchTrainer(
train_func,
scaling_config=ScalingConfig(
num_workers=1,
use_gpu=True,
resources_per_worker={"gpu_apple_mps": 1},
),
)
```
## Monitoring
Since waterdeep is not a Kubernetes node, standard Prometheus scraping won't reach it. Options:
| Approach | Notes |
|----------|-------|
| Prometheus push gateway | Ray worker pushes metrics periodically |
| Node-exporter on macOS | Homebrew `node_exporter`, scraped by Prometheus via static target |
| Ray Dashboard | Already shows all connected workers (ray-serve.lab.daviestechlabs.io) |
The Ray Dashboard at `ray-serve.lab.daviestechlabs.io` will automatically show waterdeep as a connected node with its resources, tasks, and memory usage — no additional configuration needed.
## Power Management
To prevent macOS from sleeping and disconnecting the Ray worker:
```bash
# Disable sleep when on power adapter
sudo pmset -c sleep 0 displaysleep 0 disksleep 0
# Or use caffeinate for the Ray process
caffeinate -s ray start --address=... --block
```
## Security Considerations
* Ray's GCS port (6379) will be exposed outside the cluster — restrict with firewall rules to waterdeep's IP only
* The Ray worker has no RBAC — it executes whatever tasks the head assigns
* Model weights on NFS are read-only from waterdeep (mount with `ro` option if possible)
* NFS traffic to gravenhollow traverses the LAN — ensure dual 10GbE links are active
* Consider Tailscale or WireGuard for encrypted transport if the Ray GCS traffic crosses untrusted network segments
## Future Considerations
* **DGX Spark** ([ADR-0058](0058-training-strategy-cpu-dgx-spark.md)): When acquired, waterdeep can shift to secondary inference while DGX Spark handles training
* **vLLM MPS maturity**: As vLLM's MPS backend matures, waterdeep could serve larger models more efficiently
* **MLX backend**: Apple's MLX framework may provide better performance than PyTorch MPS for some workloads — worth evaluating as an alternative serving backend
* **Second Mac Mini**: If another Apple Silicon node is added, the external-worker pattern scales trivially
## Links
* [Ray Clusters — Adding External Workers](https://docs.ray.io/en/latest/cluster/vms/getting-started.html)
* [PyTorch MPS Backend](https://pytorch.org/docs/stable/notes/mps.html)
* [vLLM Apple Silicon Support](https://docs.vllm.ai/en/latest/)
* Related: [ADR-0005](0005-multi-gpu-strategy.md) — Multi-GPU strategy
* Related: [ADR-0011](0011-kuberay-unified-gpu-backend.md) — KubeRay unified GPU backend
* Related: [ADR-0024](0024-ray-repository-structure.md) — Ray repository structure
* Related: [ADR-0035](0035-arm64-worker-strategy.md) — ARM64 worker strategy
* Related: [ADR-0037](0037-node-naming-conventions.md) — Node naming conventions
* Related: [ADR-0058](0058-training-strategy-cpu-dgx-spark.md) — Training strategy