daviestechlabs/homelab-design

Fork 0

Files

Billy D. b4e608f002

Update README with ADR Index / update-readme (push) Successful in 6s

Details

updates to finish nfs-fast implementation.

2026-02-16 18:08:38 -05:00

16 KiB

Raw Blame History

Add Mac Mini M4 Pro (waterdeep) to Ray Cluster as External Worker

Status: proposed
Date: 2026-02-16
Deciders: Billy
Technical Story: Expand Ray cluster with Apple Silicon compute for inference and training

Context and Problem Statement

The homelab Ray cluster currently runs entirely within Kubernetes, with GPU workers pinned to specific nodes:

Node	GPU	Memory	Workload
khelben	Strix Halo (ROCm)	128 GB unified	vLLM 70B (0.95 GPU)
elminster	RTX 2070 (CUDA)	8 GB VRAM	Whisper (0.5) + TTS (0.5)
drizzt	Radeon 680M (ROCm)	12 GB VRAM	Embeddings (0.8)
danilo	Intel Arc (i915)	~6 GB shared	Reranker (0.8)

All GPUs are fully allocated to inference (see ADR-0005, ADR-0011). Training is currently CPU-only and distributed across cluster nodes via Ray Train (ADR-0058).

waterdeep is a Mac Mini M4 Pro with 48 GB of unified memory that currently serves as a development workstation (see ADR-0037). Its Apple Silicon GPU (MPS backend) and unified memory architecture make it a strong candidate for both inference and training workloads — but macOS cannot run Talos Linux or easily join the Kubernetes cluster as a native node.

How do we integrate waterdeep's compute into the Ray cluster without disrupting the existing Kubernetes-managed infrastructure?

Decision Drivers

48 GB unified memory is sufficient for medium-large models (e.g., 7B–30B at Q4/Q8 quantisation)
Apple Silicon MPS backend is supported by PyTorch and vLLM (experimental)
macOS cannot run Talos Linux — must integrate without Kubernetes
Ray natively supports heterogeneous clusters with external workers
Must not impact existing inference serving stability
Training workloads (ADR-0058) would benefit from a GPU-accelerated worker
ARM64 architecture requires compatible Python packages and model formats

Considered Options

External Ray worker on macOS — run a Ray worker process natively on waterdeep that connects to the cluster Ray head over the network
Linux VM on Mac — run UTM/Parallels VM with Linux, join as a Kubernetes node
K3s agent on macOS — run K3s directly on macOS via Docker Desktop

Decision Outcome

Chosen option: Option 1 — External Ray worker on macOS, because Ray natively supports heterogeneous workers joining over the network. This avoids the complexity of running Kubernetes on macOS, lets waterdeep remain a development workstation, and leverages Apple Silicon MPS acceleration transparently through PyTorch.

Positive Consequences

Zero Kubernetes overhead on waterdeep — remains a usable dev workstation
48 GB unified memory available for models (vs split VRAM/RAM on discrete GPUs)
MPS GPU acceleration for both inference and training
Adds a 5th GPU class to the Ray fleet (Apple MPS alongside ROCm, CUDA, Intel, RDNA2)
Training jobs (ADR-0058) gain a GPU-accelerated worker
Can run a secondary LLM instance for overflow or A/B testing
Quick to set up — single ray start command
Worker can be stopped/started without affecting the cluster

Negative Consequences

Not managed by KubeRay or Flux — requires manual or launchd-based lifecycle management
Network dependency — if waterdeep sleeps or disconnects, Ray tasks on it fail
MPS backend has limited operator coverage compared to CUDA/ROCm
Python environment must be maintained separately (not in a container image)
No Longhorn storage — model cache managed locally or via NFS mount from gravenhollow (nfs-fast)
Monitoring not automatically scraped by Prometheus (needs node-exporter or push gateway)

Pros and Cons of the Options

Option 1: External Ray worker on macOS

Good, because Ray is designed for heterogeneous multi-node clusters
Good, because no VM overhead — full access to Metal/MPS and unified memory
Good, because waterdeep remains a functional dev workstation
Good, because trivial to start/stop (single process)
Bad, because not managed by Kubernetes or GitOps
Bad, because requires manual Python environment management
Bad, because MPS support in vLLM is experimental

Option 2: Linux VM on Mac

Good, because would be a standard Kubernetes node
Good, because managed by KubeRay like other workers
Bad, because VM overhead reduces available memory (hypervisor, guest OS)
Bad, because no MPS/Metal GPU passthrough to Linux VMs on Apple Silicon
Bad, because complex to maintain (VM lifecycle, networking, storage)
Bad, because wastes the primary advantage (Apple Silicon GPU)

Option 3: K3s agent on macOS

Good, because Kubernetes-native, managed by Flux
Bad, because K3s on macOS requires Docker Desktop (resource overhead)
Bad, because container networking on macOS is fragile
Bad, because MPS device access from within Docker containers is unreliable
Bad, because not a supported K3s configuration

Architecture

┌──────────────────────────────────────────────────────────────────────────┐
│                         Kubernetes Cluster (Talos)                        │
│                                                                          │
│  ┌──────────────────────────────────────────────────────────────────┐    │
│  │            RayService (ai-inference) — KubeRay managed           │    │
│  │                                                                   │    │
│  │  Head: wulfgar                                                    │    │
│  │  Workers: khelben (ROCm), elminster (CUDA),                       │    │
│  │           drizzt (RDNA2), danilo (Intel)                          │    │
│  └──────────────────────┬───────────────────────────────────────────┘    │
│                         │ Ray GCS (port 6379)                            │
│                         │                                                │
└─────────────────────────┼────────────────────────────────────────────────┘
                          │ Home network (LAN)
                          │
┌─────────────────────────┼────────────────────────────────────────────────┐
│  waterdeep (Mac Mini M4 Pro)                                             │
│                         │                                                │
│  ┌──────────────────────▼───────────────────────────────────────────┐    │
│  │           External Ray Worker (ray start --address=...)          │    │
│  │                                                                   │    │
│  │  • 12-core CPU (8P + 4E) + 16-core Neural Engine                 │    │
│  │  • 48 GB unified memory (shared CPU/GPU)                          │    │
│  │  • MPS (Metal) GPU backend via PyTorch                            │    │
│  │  • Custom resource: gpu_apple_mps: 1                              │    │
│  │                                                                   │    │
│  │  Workloads:                                                       │    │
│  │  ├── Inference: secondary LLM (7B–30B), overflow serving          │    │
│  │  └── Training: LoRA/QLoRA fine-tuning via Ray Train               │    │
│  └──────────────────────────────────────────────────────────────────┘    │
│                                                                          │
│  Model cache: ~/Library/Caches/huggingface + NFS mount (gravenhollow)    │
└──────────────────────────────────────────────────────────────────────────┘

Updated GPU Fleet

Node	GPU	Backend	Memory	Custom Resource	Workload
khelben	Strix Halo	ROCm	128 GB unified	`gpu_strixhalo: 1`	vLLM 70B
elminster	RTX 2070	CUDA	8 GB VRAM	`gpu_nvidia: 1`	Whisper + TTS
drizzt	Radeon 680M	ROCm	12 GB VRAM	`gpu_rdna2: 1`	Embeddings
danilo	Intel Arc	i915/IPEX	~6 GB shared	`gpu_intel: 1`	Reranker
waterdeep	M4 Pro	MPS (Metal)	48 GB unified	`gpu_apple_mps: 1`	LLM (7B–30B) + Training

Implementation Plan

1. Network Prerequisites

waterdeep must be able to reach the Ray head node's GCS port:

# From waterdeep, verify connectivity
nc -zv <ray-head-ip> 6379

The Ray head service (ai-inference-raycluster-head-svc) is ClusterIP-only. Options to expose it:

Approach	Complexity	Recommended
NodePort service on port 6379	Low	For initial setup
Envoy Gateway TCPRoute	Medium	For production use
Tailscale/WireGuard mesh	Medium	If already in use

2. Python Environment on waterdeep

# Install uv (per ADR-0012)
curl -LsSf https://astral.sh/uv/install.sh | sh

# Create Ray worker environment
uv venv ~/ray-worker --python 3.12
source ~/ray-worker/bin/activate

# Install Ray with ML dependencies
uv pip install "ray[default]==2.53.0" torch torchvision torchaudio \
    transformers accelerate peft bitsandbytes \
    ray-serve-apps  # internal package from Gitea PyPI

# Verify MPS availability
python -c "import torch; print(torch.backends.mps.is_available())"

3. Start Ray Worker

# Join the cluster with custom resources
ray start \
    --address="<ray-head-ip>:6379" \
    --num-cpus=12 \
    --num-gpus=1 \
    --resources='{"gpu_apple_mps": 1}' \
    --block

4. launchd Service (Persistent)

<!-- ~/Library/LaunchAgents/io.ray.worker.plist -->
<?xml version="1.0" encoding="UTF-8"?>
<!DOCTYPE plist PUBLIC "-//Apple//DTD PLIST 1.0//EN"
  "http://www.apple.com/DTDs/PropertyList-1.0.dtd">
<plist version="1.0">
<dict>
    <key>Label</key>
    <string>io.ray.worker</string>
    <key>ProgramArguments</key>
    <array>
        <string>/Users/billy/ray-worker/bin/ray</string>
        <string>start</string>
        <string>--address=RAY_HEAD_IP:6379</string>
        <string>--num-cpus=12</string>
        <string>--num-gpus=1</string>
        <string>--resources={"gpu_apple_mps": 1}</string>
        <string>--block</string>
    </array>
    <key>RunAtLoad</key>
    <true/>
    <key>KeepAlive</key>
    <true/>
    <key>StandardOutPath</key>
    <string>/tmp/ray-worker.log</string>
    <key>StandardErrorPath</key>
    <string>/tmp/ray-worker-error.log</string>
    <key>EnvironmentVariables</key>
    <dict>
        <key>PATH</key>
        <string>/Users/billy/ray-worker/bin:/usr/local/bin:/usr/bin:/bin</string>
    </dict>
</dict>
</plist>

launchctl load ~/Library/LaunchAgents/io.ray.worker.plist

5. Model Cache via NFS

Mount the gravenhollow NFS share on waterdeep so models are shared with the cluster via the fast all-SSD NAS:

# Mount gravenhollow NFS share (all-SSD, dual 10GbE)
sudo mount -t nfs gravenhollow.lab.daviestechlabs.io:/mnt/gravenhollow/kubernetes/models \
    /Volumes/model-cache

# Or add to /etc/fstab for persistence
# gravenhollow.lab.daviestechlabs.io:/mnt/gravenhollow/kubernetes/models /Volumes/model-cache nfs rw 0 0

# Symlink to HuggingFace cache location
ln -s /Volumes/model-cache ~/.cache/huggingface/hub

6. Ray Serve Deployment Targeting

To schedule a deployment specifically on waterdeep, use the gpu_apple_mps custom resource in the RayService config:

# In rayservice.yaml serveConfigV2
- name: llm-secondary
  route_prefix: /llm-secondary
  import_path: ray_serve.serve_llm:app
  runtime_env:
    env_vars:
      MODEL_ID: "Qwen/Qwen2.5-32B-Instruct-AWQ"
      DEVICE: "mps"
      MAX_MODEL_LEN: "4096"
  deployments:
    - name: LLMDeployment
      num_replicas: 1
      ray_actor_options:
        num_gpus: 0.95
        resources:
          gpu_apple_mps: 1

7. Training Integration

Ray Train jobs from ADR-0058 will automatically discover waterdeep as an available worker. To prefer it for GPU-accelerated training:

# In cpu_training_pipeline.py — updated to prefer MPS when available
trainer = TorchTrainer(
    train_func,
    scaling_config=ScalingConfig(
        num_workers=1,
        use_gpu=True,
        resources_per_worker={"gpu_apple_mps": 1},
    ),
)

Monitoring

Since waterdeep is not a Kubernetes node, standard Prometheus scraping won't reach it. Options:

Approach	Notes
Prometheus push gateway	Ray worker pushes metrics periodically
Node-exporter on macOS	Homebrew `node_exporter`, scraped by Prometheus via static target
Ray Dashboard	Already shows all connected workers (ray-serve.lab.daviestechlabs.io)

The Ray Dashboard at ray-serve.lab.daviestechlabs.io will automatically show waterdeep as a connected node with its resources, tasks, and memory usage — no additional configuration needed.

Power Management

To prevent macOS from sleeping and disconnecting the Ray worker:

# Disable sleep when on power adapter
sudo pmset -c sleep 0 displaysleep 0 disksleep 0

# Or use caffeinate for the Ray process
caffeinate -s ray start --address=... --block

Security Considerations

Ray's GCS port (6379) will be exposed outside the cluster — restrict with firewall rules to waterdeep's IP only
The Ray worker has no RBAC — it executes whatever tasks the head assigns
Model weights on NFS are read-only from waterdeep (mount with ro option if possible)
NFS traffic to gravenhollow traverses the LAN — ensure dual 10GbE links are active
Consider Tailscale or WireGuard for encrypted transport if the Ray GCS traffic crosses untrusted network segments

Future Considerations

DGX Spark (ADR-0058): When acquired, waterdeep can shift to secondary inference while DGX Spark handles training
vLLM MPS maturity: As vLLM's MPS backend matures, waterdeep could serve larger models more efficiently
MLX backend: Apple's MLX framework may provide better performance than PyTorch MPS for some workloads — worth evaluating as an alternative serving backend
Second Mac Mini: If another Apple Silicon node is added, the external-worker pattern scales trivially

16 KiB Raw Blame History Unescape Escape