updates to finish nfs-fast implementation.

2026-02-16 18:08:32 -05:00
parent 7685b2b757
commit b4e608f002
5 changed files with 134 additions and 37 deletions
--- a/decisions/0059-mac-mini-ray-worker.md
+++ b/decisions/0059-mac-mini-ray-worker.md
@@ -59,7 +59,7 @@ Chosen option: **Option 1 — External Ray worker on macOS**, because Ray native
 * Network dependency — if waterdeep sleeps or disconnects, Ray tasks on it fail
 * MPS backend has limited operator coverage compared to CUDA/ROCm
 * Python environment must be maintained separately (not in a container image)
-* No Longhorn storage — model cache managed locally or via NFS mount
+* No Longhorn storage — model cache managed locally or via NFS mount from gravenhollow (nfs-fast)
 * Monitoring not automatically scraped by Prometheus (needs node-exporter or push gateway)

 ## Pros and Cons of the Options
@@ -125,7 +125,7 @@ Chosen option: **Option 1 — External Ray worker on macOS**, because Ray native
 │  │  └── Training: LoRA/QLoRA fine-tuning via Ray Train               │    │
 │  └──────────────────────────────────────────────────────────────────┘    │
 │                                                                          │
-│  Model cache: ~/Library/Caches/huggingface + NFS mount                   │
+│  Model cache: ~/Library/Caches/huggingface + NFS mount (gravenhollow)    │
 └──────────────────────────────────────────────────────────────────────────┘
 ```

@@ -233,15 +233,15 @@ launchctl load ~/Library/LaunchAgents/io.ray.worker.plist

 ### 5. Model Cache via NFS

-Mount the NAS model cache on waterdeep so models are shared with the cluster:
+Mount the gravenhollow NFS share on waterdeep so models are shared with the cluster via the fast all-SSD NAS:

 ```bash
-# Mount candlekeep NFS share
-sudo mount -t nfs candlekeep.lab.daviestechlabs.io:/volume1/models \
+# Mount gravenhollow NFS share (all-SSD, dual 10GbE)
+sudo mount -t nfs gravenhollow.lab.daviestechlabs.io:/mnt/gravenhollow/kubernetes/models \
    /Volumes/model-cache

 # Or add to /etc/fstab for persistence
-# candlekeep.lab.daviestechlabs.io:/volume1/models /Volumes/model-cache nfs rw 0 0
+# gravenhollow.lab.daviestechlabs.io:/mnt/gravenhollow/kubernetes/models /Volumes/model-cache nfs rw 0 0

 # Symlink to HuggingFace cache location
 ln -s /Volumes/model-cache ~/.cache/huggingface/hub
@@ -315,6 +315,7 @@ caffeinate -s ray start --address=... --block
 * Ray's GCS port (6379) will be exposed outside the cluster — restrict with firewall rules to waterdeep's IP only
 * The Ray worker has no RBAC — it executes whatever tasks the head assigns
 * Model weights on NFS are read-only from waterdeep (mount with `ro` option if possible)
+* NFS traffic to gravenhollow traverses the LAN — ensure dual 10GbE links are active
 * Consider Tailscale or WireGuard for encrypted transport if the Ray GCS traffic crosses untrusted network segments

 ## Future Considerations