updates to finish nfs-fast implementation.
All checks were successful
Update README with ADR Index / update-readme (push) Successful in 6s

This commit is contained in:
2026-02-16 18:08:32 -05:00
parent 7685b2b757
commit b4e608f002
5 changed files with 134 additions and 37 deletions

View File

@@ -59,7 +59,7 @@ Chosen option: **Option 1 — External Ray worker on macOS**, because Ray native
* Network dependency — if waterdeep sleeps or disconnects, Ray tasks on it fail
* MPS backend has limited operator coverage compared to CUDA/ROCm
* Python environment must be maintained separately (not in a container image)
* No Longhorn storage — model cache managed locally or via NFS mount
* No Longhorn storage — model cache managed locally or via NFS mount from gravenhollow (nfs-fast)
* Monitoring not automatically scraped by Prometheus (needs node-exporter or push gateway)
## Pros and Cons of the Options
@@ -125,7 +125,7 @@ Chosen option: **Option 1 — External Ray worker on macOS**, because Ray native
│ │ └── Training: LoRA/QLoRA fine-tuning via Ray Train │ │
│ └──────────────────────────────────────────────────────────────────┘ │
│ │
│ Model cache: ~/Library/Caches/huggingface + NFS mount
│ Model cache: ~/Library/Caches/huggingface + NFS mount (gravenhollow)
└──────────────────────────────────────────────────────────────────────────┘
```
@@ -233,15 +233,15 @@ launchctl load ~/Library/LaunchAgents/io.ray.worker.plist
### 5. Model Cache via NFS
Mount the NAS model cache on waterdeep so models are shared with the cluster:
Mount the gravenhollow NFS share on waterdeep so models are shared with the cluster via the fast all-SSD NAS:
```bash
# Mount candlekeep NFS share
sudo mount -t nfs candlekeep.lab.daviestechlabs.io:/volume1/models \
# Mount gravenhollow NFS share (all-SSD, dual 10GbE)
sudo mount -t nfs gravenhollow.lab.daviestechlabs.io:/mnt/gravenhollow/kubernetes/models \
/Volumes/model-cache
# Or add to /etc/fstab for persistence
# candlekeep.lab.daviestechlabs.io:/volume1/models /Volumes/model-cache nfs rw 0 0
# gravenhollow.lab.daviestechlabs.io:/mnt/gravenhollow/kubernetes/models /Volumes/model-cache nfs rw 0 0
# Symlink to HuggingFace cache location
ln -s /Volumes/model-cache ~/.cache/huggingface/hub
@@ -315,6 +315,7 @@ caffeinate -s ray start --address=... --block
* Ray's GCS port (6379) will be exposed outside the cluster — restrict with firewall rules to waterdeep's IP only
* The Ray worker has no RBAC — it executes whatever tasks the head assigns
* Model weights on NFS are read-only from waterdeep (mount with `ro` option if possible)
* NFS traffic to gravenhollow traverses the LAN — ensure dual 10GbE links are active
* Consider Tailscale or WireGuard for encrypted transport if the Ray GCS traffic crosses untrusted network segments
## Future Considerations