updates to finish nfs-fast implementation.
All checks were successful
Update README with ADR Index / update-readme (push) Successful in 6s
All checks were successful
Update README with ADR Index / update-readme (push) Successful in 6s
This commit is contained in:
@@ -59,7 +59,7 @@ Chosen option: **Option 1 — External Ray worker on macOS**, because Ray native
|
||||
* Network dependency — if waterdeep sleeps or disconnects, Ray tasks on it fail
|
||||
* MPS backend has limited operator coverage compared to CUDA/ROCm
|
||||
* Python environment must be maintained separately (not in a container image)
|
||||
* No Longhorn storage — model cache managed locally or via NFS mount
|
||||
* No Longhorn storage — model cache managed locally or via NFS mount from gravenhollow (nfs-fast)
|
||||
* Monitoring not automatically scraped by Prometheus (needs node-exporter or push gateway)
|
||||
|
||||
## Pros and Cons of the Options
|
||||
@@ -125,7 +125,7 @@ Chosen option: **Option 1 — External Ray worker on macOS**, because Ray native
|
||||
│ │ └── Training: LoRA/QLoRA fine-tuning via Ray Train │ │
|
||||
│ └──────────────────────────────────────────────────────────────────┘ │
|
||||
│ │
|
||||
│ Model cache: ~/Library/Caches/huggingface + NFS mount │
|
||||
│ Model cache: ~/Library/Caches/huggingface + NFS mount (gravenhollow) │
|
||||
└──────────────────────────────────────────────────────────────────────────┘
|
||||
```
|
||||
|
||||
@@ -233,15 +233,15 @@ launchctl load ~/Library/LaunchAgents/io.ray.worker.plist
|
||||
|
||||
### 5. Model Cache via NFS
|
||||
|
||||
Mount the NAS model cache on waterdeep so models are shared with the cluster:
|
||||
Mount the gravenhollow NFS share on waterdeep so models are shared with the cluster via the fast all-SSD NAS:
|
||||
|
||||
```bash
|
||||
# Mount candlekeep NFS share
|
||||
sudo mount -t nfs candlekeep.lab.daviestechlabs.io:/volume1/models \
|
||||
# Mount gravenhollow NFS share (all-SSD, dual 10GbE)
|
||||
sudo mount -t nfs gravenhollow.lab.daviestechlabs.io:/mnt/gravenhollow/kubernetes/models \
|
||||
/Volumes/model-cache
|
||||
|
||||
# Or add to /etc/fstab for persistence
|
||||
# candlekeep.lab.daviestechlabs.io:/volume1/models /Volumes/model-cache nfs rw 0 0
|
||||
# gravenhollow.lab.daviestechlabs.io:/mnt/gravenhollow/kubernetes/models /Volumes/model-cache nfs rw 0 0
|
||||
|
||||
# Symlink to HuggingFace cache location
|
||||
ln -s /Volumes/model-cache ~/.cache/huggingface/hub
|
||||
@@ -315,6 +315,7 @@ caffeinate -s ray start --address=... --block
|
||||
* Ray's GCS port (6379) will be exposed outside the cluster — restrict with firewall rules to waterdeep's IP only
|
||||
* The Ray worker has no RBAC — it executes whatever tasks the head assigns
|
||||
* Model weights on NFS are read-only from waterdeep (mount with `ro` option if possible)
|
||||
* NFS traffic to gravenhollow traverses the LAN — ensure dual 10GbE links are active
|
||||
* Consider Tailscale or WireGuard for encrypted transport if the Ray GCS traffic crosses untrusted network segments
|
||||
|
||||
## Future Considerations
|
||||
|
||||
Reference in New Issue
Block a user