diff --git a/decisions/0026-storage-strategy.md b/decisions/0026-storage-strategy.md index 3c82195..2993823 100644 --- a/decisions/0026-storage-strategy.md +++ b/decisions/0026-storage-strategy.md @@ -37,9 +37,10 @@ How do we provide tiered storage that balances performance, reliability, and cap Chosen option: **Option 1 - Longhorn + NFS dual-tier storage** -Two storage tiers optimized for different use cases: +Three storage tiers optimized for different use cases: - **`longhorn`** (default): Fast distributed block storage on NVMe/SSDs for databases and critical workloads -- **`nfs-slow`**: High-capacity NFS storage on external NAS for media, datasets, and bulk storage +- **`nfs-fast`**: High-performance NFS + S3 storage on gravenhollow (all-SSD TrueNAS Scale, dual 10GbE, 12.2 TB) for AI model cache, hot data, and S3-compatible object storage via RustFS +- **`nfs-slow`**: High-capacity NFS storage on candlekeep (QNAP HDD NAS) for media, datasets, and bulk storage ### Positive Consequences @@ -90,7 +91,7 @@ Two storage tiers optimized for different use cases: │ │ │ ┌────────────────────────────────────────────────────────────────┐ │ │ │ candlekeep.lab.daviestechlabs.io │ │ -│ │ (External NAS) │ │ +│ │ (QNAP NAS) │ │ │ │ │ │ │ │ /kubernetes │ │ │ │ ├── jellyfin-media/ (1TB+ media library) │ │ @@ -113,6 +114,38 @@ Two storage tiers optimized for different use cases: │ │ PVC │ │ PVC │ │ PVC │ │ PVC │ │ │ └──────────┘ └──────────┘ └──────────┘ └──────────┘ │ └────────────────────────────────────────────────────────────────────────────┘ + +┌────────────────────────────────────────────────────────────────────────────┐ +│ TIER 3: NFS-FAST │ +│ (High-Performance SSD NFS + S3 Storage) │ +│ │ +│ ┌────────────────────────────────────────────────────────────────┐ │ +│ │ gravenhollow.lab.daviestechlabs.io │ │ +│ │ (TrueNAS Scale · All-SSD · Dual 10GbE · 12.2 TB) │ │ +│ │ │ │ +│ │ NFS: /mnt/gravenhollow/kubernetes │ │ +│ │ ├── ray-model-cache/ (AI model weights - hot) │ │ +│ │ ├── mlflow-artifacts/ (ML experiment tracking) │ │ +│ │ └── training-data/ (datasets for fine-tuning) │ │ +│ │ │ │ +│ │ S3 (RustFS): http://gravenhollow.lab.daviestechlabs.io:30292 │ │ +│ │ ├── kubeflow-pipelines (pipeline artifacts) │ │ +│ │ ├── training-data (large dataset staging) │ │ +│ │ └── longhorn-backups (off-cluster backup target) │ │ +│ └────────────────────────────────────────────────────────────────┘ │ +│ │ │ +│ ▼ │ +│ ┌───────────────────────┐ │ +│ │ NFS CSI Driver │ │ +│ │ (csi-driver-nfs) │ │ +│ └───────────┬───────────┘ │ +│ ▼ │ +│ ┌──────────┐ ┌──────────┐ ┌──────────┐ │ +│ │Ray Model │ │ MLflow │ │ Training │ │ +│ │ Cache │ │ Artifact │ │ Data │ │ +│ │ PVC │ │ PVC │ │ PVC │ │ +│ └──────────┘ └──────────┘ └──────────┘ │ +└────────────────────────────────────────────────────────────────────────────┘ ``` ## Tier 1: Longhorn Configuration @@ -179,19 +212,79 @@ The naming is intentional - it sets correct expectations: - **Throughput:** Adequate for streaming media, not for databases - **Benefit:** Massive capacity without consuming cluster disk space +## Tier 3: NFS-Fast Configuration + +### Helm Values (second csi-driver-nfs installation) + +A second HelmRelease (`csi-driver-nfs-fast`) references the same OCI chart but only creates the StorageClass — the CSI driver pods are already running from the nfs-slow installation. + +```yaml +controller: + replicas: 0 +node: + enabled: false +storageClass: + create: true + name: nfs-fast + parameters: + server: gravenhollow.lab.daviestechlabs.io + share: /mnt/gravenhollow/kubernetes + mountOptions: + - nfsvers=4.2 # Server-side copy, fallocate, seekhole + - nconnect=16 # 16 TCP connections across bonded 10GbE + - rsize=1048576 # 1 MB read block size + - wsize=1048576 # 1 MB write block size + - hard # Retry indefinitely on timeout + - noatime # Skip access-time updates + - nodiratime # Skip directory access-time updates + - nocto # Disable close-to-open consistency (read-heavy workloads) + - actimeo=600 # Cache attributes for 10 min + - max_connect=16 # Allow up to 16 connections to the same server + reclaimPolicy: Delete + volumeBindingMode: Immediate +``` + +### Performance Tuning Rationale + +| Option | Why | +|--------|-----| +| `nfsvers=4.2` | Enables server-side copy, hole punch, and fallocate — TrueNAS Scale supports NFSv4.2 natively | +| `nconnect=16` | Opens 16 parallel TCP connections per mount, spreading I/O across both 10GbE bond members | +| `rsize/wsize=1048576` | 1 MB block sizes maximise throughput per operation — jumbo frames (MTU 9000) carry each 1 MB payload in fewer packets, reducing per-packet overhead | +| `nocto` | Skips close-to-open consistency checks — safe because model weights and artifacts are write-once/read-many | +| `actimeo=600` | Caches file and directory attributes for 10 minutes, reducing metadata round-trips for static content | +| `nodiratime` | Avoids unnecessary directory timestamp writes alongside `noatime` | + +### Why "nfs-fast"? + +Gravenhollow addresses the performance gap between Longhorn (local) and candlekeep (HDD NAS): +- **All-SSD:** No spinning disk latency — suitable for random I/O workloads like model loading +- **Dual 10GbE:** 2× 10 Gbps network links via link aggregation +- **12.2 TB capacity:** Enough for model cache, artifacts, and training data +- **RustFS S3:** S3-compatible object storage endpoint for pipeline artifacts and backups +- **Use case:** AI/ML model cache, MLflow artifacts, training data — workloads that need better than HDD but don't require local NVMe + +### S3 Endpoint (RustFS) + +Gravenhollow also provides S3-compatible object storage via RustFS: +- **Endpoint:** `http://gravenhollow.lab.daviestechlabs.io:30292` +- **Use cases:** Kubeflow pipeline artifacts, Longhorn off-cluster backups, training dataset staging +- **Credentials:** Managed via Vault ExternalSecret (`/kv/data/gravenhollow` → `access_key`, `secret_key`) + ## Storage Tier Selection Guide | Workload Type | Storage Class | Rationale | |---------------|---------------|-----------| -| PostgreSQL (CNPG) | `longhorn` or `nfs-slow` | Depends on criticality | +| PostgreSQL (CNPG) | `longhorn` | HA with replication, low latency | | Prometheus/ClickHouse | `longhorn` | High write IOPS required | | Vault | `longhorn` | Security-critical, needs HA | +| AI/ML models (Ray) | `nfs-fast` | Large model weights, SSD speed | +| MLflow artifacts | `nfs-fast` | Experiment tracking, frequent reads | +| Training data | `nfs-fast` | Dataset staging for fine-tuning | | Media (Jellyfin, Kavita) | `nfs-slow` | Large files, sequential reads | | Photos (Immich) | `nfs-slow` | Bulk storage for photos | | User files (Nextcloud) | `nfs-slow` | Capacity over speed | -| AI/ML models (Ray) | `nfs-slow` | Large model weights | | Build caches (Gitea runner) | `nfs-slow` | Ephemeral, large | -| MLflow artifacts | `nfs-slow` | Model artifacts storage | ## Volume Usage by Tier @@ -296,14 +389,15 @@ spec: ### When to Choose Each Tier -| Requirement | Longhorn | NFS-Slow | -|-------------|----------|----------| -| Low latency | ✅ | ❌ | -| High IOPS | ✅ | ❌ | -| Large capacity | ❌ | ✅ | -| ReadWriteMany (RWX) | Limited | ✅ | -| Node failure survival | ✅ | ✅ (NAS HA) | -| Kubernetes-native | ✅ | ✅ | +| Requirement | Longhorn | NFS-Fast | NFS-Slow | +|-------------|----------|----------|----------| +| Low latency | ✅ | ⚡ | ❌ | +| High IOPS | ✅ | ⚡ | ❌ | +| Large capacity | ❌ | ✅ (12.2 TB) | ✅✅ | +| ReadWriteMany (RWX) | Limited | ✅ | ✅ | +| S3 compatible | ❌ | ✅ (RustFS) | ✅ (Quobjects) | +| Node failure survival | ✅ | ✅ (NAS) | ✅ (NAS) | +| Kubernetes-native | ✅ | ✅ | ✅ | ## Monitoring @@ -320,11 +414,13 @@ spec: ## Future Enhancements -1. **NAS high availability** - Second NAS with replication -2. **Dedicated storage network** - Separate VLAN for storage traffic +1. ~~**NAS high availability** - Second NAS with replication~~ ✅ Done — gravenhollow adds a second NAS +2. **Dedicated storage network** - Separate VLAN for storage traffic (gravenhollow's dual 10GbE makes this more impactful) 3. **NVMe-oF** - Network NVMe for lower latency 4. **Tiered Longhorn** - Hot (NVMe) and warm (SSD) within Longhorn -5. **S3 tier** - MinIO for object storage workloads +5. ~~**S3 tier** - MinIO for object storage workloads~~ ✅ Done — gravenhollow RustFS provides S3 +6. **Migrate AI/ML PVCs to nfs-fast** - Move ray-model-cache and mlflow-artifacts from nfs-slow to nfs-fast +7. **Longhorn backups to gravenhollow S3** - Use RustFS as off-cluster backup target ## References diff --git a/decisions/0037-node-naming-conventions.md b/decisions/0037-node-naming-conventions.md index c6db68f..0311658 100644 --- a/decisions/0037-node-naming-conventions.md +++ b/decisions/0037-node-naming-conventions.md @@ -82,8 +82,8 @@ Fighters are the workhorses, handling general compute without magical (GPU) abil | Node | Character/Location | Role | Notes | |------|-------------------|------|-------| -| `candlekeep` | Candlekeep | Primary NAS (Synology) | Library fortress, knowledge storage | -| `neverwinter` | Neverwinter | Fast NAS (TrueNAS Scale) | Jewel of the North, all-SSD, nfs-fast | +| `candlekeep` | Candlekeep | Primary NAS (QNAP) | Library fortress, knowledge storage | +| `gravenhollow` | Gravenhollow | Fast NAS (TrueNAS Scale) | Living memory of the Underdark, all-SSD, dual 10GbE, nfs-fast | | `waterdeep` | Waterdeep | Mac Mini dev workstation | City of Splendors, primary city | ### Future Expansion @@ -139,11 +139,11 @@ Fighters are the workhorses, handling general compute without magical (GPU) abil ┌───────────────────────────────────────────────────────────────────────────────┐ │ 🏰 Locations (Off-Cluster Infrastructure) │ │ │ -│ 📚 candlekeep ❄️ neverwinter 🏙️ waterdeep │ -│ Synology NAS TrueNAS Scale (SSD) Mac Mini │ -│ nfs-default nfs-fast Dev workstation │ -│ High capacity High speed Primary dev box │ -│ "Library Fortress" "Jewel of the North" "City of Splendors" │ +│ 📚 candlekeep 🪨 gravenhollow 🏙️ waterdeep │ +│ QNAP NAS TrueNAS Scale (SSD) Mac Mini │ +│ nfs-slow nfs-fast Dev workstation │ +│ High capacity High speed, 12.2TB Primary dev box │ +│ "Library Fortress" "Living Memory" "City of Splendors" │ └───────────────────────────────────────────────────────────────────────────────┘ ``` @@ -152,7 +152,7 @@ Fighters are the workhorses, handling general compute without magical (GPU) abil | Location | Storage Class | Speed | Capacity | Use Case | |----------|--------------|-------|----------|----------| | Candlekeep | `nfs-default` | HDD | High | Backups, archives, media | -| Neverwinter | `nfs-fast` | SSD | Medium | Database WAL, hot data | +| Gravenhollow | `nfs-fast` | SSD (12.2 TB) | Medium-High | Database WAL, hot data, model cache | | Longhorn | `longhorn` | Local SSD | Distributed | Replicated app data | ## Node Labels @@ -182,6 +182,6 @@ All nodes are resolvable via: * [Khelben Arunsun](https://forgottenrealms.fandom.com/wiki/Khelben_Arunsun) * [Elminster](https://forgottenrealms.fandom.com/wiki/Elminster_Aumar) * [Candlekeep](https://forgottenrealms.fandom.com/wiki/Candlekeep) -* [Neverwinter](https://forgottenrealms.fandom.com/wiki/Neverwinter) +* [Gravenhollow](https://forgottenrealms.fandom.com/wiki/Gravenhollow) * Related: [ADR-0035](0035-arm64-worker-strategy.md) - ARM64 Worker Strategy * Related: [ADR-0011](0011-kuberay-unified-serving.md) - KubeRay Unified Serving diff --git a/decisions/0059-mac-mini-ray-worker.md b/decisions/0059-mac-mini-ray-worker.md index 39fd5be..c3e188d 100644 --- a/decisions/0059-mac-mini-ray-worker.md +++ b/decisions/0059-mac-mini-ray-worker.md @@ -59,7 +59,7 @@ Chosen option: **Option 1 — External Ray worker on macOS**, because Ray native * Network dependency — if waterdeep sleeps or disconnects, Ray tasks on it fail * MPS backend has limited operator coverage compared to CUDA/ROCm * Python environment must be maintained separately (not in a container image) -* No Longhorn storage — model cache managed locally or via NFS mount +* No Longhorn storage — model cache managed locally or via NFS mount from gravenhollow (nfs-fast) * Monitoring not automatically scraped by Prometheus (needs node-exporter or push gateway) ## Pros and Cons of the Options @@ -125,7 +125,7 @@ Chosen option: **Option 1 — External Ray worker on macOS**, because Ray native │ │ └── Training: LoRA/QLoRA fine-tuning via Ray Train │ │ │ └──────────────────────────────────────────────────────────────────┘ │ │ │ -│ Model cache: ~/Library/Caches/huggingface + NFS mount │ +│ Model cache: ~/Library/Caches/huggingface + NFS mount (gravenhollow) │ └──────────────────────────────────────────────────────────────────────────┘ ``` @@ -233,15 +233,15 @@ launchctl load ~/Library/LaunchAgents/io.ray.worker.plist ### 5. Model Cache via NFS -Mount the NAS model cache on waterdeep so models are shared with the cluster: +Mount the gravenhollow NFS share on waterdeep so models are shared with the cluster via the fast all-SSD NAS: ```bash -# Mount candlekeep NFS share -sudo mount -t nfs candlekeep.lab.daviestechlabs.io:/volume1/models \ +# Mount gravenhollow NFS share (all-SSD, dual 10GbE) +sudo mount -t nfs gravenhollow.lab.daviestechlabs.io:/mnt/gravenhollow/kubernetes/models \ /Volumes/model-cache # Or add to /etc/fstab for persistence -# candlekeep.lab.daviestechlabs.io:/volume1/models /Volumes/model-cache nfs rw 0 0 +# gravenhollow.lab.daviestechlabs.io:/mnt/gravenhollow/kubernetes/models /Volumes/model-cache nfs rw 0 0 # Symlink to HuggingFace cache location ln -s /Volumes/model-cache ~/.cache/huggingface/hub @@ -315,6 +315,7 @@ caffeinate -s ray start --address=... --block * Ray's GCS port (6379) will be exposed outside the cluster — restrict with firewall rules to waterdeep's IP only * The Ray worker has no RBAC — it executes whatever tasks the head assigns * Model weights on NFS are read-only from waterdeep (mount with `ro` option if possible) +* NFS traffic to gravenhollow traverses the LAN — ensure dual 10GbE links are active * Consider Tailscale or WireGuard for encrypted transport if the Ray GCS traffic crosses untrusted network segments ## Future Considerations diff --git a/diagrams/node-naming.mmd b/diagrams/node-naming.mmd index 9c04eb7..85d60bf 100644 --- a/diagrams/node-naming.mmd +++ b/diagrams/node-naming.mmd @@ -31,8 +31,8 @@ flowchart TB end subgraph Infrastructure["🏰 Locations (Off-Cluster Infrastructure)"] - Candlekeep["📚 candlekeep
Synology NAS
nfs-default
Library Fortress"] - Neverwinter["❄️ neverwinter
TrueNAS Scale (SSD)
nfs-fast
Jewel of the North"] + Candlekeep["📚 candlekeep
QNAP NAS
nfs-slow
Library Fortress"] + Gravenhollow["🪨 gravenhollow
TrueNAS Scale (SSD)
nfs-fast · 12.2 TB
Living Memory"] Waterdeep["🏙️ waterdeep
Mac Mini
Dev Workstation
City of Splendors"] end @@ -44,7 +44,7 @@ flowchart TB end ControlPlane -.->|"etcd"| ControlPlane - Wizards -.->|"Fast Storage"| Neverwinter + Wizards -.->|"Fast Storage"| Gravenhollow Wizards -.->|"Backups"| Candlekeep Rogues -.->|"NFS Mounts"| Candlekeep Fighters -.->|"NFS Mounts"| Candlekeep @@ -60,5 +60,5 @@ flowchart TB class Khelben,Elminster,Drizzt,Danilo,Regis wizard class Durnan,Elaith,Jarlaxle,Mirt,Volo rogue class Wulfgar fighter - class Candlekeep,Neverwinter,Waterdeep location + class Candlekeep,Gravenhollow,Waterdeep location class AI,Edge,Compute,Storage workload diff --git a/diagrams/velero-backup.mmd b/diagrams/velero-backup.mmd index 17abe8d..e9f6f33 100644 --- a/diagrams/velero-backup.mmd +++ b/diagrams/velero-backup.mmd @@ -23,7 +23,7 @@ flowchart TB MinIO["MinIO
On-premises S3"] end subgraph Secondary["Secondary: NFS"] - NAS["Synology NAS
Long-term retention"] + NAS["QNAP NAS
Long-term retention"] end end