updates to finish nfs-fast implementation.

2026-02-16 18:08:32 -05:00
parent 7685b2b757
commit b4e608f002
5 changed files with 134 additions and 37 deletions
--- a/decisions/0026-storage-strategy.md
+++ b/decisions/0026-storage-strategy.md
@@ -37,9 +37,10 @@ How do we provide tiered storage that balances performance, reliability, and cap
 Chosen option: **Option 1 - Longhorn + NFS dual-tier storage**
-Two storage tiers optimized for different use cases:
+Three storage tiers optimized for different use cases:
 - **`longhorn`** (default): Fast distributed block storage on NVMe/SSDs for databases and critical workloads
- **`nfs-slow`**: High-capacity NFS storage on external NAS for media, datasets, and bulk storage
+- **`nfs-fast`**: High-performance NFS + S3 storage on gravenhollow (all-SSD TrueNAS Scale, dual 10GbE, 12.2 TB) for AI model cache, hot data, and S3-compatible object storage via RustFS
 - **`nfs-slow`**: High-capacity NFS storage on candlekeep (QNAP HDD NAS) for media, datasets, and bulk storage
 ### Positive Consequences
@@ -90,7 +91,7 @@ Two storage tiers optimized for different use cases:
 │                                                                            │
 │  ┌────────────────────────────────────────────────────────────────┐        │
 │  │                  candlekeep.lab.daviestechlabs.io              │        │
-│  │                        (External NAS)                           │        │
+│  │                         (QNAP NAS)                              │        │
 │  │                                                                 │        │
 │  │   /kubernetes                                                   │        │
 │  │   ├── jellyfin-media/     (1TB+ media library)                 │        │
@@ -113,6 +114,38 @@ Two storage tiers optimized for different use cases:
 │     │   PVC    │  │   PVC    │  │   PVC    │  │   PVC    │                 │
 │     └──────────┘  └──────────┘  └──────────┘  └──────────┘                 │
 └────────────────────────────────────────────────────────────────────────────┘
 ┌────────────────────────────────────────────────────────────────────────────┐
 │                              TIER 3: NFS-FAST                              │
 │                     (High-Performance SSD NFS + S3 Storage)                │
 │                                                                            │
 │  ┌────────────────────────────────────────────────────────────────┐        │
 │  │                gravenhollow.lab.daviestechlabs.io              │        │
 │  │          (TrueNAS Scale · All-SSD · Dual 10GbE · 12.2 TB)     │        │
 │  │                                                                │        │
 │  │   NFS: /mnt/gravenhollow/kubernetes                            │        │
 │  │   ├── ray-model-cache/    (AI model weights - hot)             │        │
 │  │   ├── mlflow-artifacts/   (ML experiment tracking)             │        │
 │  │   └── training-data/      (datasets for fine-tuning)           │        │
 │  │                                                                │        │
 │  │   S3 (RustFS): http://gravenhollow.lab.daviestechlabs.io:30292  │        │
 │  │   ├── kubeflow-pipelines   (pipeline artifacts)                │        │
 │  │   ├── training-data        (large dataset staging)             │        │
 │  │   └── longhorn-backups     (off-cluster backup target)         │        │
 │  └────────────────────────────────────────────────────────────────┘        │
 │                          │                                                  │
 │                          ▼                                                  │
 │              ┌───────────────────────┐                                      │
 │              │   NFS CSI Driver      │                                      │
 │              │  (csi-driver-nfs)     │                                      │
 │              └───────────┬───────────┘                                      │
 │                          ▼                                                  │
 │     ┌──────────┐  ┌──────────┐  ┌──────────┐                               │
 │     │Ray Model │  │  MLflow  │  │ Training │                               │
 │     │  Cache   │  │ Artifact │  │   Data   │                               │
 │     │   PVC    │  │   PVC    │  │   PVC    │                               │
 │     └──────────┘  └──────────┘  └──────────┘                               │
 └────────────────────────────────────────────────────────────────────────────┘
 ```
 ## Tier 1: Longhorn Configuration
@@ -179,19 +212,79 @@ The naming is intentional - it sets correct expectations:
 - **Throughput:** Adequate for streaming media, not for databases
 - **Benefit:** Massive capacity without consuming cluster disk space
 ## Tier 3: NFS-Fast Configuration
 ### Helm Values (second csi-driver-nfs installation)
 A second HelmRelease (`csi-driver-nfs-fast`) references the same OCI chart but only creates the StorageClass — the CSI driver pods are already running from the nfs-slow installation.
 ```yaml
 controller:
  replicas: 0
 node:
  enabled: false
 storageClass:
  create: true
  name: nfs-fast
  parameters:
    server: gravenhollow.lab.daviestechlabs.io
    share: /mnt/gravenhollow/kubernetes
  mountOptions:
    - nfsvers=4.2        # Server-side copy, fallocate, seekhole
    - nconnect=16        # 16 TCP connections across bonded 10GbE
    - rsize=1048576      # 1 MB read block size
    - wsize=1048576      # 1 MB write block size
    - hard               # Retry indefinitely on timeout
    - noatime            # Skip access-time updates
    - nodiratime         # Skip directory access-time updates
    - nocto              # Disable close-to-open consistency (read-heavy workloads)
    - actimeo=600        # Cache attributes for 10 min
    - max_connect=16     # Allow up to 16 connections to the same server
  reclaimPolicy: Delete
  volumeBindingMode: Immediate
 ```
 ### Performance Tuning Rationale
 | Option | Why |
 |--------|-----|
 | `nfsvers=4.2` | Enables server-side copy, hole punch, and fallocate — TrueNAS Scale supports NFSv4.2 natively |
 | `nconnect=16` | Opens 16 parallel TCP connections per mount, spreading I/O across both 10GbE bond members |
 | `rsize/wsize=1048576` | 1 MB block sizes maximise throughput per operation — jumbo frames (MTU 9000) carry each 1 MB payload in fewer packets, reducing per-packet overhead |
 | `nocto` | Skips close-to-open consistency checks — safe because model weights and artifacts are write-once/read-many |
 | `actimeo=600` | Caches file and directory attributes for 10 minutes, reducing metadata round-trips for static content |
 | `nodiratime` | Avoids unnecessary directory timestamp writes alongside `noatime` |
 ### Why "nfs-fast"?
 Gravenhollow addresses the performance gap between Longhorn (local) and candlekeep (HDD NAS):
 - **All-SSD:** No spinning disk latency — suitable for random I/O workloads like model loading
 - **Dual 10GbE:** 2× 10 Gbps network links via link aggregation
 - **12.2 TB capacity:** Enough for model cache, artifacts, and training data
 - **RustFS S3:** S3-compatible object storage endpoint for pipeline artifacts and backups
 - **Use case:** AI/ML model cache, MLflow artifacts, training data — workloads that need better than HDD but don't require local NVMe
 ### S3 Endpoint (RustFS)
 Gravenhollow also provides S3-compatible object storage via RustFS:
 - **Endpoint:** `http://gravenhollow.lab.daviestechlabs.io:30292`
 - **Use cases:** Kubeflow pipeline artifacts, Longhorn off-cluster backups, training dataset staging
 - **Credentials:** Managed via Vault ExternalSecret (`/kv/data/gravenhollow` → `access_key`, `secret_key`)
 ## Storage Tier Selection Guide
 | Workload Type | Storage Class | Rationale |
 |---------------|---------------|-----------|
-| PostgreSQL (CNPG) | `longhorn` or `nfs-slow` | Depends on criticality |
+| PostgreSQL (CNPG) | `longhorn` | HA with replication, low latency |
 | Prometheus/ClickHouse | `longhorn` | High write IOPS required |
 | Vault | `longhorn` | Security-critical, needs HA |
 | AI/ML models (Ray) | `nfs-fast` | Large model weights, SSD speed |
 | MLflow artifacts | `nfs-fast` | Experiment tracking, frequent reads |
 | Training data | `nfs-fast` | Dataset staging for fine-tuning |
 | Media (Jellyfin, Kavita) | `nfs-slow` | Large files, sequential reads |
 | Photos (Immich) | `nfs-slow` | Bulk storage for photos |
 | User files (Nextcloud) | `nfs-slow` | Capacity over speed |
 | AI/ML models (Ray) | `nfs-slow` | Large model weights |
 | Build caches (Gitea runner) | `nfs-slow` | Ephemeral, large |
 | MLflow artifacts | `nfs-slow` | Model artifacts storage |
 ## Volume Usage by Tier
@@ -296,14 +389,15 @@ spec:
 ### When to Choose Each Tier
-| Requirement | Longhorn | NFS-Slow |
+| Requirement | Longhorn | NFS-Fast | NFS-Slow |
-|-------------|----------|----------|
+|-------------|----------|----------|----------|
-| Low latency | ✅ | ❌ |
+| Low latency | ✅ | ⚡ | ❌ |
-| High IOPS | ✅ | ❌ |
+| High IOPS | ✅ | ⚡ | ❌ |
-| Large capacity | ❌ | ✅ |
+| Large capacity | ❌ | ✅ (12.2 TB) | ✅✅ |
-| ReadWriteMany (RWX) | Limited | ✅ |
+| ReadWriteMany (RWX) | Limited | ✅ | ✅ |
-| Node failure survival | ✅ | ✅ (NAS HA) |
+| S3 compatible | ❌ | ✅ (RustFS) | ✅ (Quobjects) |
-| Kubernetes-native | ✅ | ✅ |
+| Node failure survival | ✅ | ✅ (NAS) | ✅ (NAS) |
 | Kubernetes-native | ✅ | ✅ | ✅ |
 ## Monitoring
@@ -320,11 +414,13 @@ spec:
 ## Future Enhancements
-1. **NAS high availability** - Second NAS with replication
+1. ~~**NAS high availability** - Second NAS with replication~~ ✅ Done — gravenhollow adds a second NAS
-2. **Dedicated storage network** - Separate VLAN for storage traffic
+2. **Dedicated storage network** - Separate VLAN for storage traffic (gravenhollow's dual 10GbE makes this more impactful)
 3. **NVMe-oF** - Network NVMe for lower latency
 4. **Tiered Longhorn** - Hot (NVMe) and warm (SSD) within Longhorn
-5. **S3 tier** - MinIO for object storage workloads
+5. ~~**S3 tier** - MinIO for object storage workloads~~ ✅ Done — gravenhollow RustFS provides S3
 6. **Migrate AI/ML PVCs to nfs-fast** - Move ray-model-cache and mlflow-artifacts from nfs-slow to nfs-fast
 7. **Longhorn backups to gravenhollow S3** - Use RustFS as off-cluster backup target
 ## References
--- a/decisions/0037-node-naming-conventions.md
+++ b/decisions/0037-node-naming-conventions.md
@@ -82,8 +82,8 @@ Fighters are the workhorses, handling general compute without magical (GPU) abil
 | Node | Character/Location | Role | Notes |
 |------|-------------------|------|-------|
-| `candlekeep` | Candlekeep | Primary NAS (Synology) | Library fortress, knowledge storage |
+| `candlekeep` | Candlekeep | Primary NAS (QNAP) | Library fortress, knowledge storage |
-| `neverwinter` | Neverwinter | Fast NAS (TrueNAS Scale) | Jewel of the North, all-SSD, nfs-fast |
+| `gravenhollow` | Gravenhollow | Fast NAS (TrueNAS Scale) | Living memory of the Underdark, all-SSD, dual 10GbE, nfs-fast |
 | `waterdeep` | Waterdeep | Mac Mini dev workstation | City of Splendors, primary city |
 ### Future Expansion
@@ -139,11 +139,11 @@ Fighters are the workhorses, handling general compute without magical (GPU) abil
 ┌───────────────────────────────────────────────────────────────────────────────┐
 │                    🏰 Locations (Off-Cluster Infrastructure)                   │
 │                                                                                │
-│  📚 candlekeep              ❄️ neverwinter              🏙️ waterdeep           │
+│  📚 candlekeep              🪨 gravenhollow              🏙️ waterdeep           │
-│  Synology NAS               TrueNAS Scale (SSD)         Mac Mini               │
+│  QNAP NAS                  TrueNAS Scale (SSD)          Mac Mini               │
-│  nfs-default                nfs-fast                    Dev workstation        │
+│  nfs-slow                   nfs-fast                     Dev workstation        │
-│  High capacity              High speed                  Primary dev box        │
+│  High capacity              High speed, 12.2TB           Primary dev box        │
-│  "Library Fortress"         "Jewel of the North"        "City of Splendors"    │
+│  "Library Fortress"         "Living Memory"              "City of Splendors"    │
 └───────────────────────────────────────────────────────────────────────────────┘
 ```
@@ -152,7 +152,7 @@ Fighters are the workhorses, handling general compute without magical (GPU) abil
 | Location | Storage Class | Speed | Capacity | Use Case |
 |----------|--------------|-------|----------|----------|
 | Candlekeep | `nfs-default` | HDD | High | Backups, archives, media |
-| Neverwinter | `nfs-fast` | SSD | Medium | Database WAL, hot data |
+| Gravenhollow | `nfs-fast` | SSD (12.2 TB) | Medium-High | Database WAL, hot data, model cache |
 | Longhorn | `longhorn` | Local SSD | Distributed | Replicated app data |
 ## Node Labels
@@ -182,6 +182,6 @@ All nodes are resolvable via:
 * [Khelben Arunsun](https://forgottenrealms.fandom.com/wiki/Khelben_Arunsun)
 * [Elminster](https://forgottenrealms.fandom.com/wiki/Elminster_Aumar)
 * [Candlekeep](https://forgottenrealms.fandom.com/wiki/Candlekeep)
-* [Neverwinter](https://forgottenrealms.fandom.com/wiki/Neverwinter)
+* [Gravenhollow](https://forgottenrealms.fandom.com/wiki/Gravenhollow)
 * Related: [ADR-0035](0035-arm64-worker-strategy.md) - ARM64 Worker Strategy
 * Related: [ADR-0011](0011-kuberay-unified-serving.md) - KubeRay Unified Serving
--- a/decisions/0059-mac-mini-ray-worker.md
+++ b/decisions/0059-mac-mini-ray-worker.md
@@ -59,7 +59,7 @@ Chosen option: **Option 1 — External Ray worker on macOS**, because Ray native
 * Network dependency — if waterdeep sleeps or disconnects, Ray tasks on it fail
 * MPS backend has limited operator coverage compared to CUDA/ROCm
 * Python environment must be maintained separately (not in a container image)
-* No Longhorn storage — model cache managed locally or via NFS mount
+* No Longhorn storage — model cache managed locally or via NFS mount from gravenhollow (nfs-fast)
 * Monitoring not automatically scraped by Prometheus (needs node-exporter or push gateway)
 ## Pros and Cons of the Options
@@ -125,7 +125,7 @@ Chosen option: **Option 1 — External Ray worker on macOS**, because Ray native
 │  │  └── Training: LoRA/QLoRA fine-tuning via Ray Train               │    │
 │  └──────────────────────────────────────────────────────────────────┘    │
 │                                                                          │
-│  Model cache: ~/Library/Caches/huggingface + NFS mount                   │
+│  Model cache: ~/Library/Caches/huggingface + NFS mount (gravenhollow)    │
 └──────────────────────────────────────────────────────────────────────────┘
 ```
@@ -233,15 +233,15 @@ launchctl load ~/Library/LaunchAgents/io.ray.worker.plist
 ### 5. Model Cache via NFS
-Mount the NAS model cache on waterdeep so models are shared with the cluster:
+Mount the gravenhollow NFS share on waterdeep so models are shared with the cluster via the fast all-SSD NAS:
 ```bash
-# Mount candlekeep NFS share
+# Mount gravenhollow NFS share (all-SSD, dual 10GbE)
-sudo mount -t nfs candlekeep.lab.daviestechlabs.io:/volume1/models \
+sudo mount -t nfs gravenhollow.lab.daviestechlabs.io:/mnt/gravenhollow/kubernetes/models \
    /Volumes/model-cache
 # Or add to /etc/fstab for persistence
-# candlekeep.lab.daviestechlabs.io:/volume1/models /Volumes/model-cache nfs rw 0 0
+# gravenhollow.lab.daviestechlabs.io:/mnt/gravenhollow/kubernetes/models /Volumes/model-cache nfs rw 0 0
 # Symlink to HuggingFace cache location
 ln -s /Volumes/model-cache ~/.cache/huggingface/hub
@@ -315,6 +315,7 @@ caffeinate -s ray start --address=... --block
 * Ray's GCS port (6379) will be exposed outside the cluster — restrict with firewall rules to waterdeep's IP only
 * The Ray worker has no RBAC — it executes whatever tasks the head assigns
 * Model weights on NFS are read-only from waterdeep (mount with `ro` option if possible)
 * NFS traffic to gravenhollow traverses the LAN — ensure dual 10GbE links are active
 * Consider Tailscale or WireGuard for encrypted transport if the Ray GCS traffic crosses untrusted network segments
 ## Future Considerations
--- a/diagrams/node-naming.mmd
+++ b/diagrams/node-naming.mmd
@@ -31,8 +31,8 @@ flowchart TB
    end
    subgraph Infrastructure["🏰 Locations (Off-Cluster Infrastructure)"]
-        Candlekeep["📚 candlekeep<br/>Synology NAS<br/>nfs-default<br/><i>Library Fortress</i>"]
+        Candlekeep["📚 candlekeep<br/>QNAP NAS<br/>nfs-slow<br/><i>Library Fortress</i>"]
-        Neverwinter["❄️ neverwinter<br/>TrueNAS Scale (SSD)<br/>nfs-fast<br/><i>Jewel of the North</i>"]
+        Gravenhollow["🪨 gravenhollow<br/>TrueNAS Scale (SSD)<br/>nfs-fast · 12.2 TB<br/><i>Living Memory</i>"]
        Waterdeep["🏙️ waterdeep<br/>Mac Mini<br/>Dev Workstation<br/><i>City of Splendors</i>"]
    end
@@ -44,7 +44,7 @@ flowchart TB
    end
    ControlPlane -.->|"etcd"| ControlPlane
-    Wizards -.->|"Fast Storage"| Neverwinter
+    Wizards -.->|"Fast Storage"| Gravenhollow
    Wizards -.->|"Backups"| Candlekeep
    Rogues -.->|"NFS Mounts"| Candlekeep
    Fighters -.->|"NFS Mounts"| Candlekeep
@@ -60,5 +60,5 @@ flowchart TB
    class Khelben,Elminster,Drizzt,Danilo,Regis wizard
    class Durnan,Elaith,Jarlaxle,Mirt,Volo rogue
    class Wulfgar fighter
-    class Candlekeep,Neverwinter,Waterdeep location
+    class Candlekeep,Gravenhollow,Waterdeep location
    class AI,Edge,Compute,Storage workload
--- a/diagrams/velero-backup.mmd
+++ b/diagrams/velero-backup.mmd
@@ -23,7 +23,7 @@ flowchart TB
            MinIO["MinIO<br/>On-premises S3"]
        end
        subgraph Secondary["Secondary: NFS"]
-            NAS["Synology NAS<br/>Long-term retention"]
+            NAS["QNAP NAS<br/>Long-term retention"]
        end
    end