feat(strixhalo): patch torch.cuda.mem_get_info for unified memory APU

On Strix Halo, PyTorch reports GTT pool (128 GiB) as device memory instead of real VRAM (96 GiB from BIOS). vLLM uses mem_get_info() to pre-allocate and refuses to start when free GTT (29 GiB) < requested. The strixhalo_vram_fix.pth hook auto-patches mem_get_info on Python startup to read real VRAM total/used from /sys/class/drm sysfs. Only activates when PyTorch total differs >10% from sysfs VRAM.
2026-02-06 16:29:46 -05:00
parent 98c3ef284f
commit e7642b86dd
2 changed files with 117 additions and 0 deletions
--- a/dockerfiles/Dockerfile.ray-worker-strixhalo
+++ b/dockerfiles/Dockerfile.ray-worker-strixhalo
@@ -93,6 +93,16 @@ COPY --chown=1000:100 amdsmi-shim /tmp/amdsmi-shim
 RUN --mount=type=cache,target=/home/ray/.cache/uv,uid=1000,gid=1000 \
    uv pip install --system /tmp/amdsmi-shim && rm -rf /tmp/amdsmi-shim

+# FIX: Patch torch.cuda.mem_get_info for unified memory APUs.
+# On Strix Halo, PyTorch reports GTT (128 GiB) instead of real VRAM (96 GiB)
+# from sysfs. vLLM uses mem_get_info to pre-allocate, so wrong numbers cause
+# OOM or "insufficient GPU memory" at startup. The .pth file auto-patches
+# mem_get_info on Python startup to return sysfs VRAM values.
+COPY --chown=1000:100 amdsmi-shim/strixhalo_vram_fix.py \
+    /home/ray/anaconda3/lib/python3.11/site-packages/strixhalo_vram_fix.py
+RUN echo "import strixhalo_vram_fix" > \
+    /home/ray/anaconda3/lib/python3.11/site-packages/strixhalo_vram_fix.pth
+
 # Pre-download common models for faster cold starts (optional, increases image size)
 # RUN python3 -c "from sentence_transformers import SentenceTransformer; SentenceTransformer('BAAI/bge-large-en-v1.5')"