KubeRay's auto-injected wait-gcs-ready init container has only 256Mi
memory limit. The .pth hook was unconditionally importing torch+ROCm
which requires >256Mi, causing OOMKill.
Now checks cgroup memory limit first — if under 512Mi, skips the
expensive torch import entirely. The VRAM patch is only needed by the
main Ray worker process, not by health-check init containers.
On Strix Halo, PyTorch reports GTT pool (128 GiB) as device memory
instead of real VRAM (96 GiB from BIOS). vLLM uses mem_get_info() to
pre-allocate and refuses to start when free GTT (29 GiB) < requested.
The strixhalo_vram_fix.pth hook auto-patches mem_get_info on Python
startup to read real VRAM total/used from /sys/class/drm sysfs.
Only activates when PyTorch total differs >10% from sysfs VRAM.