Files
kuberay-images/amdsmi-shim/strixhalo_vram_fix.py
Billy D. d1b6d78c66
Some checks failed
Build and Push Images / determine-version (push) Successful in 5s
Build and Push Images / build (Dockerfile.ray-worker-nvidia, nvidia) (push) Failing after 24s
Build and Push Images / build (Dockerfile.ray-worker-intel, intel) (push) Failing after 27s
Build and Push Images / build (Dockerfile.ray-worker-strixhalo, strixhalo) (push) Failing after 22s
Build and Push Images / build (Dockerfile.ray-worker-rdna2, rdna2) (push) Failing after 24s
Build and Push Images / Release (push) Has been skipped
Build and Push Images / Notify (push) Successful in 1s
fix(strixhalo): skip VRAM patch in low-memory init containers
KubeRay's auto-injected wait-gcs-ready init container has only 256Mi
memory limit. The .pth hook was unconditionally importing torch+ROCm
which requires >256Mi, causing OOMKill.

Now checks cgroup memory limit first — if under 512Mi, skips the
expensive torch import entirely. The VRAM patch is only needed by the
main Ray worker process, not by health-check init containers.
2026-02-06 19:15:49 -05:00

4.1 KiB