kuberay-images

Author	SHA1	Message	Date
Billy D.	d1b6d78c66	fix(strixhalo): skip VRAM patch in low-memory init containers Some checks failed Build and Push Images / determine-version (push) Successful in 5s Details Build and Push Images / build (Dockerfile.ray-worker-nvidia, nvidia) (push) Failing after 24s Details Build and Push Images / build (Dockerfile.ray-worker-intel, intel) (push) Failing after 27s Details Build and Push Images / build (Dockerfile.ray-worker-strixhalo, strixhalo) (push) Failing after 22s Details Build and Push Images / build (Dockerfile.ray-worker-rdna2, rdna2) (push) Failing after 24s Details Build and Push Images / Release (push) Has been skipped Details Build and Push Images / Notify (push) Successful in 1s Details KubeRay's auto-injected wait-gcs-ready init container has only 256Mi memory limit. The .pth hook was unconditionally importing torch+ROCm which requires >256Mi, causing OOMKill. Now checks cgroup memory limit first — if under 512Mi, skips the expensive torch import entirely. The VRAM patch is only needed by the main Ray worker process, not by health-check init containers.	2026-02-06 19:15:49 -05:00
Billy D.	e7642b86dd	feat(strixhalo): patch torch.cuda.mem_get_info for unified memory APU Some checks failed Build and Push Images / determine-version (push) Successful in 4s Details Build and Push Images / build (Dockerfile.ray-worker-nvidia, nvidia) (push) Failing after 25s Details Build and Push Images / build (Dockerfile.ray-worker-intel, intel) (push) Failing after 28s Details Build and Push Images / build (Dockerfile.ray-worker-strixhalo, strixhalo) (push) Failing after 23s Details Build and Push Images / build (Dockerfile.ray-worker-rdna2, rdna2) (push) Failing after 26s Details Build and Push Images / Release (push) Has been skipped Details Build and Push Images / Notify (push) Successful in 1s Details On Strix Halo, PyTorch reports GTT pool (128 GiB) as device memory instead of real VRAM (96 GiB from BIOS). vLLM uses mem_get_info() to pre-allocate and refuses to start when free GTT (29 GiB) < requested. The strixhalo_vram_fix.pth hook auto-patches mem_get_info on Python startup to read real VRAM total/used from /sys/class/drm sysfs. Only activates when PyTorch total differs >10% from sysfs VRAM.	2026-02-06 16:29:46 -05:00
Billy D.	300582a520	feat(strixhalo): add amdsmi sysfs shim to bypass glibc 2.38 requirement Some checks failed Build and Push Images / determine-version (push) Successful in 58s Details Build and Push Images / Release (push) Has been cancelled Details Build and Push Images / Notify (push) Has been cancelled Details Build and Push Images / build-strixhalo (push) Has been cancelled Details Build and Push Images / build-intel (push) Has been cancelled Details Build and Push Images / build-nvidia (push) Has been cancelled Details Build and Push Images / build-rdna2 (push) Has been cancelled Details The native amdsmi from ROCm 7.1 requires libamd_smi.so linked against glibc 2.38 (Ubuntu 24.04), but the Ray base image is Ubuntu 22.04 (glibc 2.35). This caused vLLM to fail ROCm platform detection with 'No module named amdsmi' / GLIBC_2.38 not found errors. Solution: Pure-Python amdsmi shim that reads GPU info from sysfs (/sys/class/drm/*) instead of the native library. Provides the full API surface used by both vLLM (platform detection, device info) and PyTorch (device counting, memory/power/temp monitoring). Tested in-container: vLLM detects RocmPlatform, PyTorch sees GPU (Radeon 8060S, 128GB, HIP 7.3), DeviceConfig resolves to 'cuda'. Changes: - Add amdsmi-shim/ package with sysfs-backed implementation - Update Dockerfile to install shim after vLLM/torch - Add amdsmi-shim/ to .dockerignore explicit includes	2026-02-06 08:28:07 -05:00

3 Commits