Commit Graph

7 Commits

Author SHA1 Message Date
8902f6e616 v1.0.26: dynamic VRAM via GTT for 32GB carve-out
Some checks failed
Build and Push Images / build (Dockerfile.ray-worker-intel, intel) (push) Has been cancelled
Build and Push Images / build (Dockerfile.ray-worker-nvidia, nvidia) (push) Has been cancelled
Build and Push Images / build (Dockerfile.ray-worker-rdna2, rdna2) (push) Has been cancelled
Build and Push Images / build (Dockerfile.ray-worker-strixhalo, strixhalo) (push) Has been cancelled
Build and Push Images / Release (push) Has been cancelled
Build and Push Images / Notify (push) Has been cancelled
Build and Push Images / determine-version (push) Has been cancelled
strixhalo_vram_fix.py: compute effective VRAM as
min(GTT_pool, physical_RAM) - 4GB OS reserve instead of
raw sysfs VRAM. Prevents OOM when carve-out < model size
and prevents kernel OOM when GTT > physical RAM.
2026-02-10 18:38:09 -05:00
a20a5d2ccd mo fixes.
Some checks failed
Build and Push Images / determine-version (push) Successful in 6s
Build and Push Images / Release (push) Has been cancelled
Build and Push Images / Notify (push) Has been cancelled
Build and Push Images / build (Dockerfile.ray-worker-strixhalo, strixhalo) (push) Has been cancelled
Build and Push Images / build (Dockerfile.ray-worker-nvidia, nvidia) (push) Failing after 49s
Build and Push Images / build (Dockerfile.ray-worker-intel, intel) (push) Failing after 1m25s
Build and Push Images / build (Dockerfile.ray-worker-rdna2, rdna2) (push) Has been cancelled
2026-02-09 11:46:10 -05:00
f297deca9d fixing vllm.
Some checks failed
Build and Push Images / determine-version (push) Successful in 5s
Build and Push Images / build (Dockerfile.ray-worker-nvidia, nvidia) (push) Failing after 39s
Build and Push Images / build (Dockerfile.ray-worker-intel, intel) (push) Failing after 42s
Build and Push Images / build (Dockerfile.ray-worker-strixhalo, strixhalo) (push) Failing after 20s
Build and Push Images / build (Dockerfile.ray-worker-rdna2, rdna2) (push) Failing after 23s
Build and Push Images / Release (push) Has been skipped
Build and Push Images / Notify (push) Successful in 1s
2026-02-08 16:53:16 -05:00
9042460736 fix(strixhalo): add re-entry guard to prevent offload-arch fork bomb
Some checks failed
Build and Push Images / determine-version (push) Successful in 5s
Build and Push Images / build (Dockerfile.ray-worker-nvidia, nvidia) (push) Failing after 25s
Build and Push Images / build (Dockerfile.ray-worker-intel, intel) (push) Failing after 27s
Build and Push Images / build (Dockerfile.ray-worker-strixhalo, strixhalo) (push) Failing after 22s
Build and Push Images / build (Dockerfile.ray-worker-rdna2, rdna2) (push) Failing after 25s
Build and Push Images / Release (push) Has been skipped
Build and Push Images / Notify (push) Successful in 2s
torch init calls offload-arch (a Python script) which re-enters the
.pth hook, triggering another import torch, creating an infinite fork
storm (1000+ processes). Set _STRIXHALO_VRAM_FIX_ACTIVE env var before
importing torch so child processes skip the patch.
2026-02-07 08:47:06 -05:00
d1b6d78c66 fix(strixhalo): skip VRAM patch in low-memory init containers
Some checks failed
Build and Push Images / determine-version (push) Successful in 5s
Build and Push Images / build (Dockerfile.ray-worker-nvidia, nvidia) (push) Failing after 24s
Build and Push Images / build (Dockerfile.ray-worker-intel, intel) (push) Failing after 27s
Build and Push Images / build (Dockerfile.ray-worker-strixhalo, strixhalo) (push) Failing after 22s
Build and Push Images / build (Dockerfile.ray-worker-rdna2, rdna2) (push) Failing after 24s
Build and Push Images / Release (push) Has been skipped
Build and Push Images / Notify (push) Successful in 1s
KubeRay's auto-injected wait-gcs-ready init container has only 256Mi
memory limit. The .pth hook was unconditionally importing torch+ROCm
which requires >256Mi, causing OOMKill.

Now checks cgroup memory limit first — if under 512Mi, skips the
expensive torch import entirely. The VRAM patch is only needed by the
main Ray worker process, not by health-check init containers.
2026-02-06 19:15:49 -05:00
e7642b86dd feat(strixhalo): patch torch.cuda.mem_get_info for unified memory APU
Some checks failed
Build and Push Images / determine-version (push) Successful in 4s
Build and Push Images / build (Dockerfile.ray-worker-nvidia, nvidia) (push) Failing after 25s
Build and Push Images / build (Dockerfile.ray-worker-intel, intel) (push) Failing after 28s
Build and Push Images / build (Dockerfile.ray-worker-strixhalo, strixhalo) (push) Failing after 23s
Build and Push Images / build (Dockerfile.ray-worker-rdna2, rdna2) (push) Failing after 26s
Build and Push Images / Release (push) Has been skipped
Build and Push Images / Notify (push) Successful in 1s
On Strix Halo, PyTorch reports GTT pool (128 GiB) as device memory
instead of real VRAM (96 GiB from BIOS). vLLM uses mem_get_info() to
pre-allocate and refuses to start when free GTT (29 GiB) < requested.

The strixhalo_vram_fix.pth hook auto-patches mem_get_info on Python
startup to read real VRAM total/used from /sys/class/drm sysfs.
Only activates when PyTorch total differs >10% from sysfs VRAM.
2026-02-06 16:29:46 -05:00
300582a520 feat(strixhalo): add amdsmi sysfs shim to bypass glibc 2.38 requirement
Some checks failed
Build and Push Images / determine-version (push) Successful in 58s
Build and Push Images / Release (push) Has been cancelled
Build and Push Images / Notify (push) Has been cancelled
Build and Push Images / build-strixhalo (push) Has been cancelled
Build and Push Images / build-intel (push) Has been cancelled
Build and Push Images / build-nvidia (push) Has been cancelled
Build and Push Images / build-rdna2 (push) Has been cancelled
The native amdsmi from ROCm 7.1 requires libamd_smi.so linked against
glibc 2.38 (Ubuntu 24.04), but the Ray base image is Ubuntu 22.04
(glibc 2.35). This caused vLLM to fail ROCm platform detection with
'No module named amdsmi' / GLIBC_2.38 not found errors.

Solution: Pure-Python amdsmi shim that reads GPU info from sysfs
(/sys/class/drm/*) instead of the native library. Provides the full
API surface used by both vLLM (platform detection, device info) and
PyTorch (device counting, memory/power/temp monitoring).

Tested in-container: vLLM detects RocmPlatform, PyTorch sees GPU
(Radeon 8060S, 128GB, HIP 7.3), DeviceConfig resolves to 'cuda'.

Changes:
- Add amdsmi-shim/ package with sysfs-backed implementation
- Update Dockerfile to install shim after vLLM/torch
- Add amdsmi-shim/ to .dockerignore explicit includes
2026-02-06 08:28:07 -05:00