Commit Graph

24 Commits

Author SHA1 Message Date
8adaef62a2 fix(strixhalo): remove apt-get layer that corrupts vendor hipcc
Some checks failed
Build and Push Images / determine-version (push) Successful in 6s
Build and Push Images / build (Dockerfile.ray-worker-rdna2, rdna2) (push) Has been cancelled
Build and Push Images / build (Dockerfile.ray-worker-strixhalo, strixhalo) (push) Has been cancelled
Build and Push Images / Release (push) Has been cancelled
Build and Push Images / Notify (push) Has been cancelled
Build and Push Images / build (Dockerfile.ray-worker-intel, intel) (push) Has been cancelled
Build and Push Images / build (Dockerfile.ray-worker-nvidia, nvidia) (push) Has been cancelled
The vendor image (rocm/pytorch:rocm7.0.2) ships all needed runtime
packages. Any apt-get install triggers ROCm repo dependency resolution
that upgrades vendor hipcc 1.1.1.70002 to Ubuntu's 5.7.1, whose
hipconfig.pl reports HIP version 0.0.0 → cmake can't find HIP.

Changes:
- Remove entire apt-get layer (git, ccache, runtime libs all pre-installed)
- Keep only ray user creation from that RUN block
- Add detailed comments explaining why apt-get must never be used

Combined with cmake<4 (downgrade from 4.0.0) and HIP_ROOT_DIR=/opt/rocm
from prior commits, this produces a successful build (attempt 5).

Verified: torch 2.9.1+rocm7.0.2 (vendor), vllm 0.15.2.dev0 (source-built)
2026-02-09 18:24:50 -05:00
2e3fbb8c60 feat(strixhalo): full source build of vLLM for gfx1151 (v1.0.20)
Some checks failed
Build and Push Images / determine-version (push) Successful in 7s
Build and Push Images / build (Dockerfile.ray-worker-intel, intel) (push) Has been cancelled
Build and Push Images / build (Dockerfile.ray-worker-nvidia, nvidia) (push) Has been cancelled
Build and Push Images / build (Dockerfile.ray-worker-rdna2, rdna2) (push) Has been cancelled
Build and Push Images / build (Dockerfile.ray-worker-strixhalo, strixhalo) (push) Has been cancelled
Build and Push Images / Release (push) Has been cancelled
Build and Push Images / Notify (push) Has been cancelled
- Build vLLM v0.15.1 from source against vendor torch 2.9.1
- Preserve AMD's vendor PyTorch from rocm/pytorch:rocm7.0.2 base
- use_existing_torch.py --prefix to strip torch from build-requires
- Compile C++/HIP extensions for gfx1100 (mapped from gfx1151)
- Install triton/flash-attn from wheels.vllm.ai/rocm with --no-deps
- Add torch vendor verification step to catch accidental overwrites
- Fix GPU_RESOURCE default to match cluster (gpu_strixhalo)
- Remove unsupported expandable_segments from PYTORCH_ALLOC_CONF
- AITER is gfx9-only; gfx11 uses TRITON_ATTN backend by default
2026-02-09 15:46:25 -05:00
ab2a7f486e fix(strixhalo): switch base to ROCm 7.0.2 to fix libhsa segfault
Some checks failed
Build and Push Images / build (Dockerfile.ray-worker-intel, intel) (push) Has been cancelled
Build and Push Images / build (Dockerfile.ray-worker-nvidia, nvidia) (push) Has been cancelled
Build and Push Images / build (Dockerfile.ray-worker-rdna2, rdna2) (push) Has been cancelled
Build and Push Images / build (Dockerfile.ray-worker-strixhalo, strixhalo) (push) Has been cancelled
Build and Push Images / Release (push) Has been cancelled
Build and Push Images / Notify (push) Has been cancelled
Build and Push Images / determine-version (push) Has been cancelled
ROCm 7.1 system libraries (libhsa-runtime64.so.1.18.70100) are ABI-
incompatible with the torch/vLLM ROCm 7.0 wheels from wheels.vllm.ai.
This caused SIGSEGV at 0x34 in libhsa-runtime64 on every GPU operation.

Switch to rocm/pytorch:rocm7.0.2_ubuntu24.04_py3.12_pytorch_release_2.9.1
which provides matching ROCm 7.0.2 system libraries while keeping
Ubuntu 24.04 (glibc 2.38) and Python 3.12.
2026-02-09 14:37:05 -05:00
3a33ed387f fixing strixhalo builds.
Some checks failed
Build and Push Images / build (Dockerfile.ray-worker-intel, intel) (push) Has been cancelled
Build and Push Images / build (Dockerfile.ray-worker-nvidia, nvidia) (push) Has been cancelled
Build and Push Images / build (Dockerfile.ray-worker-rdna2, rdna2) (push) Has been cancelled
Build and Push Images / build (Dockerfile.ray-worker-strixhalo, strixhalo) (push) Has been cancelled
Build and Push Images / Release (push) Has been cancelled
Build and Push Images / Notify (push) Has been cancelled
Build and Push Images / determine-version (push) Has been cancelled
2026-02-09 12:49:39 -05:00
65de596212 big refactor.
Some checks failed
Build and Push Images / build (Dockerfile.ray-worker-intel, intel) (push) Has been cancelled
Build and Push Images / build (Dockerfile.ray-worker-nvidia, nvidia) (push) Has been cancelled
Build and Push Images / build (Dockerfile.ray-worker-rdna2, rdna2) (push) Has been cancelled
Build and Push Images / build (Dockerfile.ray-worker-strixhalo, strixhalo) (push) Has been cancelled
Build and Push Images / Release (push) Has been cancelled
Build and Push Images / Notify (push) Has been cancelled
Build and Push Images / determine-version (push) Has been cancelled
2026-02-09 12:17:12 -05:00
a20a5d2ccd mo fixes.
Some checks failed
Build and Push Images / determine-version (push) Successful in 6s
Build and Push Images / Release (push) Has been cancelled
Build and Push Images / Notify (push) Has been cancelled
Build and Push Images / build (Dockerfile.ray-worker-strixhalo, strixhalo) (push) Has been cancelled
Build and Push Images / build (Dockerfile.ray-worker-nvidia, nvidia) (push) Failing after 49s
Build and Push Images / build (Dockerfile.ray-worker-intel, intel) (push) Failing after 1m25s
Build and Push Images / build (Dockerfile.ray-worker-rdna2, rdna2) (push) Has been cancelled
2026-02-09 11:46:10 -05:00
b0c58b98a0 fix
Some checks failed
Build and Push Images / determine-version (push) Successful in 4s
Build and Push Images / build (Dockerfile.ray-worker-rdna2, rdna2) (push) Has been cancelled
Build and Push Images / build (Dockerfile.ray-worker-strixhalo, strixhalo) (push) Has been cancelled
Build and Push Images / Release (push) Has been cancelled
Build and Push Images / Notify (push) Has been cancelled
Build and Push Images / build (Dockerfile.ray-worker-intel, intel) (push) Has been cancelled
Build and Push Images / build (Dockerfile.ray-worker-nvidia, nvidia) (push) Has been cancelled
2026-02-09 11:31:18 -05:00
5f2d167ba0 fixing build problem.
Some checks failed
Build and Push Images / build (Dockerfile.ray-worker-intel, intel) (push) Has been cancelled
Build and Push Images / build (Dockerfile.ray-worker-nvidia, nvidia) (push) Has been cancelled
Build and Push Images / build (Dockerfile.ray-worker-rdna2, rdna2) (push) Has been cancelled
Build and Push Images / build (Dockerfile.ray-worker-strixhalo, strixhalo) (push) Has been cancelled
Build and Push Images / Release (push) Has been cancelled
Build and Push Images / Notify (push) Has been cancelled
Build and Push Images / determine-version (push) Has been cancelled
2026-02-09 11:12:34 -05:00
fcc9781d42 different rocm
Some checks failed
Build and Push Images / build (Dockerfile.ray-worker-intel, intel) (push) Has been cancelled
Build and Push Images / build (Dockerfile.ray-worker-nvidia, nvidia) (push) Has been cancelled
Build and Push Images / determine-version (push) Has been cancelled
Build and Push Images / build (Dockerfile.ray-worker-rdna2, rdna2) (push) Has been cancelled
Build and Push Images / build (Dockerfile.ray-worker-strixhalo, strixhalo) (push) Has been cancelled
Build and Push Images / Release (push) Has been cancelled
Build and Push Images / Notify (push) Has been cancelled
;
;
2026-02-09 11:08:33 -05:00
c9cf143821 more fixes.
Some checks failed
Build and Push Images / determine-version (push) Successful in 6s
Build and Push Images / build (Dockerfile.ray-worker-nvidia, nvidia) (push) Failing after 40s
Build and Push Images / build (Dockerfile.ray-worker-intel, intel) (push) Failing after 43s
Build and Push Images / Release (push) Has been cancelled
Build and Push Images / Notify (push) Has been cancelled
Build and Push Images / build (Dockerfile.ray-worker-strixhalo, strixhalo) (push) Has been cancelled
Build and Push Images / build (Dockerfile.ray-worker-rdna2, rdna2) (push) Has been cancelled
2026-02-09 10:56:51 -05:00
2c38cce20c fix.
Some checks failed
Build and Push Images / determine-version (push) Has been cancelled
Build and Push Images / build (Dockerfile.ray-worker-intel, intel) (push) Has been cancelled
Build and Push Images / build (Dockerfile.ray-worker-nvidia, nvidia) (push) Has been cancelled
Build and Push Images / build (Dockerfile.ray-worker-rdna2, rdna2) (push) Has been cancelled
Build and Push Images / build (Dockerfile.ray-worker-strixhalo, strixhalo) (push) Has been cancelled
Build and Push Images / Release (push) Has been cancelled
Build and Push Images / Notify (push) Has been cancelled
2026-02-09 10:43:56 -05:00
2e3e014b80 fixing nvidia and strixhalo
Some checks failed
Build and Push Images / determine-version (push) Successful in 4s
Build and Push Images / Release (push) Has been cancelled
Build and Push Images / Notify (push) Has been cancelled
Build and Push Images / build (Dockerfile.ray-worker-rdna2, rdna2) (push) Has been cancelled
Build and Push Images / build (Dockerfile.ray-worker-strixhalo, strixhalo) (push) Has been cancelled
Build and Push Images / build (Dockerfile.ray-worker-intel, intel) (push) Has been cancelled
Build and Push Images / build (Dockerfile.ray-worker-nvidia, nvidia) (push) Has been cancelled
2026-02-09 10:24:32 -05:00
6aad7ad38a fix: update to python 3.12.
Some checks failed
Build and Push Images / determine-version (push) Successful in 4s
Build and Push Images / build (Dockerfile.ray-worker-nvidia, nvidia) (push) Failing after 21s
Build and Push Images / build (Dockerfile.ray-worker-intel, intel) (push) Failing after 23s
Build and Push Images / build (Dockerfile.ray-worker-strixhalo, strixhalo) (push) Failing after 19s
Build and Push Images / build (Dockerfile.ray-worker-rdna2, rdna2) (push) Failing after 23s
Build and Push Images / Release (push) Has been skipped
Build and Push Images / Notify (push) Successful in 1s
2026-02-09 08:52:32 -05:00
64585dac7e fixing numpy pin.
Some checks failed
Build and Push Images / determine-version (push) Successful in 6s
Build and Push Images / build (Dockerfile.ray-worker-nvidia, nvidia) (push) Failing after 21s
Build and Push Images / build (Dockerfile.ray-worker-intel, intel) (push) Failing after 24s
Build and Push Images / build (Dockerfile.ray-worker-strixhalo, strixhalo) (push) Failing after 22s
Build and Push Images / build (Dockerfile.ray-worker-rdna2, rdna2) (push) Failing after 34s
Build and Push Images / Release (push) Has been skipped
Build and Push Images / Notify (push) Successful in 2s
2026-02-08 21:39:11 -05:00
e7642b86dd feat(strixhalo): patch torch.cuda.mem_get_info for unified memory APU
Some checks failed
Build and Push Images / determine-version (push) Successful in 4s
Build and Push Images / build (Dockerfile.ray-worker-nvidia, nvidia) (push) Failing after 25s
Build and Push Images / build (Dockerfile.ray-worker-intel, intel) (push) Failing after 28s
Build and Push Images / build (Dockerfile.ray-worker-strixhalo, strixhalo) (push) Failing after 23s
Build and Push Images / build (Dockerfile.ray-worker-rdna2, rdna2) (push) Failing after 26s
Build and Push Images / Release (push) Has been skipped
Build and Push Images / Notify (push) Successful in 1s
On Strix Halo, PyTorch reports GTT pool (128 GiB) as device memory
instead of real VRAM (96 GiB from BIOS). vLLM uses mem_get_info() to
pre-allocate and refuses to start when free GTT (29 GiB) < requested.

The strixhalo_vram_fix.pth hook auto-patches mem_get_info on Python
startup to read real VRAM total/used from /sys/class/drm sysfs.
Only activates when PyTorch total differs >10% from sysfs VRAM.
2026-02-06 16:29:46 -05:00
300582a520 feat(strixhalo): add amdsmi sysfs shim to bypass glibc 2.38 requirement
Some checks failed
Build and Push Images / determine-version (push) Successful in 58s
Build and Push Images / Release (push) Has been cancelled
Build and Push Images / Notify (push) Has been cancelled
Build and Push Images / build-strixhalo (push) Has been cancelled
Build and Push Images / build-intel (push) Has been cancelled
Build and Push Images / build-nvidia (push) Has been cancelled
Build and Push Images / build-rdna2 (push) Has been cancelled
The native amdsmi from ROCm 7.1 requires libamd_smi.so linked against
glibc 2.38 (Ubuntu 24.04), but the Ray base image is Ubuntu 22.04
(glibc 2.35). This caused vLLM to fail ROCm platform detection with
'No module named amdsmi' / GLIBC_2.38 not found errors.

Solution: Pure-Python amdsmi shim that reads GPU info from sysfs
(/sys/class/drm/*) instead of the native library. Provides the full
API surface used by both vLLM (platform detection, device info) and
PyTorch (device counting, memory/power/temp monitoring).

Tested in-container: vLLM detects RocmPlatform, PyTorch sees GPU
(Radeon 8060S, 128GB, HIP 7.3), DeviceConfig resolves to 'cuda'.

Changes:
- Add amdsmi-shim/ package with sysfs-backed implementation
- Update Dockerfile to install shim after vLLM/torch
- Add amdsmi-shim/ to .dockerignore explicit includes
2026-02-06 08:28:07 -05:00
5f1873908f overhaul image builds.
Some checks failed
Build and Push Images / determine-version (push) Successful in 5s
Build and Push Images / build-nvidia (push) Failing after 21s
Build and Push Images / build-rdna2 (push) Failing after 21s
Build and Push Images / build-strixhalo (push) Failing after 12s
Build and Push Images / build-intel (push) Failing after 19s
Build and Push Images / Release (push) Has been skipped
Build and Push Images / Notify (push) Successful in 1s
2026-02-06 07:47:37 -05:00
38784f3a04 fix: use correct UID:GID 1000:100 for ray user
Some checks failed
Build and Push Images / determine-version (push) Has been cancelled
Build and Push Images / build-nvidia (push) Has been cancelled
Build and Push Images / build-rdna2 (push) Has been cancelled
Build and Push Images / build-strixhalo (push) Has been cancelled
Build and Push Images / build-intel (push) Has been cancelled
Build and Push Images / Notify (push) Has been cancelled
Build and Push Images / Release (push) Has been cancelled
Ray official images use uid=1000(ray) gid=100(users).
Using numeric IDs for podman compatibility.
2026-02-05 17:32:27 -05:00
5768af76bf fix: use fully-qualified image names for podman compatibility
Some checks failed
Build and Push Images / determine-version (push) Successful in 27s
Build and Push Images / build-nvidia (push) Has started running
Build and Push Images / Release (push) Has been cancelled
Build and Push Images / Notify (push) Has been cancelled
Build and Push Images / build-strixhalo (push) Has been cancelled
Build and Push Images / build-intel (push) Has been cancelled
Build and Push Images / build-rdna2 (push) Has been cancelled
Podman requires docker.io/ prefix for Docker Hub images when
unqualified-search registries are not configured.
2026-02-05 17:25:17 -05:00
40c544ba0a fix: remove COPY ray-serve/ - now installed from PyPI
Some checks failed
Build and Push Images / build-nvidia (push) Failing after 13s
Build and Push Images / build-strixhalo (push) Failing after 1m56s
Build and Push Images / build-rdna2 (push) Failing after 2m8s
Build and Push Images / Release (push) Has been cancelled
Build and Push Images / Notify (push) Has been cancelled
Build and Push Images / build-intel (push) Has been cancelled
ray-serve-apps package is now installed from Gitea PyPI registry
at runtime by the RayService configuration, not bundled in image.
2026-02-03 22:23:05 -05:00
cb7dad96c1 fix: PATH variable expansion in ROCm worker Dockerfiles
Some checks failed
Build and Push Images / build-rdna2 (push) Has been cancelled
Build and Push Images / build-strixhalo (push) Has been cancelled
Build and Push Images / build-intel (push) Has been cancelled
Build and Push Images / build-nvidia (push) Has been cancelled
Build and Push Images / Release (push) Has been cancelled
Build and Push Images / Notify (push) Has been cancelled
Split ENV ROCM_HOME and ENV PATH into separate commands to fix variable
expansion issue. When ROCM_HOME and PATH were in the same ENV line,
${ROCM_HOME} expanded to empty string since it wasn't defined yet.

This was causing 'ray: command not found' in init containers.
2026-02-03 21:07:00 -05:00
3c788fe2b6 fix(strixhalo): upgrade pandas for numpy 2.x compatibility
Some checks failed
Build and Push Images / build-strixhalo (push) Has been cancelled
Build and Push Images / build-nvidia (push) Has been cancelled
Build and Push Images / build-intel (push) Has been cancelled
Build and Push Images / Release (push) Has been cancelled
Build and Push Images / Notify (push) Has been cancelled
Build and Push Images / build-rdna2 (push) Has been cancelled
Ray base image has pandas 1.5.3 compiled against numpy 1.x, but TheRock
PyTorch ROCm wheels require numpy 2.x. This causes:
  ValueError: numpy.dtype size changed, may indicate binary incompatibility

Fix by installing pandas 2.x which is compatible with numpy 2.x.
2026-02-02 13:25:28 -05:00
cb80709d3d build: optimize Dockerfiles for production
Some checks failed
Build and Push Images / build-rdna2 (push) Failing after 4m3s
Build and Push Images / build-nvidia (push) Failing after 4m6s
Build and Push Images / build-strixhalo (push) Failing after 18s
Build and Push Images / build-intel (push) Failing after 21s
- Use BuildKit syntax 1.7 with cache mounts for apt/uv
- Switch from pip to uv for 10-100x faster installs (ADR-0014)
- Add OCI Image Spec labels for container metadata
- Add HEALTHCHECK directives for orchestration
- Add .dockerignore to reduce context size
- Update Makefile with buildx and lint target
- Add retry logic to ray-entrypoint.sh

Refs: ADR-0012 (uv), ADR-0014 (Docker best practices)
2026-02-02 07:26:27 -05:00
a16ffff73f feat: Add GPU-specific Ray worker images with CI/CD
Some checks failed
Build and Push Images / build-nvidia (push) Failing after 1s
Build and Push Images / build-rdna2 (push) Failing after 1s
Build and Push Images / build-strixhalo (push) Failing after 1s
Build and Push Images / build-intel (push) Failing after 1s
- Add Dockerfiles for nvidia, rdna2, strixhalo, and intel GPU targets
- Add ray-serve modules (embeddings, whisper, tts, llm, reranker)
- Add Gitea Actions workflow for automated builds
- Add Makefile for local development
- Update README with comprehensive documentation
2026-02-01 15:04:31 -05:00