kuberay-images

Author	SHA1	Message	Date
Billy D.	e7642b86dd	feat(strixhalo): patch torch.cuda.mem_get_info for unified memory APU Some checks failed Build and Push Images / determine-version (push) Successful in 4s Details Build and Push Images / build (Dockerfile.ray-worker-nvidia, nvidia) (push) Failing after 25s Details Build and Push Images / build (Dockerfile.ray-worker-intel, intel) (push) Failing after 28s Details Build and Push Images / build (Dockerfile.ray-worker-strixhalo, strixhalo) (push) Failing after 23s Details Build and Push Images / build (Dockerfile.ray-worker-rdna2, rdna2) (push) Failing after 26s Details Build and Push Images / Release (push) Has been skipped Details Build and Push Images / Notify (push) Successful in 1s Details On Strix Halo, PyTorch reports GTT pool (128 GiB) as device memory instead of real VRAM (96 GiB from BIOS). vLLM uses mem_get_info() to pre-allocate and refuses to start when free GTT (29 GiB) < requested. The strixhalo_vram_fix.pth hook auto-patches mem_get_info on Python startup to read real VRAM total/used from /sys/class/drm sysfs. Only activates when PyTorch total differs >10% from sysfs VRAM.	2026-02-06 16:29:46 -05:00
Billy D.	300582a520	feat(strixhalo): add amdsmi sysfs shim to bypass glibc 2.38 requirement Some checks failed Build and Push Images / determine-version (push) Successful in 58s Details Build and Push Images / Release (push) Has been cancelled Details Build and Push Images / Notify (push) Has been cancelled Details Build and Push Images / build-strixhalo (push) Has been cancelled Details Build and Push Images / build-intel (push) Has been cancelled Details Build and Push Images / build-nvidia (push) Has been cancelled Details Build and Push Images / build-rdna2 (push) Has been cancelled Details The native amdsmi from ROCm 7.1 requires libamd_smi.so linked against glibc 2.38 (Ubuntu 24.04), but the Ray base image is Ubuntu 22.04 (glibc 2.35). This caused vLLM to fail ROCm platform detection with 'No module named amdsmi' / GLIBC_2.38 not found errors. Solution: Pure-Python amdsmi shim that reads GPU info from sysfs (/sys/class/drm/*) instead of the native library. Provides the full API surface used by both vLLM (platform detection, device info) and PyTorch (device counting, memory/power/temp monitoring). Tested in-container: vLLM detects RocmPlatform, PyTorch sees GPU (Radeon 8060S, 128GB, HIP 7.3), DeviceConfig resolves to 'cuda'. Changes: - Add amdsmi-shim/ package with sysfs-backed implementation - Update Dockerfile to install shim after vLLM/torch - Add amdsmi-shim/ to .dockerignore explicit includes	2026-02-06 08:28:07 -05:00
Billy D.	5f1873908f	overhaul image builds. Some checks failed Build and Push Images / determine-version (push) Successful in 5s Details Build and Push Images / build-nvidia (push) Failing after 21s Details Build and Push Images / build-rdna2 (push) Failing after 21s Details Build and Push Images / build-strixhalo (push) Failing after 12s Details Build and Push Images / build-intel (push) Failing after 19s Details Build and Push Images / Release (push) Has been skipped Details Build and Push Images / Notify (push) Successful in 1s Details	2026-02-06 07:47:37 -05:00
Billy D.	38784f3a04	fix: use correct UID:GID 1000:100 for ray user Some checks failed Build and Push Images / determine-version (push) Has been cancelled Details Build and Push Images / build-nvidia (push) Has been cancelled Details Build and Push Images / build-rdna2 (push) Has been cancelled Details Build and Push Images / build-strixhalo (push) Has been cancelled Details Build and Push Images / build-intel (push) Has been cancelled Details Build and Push Images / Notify (push) Has been cancelled Details Build and Push Images / Release (push) Has been cancelled Details Ray official images use uid=1000(ray) gid=100(users). Using numeric IDs for podman compatibility.	2026-02-05 17:32:27 -05:00
Billy D.	5768af76bf	fix: use fully-qualified image names for podman compatibility Some checks failed Build and Push Images / determine-version (push) Successful in 27s Details Build and Push Images / build-nvidia (push) Has started running Details Build and Push Images / Release (push) Has been cancelled Details Build and Push Images / Notify (push) Has been cancelled Details Build and Push Images / build-strixhalo (push) Has been cancelled Details Build and Push Images / build-intel (push) Has been cancelled Details Build and Push Images / build-rdna2 (push) Has been cancelled Details Podman requires docker.io/ prefix for Docker Hub images when unqualified-search registries are not configured.	2026-02-05 17:25:17 -05:00
Billy D.	40c544ba0a	fix: remove COPY ray-serve/ - now installed from PyPI Some checks failed Build and Push Images / build-nvidia (push) Failing after 13s Details Build and Push Images / build-strixhalo (push) Failing after 1m56s Details Build and Push Images / build-rdna2 (push) Failing after 2m8s Details Build and Push Images / Release (push) Has been cancelled Details Build and Push Images / Notify (push) Has been cancelled Details Build and Push Images / build-intel (push) Has been cancelled Details ray-serve-apps package is now installed from Gitea PyPI registry at runtime by the RayService configuration, not bundled in image.	2026-02-03 22:23:05 -05:00
Billy D.	cb7dad96c1	fix: PATH variable expansion in ROCm worker Dockerfiles Some checks failed Build and Push Images / build-rdna2 (push) Has been cancelled Details Build and Push Images / build-strixhalo (push) Has been cancelled Details Build and Push Images / build-intel (push) Has been cancelled Details Build and Push Images / build-nvidia (push) Has been cancelled Details Build and Push Images / Release (push) Has been cancelled Details Build and Push Images / Notify (push) Has been cancelled Details Split ENV ROCM_HOME and ENV PATH into separate commands to fix variable expansion issue. When ROCM_HOME and PATH were in the same ENV line, ${ROCM_HOME} expanded to empty string since it wasn't defined yet. This was causing 'ray: command not found' in init containers.	2026-02-03 21:07:00 -05:00
Billy D.	3c788fe2b6	fix(strixhalo): upgrade pandas for numpy 2.x compatibility Some checks failed Build and Push Images / build-strixhalo (push) Has been cancelled Details Build and Push Images / build-nvidia (push) Has been cancelled Details Build and Push Images / build-intel (push) Has been cancelled Details Build and Push Images / Release (push) Has been cancelled Details Build and Push Images / Notify (push) Has been cancelled Details Build and Push Images / build-rdna2 (push) Has been cancelled Details Ray base image has pandas 1.5.3 compiled against numpy 1.x, but TheRock PyTorch ROCm wheels require numpy 2.x. This causes: ValueError: numpy.dtype size changed, may indicate binary incompatibility Fix by installing pandas 2.x which is compatible with numpy 2.x.	2026-02-02 13:25:28 -05:00
Billy D.	cb80709d3d	build: optimize Dockerfiles for production Some checks failed Build and Push Images / build-rdna2 (push) Failing after 4m3s Details Build and Push Images / build-nvidia (push) Failing after 4m6s Details Build and Push Images / build-strixhalo (push) Failing after 18s Details Build and Push Images / build-intel (push) Failing after 21s Details - Use BuildKit syntax 1.7 with cache mounts for apt/uv - Switch from pip to uv for 10-100x faster installs (ADR-0014) - Add OCI Image Spec labels for container metadata - Add HEALTHCHECK directives for orchestration - Add .dockerignore to reduce context size - Update Makefile with buildx and lint target - Add retry logic to ray-entrypoint.sh Refs: ADR-0012 (uv), ADR-0014 (Docker best practices)	2026-02-02 07:26:27 -05:00
Billy D.	a16ffff73f	feat: Add GPU-specific Ray worker images with CI/CD Some checks failed Build and Push Images / build-nvidia (push) Failing after 1s Details Build and Push Images / build-rdna2 (push) Failing after 1s Details Build and Push Images / build-strixhalo (push) Failing after 1s Details Build and Push Images / build-intel (push) Failing after 1s Details - Add Dockerfiles for nvidia, rdna2, strixhalo, and intel GPU targets - Add ray-serve modules (embeddings, whisper, tts, llm, reranker) - Add Gitea Actions workflow for automated builds - Add Makefile for local development - Update README with comprehensive documentation	2026-02-01 15:04:31 -05:00

10 Commits