2e3fbb8c60
feat(strixhalo): full source build of vLLM for gfx1151 (v1.0.20)
...
Build and Push Images / determine-version (push) Successful in 7s
Build and Push Images / build (Dockerfile.ray-worker-intel, intel) (push) Has been cancelled
Build and Push Images / build (Dockerfile.ray-worker-nvidia, nvidia) (push) Has been cancelled
Build and Push Images / build (Dockerfile.ray-worker-rdna2, rdna2) (push) Has been cancelled
Build and Push Images / build (Dockerfile.ray-worker-strixhalo, strixhalo) (push) Has been cancelled
Build and Push Images / Release (push) Has been cancelled
Build and Push Images / Notify (push) Has been cancelled
- Build vLLM v0.15.1 from source against vendor torch 2.9.1
- Preserve AMD's vendor PyTorch from rocm/pytorch:rocm7.0.2 base
- use_existing_torch.py --prefix to strip torch from build-requires
- Compile C++/HIP extensions for gfx1100 (mapped from gfx1151)
- Install triton/flash-attn from wheels.vllm.ai/rocm with --no-deps
- Add torch vendor verification step to catch accidental overwrites
- Fix GPU_RESOURCE default to match cluster (gpu_strixhalo)
- Remove unsupported expandable_segments from PYTORCH_ALLOC_CONF
- AITER is gfx9-only; gfx11 uses TRITON_ATTN backend by default
2026-02-09 15:46:25 -05:00
ab2a7f486e
fix(strixhalo): switch base to ROCm 7.0.2 to fix libhsa segfault
...
Build and Push Images / build (Dockerfile.ray-worker-intel, intel) (push) Has been cancelled
Build and Push Images / build (Dockerfile.ray-worker-nvidia, nvidia) (push) Has been cancelled
Build and Push Images / build (Dockerfile.ray-worker-rdna2, rdna2) (push) Has been cancelled
Build and Push Images / build (Dockerfile.ray-worker-strixhalo, strixhalo) (push) Has been cancelled
Build and Push Images / Release (push) Has been cancelled
Build and Push Images / Notify (push) Has been cancelled
Build and Push Images / determine-version (push) Has been cancelled
ROCm 7.1 system libraries (libhsa-runtime64.so.1.18.70100) are ABI-
incompatible with the torch/vLLM ROCm 7.0 wheels from wheels.vllm.ai.
This caused SIGSEGV at 0x34 in libhsa-runtime64 on every GPU operation.
Switch to rocm/pytorch:rocm7.0.2_ubuntu24.04_py3.12_pytorch_release_2.9.1
which provides matching ROCm 7.0.2 system libraries while keeping
Ubuntu 24.04 (glibc 2.38) and Python 3.12.
2026-02-09 14:37:05 -05:00
3a33ed387f
fixing strixhalo builds.
Build and Push Images / build (Dockerfile.ray-worker-intel, intel) (push) Has been cancelled
Build and Push Images / build (Dockerfile.ray-worker-nvidia, nvidia) (push) Has been cancelled
Build and Push Images / build (Dockerfile.ray-worker-rdna2, rdna2) (push) Has been cancelled
Build and Push Images / build (Dockerfile.ray-worker-strixhalo, strixhalo) (push) Has been cancelled
Build and Push Images / Release (push) Has been cancelled
Build and Push Images / Notify (push) Has been cancelled
Build and Push Images / determine-version (push) Has been cancelled
2026-02-09 12:49:39 -05:00
65de596212
big refactor.
Build and Push Images / build (Dockerfile.ray-worker-intel, intel) (push) Has been cancelled
Build and Push Images / build (Dockerfile.ray-worker-nvidia, nvidia) (push) Has been cancelled
Build and Push Images / build (Dockerfile.ray-worker-rdna2, rdna2) (push) Has been cancelled
Build and Push Images / build (Dockerfile.ray-worker-strixhalo, strixhalo) (push) Has been cancelled
Build and Push Images / Release (push) Has been cancelled
Build and Push Images / Notify (push) Has been cancelled
Build and Push Images / determine-version (push) Has been cancelled
2026-02-09 12:17:12 -05:00
a20a5d2ccd
mo fixes.
Build and Push Images / determine-version (push) Successful in 6s
Build and Push Images / Release (push) Has been cancelled
Build and Push Images / Notify (push) Has been cancelled
Build and Push Images / build (Dockerfile.ray-worker-strixhalo, strixhalo) (push) Has been cancelled
Build and Push Images / build (Dockerfile.ray-worker-nvidia, nvidia) (push) Failing after 49s
Build and Push Images / build (Dockerfile.ray-worker-intel, intel) (push) Failing after 1m25s
Build and Push Images / build (Dockerfile.ray-worker-rdna2, rdna2) (push) Has been cancelled
2026-02-09 11:46:10 -05:00
b0c58b98a0
fix
Build and Push Images / determine-version (push) Successful in 4s
Build and Push Images / build (Dockerfile.ray-worker-rdna2, rdna2) (push) Has been cancelled
Build and Push Images / build (Dockerfile.ray-worker-strixhalo, strixhalo) (push) Has been cancelled
Build and Push Images / Release (push) Has been cancelled
Build and Push Images / Notify (push) Has been cancelled
Build and Push Images / build (Dockerfile.ray-worker-intel, intel) (push) Has been cancelled
Build and Push Images / build (Dockerfile.ray-worker-nvidia, nvidia) (push) Has been cancelled
2026-02-09 11:31:18 -05:00
5f2d167ba0
fixing build problem.
Build and Push Images / build (Dockerfile.ray-worker-intel, intel) (push) Has been cancelled
Build and Push Images / build (Dockerfile.ray-worker-nvidia, nvidia) (push) Has been cancelled
Build and Push Images / build (Dockerfile.ray-worker-rdna2, rdna2) (push) Has been cancelled
Build and Push Images / build (Dockerfile.ray-worker-strixhalo, strixhalo) (push) Has been cancelled
Build and Push Images / Release (push) Has been cancelled
Build and Push Images / Notify (push) Has been cancelled
Build and Push Images / determine-version (push) Has been cancelled
2026-02-09 11:12:34 -05:00
fcc9781d42
different rocm
...
Build and Push Images / build (Dockerfile.ray-worker-intel, intel) (push) Has been cancelled
Build and Push Images / build (Dockerfile.ray-worker-nvidia, nvidia) (push) Has been cancelled
Build and Push Images / determine-version (push) Has been cancelled
Build and Push Images / build (Dockerfile.ray-worker-rdna2, rdna2) (push) Has been cancelled
Build and Push Images / build (Dockerfile.ray-worker-strixhalo, strixhalo) (push) Has been cancelled
Build and Push Images / Release (push) Has been cancelled
Build and Push Images / Notify (push) Has been cancelled
;
;
2026-02-09 11:08:33 -05:00
c9cf143821
more fixes.
Build and Push Images / determine-version (push) Successful in 6s
Build and Push Images / build (Dockerfile.ray-worker-nvidia, nvidia) (push) Failing after 40s
Build and Push Images / build (Dockerfile.ray-worker-intel, intel) (push) Failing after 43s
Build and Push Images / Release (push) Has been cancelled
Build and Push Images / Notify (push) Has been cancelled
Build and Push Images / build (Dockerfile.ray-worker-strixhalo, strixhalo) (push) Has been cancelled
Build and Push Images / build (Dockerfile.ray-worker-rdna2, rdna2) (push) Has been cancelled
2026-02-09 10:56:51 -05:00
2c38cce20c
fix.
Build and Push Images / determine-version (push) Has been cancelled
Build and Push Images / build (Dockerfile.ray-worker-intel, intel) (push) Has been cancelled
Build and Push Images / build (Dockerfile.ray-worker-nvidia, nvidia) (push) Has been cancelled
Build and Push Images / build (Dockerfile.ray-worker-rdna2, rdna2) (push) Has been cancelled
Build and Push Images / build (Dockerfile.ray-worker-strixhalo, strixhalo) (push) Has been cancelled
Build and Push Images / Release (push) Has been cancelled
Build and Push Images / Notify (push) Has been cancelled
2026-02-09 10:43:56 -05:00
2e3e014b80
fixing nvidia and strixhalo
Build and Push Images / determine-version (push) Successful in 4s
Build and Push Images / Release (push) Has been cancelled
Build and Push Images / Notify (push) Has been cancelled
Build and Push Images / build (Dockerfile.ray-worker-rdna2, rdna2) (push) Has been cancelled
Build and Push Images / build (Dockerfile.ray-worker-strixhalo, strixhalo) (push) Has been cancelled
Build and Push Images / build (Dockerfile.ray-worker-intel, intel) (push) Has been cancelled
Build and Push Images / build (Dockerfile.ray-worker-nvidia, nvidia) (push) Has been cancelled
2026-02-09 10:24:32 -05:00
2a32dddd59
fixing coqui
Build and Push Images / determine-version (push) Successful in 4s
Build and Push Images / build (Dockerfile.ray-worker-nvidia, nvidia) (push) Failing after 22s
Build and Push Images / build (Dockerfile.ray-worker-intel, intel) (push) Failing after 25s
Build and Push Images / build (Dockerfile.ray-worker-strixhalo, strixhalo) (push) Failing after 18s
Build and Push Images / build (Dockerfile.ray-worker-rdna2, rdna2) (push) Failing after 21s
Build and Push Images / Release (push) Has been skipped
Build and Push Images / Notify (push) Successful in 2s
2026-02-09 09:14:09 -05:00
6aad7ad38a
fix: update to python 3.12.
Build and Push Images / determine-version (push) Successful in 4s
Build and Push Images / build (Dockerfile.ray-worker-nvidia, nvidia) (push) Failing after 21s
Build and Push Images / build (Dockerfile.ray-worker-intel, intel) (push) Failing after 23s
Build and Push Images / build (Dockerfile.ray-worker-strixhalo, strixhalo) (push) Failing after 19s
Build and Push Images / build (Dockerfile.ray-worker-rdna2, rdna2) (push) Failing after 23s
Build and Push Images / Release (push) Has been skipped
Build and Push Images / Notify (push) Successful in 1s
2026-02-09 08:52:32 -05:00
64585dac7e
fixing numpy pin.
Build and Push Images / determine-version (push) Successful in 6s
Build and Push Images / build (Dockerfile.ray-worker-nvidia, nvidia) (push) Failing after 21s
Build and Push Images / build (Dockerfile.ray-worker-intel, intel) (push) Failing after 24s
Build and Push Images / build (Dockerfile.ray-worker-strixhalo, strixhalo) (push) Failing after 22s
Build and Push Images / build (Dockerfile.ray-worker-rdna2, rdna2) (push) Failing after 34s
Build and Push Images / Release (push) Has been skipped
Build and Push Images / Notify (push) Successful in 2s
2026-02-08 21:39:11 -05:00
f297deca9d
fixing vllm.
Build and Push Images / determine-version (push) Successful in 5s
Build and Push Images / build (Dockerfile.ray-worker-nvidia, nvidia) (push) Failing after 39s
Build and Push Images / build (Dockerfile.ray-worker-intel, intel) (push) Failing after 42s
Build and Push Images / build (Dockerfile.ray-worker-strixhalo, strixhalo) (push) Failing after 20s
Build and Push Images / build (Dockerfile.ray-worker-rdna2, rdna2) (push) Failing after 23s
Build and Push Images / Release (push) Has been skipped
Build and Push Images / Notify (push) Successful in 1s
2026-02-08 16:53:16 -05:00
9042460736
fix(strixhalo): add re-entry guard to prevent offload-arch fork bomb
...
Build and Push Images / determine-version (push) Successful in 5s
Build and Push Images / build (Dockerfile.ray-worker-nvidia, nvidia) (push) Failing after 25s
Build and Push Images / build (Dockerfile.ray-worker-intel, intel) (push) Failing after 27s
Build and Push Images / build (Dockerfile.ray-worker-strixhalo, strixhalo) (push) Failing after 22s
Build and Push Images / build (Dockerfile.ray-worker-rdna2, rdna2) (push) Failing after 25s
Build and Push Images / Release (push) Has been skipped
Build and Push Images / Notify (push) Successful in 2s
torch init calls offload-arch (a Python script) which re-enters the
.pth hook, triggering another import torch, creating an infinite fork
storm (1000+ processes). Set _STRIXHALO_VRAM_FIX_ACTIVE env var before
importing torch so child processes skip the patch.
2026-02-07 08:47:06 -05:00
d1b6d78c66
fix(strixhalo): skip VRAM patch in low-memory init containers
...
Build and Push Images / determine-version (push) Successful in 5s
Build and Push Images / build (Dockerfile.ray-worker-nvidia, nvidia) (push) Failing after 24s
Build and Push Images / build (Dockerfile.ray-worker-intel, intel) (push) Failing after 27s
Build and Push Images / build (Dockerfile.ray-worker-strixhalo, strixhalo) (push) Failing after 22s
Build and Push Images / build (Dockerfile.ray-worker-rdna2, rdna2) (push) Failing after 24s
Build and Push Images / Release (push) Has been skipped
Build and Push Images / Notify (push) Successful in 1s
KubeRay's auto-injected wait-gcs-ready init container has only 256Mi
memory limit. The .pth hook was unconditionally importing torch+ROCm
which requires >256Mi, causing OOMKill.
Now checks cgroup memory limit first — if under 512Mi, skips the
expensive torch import entirely. The VRAM patch is only needed by the
main Ray worker process, not by health-check init containers.
2026-02-06 19:15:49 -05:00
e7642b86dd
feat(strixhalo): patch torch.cuda.mem_get_info for unified memory APU
...
Build and Push Images / determine-version (push) Successful in 4s
Build and Push Images / build (Dockerfile.ray-worker-nvidia, nvidia) (push) Failing after 25s
Build and Push Images / build (Dockerfile.ray-worker-intel, intel) (push) Failing after 28s
Build and Push Images / build (Dockerfile.ray-worker-strixhalo, strixhalo) (push) Failing after 23s
Build and Push Images / build (Dockerfile.ray-worker-rdna2, rdna2) (push) Failing after 26s
Build and Push Images / Release (push) Has been skipped
Build and Push Images / Notify (push) Successful in 1s
On Strix Halo, PyTorch reports GTT pool (128 GiB) as device memory
instead of real VRAM (96 GiB from BIOS). vLLM uses mem_get_info() to
pre-allocate and refuses to start when free GTT (29 GiB) < requested.
The strixhalo_vram_fix.pth hook auto-patches mem_get_info on Python
startup to read real VRAM total/used from /sys/class/drm sysfs.
Only activates when PyTorch total differs >10% from sysfs VRAM.
2026-02-06 16:29:46 -05:00
98c3ef284f
fix(ci): simplify workflow to matrix strategy for Gitea compat
...
Build and Push Images / determine-version (push) Successful in 5s
Build and Push Images / build (Dockerfile.ray-worker-nvidia, nvidia) (push) Failing after 23m46s
Build and Push Images / build (Dockerfile.ray-worker-intel, intel) (push) Failing after 23m50s
Build and Push Images / build (Dockerfile.ray-worker-strixhalo, strixhalo) (push) Failing after 31s
Build and Push Images / build (Dockerfile.ray-worker-rdna2, rdna2) (push) Failing after 43s
Build and Push Images / Release (push) Has been skipped
Build and Push Images / Notify (push) Successful in 2s
- Replace 4 separate build jobs with single matrix build job
- Eliminates complex dependency graph causing 'must contain one job
without dependencies' parse error in Gitea act_runner
- All if: conditions now use single-line strings (no multi-line |)
- workflow_dispatch image filter moved to step-level check
- Add stale buildx builder cleanup step before each build
- Simplify release/notify to depend on single 'build' job
2026-02-06 15:41:19 -05:00
3bc0b848de
fix(ci): add amdsmi-shim to paths filter
...
Build and Push Images / determine-version (push) Has been cancelled
Build and Push Images / build-nvidia (push) Has been cancelled
Build and Push Images / build-rdna2 (push) Has been cancelled
Build and Push Images / build-strixhalo (push) Has been cancelled
Build and Push Images / build-intel (push) Has been cancelled
Build and Push Images / Release (push) Has been cancelled
Build and Push Images / Notify (push) Has been cancelled
Changes to the amdsmi-shim package should trigger image rebuilds.
2026-02-06 08:52:33 -05:00
300582a520
feat(strixhalo): add amdsmi sysfs shim to bypass glibc 2.38 requirement
...
Build and Push Images / determine-version (push) Successful in 58s
Build and Push Images / Release (push) Has been cancelled
Build and Push Images / Notify (push) Has been cancelled
Build and Push Images / build-strixhalo (push) Has been cancelled
Build and Push Images / build-intel (push) Has been cancelled
Build and Push Images / build-nvidia (push) Has been cancelled
Build and Push Images / build-rdna2 (push) Has been cancelled
The native amdsmi from ROCm 7.1 requires libamd_smi.so linked against
glibc 2.38 (Ubuntu 24.04), but the Ray base image is Ubuntu 22.04
(glibc 2.35). This caused vLLM to fail ROCm platform detection with
'No module named amdsmi' / GLIBC_2.38 not found errors.
Solution: Pure-Python amdsmi shim that reads GPU info from sysfs
(/sys/class/drm/*) instead of the native library. Provides the full
API surface used by both vLLM (platform detection, device info) and
PyTorch (device counting, memory/power/temp monitoring).
Tested in-container: vLLM detects RocmPlatform, PyTorch sees GPU
(Radeon 8060S, 128GB, HIP 7.3), DeviceConfig resolves to 'cuda'.
Changes:
- Add amdsmi-shim/ package with sysfs-backed implementation
- Update Dockerfile to install shim after vLLM/torch
- Add amdsmi-shim/ to .dockerignore explicit includes
2026-02-06 08:28:07 -05:00
5f1873908f
overhaul image builds.
Build and Push Images / determine-version (push) Successful in 5s
Build and Push Images / build-nvidia (push) Failing after 21s
Build and Push Images / build-rdna2 (push) Failing after 21s
Build and Push Images / build-strixhalo (push) Failing after 12s
Build and Push Images / build-intel (push) Failing after 19s
Build and Push Images / Release (push) Has been skipped
Build and Push Images / Notify (push) Successful in 1s
2026-02-06 07:47:37 -05:00
38784f3a04
fix: use correct UID:GID 1000:100 for ray user
...
Build and Push Images / determine-version (push) Has been cancelled
Build and Push Images / build-nvidia (push) Has been cancelled
Build and Push Images / build-rdna2 (push) Has been cancelled
Build and Push Images / build-strixhalo (push) Has been cancelled
Build and Push Images / build-intel (push) Has been cancelled
Build and Push Images / Notify (push) Has been cancelled
Build and Push Images / Release (push) Has been cancelled
Ray official images use uid=1000(ray) gid=100(users).
Using numeric IDs for podman compatibility.
2026-02-05 17:32:27 -05:00
5768af76bf
fix: use fully-qualified image names for podman compatibility
...
Build and Push Images / determine-version (push) Successful in 27s
Build and Push Images / build-nvidia (push) Has started running
Build and Push Images / Release (push) Has been cancelled
Build and Push Images / Notify (push) Has been cancelled
Build and Push Images / build-strixhalo (push) Has been cancelled
Build and Push Images / build-intel (push) Has been cancelled
Build and Push Images / build-rdna2 (push) Has been cancelled
Podman requires docker.io/ prefix for Docker Hub images when
unqualified-search registries are not configured.
2026-02-05 17:25:17 -05:00
70a3c3ad6d
feat: add podman support to Makefile
...
Auto-detects podman or docker, with override via CONTAINER_ENGINE.
Podman uses 'podman build', docker uses 'docker buildx build --load'.
2026-02-05 17:23:18 -05:00
5606a9a626
fix: notify job and registry push issues
Build and Push Images / determine-version (push) Waiting to run
Build and Push Images / build-nvidia (push) Has been cancelled
Build and Push Images / build-rdna2 (push) Has been cancelled
Build and Push Images / build-strixhalo (push) Has been cancelled
Build and Push Images / build-intel (push) Has been cancelled
Build and Push Images / Release (push) Has been cancelled
Build and Push Images / Notify (push) Has been cancelled
2026-02-05 06:04:09 -05:00
bc3c115b90
fix: Use internal HTTP endpoint with buildx config and direct auth
...
Build and Push Images / determine-version (push) Successful in 1m24s
Build and Push Images / build-rdna2 (push) Failing after 3h11m33s
Build and Push Images / build-nvidia (push) Failing after 3h11m35s
Build and Push Images / build-intel (push) Failing after 17m53s
Build and Push Images / build-strixhalo (push) Failing after 3h11m34s
Build and Push Images / Release (push) Has been cancelled
Build and Push Images / Notify (push) Has been cancelled
- Back to internal endpoint (avoids Cloudflare 100MB limit)
- buildkitd-config-inline: http=true, insecure=true for HTTP registry
- Create ~/.docker/config.json directly with base64 auth
- No docker login command (it defaults to HTTPS)
- Buildx reads config.json for push authentication
2026-02-04 18:08:28 -05:00
dd6c400581
fix: Use external HTTPS endpoint with valid cert for registry
...
Build and Push Images / determine-version (push) Successful in 54s
Build and Push Images / Release (push) Has been cancelled
Build and Push Images / Notify (push) Has been cancelled
Build and Push Images / build-strixhalo (push) Has been cancelled
Build and Push Images / build-intel (push) Has been cancelled
Build and Push Images / build-rdna2 (push) Has been cancelled
Build and Push Images / build-nvidia (push) Has been cancelled
Simplify approach - use git.daviestechlabs.io external endpoint
which has valid Let's Encrypt cert. Much cleaner than fighting
with HTTP/HTTPS issues on internal endpoints.
- Remove buildkitd-config-inline (not needed for valid HTTPS)
- Remove manual config.json creation
- Use standard docker/login-action for Gitea registry
2026-02-04 18:01:58 -05:00
a77d5db274
fix: Create docker config.json directly for buildx auth
...
Build and Push Images / determine-version (push) Successful in 55s
Build and Push Images / Release (push) Has been cancelled
Build and Push Images / Notify (push) Has been cancelled
Build and Push Images / build-strixhalo (push) Has been cancelled
Build and Push Images / build-intel (push) Has been cancelled
Build and Push Images / build-rdna2 (push) Has been cancelled
Build and Push Images / build-nvidia (push) Has been cancelled
Bypass docker login command which requires daemon configuration.
Instead, create ~/.docker/config.json directly with base64 auth.
Buildx uses this config for registry authentication during push.
2026-02-04 17:53:02 -05:00
9e9a93b838
fix: Use internal HTTP endpoint for rootless DinD runner
...
Build and Push Images / determine-version (push) Successful in 1m30s
Build and Push Images / build-nvidia (push) Failing after 6m24s
Build and Push Images / build-strixhalo (push) Failing after 5m14s
Build and Push Images / build-rdna2 (push) Failing after 6m54s
Build and Push Images / build-intel (push) Failing after 5m59s
Build and Push Images / Release (push) Has been skipped
Build and Push Images / Notify (push) Successful in 2s
- Switch from external HTTPS to internal HTTP (gitea-http.gitea.svc.cluster.local:3000)
- Remove sudo commands that don't work in rootless Docker-in-Docker
- Use direct docker login with --password-stdin for compatibility
- Add http=true to buildkitd config for HTTP registry
2026-02-04 15:27:53 -05:00
110d1eab55
fix: Configure Docker daemon for insecure registry before login
...
Build and Push Images / determine-version (push) Successful in 53s
Build and Push Images / build-nvidia (push) Failing after 7m2s
Build and Push Images / build-rdna2 (push) Failing after 7m6s
Build and Push Images / Release (push) Has been cancelled
Build and Push Images / Notify (push) Has been cancelled
Build and Push Images / build-strixhalo (push) Has been cancelled
Build and Push Images / build-intel (push) Has been cancelled
The docker/login-action needs the registry marked as insecure in the
Docker daemon config, not just in buildkitd. This adds a step to
configure /etc/docker/daemon.json with insecure-registries before
attempting to login.
2026-02-04 15:18:06 -05:00
e299f6476e
fix: Use external registry URL for proper Bearer token auth
...
Build and Push Images / determine-version (push) Successful in 1m32s
Build and Push Images / build-nvidia (push) Failing after 6m47s
Build and Push Images / build-rdna2 (push) Failing after 7m8s
Build and Push Images / build-strixhalo (push) Failing after 6m35s
Build and Push Images / build-intel (push) Failing after 6m35s
Build and Push Images / Release (push) Has been skipped
Build and Push Images / Notify (push) Successful in 2s
Gitea's container registry uses Bearer token auth with realm pointing
to external URL. Changed from internal K8s service URL to
registry.lab.daviestechlabs.io for proper auth flow.
Also removed insecure registry buildx config since using HTTPS now.
2026-02-04 08:13:35 -05:00
5cb79a0fe7
fix: Use docker/login-action for buildx registry authentication
...
Build and Push Images / determine-version (push) Successful in 57s
Build and Push Images / build-nvidia (push) Failing after 6m47s
Build and Push Images / build-rdna2 (push) Failing after 7m10s
Build and Push Images / Release (push) Has been cancelled
Build and Push Images / Notify (push) Has been cancelled
Build and Push Images / build-intel (push) Has been cancelled
Build and Push Images / build-strixhalo (push) Has been cancelled
docker login doesn't properly propagate credentials to buildx builders.
docker/login-action handles this correctly and creates proper ~/.docker/config.json
2026-02-04 08:00:12 -05:00
338b668388
feat: Add semantic versioning based on commit message prefixes
...
Build and Push Images / determine-version (push) Successful in 55s
Build and Push Images / build-nvidia (push) Failing after 1h52m48s
Build and Push Images / build-rdna2 (push) Failing after 3h14m40s
Build and Push Images / build-strixhalo (push) Failing after 1h52m42s
Build and Push Images / build-intel (push) Failing after 3h14m39s
Build and Push Images / Release (push) Has been cancelled
Build and Push Images / Notify (push) Has been cancelled
- Added determine-version job that runs BEFORE builds
- Version bump based on commit message:
- major: or BREAKING CHANGE → major bump
- minor:, feat:, or feature: → minor bump
- everything else → patch bump
- All build jobs now depend on determine-version
- Images tagged with calculated version (e.g. v1.2.3) + latest
- Release job creates git tag after successful builds
- Notify job includes version info in notifications
- PRs get tagged with pr-<number>
- Manual tag pushes use tag directly (no version recalculation)
2026-02-03 22:30:48 -05:00
0bb3d25df7
trigger: rebuild after clearing runner cache
2026-02-03 22:25:35 -05:00
40c544ba0a
fix: remove COPY ray-serve/ - now installed from PyPI
...
Build and Push Images / build-nvidia (push) Failing after 13s
Build and Push Images / build-strixhalo (push) Failing after 1m56s
Build and Push Images / build-rdna2 (push) Failing after 2m8s
Build and Push Images / Release (push) Has been cancelled
Build and Push Images / Notify (push) Has been cancelled
Build and Push Images / build-intel (push) Has been cancelled
ray-serve-apps package is now installed from Gitea PyPI registry
at runtime by the RayService configuration, not bundled in image.
2026-02-03 22:23:05 -05:00
96921fe799
fix: workflow conditions for push events
...
Build and Push Images / build-nvidia (push) Failing after 15s
Build and Push Images / build-rdna2 (push) Failing after 17s
Build and Push Images / build-strixhalo (push) Failing after 15s
Build and Push Images / build-intel (push) Failing after 16s
Build and Push Images / Release (push) Has been skipped
Build and Push Images / Notify (push) Successful in 1s
The if conditions were checking github.event.inputs.image == '' which
fails for push events where inputs is undefined. Changed logic to run
all builds unless this is a workflow_dispatch with a specific image
selected.
2026-02-03 21:39:17 -05:00
7e7822f995
trigger: rebuild rdna2 image
Build and Push Images / Notify (push) Has been cancelled
Build and Push Images / build-nvidia (push) Has been cancelled
Build and Push Images / build-rdna2 (push) Has been cancelled
Build and Push Images / build-strixhalo (push) Has been cancelled
Build and Push Images / build-intel (push) Has been cancelled
Build and Push Images / Release (push) Has been cancelled
2026-02-03 21:34:53 -05:00
aac9508c28
trigger: rebuild worker images after fix
2026-02-03 21:32:13 -05:00
cb7dad96c1
fix: PATH variable expansion in ROCm worker Dockerfiles
...
Build and Push Images / build-rdna2 (push) Has been cancelled
Build and Push Images / build-strixhalo (push) Has been cancelled
Build and Push Images / build-intel (push) Has been cancelled
Build and Push Images / build-nvidia (push) Has been cancelled
Build and Push Images / Release (push) Has been cancelled
Build and Push Images / Notify (push) Has been cancelled
Split ENV ROCM_HOME and ENV PATH into separate commands to fix variable
expansion issue. When ROCM_HOME and PATH were in the same ENV line,
${ROCM_HOME} expanded to empty string since it wasn't defined yet.
This was causing 'ray: command not found' in init containers.
2026-02-03 21:07:00 -05:00
a8943c79ad
refactor: remove ray-serve (moved to dedicated repo)
...
Implements ADR-0024: Ray Repository Structure
ray-serve is now a standalone PyPI package repo:
- https://git.daviestechlabs.io/billy/ray-serve
kuberay-images now contains only Docker images for Ray workers
2026-02-03 07:45:48 -05:00
796997cf06
adding intel image build fixes.
Build and Push Images / build-nvidia (push) Failing after 6m29s
Build and Push Images / build-strixhalo (push) Failing after 5m27s
Build and Push Images / build-intel (push) Failing after 4m6s
Build and Push Images / build-rdna2 (push) Failing after 2h19m57s
Build and Push Images / Release (push) Has been skipped
Build and Push Images / Notify (push) Successful in 2s
2026-02-02 21:16:48 -05:00
81388aed2c
ci: retry build with Docker Hub auth
2026-02-02 17:44:43 -05:00
8af9d04210
fix(ci): configure Docker buildx for insecure HTTP registry
Build and Push Images / build-nvidia (push) Failing after 6m6s
Build and Push Images / build-rdna2 (push) Failing after 6m31s
Build and Push Images / build-strixhalo (push) Failing after 5m35s
Build and Push Images / build-intel (push) Failing after 5m33s
Build and Push Images / Release (push) Has been skipped
Build and Push Images / Notify (push) Successful in 1s
2026-02-02 17:21:39 -05:00
456f08ec81
fix: use internal K8s service URL for container registry
...
Build and Push Images / build-rdna2 (push) Failing after 8m19s
Build and Push Images / build-nvidia (push) Failing after 9m26s
Build and Push Images / build-strixhalo (push) Failing after 6m50s
Build and Push Images / build-intel (push) Failing after 7m14s
Build and Push Images / Release (push) Has been skipped
Build and Push Images / Notify (push) Successful in 1s
- Switch from external git.daviestechlabs.io to internal gitea-http.gitea.svc
- Avoids Cloudflare/Authentik routing since runner is in-cluster
- Add REGISTRY_HOST env var for login steps
2026-02-02 13:28:51 -05:00
3c788fe2b6
fix(strixhalo): upgrade pandas for numpy 2.x compatibility
...
Build and Push Images / build-strixhalo (push) Has been cancelled
Build and Push Images / build-nvidia (push) Has been cancelled
Build and Push Images / build-intel (push) Has been cancelled
Build and Push Images / Release (push) Has been cancelled
Build and Push Images / Notify (push) Has been cancelled
Build and Push Images / build-rdna2 (push) Has been cancelled
Ray base image has pandas 1.5.3 compiled against numpy 1.x, but TheRock
PyTorch ROCm wheels require numpy 2.x. This causes:
ValueError: numpy.dtype size changed, may indicate binary incompatibility
Fix by installing pandas 2.x which is compatible with numpy 2.x.
2026-02-02 13:25:28 -05:00
4e813cea64
fix: use twine for PyPI upload with internal URL
...
Build and Publish ray-serve-apps / lint (push) Successful in 1m32s
Build and Publish ray-serve-apps / publish (push) Successful in 2m4s
Replaces curl-based upload with twine which handles the
PyPI upload protocol correctly. Uses TWINE_REPOSITORY_URL
env var to point to internal Gitea service.
2026-02-02 12:40:33 -05:00
18302cf640
chore: trigger ray-serve publish
Build and Publish ray-serve-apps / lint (push) Successful in 1m31s
Build and Publish ray-serve-apps / publish (push) Failing after 1m28s
2026-02-02 12:35:03 -05:00
45a89ffb2c
chore: trigger workflow to test secrets
2026-02-02 12:33:44 -05:00
7b4871f554
debug: check if secrets are being passed
Build and Publish ray-serve-apps / lint (push) Successful in 1m33s
Build and Publish ray-serve-apps / publish (push) Failing after 1m32s
2026-02-02 12:20:39 -05:00