feat(strixhalo): add amdsmi sysfs shim to bypass glibc 2.38 requirement
Some checks failed
Build and Push Images / determine-version (push) Successful in 58s
Build and Push Images / Release (push) Has been cancelled
Build and Push Images / Notify (push) Has been cancelled
Build and Push Images / build-strixhalo (push) Has been cancelled
Build and Push Images / build-intel (push) Has been cancelled
Build and Push Images / build-nvidia (push) Has been cancelled
Build and Push Images / build-rdna2 (push) Has been cancelled
Some checks failed
Build and Push Images / determine-version (push) Successful in 58s
Build and Push Images / Release (push) Has been cancelled
Build and Push Images / Notify (push) Has been cancelled
Build and Push Images / build-strixhalo (push) Has been cancelled
Build and Push Images / build-intel (push) Has been cancelled
Build and Push Images / build-nvidia (push) Has been cancelled
Build and Push Images / build-rdna2 (push) Has been cancelled
The native amdsmi from ROCm 7.1 requires libamd_smi.so linked against glibc 2.38 (Ubuntu 24.04), but the Ray base image is Ubuntu 22.04 (glibc 2.35). This caused vLLM to fail ROCm platform detection with 'No module named amdsmi' / GLIBC_2.38 not found errors. Solution: Pure-Python amdsmi shim that reads GPU info from sysfs (/sys/class/drm/*) instead of the native library. Provides the full API surface used by both vLLM (platform detection, device info) and PyTorch (device counting, memory/power/temp monitoring). Tested in-container: vLLM detects RocmPlatform, PyTorch sees GPU (Radeon 8060S, 128GB, HIP 7.3), DeviceConfig resolves to 'cuda'. Changes: - Add amdsmi-shim/ package with sysfs-backed implementation - Update Dockerfile to install shim after vLLM/torch - Add amdsmi-shim/ to .dockerignore explicit includes
This commit is contained in:
@@ -32,7 +32,7 @@ COPY --link --from=rocm-source /opt/rocm /opt/rocm
|
||||
# ROCm environment variables - split to ensure ROCM_HOME is set first
|
||||
ENV ROCM_HOME=/opt/rocm
|
||||
ENV PATH="/opt/rocm/bin:/opt/rocm/llvm/bin:/home/ray/anaconda3/bin:/home/ray/.local/bin:/usr/local/sbin:/usr/local/bin:/usr/sbin:/usr/bin:/sbin:/bin" \
|
||||
LD_LIBRARY_PATH="/opt/rocm/lib:/opt/rocm/lib64" \
|
||||
LD_LIBRARY_PATH="/opt/rocm/lib:/opt/rocm/lib64:/home/ray/anaconda3/lib" \
|
||||
HSA_PATH="/opt/rocm/hsa" \
|
||||
HIP_PATH="/opt/rocm/hip" \
|
||||
# Strix Halo (gfx1151) specific settings
|
||||
@@ -84,6 +84,15 @@ RUN --mount=type=cache,target=/home/ray/.cache/uv,uid=1000,gid=1000 \
|
||||
RUN --mount=type=cache,target=/home/ray/.cache/uv,uid=1000,gid=1000 \
|
||||
uv pip install --system 'pandas>=2.0.0,<3.0'
|
||||
|
||||
# Install amdsmi sysfs shim LAST (required for vLLM ROCm platform detection).
|
||||
# The native amdsmi from ROCm 7.1 requires glibc 2.38 (Ubuntu 24.04),
|
||||
# but the Ray base image is Ubuntu 22.04 (glibc 2.35). This pure-Python
|
||||
# shim reads GPU info from /sys/class/drm/* instead of libamd_smi.so.
|
||||
# Must be installed after vLLM/torch to prevent PyPI amdsmi from overwriting it.
|
||||
COPY --chown=1000:100 amdsmi-shim /tmp/amdsmi-shim
|
||||
RUN --mount=type=cache,target=/home/ray/.cache/uv,uid=1000,gid=1000 \
|
||||
uv pip install --system /tmp/amdsmi-shim && rm -rf /tmp/amdsmi-shim
|
||||
|
||||
# Pre-download common models for faster cold starts (optional, increases image size)
|
||||
# RUN python3 -c "from sentence_transformers import SentenceTransformer; SentenceTransformer('BAAI/bge-large-en-v1.5')"
|
||||
|
||||
|
||||
Reference in New Issue
Block a user