v1.0.26: dynamic VRAM via GTT for 32GB carve-out
Some checks failed
Build and Push Images / build (Dockerfile.ray-worker-intel, intel) (push) Has been cancelled
Build and Push Images / build (Dockerfile.ray-worker-nvidia, nvidia) (push) Has been cancelled
Build and Push Images / build (Dockerfile.ray-worker-rdna2, rdna2) (push) Has been cancelled
Build and Push Images / build (Dockerfile.ray-worker-strixhalo, strixhalo) (push) Has been cancelled
Build and Push Images / Release (push) Has been cancelled
Build and Push Images / Notify (push) Has been cancelled
Build and Push Images / determine-version (push) Has been cancelled
Some checks failed
Build and Push Images / build (Dockerfile.ray-worker-intel, intel) (push) Has been cancelled
Build and Push Images / build (Dockerfile.ray-worker-nvidia, nvidia) (push) Has been cancelled
Build and Push Images / build (Dockerfile.ray-worker-rdna2, rdna2) (push) Has been cancelled
Build and Push Images / build (Dockerfile.ray-worker-strixhalo, strixhalo) (push) Has been cancelled
Build and Push Images / Release (push) Has been cancelled
Build and Push Images / Notify (push) Has been cancelled
Build and Push Images / determine-version (push) Has been cancelled
strixhalo_vram_fix.py: compute effective VRAM as min(GTT_pool, physical_RAM) - 4GB OS reserve instead of raw sysfs VRAM. Prevents OOM when carve-out < model size and prevents kernel OOM when GTT > physical RAM.
This commit is contained in:
@@ -2,28 +2,29 @@
|
||||
# Used for: vLLM (Llama 3.1 70B)
|
||||
#
|
||||
# Build:
|
||||
# docker build -t registry.lab.daviestechlabs.io/daviestechlabs/ray-worker-strixhalo:v1.0.21 \
|
||||
# docker build -t registry.lab.daviestechlabs.io/daviestechlabs/ray-worker-strixhalo:v2.0.0 \
|
||||
# -f dockerfiles/Dockerfile.ray-worker-strixhalo .
|
||||
#
|
||||
# STRATEGY: Full source build of vLLM on AMD's vendor PyTorch image.
|
||||
#
|
||||
# The vendor image (rocm/pytorch ROCm 7.0.2 / Ubuntu 24.04 / Python 3.12)
|
||||
# ships torch 2.9.1 compiled by AMD CI against the exact ROCm libraries in
|
||||
# the image. Pre-built vLLM torch wheels (wheels.vllm.ai) carry a custom
|
||||
# torch 2.9.1+git8907517 that segfaults in libhsa-runtime64.so on gfx1151
|
||||
# during HSA queue creation. By keeping the vendor torch and compiling vLLM
|
||||
# from source we guarantee ABI compatibility across the entire stack.
|
||||
# The vendor image (rocm/pytorch ROCm 7.2 / Ubuntu 24.04 / Python 3.12)
|
||||
# ships torch 2.9.1+rocm7.2.0 compiled by AMD CI against the exact ROCm
|
||||
# libraries in the image. ROCm 7.2 includes the hsakmt VGPR count fix
|
||||
# for gfx1151 (TheRock #2991) — ROCm 7.0.x/7.1.x segfault during HSA
|
||||
# queue creation due to incorrect VGPR sizing. By keeping the vendor
|
||||
# torch and compiling vLLM from source we guarantee ABI compatibility
|
||||
# across the entire stack.
|
||||
#
|
||||
# gfx1151 is mapped to gfx1100 at runtime via HSA_OVERRIDE_GFX_VERSION=11.0.0,
|
||||
# so all HIP kernels are compiled for the gfx1100 target.
|
||||
# ROCm 7.2 supports gfx1151 natively — no HSA_OVERRIDE_GFX_VERSION needed.
|
||||
# HIP kernels are compiled directly for the gfx1151 target.
|
||||
#
|
||||
# Note: AITER is gfx9-only. On gfx11, vLLM defaults to TRITON_ATTN backend.
|
||||
|
||||
FROM docker.io/rocm/pytorch:rocm7.0.2_ubuntu24.04_py3.12_pytorch_release_2.9.1
|
||||
FROM docker.io/rocm/pytorch:rocm7.2_ubuntu24.04_py3.12_pytorch_release_2.9.1
|
||||
|
||||
# ── Build arguments ─────────────────────────────────────────────────────
|
||||
ARG VLLM_VERSION=v0.15.1
|
||||
ARG PYTORCH_ROCM_ARCH="gfx1100"
|
||||
ARG PYTORCH_ROCM_ARCH="gfx1151"
|
||||
ARG MAX_JOBS=16
|
||||
|
||||
# ── OCI labels ──────────────────────────────────────────────────────────
|
||||
@@ -32,7 +33,7 @@ LABEL org.opencontainers.image.description="Ray Serve worker for AMD Strix Halo
|
||||
LABEL org.opencontainers.image.vendor="DaviesTechLabs"
|
||||
LABEL org.opencontainers.image.source="https://git.daviestechlabs.io/daviestechlabs/kuberay-images"
|
||||
LABEL org.opencontainers.image.licenses="MIT"
|
||||
LABEL gpu.target="amd-rocm-7.0.2-gfx1151"
|
||||
LABEL gpu.target="amd-rocm-7.2-gfx1151"
|
||||
LABEL ray.version="2.53.0"
|
||||
LABEL vllm.build="source"
|
||||
|
||||
@@ -52,17 +53,16 @@ ENV PATH="/opt/venv/bin:/opt/rocm/bin:/opt/rocm/llvm/bin:/home/ray/.local/bin:/u
|
||||
HIP_VISIBLE_DEVICES=0 \
|
||||
HSA_ENABLE_SDMA=0 \
|
||||
PYTORCH_ALLOC_CONF="max_split_size_mb:512" \
|
||||
HSA_OVERRIDE_GFX_VERSION="11.0.0" \
|
||||
ROCM_TARGET_LST="gfx1151,gfx1100"
|
||||
ROCM_TARGET_LST="gfx1151"
|
||||
|
||||
# ── System setup ─────────────────────────────────────────────────────────
|
||||
# The vendor image already ships ALL needed packages:
|
||||
# cmake 4.0, hipcc 7.0.2, clang++ 20.0 (AMD ROCm LLVM), git,
|
||||
# libelf, libnuma, libdrm, libopenmpi3, and HIP dev headers/cmake configs.
|
||||
# The vendor image ships hipcc 7.2, clang++ (AMD ROCm LLVM), git,
|
||||
# libelf, libnuma, libdrm, libopenmpi3, and HIP dev headers/cmake configs.
|
||||
# cmake is NOT in the 7.2 image — installed via pip below.
|
||||
#
|
||||
# CRITICAL: Do NOT run apt-get upgrade or install ANY packages from apt.
|
||||
# Even installing ccache triggers a dependency cascade that pulls in
|
||||
# Ubuntu's hipcc 5.7.1 (which overwrites the vendor hipcc 7.0.2) and
|
||||
# Ubuntu's hipcc 5.7.1 (which overwrites the vendor hipcc 7.2) and
|
||||
# a broken /usr/bin/hipconfig.pl that makes cmake find_package(hip)
|
||||
# report version 0.0.0 → "Can't find CUDA or HIP installation."
|
||||
#
|
||||
@@ -81,9 +81,9 @@ RUN (groupadd -g 100 -o users 2>/dev/null || true) \
|
||||
COPY --from=ghcr.io/astral-sh/uv:latest /uv /usr/local/bin/uv
|
||||
|
||||
# ── Python build dependencies ──────────────────────────────────────────
|
||||
# CRITICAL: vLLM requires cmake<4. The vendor image ships cmake 4.0.0
|
||||
# which changed find_package(MODULE) behaviour and breaks FindHIP.cmake
|
||||
# (reports HIP version 0.0.0). Downgrade to 3.x per vLLM's rocm-build.txt.
|
||||
# CRITICAL: vLLM requires cmake<4. cmake 4.0+ changed find_package(MODULE)
|
||||
# behaviour and breaks FindHIP.cmake (reports HIP version 0.0.0).
|
||||
# The ROCm 7.2 image does not ship cmake, so we install 3.x here.
|
||||
RUN --mount=type=cache,target=/root/.cache/uv \
|
||||
uv pip install --python /opt/venv/bin/python3 \
|
||||
'cmake>=3.26.1,<4' \
|
||||
@@ -234,10 +234,11 @@ RUN --mount=type=cache,target=/root/.cache/uv \
|
||||
|
||||
# ── Verify vendor torch survived ───────────────────────────────────────
|
||||
# Fail early if any install step accidentally replaced the vendor torch.
|
||||
# ROCm 7.2 vendor torch version: 2.9.1+rocm7.2.0.git7e1940d4
|
||||
RUN python3 -c "\
|
||||
import torch; \
|
||||
v = torch.__version__; \
|
||||
assert '+git' not in v, f'vLLM torch detected ({v}) — vendor torch was overwritten!'; \
|
||||
assert 'rocm7.2' in v, f'Expected ROCm 7.2 vendor torch, got {v}'; \
|
||||
print(f'torch {v} (vendor) OK')"
|
||||
|
||||
# ── amdsmi sysfs shim ──────────────────────────────────────────────────
|
||||
|
||||
Reference in New Issue
Block a user