# ADR-0014: Docker Build Best Practices ## Status Accepted ## Date 2026-02-02 ## Context Our ML/AI platform relies heavily on containerized services, particularly GPU workers for KubeRay that include large dependencies (PyTorch, vLLM, ROCm, CUDA). These images can take 30+ minutes to build and exceed 10GB in size. We need standardized practices to ensure: 1. **Fast rebuilds** - Avoid re-downloading dependencies on every build 2. **Reproducibility** - Consistent builds across different machines 3. **Security** - Non-root execution, minimal attack surface 4. **Observability** - Proper metadata for image management 5. **Consistency** - Same patterns across all Dockerfiles ## Decision We adopt the following Docker build best practices across all repositories: ### 1. BuildKit Syntax and Features ```dockerfile # syntax=docker/dockerfile:1.7 ``` All Dockerfiles use BuildKit syntax 1.7+ for cache mount support. ### 2. Use uv for Python Package Installation Replace pip with uv for dramatically faster installs (10-100x): ```dockerfile # Install uv COPY --from=ghcr.io/astral-sh/uv:latest /uv /usr/local/bin/uv # Install packages with cache mount RUN --mount=type=cache,target=/root/.cache/uv \ uv pip install --system --no-cache \ 'package>=1.0,<2.0' ``` Benefits: - Parallel downloads and installs - Better dependency resolution - Consistent with ADR-0012 (uv for Python development) ### 3. Cache Mounts for Package Managers ```dockerfile # APT cache mount RUN --mount=type=cache,target=/var/cache/apt,sharing=locked \ --mount=type=cache,target=/var/lib/apt,sharing=locked \ apt-get update && apt-get install -y --no-install-recommends \ package1 package2 # uv/pip cache mount RUN --mount=type=cache,target=/home/ray/.cache/uv,uid=1000,gid=1000 \ uv pip install --system 'package>=1.0' ``` ### 4. OCI Image Specification Labels All images include standard metadata: ```dockerfile LABEL org.opencontainers.image.title="Service Name" LABEL org.opencontainers.image.description="Service description" LABEL org.opencontainers.image.vendor="DaviesTechLabs" LABEL org.opencontainers.image.source="https://git.daviestechlabs.io/daviestechlabs/repo" LABEL org.opencontainers.image.licenses="MIT" ``` ### 5. Health Checks All service images include HEALTHCHECK: ```dockerfile HEALTHCHECK --interval=30s --timeout=10s --start-period=60s --retries=3 \ CMD curl -f http://localhost:8000/health || exit 1 ``` ### 6. Non-Root Execution Services run as unprivileged users: ```dockerfile USER ray # or appuser, 1000:1000 ``` ### 7. Version Pinning with Ranges Dependencies use minimum version with upper bound: ```dockerfile RUN uv pip install --system \ 'transformers>=4.35.0,<5.0' \ 'torch>=2.0.0,<3.0' ``` ### 8. Layer Optimization - Combine related commands into single RUN layers - Order from least to most frequently changing - Use multi-stage builds to reduce final image size - Use `COPY --link` for multi-stage `COPY --from` layers to make them independent of prior layers, improving cache reuse when base images change: ```dockerfile # --link makes this layer reusable even if the base image changes COPY --link --from=rocm-source /opt/rocm /opt/rocm ``` ### 9. Registry-Based BuildKit Cache Use `type=registry` cache instead of `type=gha` (which only works on GitHub Actions). This stores build cache layers directly in the container registry with zstd compression: ```yaml - name: Build and push uses: docker/build-push-action@v5 with: cache-from: type=registry,ref=${{ env.REGISTRY }}/image:buildcache cache-to: type=registry,ref=${{ env.REGISTRY }}/image:buildcache,mode=max,image-manifest=true,compression=zstd ``` Benefits: - Works on any CI system (Gitea Actions, Jenkins, etc.) - `mode=max` caches all layers, not just final image layers - `compression=zstd` is faster than gzip with similar compression ratios - Cache survives runner restarts (stored in registry, not ephemeral disk) **Important:** `type=gha` is a no-op on self-hosted Gitea runners — it requires GitHub's cache API. Always use `type=registry` for self-hosted CI. ### 10. .dockerignore All repos include a `.dockerignore`: ``` .git .gitea *.md __pycache__/ *.pyc .venv/ .mypy_cache/ .pytest_cache/ .ruff_cache/ ``` ### 11. Makefile Integration Standard targets for building and linting: ```makefile lint: hadolint Dockerfile build: docker buildx build --platform linux/amd64 --load -t image:tag . ``` ## Consequences ### Positive - **10-100x faster pip operations** with uv cache mounts - **Consistent builds** via lockfiles and version pinning - **Better observability** through OCI labels - **Improved security** with non-root execution - **Faster CI/CD** through BuildKit caching ### Negative - **Requires Docker BuildKit** - Must use `DOCKER_BUILDKIT=1` or buildx - **Cache invalidation complexity** - Cache mounts persist across builds - **Learning curve** - Developers must understand BuildKit syntax ## Related ADRs - [ADR-0011](0011-kuberay-unified-gpu-backend.md) - KubeRay GPU backend - [ADR-0012](0012-use-uv-for-python-development.md) - uv for Python development - [ADR-0013](0013-gitea-actions-for-ci.md) - Gitea Actions CI/CD