Files
homelab-design/decisions/0014-docker-build-best-practices.md

5.1 KiB

ADR-0014: Docker Build Best Practices

Status

Accepted

Date

2026-02-02

Context

Our ML/AI platform relies heavily on containerized services, particularly GPU workers for KubeRay that include large dependencies (PyTorch, vLLM, ROCm, CUDA). These images can take 30+ minutes to build and exceed 10GB in size. We need standardized practices to ensure:

  1. Fast rebuilds - Avoid re-downloading dependencies on every build
  2. Reproducibility - Consistent builds across different machines
  3. Security - Non-root execution, minimal attack surface
  4. Observability - Proper metadata for image management
  5. Consistency - Same patterns across all Dockerfiles

Decision

We adopt the following Docker build best practices across all repositories:

1. BuildKit Syntax and Features

# syntax=docker/dockerfile:1.7

All Dockerfiles use BuildKit syntax 1.7+ for cache mount support.

2. Use uv for Python Package Installation

Replace pip with uv for dramatically faster installs (10-100x):

# Install uv
COPY --from=ghcr.io/astral-sh/uv:latest /uv /usr/local/bin/uv

# Install packages with cache mount
RUN --mount=type=cache,target=/root/.cache/uv \
    uv pip install --system --no-cache \
        'package>=1.0,<2.0'

Benefits:

  • Parallel downloads and installs
  • Better dependency resolution
  • Consistent with ADR-0012 (uv for Python development)

3. Cache Mounts for Package Managers

# APT cache mount
RUN --mount=type=cache,target=/var/cache/apt,sharing=locked \
    --mount=type=cache,target=/var/lib/apt,sharing=locked \
    apt-get update && apt-get install -y --no-install-recommends \
        package1 package2

# uv/pip cache mount
RUN --mount=type=cache,target=/home/ray/.cache/uv,uid=1000,gid=1000 \
    uv pip install --system 'package>=1.0'

4. OCI Image Specification Labels

All images include standard metadata:

LABEL org.opencontainers.image.title="Service Name"
LABEL org.opencontainers.image.description="Service description"
LABEL org.opencontainers.image.vendor="DaviesTechLabs"
LABEL org.opencontainers.image.source="https://git.daviestechlabs.io/daviestechlabs/repo"
LABEL org.opencontainers.image.licenses="MIT"

5. Health Checks

All service images include HEALTHCHECK:

HEALTHCHECK --interval=30s --timeout=10s --start-period=60s --retries=3 \
    CMD curl -f http://localhost:8000/health || exit 1

6. Non-Root Execution

Services run as unprivileged users:

USER ray   # or appuser, 1000:1000

7. Version Pinning with Ranges

Dependencies use minimum version with upper bound:

RUN uv pip install --system \
    'transformers>=4.35.0,<5.0' \
    'torch>=2.0.0,<3.0'

8. Layer Optimization

  • Combine related commands into single RUN layers
  • Order from least to most frequently changing
  • Use multi-stage builds to reduce final image size
  • Use COPY --link for multi-stage COPY --from layers to make them independent of prior layers, improving cache reuse when base images change:
# --link makes this layer reusable even if the base image changes
COPY --link --from=rocm-source /opt/rocm /opt/rocm

9. Registry-Based BuildKit Cache

Use type=registry cache instead of type=gha (which only works on GitHub Actions). This stores build cache layers directly in the container registry with zstd compression:

- name: Build and push
  uses: docker/build-push-action@v5
  with:
    cache-from: type=registry,ref=${{ env.REGISTRY }}/image:buildcache
    cache-to: type=registry,ref=${{ env.REGISTRY }}/image:buildcache,mode=max,image-manifest=true,compression=zstd

Benefits:

  • Works on any CI system (Gitea Actions, Jenkins, etc.)
  • mode=max caches all layers, not just final image layers
  • compression=zstd is faster than gzip with similar compression ratios
  • Cache survives runner restarts (stored in registry, not ephemeral disk)

Important: type=gha is a no-op on self-hosted Gitea runners — it requires GitHub's cache API. Always use type=registry for self-hosted CI.

10. .dockerignore

All repos include a .dockerignore:

.git
.gitea
*.md
__pycache__/
*.pyc
.venv/
.mypy_cache/
.pytest_cache/
.ruff_cache/

11. Makefile Integration

Standard targets for building and linting:

lint:
    hadolint Dockerfile

build:
    docker buildx build --platform linux/amd64 --load -t image:tag .

Consequences

Positive

  • 10-100x faster pip operations with uv cache mounts
  • Consistent builds via lockfiles and version pinning
  • Better observability through OCI labels
  • Improved security with non-root execution
  • Faster CI/CD through BuildKit caching

Negative

  • Requires Docker BuildKit - Must use DOCKER_BUILDKIT=1 or buildx
  • Cache invalidation complexity - Cache mounts persist across builds
  • Learning curve - Developers must understand BuildKit syntax