daviestechlabs/homelab-design

Fork 0

Files

Billy D. dd277f6459 tuning up runner improvements.

2026-02-06 07:53:31 -05:00

5.1 KiB

Raw Blame History

ADR-0014: Docker Build Best Practices

Status

Accepted

Date

2026-02-02

Context

Our ML/AI platform relies heavily on containerized services, particularly GPU workers for KubeRay that include large dependencies (PyTorch, vLLM, ROCm, CUDA). These images can take 30+ minutes to build and exceed 10GB in size. We need standardized practices to ensure:

Fast rebuilds - Avoid re-downloading dependencies on every build
Reproducibility - Consistent builds across different machines
Security - Non-root execution, minimal attack surface
Observability - Proper metadata for image management
Consistency - Same patterns across all Dockerfiles

Decision

We adopt the following Docker build best practices across all repositories:

1. BuildKit Syntax and Features

# syntax=docker/dockerfile:1.7

All Dockerfiles use BuildKit syntax 1.7+ for cache mount support.

2. Use uv for Python Package Installation

Replace pip with uv for dramatically faster installs (10-100x):

# Install uv
COPY --from=ghcr.io/astral-sh/uv:latest /uv /usr/local/bin/uv

# Install packages with cache mount
RUN --mount=type=cache,target=/root/.cache/uv \
    uv pip install --system --no-cache \
        'package>=1.0,<2.0'

Benefits:

Parallel downloads and installs
Better dependency resolution
Consistent with ADR-0012 (uv for Python development)

3. Cache Mounts for Package Managers

# APT cache mount
RUN --mount=type=cache,target=/var/cache/apt,sharing=locked \
    --mount=type=cache,target=/var/lib/apt,sharing=locked \
    apt-get update && apt-get install -y --no-install-recommends \
        package1 package2

# uv/pip cache mount
RUN --mount=type=cache,target=/home/ray/.cache/uv,uid=1000,gid=1000 \
    uv pip install --system 'package>=1.0'

4. OCI Image Specification Labels

All images include standard metadata:

LABEL org.opencontainers.image.title="Service Name"
LABEL org.opencontainers.image.description="Service description"
LABEL org.opencontainers.image.vendor="DaviesTechLabs"
LABEL org.opencontainers.image.source="https://git.daviestechlabs.io/daviestechlabs/repo"
LABEL org.opencontainers.image.licenses="MIT"

5. Health Checks

All service images include HEALTHCHECK:

HEALTHCHECK --interval=30s --timeout=10s --start-period=60s --retries=3 \
    CMD curl -f http://localhost:8000/health || exit 1

6. Non-Root Execution

Services run as unprivileged users:

USER ray   # or appuser, 1000:1000

7. Version Pinning with Ranges

Dependencies use minimum version with upper bound:

RUN uv pip install --system \
    'transformers>=4.35.0,<5.0' \
    'torch>=2.0.0,<3.0'

8. Layer Optimization

Combine related commands into single RUN layers
Order from least to most frequently changing
Use multi-stage builds to reduce final image size
Use COPY --link for multi-stage COPY --from layers to make them independent of prior layers, improving cache reuse when base images change:

# --link makes this layer reusable even if the base image changes
COPY --link --from=rocm-source /opt/rocm /opt/rocm

9. Registry-Based BuildKit Cache

Use type=registry cache instead of type=gha (which only works on GitHub Actions). This stores build cache layers directly in the container registry with zstd compression:

- name: Build and push
  uses: docker/build-push-action@v5
  with:
    cache-from: type=registry,ref=${{ env.REGISTRY }}/image:buildcache
    cache-to: type=registry,ref=${{ env.REGISTRY }}/image:buildcache,mode=max,image-manifest=true,compression=zstd

Benefits:

Works on any CI system (Gitea Actions, Jenkins, etc.)
mode=max caches all layers, not just final image layers
compression=zstd is faster than gzip with similar compression ratios
Cache survives runner restarts (stored in registry, not ephemeral disk)

Important: type=gha is a no-op on self-hosted Gitea runners — it requires GitHub's cache API. Always use type=registry for self-hosted CI.

10. .dockerignore

All repos include a .dockerignore:

.git
.gitea
*.md
__pycache__/
*.pyc
.venv/
.mypy_cache/
.pytest_cache/
.ruff_cache/

11. Makefile Integration

Standard targets for building and linting:

lint:
    hadolint Dockerfile

build:
    docker buildx build --platform linux/amd64 --load -t image:tag .

Consequences

Positive

10-100x faster pip operations with uv cache mounts
Consistent builds via lockfiles and version pinning
Better observability through OCI labels
Improved security with non-root execution
Faster CI/CD through BuildKit caching

Negative

Requires Docker BuildKit - Must use DOCKER_BUILDKIT=1 or buildx
Cache invalidation complexity - Cache mounts persist across builds
Learning curve - Developers must understand BuildKit syntax

ADR-0011 - KubeRay GPU backend
ADR-0012 - uv for Python development
ADR-0013 - Gitea Actions CI/CD

5.1 KiB Raw Blame History