# ADR-0014: Docker Build Best Practices

## Status

Accepted

## Date

2026-02-02

## Context

Our ML/AI platform relies heavily on containerized services, particularly GPU workers
for KubeRay that include large dependencies (PyTorch, vLLM, ROCm, CUDA). These images
can take 30+ minutes to build and exceed 10GB in size. We need standardized practices
to ensure:

1. **Fast rebuilds** - Avoid re-downloading dependencies on every build
2. **Reproducibility** - Consistent builds across different machines
3. **Security** - Non-root execution, minimal attack surface
4. **Observability** - Proper metadata for image management
5. **Consistency** - Same patterns across all Dockerfiles

## Decision

We adopt the following Docker build best practices across all repositories:

### 1. BuildKit Syntax and Features

```dockerfile
# syntax=docker/dockerfile:1.7
```

All Dockerfiles use BuildKit syntax 1.7+ for cache mount support.

### 2. Use uv for Python Package Installation

Replace pip with uv for dramatically faster installs (10-100x):

```dockerfile
# Install uv
COPY --from=ghcr.io/astral-sh/uv:latest /uv /usr/local/bin/uv

# Install packages with cache mount
RUN --mount=type=cache,target=/root/.cache/uv \
    uv pip install --system --no-cache \
        'package>=1.0,<2.0'
```

Benefits:
- Parallel downloads and installs
- Better dependency resolution
- Consistent with ADR-0012 (uv for Python development)

### 3. Cache Mounts for Package Managers

```dockerfile
# APT cache mount
RUN --mount=type=cache,target=/var/cache/apt,sharing=locked \
    --mount=type=cache,target=/var/lib/apt,sharing=locked \
    apt-get update && apt-get install -y --no-install-recommends \
        package1 package2

# uv/pip cache mount
RUN --mount=type=cache,target=/home/ray/.cache/uv,uid=1000,gid=1000 \
    uv pip install --system 'package>=1.0'
```

### 4. OCI Image Specification Labels

All images include standard metadata:

```dockerfile
LABEL org.opencontainers.image.title="Service Name"
LABEL org.opencontainers.image.description="Service description"
LABEL org.opencontainers.image.vendor="DaviesTechLabs"
LABEL org.opencontainers.image.source="https://git.daviestechlabs.io/daviestechlabs/repo"
LABEL org.opencontainers.image.licenses="MIT"
```

### 5. Health Checks

All service images include HEALTHCHECK:

```dockerfile
HEALTHCHECK --interval=30s --timeout=10s --start-period=60s --retries=3 \
    CMD curl -f http://localhost:8000/health || exit 1
```

### 6. Non-Root Execution

Services run as unprivileged users:

```dockerfile
USER ray   # or appuser, 1000:1000
```

### 7. Version Pinning with Ranges

Dependencies use minimum version with upper bound:

```dockerfile
RUN uv pip install --system \
    'transformers>=4.35.0,<5.0' \
    'torch>=2.0.0,<3.0'
```

### 8. Layer Optimization

- Combine related commands into single RUN layers
- Order from least to most frequently changing
- Use multi-stage builds to reduce final image size
- Use `COPY --link` for multi-stage `COPY --from` layers to make them independent
  of prior layers, improving cache reuse when base images change:

```dockerfile
# --link makes this layer reusable even if the base image changes
COPY --link --from=rocm-source /opt/rocm /opt/rocm
```

### 9. Registry-Based BuildKit Cache

Use `type=registry` cache instead of `type=gha` (which only works on GitHub Actions).
This stores build cache layers directly in the container registry with zstd compression:

```yaml
- name: Build and push
  uses: docker/build-push-action@v5
  with:
    cache-from: type=registry,ref=${{ env.REGISTRY }}/image:buildcache
    cache-to: type=registry,ref=${{ env.REGISTRY }}/image:buildcache,mode=max,image-manifest=true,compression=zstd
```

Benefits:
- Works on any CI system (Gitea Actions, Jenkins, etc.)
- `mode=max` caches all layers, not just final image layers
- `compression=zstd` is faster than gzip with similar compression ratios
- Cache survives runner restarts (stored in registry, not ephemeral disk)

**Important:** `type=gha` is a no-op on self-hosted Gitea runners — it requires
GitHub's cache API. Always use `type=registry` for self-hosted CI.

### 10. .dockerignore

All repos include a `.dockerignore`:

```
.git
.gitea
*.md
__pycache__/
*.pyc
.venv/
.mypy_cache/
.pytest_cache/
.ruff_cache/
```

### 11. Makefile Integration

Standard targets for building and linting:

```makefile
lint:
    hadolint Dockerfile

build:
    docker buildx build --platform linux/amd64 --load -t image:tag .
```

## Consequences

### Positive

- **10-100x faster pip operations** with uv cache mounts
- **Consistent builds** via lockfiles and version pinning
- **Better observability** through OCI labels
- **Improved security** with non-root execution
- **Faster CI/CD** through BuildKit caching

### Negative

- **Requires Docker BuildKit** - Must use `DOCKER_BUILDKIT=1` or buildx
- **Cache invalidation complexity** - Cache mounts persist across builds
- **Learning curve** - Developers must understand BuildKit syntax

## Related ADRs

- [ADR-0011](0011-kuberay-unified-gpu-backend.md) - KubeRay GPU backend
- [ADR-0012](0012-use-uv-for-python-development.md) - uv for Python development
- [ADR-0013](0013-gitea-actions-for-ci.md) - Gitea Actions CI/CD