Files
homelab-design/decisions/0014-docker-build-best-practices.md

192 lines
5.1 KiB
Markdown

# ADR-0014: Docker Build Best Practices
## Status
Accepted
## Date
2026-02-02
## Context
Our ML/AI platform relies heavily on containerized services, particularly GPU workers
for KubeRay that include large dependencies (PyTorch, vLLM, ROCm, CUDA). These images
can take 30+ minutes to build and exceed 10GB in size. We need standardized practices
to ensure:
1. **Fast rebuilds** - Avoid re-downloading dependencies on every build
2. **Reproducibility** - Consistent builds across different machines
3. **Security** - Non-root execution, minimal attack surface
4. **Observability** - Proper metadata for image management
5. **Consistency** - Same patterns across all Dockerfiles
## Decision
We adopt the following Docker build best practices across all repositories:
### 1. BuildKit Syntax and Features
```dockerfile
# syntax=docker/dockerfile:1.7
```
All Dockerfiles use BuildKit syntax 1.7+ for cache mount support.
### 2. Use uv for Python Package Installation
Replace pip with uv for dramatically faster installs (10-100x):
```dockerfile
# Install uv
COPY --from=ghcr.io/astral-sh/uv:latest /uv /usr/local/bin/uv
# Install packages with cache mount
RUN --mount=type=cache,target=/root/.cache/uv \
uv pip install --system --no-cache \
'package>=1.0,<2.0'
```
Benefits:
- Parallel downloads and installs
- Better dependency resolution
- Consistent with ADR-0012 (uv for Python development)
### 3. Cache Mounts for Package Managers
```dockerfile
# APT cache mount
RUN --mount=type=cache,target=/var/cache/apt,sharing=locked \
--mount=type=cache,target=/var/lib/apt,sharing=locked \
apt-get update && apt-get install -y --no-install-recommends \
package1 package2
# uv/pip cache mount
RUN --mount=type=cache,target=/home/ray/.cache/uv,uid=1000,gid=1000 \
uv pip install --system 'package>=1.0'
```
### 4. OCI Image Specification Labels
All images include standard metadata:
```dockerfile
LABEL org.opencontainers.image.title="Service Name"
LABEL org.opencontainers.image.description="Service description"
LABEL org.opencontainers.image.vendor="DaviesTechLabs"
LABEL org.opencontainers.image.source="https://git.daviestechlabs.io/daviestechlabs/repo"
LABEL org.opencontainers.image.licenses="MIT"
```
### 5. Health Checks
All service images include HEALTHCHECK:
```dockerfile
HEALTHCHECK --interval=30s --timeout=10s --start-period=60s --retries=3 \
CMD curl -f http://localhost:8000/health || exit 1
```
### 6. Non-Root Execution
Services run as unprivileged users:
```dockerfile
USER ray # or appuser, 1000:1000
```
### 7. Version Pinning with Ranges
Dependencies use minimum version with upper bound:
```dockerfile
RUN uv pip install --system \
'transformers>=4.35.0,<5.0' \
'torch>=2.0.0,<3.0'
```
### 8. Layer Optimization
- Combine related commands into single RUN layers
- Order from least to most frequently changing
- Use multi-stage builds to reduce final image size
- Use `COPY --link` for multi-stage `COPY --from` layers to make them independent
of prior layers, improving cache reuse when base images change:
```dockerfile
# --link makes this layer reusable even if the base image changes
COPY --link --from=rocm-source /opt/rocm /opt/rocm
```
### 9. Registry-Based BuildKit Cache
Use `type=registry` cache instead of `type=gha` (which only works on GitHub Actions).
This stores build cache layers directly in the container registry with zstd compression:
```yaml
- name: Build and push
uses: docker/build-push-action@v5
with:
cache-from: type=registry,ref=${{ env.REGISTRY }}/image:buildcache
cache-to: type=registry,ref=${{ env.REGISTRY }}/image:buildcache,mode=max,image-manifest=true,compression=zstd
```
Benefits:
- Works on any CI system (Gitea Actions, Jenkins, etc.)
- `mode=max` caches all layers, not just final image layers
- `compression=zstd` is faster than gzip with similar compression ratios
- Cache survives runner restarts (stored in registry, not ephemeral disk)
**Important:** `type=gha` is a no-op on self-hosted Gitea runners — it requires
GitHub's cache API. Always use `type=registry` for self-hosted CI.
### 10. .dockerignore
All repos include a `.dockerignore`:
```
.git
.gitea
*.md
__pycache__/
*.pyc
.venv/
.mypy_cache/
.pytest_cache/
.ruff_cache/
```
### 11. Makefile Integration
Standard targets for building and linting:
```makefile
lint:
hadolint Dockerfile
build:
docker buildx build --platform linux/amd64 --load -t image:tag .
```
## Consequences
### Positive
- **10-100x faster pip operations** with uv cache mounts
- **Consistent builds** via lockfiles and version pinning
- **Better observability** through OCI labels
- **Improved security** with non-root execution
- **Faster CI/CD** through BuildKit caching
### Negative
- **Requires Docker BuildKit** - Must use `DOCKER_BUILDKIT=1` or buildx
- **Cache invalidation complexity** - Cache mounts persist across builds
- **Learning curve** - Developers must understand BuildKit syntax
## Related ADRs
- [ADR-0011](0011-kuberay-unified-gpu-backend.md) - KubeRay GPU backend
- [ADR-0012](0012-use-uv-for-python-development.md) - uv for Python development
- [ADR-0013](0013-gitea-actions-for-ci.md) - Gitea Actions CI/CD