From dd277f6459c3b6bb0000eafbff8ff65b66a96b86 Mon Sep 17 00:00:00 2001 From: "Billy D." Date: Fri, 6 Feb 2026 07:53:31 -0500 Subject: [PATCH] tuning up runner improvements. --- decisions/0014-docker-build-best-practices.md | 33 ++++++++++- decisions/0031-gitea-cicd-strategy.md | 55 +++++++++++++++++-- 2 files changed, 81 insertions(+), 7 deletions(-) diff --git a/decisions/0014-docker-build-best-practices.md b/decisions/0014-docker-build-best-practices.md index cbd0dcf..b2cc324 100644 --- a/decisions/0014-docker-build-best-practices.md +++ b/decisions/0014-docker-build-best-practices.md @@ -110,8 +110,37 @@ RUN uv pip install --system \ - Combine related commands into single RUN layers - Order from least to most frequently changing - Use multi-stage builds to reduce final image size +- Use `COPY --link` for multi-stage `COPY --from` layers to make them independent + of prior layers, improving cache reuse when base images change: -### 9. .dockerignore +```dockerfile +# --link makes this layer reusable even if the base image changes +COPY --link --from=rocm-source /opt/rocm /opt/rocm +``` + +### 9. Registry-Based BuildKit Cache + +Use `type=registry` cache instead of `type=gha` (which only works on GitHub Actions). +This stores build cache layers directly in the container registry with zstd compression: + +```yaml +- name: Build and push + uses: docker/build-push-action@v5 + with: + cache-from: type=registry,ref=${{ env.REGISTRY }}/image:buildcache + cache-to: type=registry,ref=${{ env.REGISTRY }}/image:buildcache,mode=max,image-manifest=true,compression=zstd +``` + +Benefits: +- Works on any CI system (Gitea Actions, Jenkins, etc.) +- `mode=max` caches all layers, not just final image layers +- `compression=zstd` is faster than gzip with similar compression ratios +- Cache survives runner restarts (stored in registry, not ephemeral disk) + +**Important:** `type=gha` is a no-op on self-hosted Gitea runners — it requires +GitHub's cache API. Always use `type=registry` for self-hosted CI. + +### 10. .dockerignore All repos include a `.dockerignore`: @@ -127,7 +156,7 @@ __pycache__/ .ruff_cache/ ``` -### 10. Makefile Integration +### 11. Makefile Integration Standard targets for building and linting: diff --git a/decisions/0031-gitea-cicd-strategy.md b/decisions/0031-gitea-cicd-strategy.md index 4d5dec3..685a20c 100644 --- a/decisions/0031-gitea-cicd-strategy.md +++ b/decisions/0031-gitea-cicd-strategy.md @@ -286,13 +286,58 @@ on: See [kuberay-images/.gitea/workflows/build-push.yaml](https://git.daviestechlabs.io/daviestechlabs/kuberay-images/src/branch/main/.gitea/workflows/build-push.yaml) for complete example. +## Build Performance Tuning + +GPU worker images are 20-30GB+ due to ROCm/CUDA/PyTorch layers. Several optimizations +are in place to avoid multi-hour rebuild/push cycles on every change. + +### Registry-Based BuildKit Cache + +Use `type=registry` cache (not `type=gha`, which is a no-op on Gitea runners): + +```yaml +cache-from: type=registry,ref=${{ env.REGISTRY }}/image:buildcache +cache-to: type=registry,ref=${{ env.REGISTRY }}/image:buildcache,mode=max,image-manifest=true,compression=zstd +``` + +- `mode=max` caches all intermediate layers, not just the final image +- `compression=zstd` is faster than gzip with comparable ratios +- Cache is stored in the Gitea container registry alongside images +- Only changed layers are rebuilt and pushed on subsequent builds + +### Docker Daemon Tuning + +The runner's DinD daemon.json is configured for parallel transfers: + +```json +{ + "max-concurrent-uploads": 10, + "max-concurrent-downloads": 10, + "features": { + "containerd-snapshotter": true + } +} +``` + +Defaults are only 3 concurrent uploads — insufficient for images with many large layers. + +### Persistent DinD Layer Cache + +The runner mounts a 100Gi Longhorn PVC at `/home/rootless/.local/share/docker` to +persist Docker's layer cache across pod restarts. Without this, every runner restart +forces re-download of 10-20GB base images (ROCm, Ray, PyTorch). + +| Volume | Storage Class | Size | Purpose | +|--------|---------------|------|---------| +| `gitea-runner-data` | nfs-slow | 5Gi | Runner state, workspace | +| `gitea-runner-docker-cache` | longhorn | 100Gi | Docker layer cache | + ## Future Enhancements -1. **Caching improvements** - Persistent layer cache across builds -2. **Multi-arch builds** - ARM64 support for Raspberry Pi -3. **Security scanning** - Trivy integration in CI -4. **Signed images** - Cosign for image signatures -5. **SLSA provenance** - Supply chain attestations +1. **Multi-arch builds** - ARM64 support for Raspberry Pi +2. **Security scanning** - Trivy integration in CI +3. **Signed images** - Cosign for image signatures +4. **SLSA provenance** - Supply chain attestations ## References