tuning up runner improvements.

2026-02-06 07:53:31 -05:00
parent 80fb911e22
commit dd277f6459
2 changed files with 81 additions and 7 deletions
--- a/decisions/0031-gitea-cicd-strategy.md
+++ b/decisions/0031-gitea-cicd-strategy.md
@@ -286,13 +286,58 @@ on:

 See [kuberay-images/.gitea/workflows/build-push.yaml](https://git.daviestechlabs.io/daviestechlabs/kuberay-images/src/branch/main/.gitea/workflows/build-push.yaml) for complete example.

+## Build Performance Tuning
+
+GPU worker images are 20-30GB+ due to ROCm/CUDA/PyTorch layers. Several optimizations
+are in place to avoid multi-hour rebuild/push cycles on every change.
+
+### Registry-Based BuildKit Cache
+
+Use `type=registry` cache (not `type=gha`, which is a no-op on Gitea runners):
+
+```yaml
+cache-from: type=registry,ref=${{ env.REGISTRY }}/image:buildcache
+cache-to: type=registry,ref=${{ env.REGISTRY }}/image:buildcache,mode=max,image-manifest=true,compression=zstd
+```
+
+- `mode=max` caches all intermediate layers, not just the final image
+- `compression=zstd` is faster than gzip with comparable ratios
+- Cache is stored in the Gitea container registry alongside images
+- Only changed layers are rebuilt and pushed on subsequent builds
+
+### Docker Daemon Tuning
+
+The runner's DinD daemon.json is configured for parallel transfers:
+
+```json
+{
+  "max-concurrent-uploads": 10,
+  "max-concurrent-downloads": 10,
+  "features": {
+    "containerd-snapshotter": true
+  }
+}
+```
+
+Defaults are only 3 concurrent uploads — insufficient for images with many large layers.
+
+### Persistent DinD Layer Cache
+
+The runner mounts a 100Gi Longhorn PVC at `/home/rootless/.local/share/docker` to
+persist Docker's layer cache across pod restarts. Without this, every runner restart
+forces re-download of 10-20GB base images (ROCm, Ray, PyTorch).
+
+| Volume | Storage Class | Size | Purpose |
+|--------|---------------|------|---------|
+| `gitea-runner-data` | nfs-slow | 5Gi | Runner state, workspace |
+| `gitea-runner-docker-cache` | longhorn | 100Gi | Docker layer cache |
+
 ## Future Enhancements

-1. **Caching improvements** - Persistent layer cache across builds
-2. **Multi-arch builds** - ARM64 support for Raspberry Pi
-3. **Security scanning** - Trivy integration in CI
-4. **Signed images** - Cosign for image signatures
-5. **SLSA provenance** - Supply chain attestations
+1. **Multi-arch builds** - ARM64 support for Raspberry Pi
+2. **Security scanning** - Trivy integration in CI
+3. **Signed images** - Cosign for image signatures
+4. **SLSA provenance** - Supply chain attestations

 ## References