updating to match everything in my homelab.
This commit is contained in:
@@ -52,8 +52,8 @@ WebAuthn provides the best security (phishing-resistant) and user experience (to
|
||||
|
||||
| Application | WebAuthn Support | Current Status | Action Required |
|
||||
|-------------|------------------|----------------|-----------------|
|
||||
| Authentik | ✅ Native | ✅ Working | Configure enforcement policies |
|
||||
| Vaultwarden | ✅ Native | ⚠️ Partial | Enable in admin settings |
|
||||
| Authentik | ✅ Native | ⚠️ In Progress | Configure enforcement policies |
|
||||
| Vaultwarden | ✅ Native | ✅ Implemented | None - WebAuthn enrolled |
|
||||
|
||||
## Authentik Configuration
|
||||
|
||||
|
||||
301
decisions/0031-gitea-cicd-strategy.md
Normal file
301
decisions/0031-gitea-cicd-strategy.md
Normal file
@@ -0,0 +1,301 @@
|
||||
# Gitea CI/CD Pipeline Strategy
|
||||
|
||||
* Status: accepted
|
||||
* Date: 2026-02-04
|
||||
* Deciders: Billy
|
||||
* Technical Story: Establish CI/CD patterns for building and publishing container images via Gitea Actions
|
||||
|
||||
## Context and Problem Statement
|
||||
|
||||
The homelab uses Gitea as the Git hosting platform. Applications need automated CI/CD pipelines to build container images, run tests, and publish artifacts. Gitea Actions provides GitHub Actions-compatible workflow execution.
|
||||
|
||||
How do we configure CI/CD pipelines that work reliably with the homelab's self-hosted infrastructure including private container registry, rootless Docker-in-Docker runners, and internal services?
|
||||
|
||||
## Decision Drivers
|
||||
|
||||
* Self-hosted - no external CI/CD dependencies
|
||||
* Container registry integration - push to Gitea's built-in registry
|
||||
* Rootless security - runners don't require privileged containers
|
||||
* Internal networking - leverage cluster service discovery
|
||||
* Semantic versioning - automated version bumps based on commit messages
|
||||
|
||||
## Considered Options
|
||||
|
||||
1. **Gitea Actions with rootless DinD runners**
|
||||
2. **External CI/CD (GitHub Actions, GitLab CI)**
|
||||
3. **Self-hosted Jenkins/Drone**
|
||||
4. **Tekton Pipelines**
|
||||
|
||||
## Decision Outcome
|
||||
|
||||
Chosen option: **Option 1 - Gitea Actions with rootless DinD runners**
|
||||
|
||||
Gitea Actions provides GitHub Actions compatibility, runs inside the cluster with access to internal services, and supports rootless Docker-in-Docker for secure container builds.
|
||||
|
||||
### Positive Consequences
|
||||
|
||||
* GitHub Actions syntax familiarity
|
||||
* In-cluster access to internal services
|
||||
* Built-in container registry integration
|
||||
* No external dependencies
|
||||
* Rootless execution for security
|
||||
|
||||
### Negative Consequences
|
||||
|
||||
* Some GitHub Actions may not work (org-specific actions)
|
||||
* Rootless DinD has some limitations
|
||||
* Self-hosted maintenance burden
|
||||
|
||||
## Architecture
|
||||
|
||||
```
|
||||
┌─────────────────────────────────────────────────────────────────────────────┐
|
||||
│ Developer Push │
|
||||
└──────────────────────────────────┬──────────────────────────────────────────┘
|
||||
│
|
||||
▼
|
||||
┌─────────────────────────────────────────────────────────────────────────────┐
|
||||
│ Gitea Server │
|
||||
│ (git.daviestechlabs.io) │
|
||||
│ ┌─────────────────────────────────────────────────────────────────────┐ │
|
||||
│ │ Actions Trigger │ │
|
||||
│ │ • Push to main branch │ │
|
||||
│ │ • Pull request │ │
|
||||
│ │ • Tag creation │ │
|
||||
│ │ • workflow_dispatch │ │
|
||||
│ └─────────────────────────────────────────────────────────────────────┘ │
|
||||
└──────────────────────────────────┬──────────────────────────────────────────┘
|
||||
│
|
||||
▼
|
||||
┌─────────────────────────────────────────────────────────────────────────────┐
|
||||
│ Gitea Actions Runner │
|
||||
│ (rootless Docker-in-Docker) │
|
||||
│ │
|
||||
│ ┌─────────────┐ ┌─────────────┐ ┌─────────────┐ │
|
||||
│ │ Checkout │───▶│ Buildx │───▶│ Push │ │
|
||||
│ │ │ │ Build │ │ Registry │ │
|
||||
│ └─────────────┘ └─────────────┘ └──────┬──────┘ │
|
||||
│ │ │
|
||||
└───────────────────────────────────────────────┼─────────────────────────────┘
|
||||
│
|
||||
▼
|
||||
┌─────────────────────────────────────────────────────────────────────────────┐
|
||||
│ Gitea Container Registry │
|
||||
│ (gitea-http.gitea.svc.cluster.local:3000) │
|
||||
│ │
|
||||
│ Images: │
|
||||
│ • daviestechlabs/ray-worker-nvidia:v1.0.1 │
|
||||
│ • daviestechlabs/ray-worker-rdna2:v1.0.1 │
|
||||
│ • daviestechlabs/ray-worker-strixhalo:v1.0.1 │
|
||||
│ • daviestechlabs/ray-worker-intel:v1.0.1 │
|
||||
│ • daviestechlabs/ntfy-discord:latest │
|
||||
└─────────────────────────────────────────────────────────────────────────────┘
|
||||
```
|
||||
|
||||
## Runner Configuration
|
||||
|
||||
### Rootless Docker-in-Docker
|
||||
|
||||
The runner uses rootless Docker for security:
|
||||
|
||||
```yaml
|
||||
# Runner deployment uses rootless DinD
|
||||
# No privileged containers required
|
||||
# No sudo access in workflows
|
||||
```
|
||||
|
||||
### Runner Registration
|
||||
|
||||
Runners must be registered with **project-scoped tokens**, not instance tokens:
|
||||
|
||||
1. Go to **Repository → Settings → Actions → Runners**
|
||||
2. Create new runner with project token
|
||||
3. Use token for runner registration
|
||||
|
||||
**Common mistake:** Using instance-level token causes jobs not to be picked up.
|
||||
|
||||
## Registry Authentication
|
||||
|
||||
### Internal HTTP Endpoint
|
||||
|
||||
Use internal cluster DNS for registry access. This avoids:
|
||||
- Cloudflare tunnel 100MB upload limit
|
||||
- TLS certificate issues
|
||||
- External network latency
|
||||
|
||||
```yaml
|
||||
env:
|
||||
REGISTRY: gitea-http.gitea.svc.cluster.local:3000/daviestechlabs
|
||||
REGISTRY_HOST: gitea-http.gitea.svc.cluster.local:3000
|
||||
```
|
||||
|
||||
### Buildx Configuration
|
||||
|
||||
Configure buildx to use HTTP for internal registry:
|
||||
|
||||
```yaml
|
||||
- name: Set up Docker Buildx
|
||||
uses: docker/setup-buildx-action@v3
|
||||
with:
|
||||
buildkitd-config-inline: |
|
||||
[registry."gitea-http.gitea.svc.cluster.local:3000"]
|
||||
http = true
|
||||
insecure = true
|
||||
```
|
||||
|
||||
### Credential Configuration
|
||||
|
||||
For rootless DinD, create docker config directly (no `docker login` - it defaults to HTTPS):
|
||||
|
||||
```yaml
|
||||
- name: Configure Gitea Registry Auth
|
||||
if: github.event_name != 'pull_request'
|
||||
run: |
|
||||
AUTH=$(echo -n "${{ secrets.REGISTRY_USER }}:${{ secrets.REGISTRY_TOKEN }}" | base64 -w0)
|
||||
mkdir -p ~/.docker
|
||||
cat > ~/.docker/config.json << EOF
|
||||
{
|
||||
"auths": {
|
||||
"${{ env.REGISTRY_HOST }}": {
|
||||
"auth": "$AUTH"
|
||||
}
|
||||
}
|
||||
}
|
||||
EOF
|
||||
```
|
||||
|
||||
**Important:** Buildx reads `~/.docker/config.json` for authentication during push. Do NOT use `docker login` for HTTP registries as it defaults to HTTPS.
|
||||
|
||||
### Required Secrets
|
||||
|
||||
Configure in **Repository → Settings → Actions → Secrets**:
|
||||
|
||||
| Secret | Purpose |
|
||||
|--------|---------|
|
||||
| `REGISTRY_USER` | Gitea username with package write access |
|
||||
| `REGISTRY_TOKEN` | Gitea access token with `write:package` scope |
|
||||
| `DOCKERHUB_TOKEN` | (Optional) Docker Hub token for rate limit bypass |
|
||||
|
||||
## Semantic Versioning
|
||||
|
||||
### Commit Message Conventions
|
||||
|
||||
Version bumps are determined from commit message prefixes:
|
||||
|
||||
| Prefix | Bump Type | Example |
|
||||
|--------|-----------|---------|
|
||||
| `major:` or `BREAKING CHANGE` | Major (x.0.0) | `major: Remove deprecated API` |
|
||||
| `minor:`, `feat:`, `feature:` | Minor (0.x.0) | `feat: Add new endpoint` |
|
||||
| (anything else) | Patch (0.0.x) | `fix: Correct typo` |
|
||||
|
||||
### Version Calculation
|
||||
|
||||
```yaml
|
||||
- name: Calculate semantic version
|
||||
id: version
|
||||
run: |
|
||||
LATEST=$(git describe --tags --abbrev=0 2>/dev/null || echo "v0.0.0")
|
||||
VERSION=${LATEST#v}
|
||||
IFS='.' read -r MAJOR MINOR PATCH <<< "$VERSION"
|
||||
|
||||
MSG="${{ github.event.head_commit.message }}"
|
||||
if echo "$MSG" | grep -qiE "^major:|BREAKING CHANGE"; then
|
||||
MAJOR=$((MAJOR + 1)); MINOR=0; PATCH=0
|
||||
elif echo "$MSG" | grep -qiE "^(minor:|feat:|feature:)"; then
|
||||
MINOR=$((MINOR + 1)); PATCH=0
|
||||
else
|
||||
PATCH=$((PATCH + 1))
|
||||
fi
|
||||
|
||||
echo "version=v${MAJOR}.${MINOR}.${PATCH}" >> $GITHUB_OUTPUT
|
||||
```
|
||||
|
||||
### Automatic Tagging
|
||||
|
||||
After successful builds, create and push a git tag:
|
||||
|
||||
```yaml
|
||||
- name: Create and push tag
|
||||
run: |
|
||||
git config user.name "gitea-actions[bot]"
|
||||
git config user.email "actions@git.daviestechlabs.io"
|
||||
git tag -a "$VERSION" -m "Release $VERSION ($BUMP)"
|
||||
git push origin "$VERSION"
|
||||
```
|
||||
|
||||
## Notifications
|
||||
|
||||
### ntfy Integration
|
||||
|
||||
Send build status to ntfy for notifications:
|
||||
|
||||
```yaml
|
||||
- name: Notify on success
|
||||
run: |
|
||||
curl -s \
|
||||
-H "Title: ✅ Images Built: ${{ gitea.repository }}" \
|
||||
-H "Priority: default" \
|
||||
-H "Tags: white_check_mark,docker" \
|
||||
-d "Version: ${{ needs.determine-version.outputs.version }}" \
|
||||
http://ntfy.observability.svc.cluster.local:80/gitea-ci
|
||||
```
|
||||
|
||||
## Skip Patterns
|
||||
|
||||
### Commit Message Skip Flags
|
||||
|
||||
| Flag | Effect |
|
||||
|------|--------|
|
||||
| `[skip images]` | Skip all image builds |
|
||||
| `[ray-serve only]` | Skip worker images |
|
||||
| `[skip ci]` | Skip entire workflow |
|
||||
|
||||
### Path-based Triggers
|
||||
|
||||
Only run on relevant file changes:
|
||||
|
||||
```yaml
|
||||
on:
|
||||
push:
|
||||
paths:
|
||||
- 'dockerfiles/**'
|
||||
- '.gitea/workflows/build-push.yaml'
|
||||
```
|
||||
|
||||
## Troubleshooting
|
||||
|
||||
### Common Issues
|
||||
|
||||
| Issue | Cause | Solution |
|
||||
|-------|-------|----------|
|
||||
| Jobs not picked up | Instance token instead of project token | Re-register with project-scoped token |
|
||||
| 401 Unauthorized | Missing or wrong registry credentials | Check REGISTRY_USER and REGISTRY_TOKEN secrets |
|
||||
| "http: server gave HTTP response to HTTPS client" | Using `docker login` with HTTP registry | Create config.json directly, don't use docker login |
|
||||
| Cloudflare 100MB upload limit | Using external endpoint for large images | Use internal HTTP endpoint |
|
||||
| TLS certificate error | Using HTTPS with self-signed cert | Use internal HTTP endpoint with buildkitd http=true |
|
||||
| sudo not found | Rootless DinD has no sudo | Use user-space configuration methods |
|
||||
| "must contain at least one job without dependencies" | All jobs have `needs` | Ensure at least one job has no `needs` clause |
|
||||
|
||||
### Debugging
|
||||
|
||||
1. Check runner logs in Gitea Actions UI
|
||||
2. Add debug output: `echo "::debug::Variable=$VAR"`
|
||||
3. Use `actions/debug-output` step for verbose logging
|
||||
|
||||
## Workflow Template
|
||||
|
||||
See [kuberay-images/.gitea/workflows/build-push.yaml](https://git.daviestechlabs.io/daviestechlabs/kuberay-images/src/branch/main/.gitea/workflows/build-push.yaml) for complete example.
|
||||
|
||||
## Future Enhancements
|
||||
|
||||
1. **Caching improvements** - Persistent layer cache across builds
|
||||
2. **Multi-arch builds** - ARM64 support for Raspberry Pi
|
||||
3. **Security scanning** - Trivy integration in CI
|
||||
4. **Signed images** - Cosign for image signatures
|
||||
5. **SLSA provenance** - Supply chain attestations
|
||||
|
||||
## References
|
||||
|
||||
* [Gitea Actions Documentation](https://docs.gitea.com/usage/actions/overview)
|
||||
* [Docker Buildx Documentation](https://docs.docker.com/build/buildx/)
|
||||
* [Semantic Versioning](https://semver.org/)
|
||||
180
decisions/0032-velero-backup-strategy.md
Normal file
180
decisions/0032-velero-backup-strategy.md
Normal file
@@ -0,0 +1,180 @@
|
||||
# Velero Backup and Disaster Recovery Strategy
|
||||
|
||||
* Status: accepted
|
||||
* Date: 2026-02-05
|
||||
* Deciders: Billy
|
||||
* Technical Story: Establish cluster backup and disaster recovery capabilities
|
||||
|
||||
## Context and Problem Statement
|
||||
|
||||
A homelab running critical workloads (AI/ML pipelines, databases, productivity apps) needs protection against data loss from hardware failures, misconfigurations, or disasters. Kubernetes resources and persistent data must be recoverable.
|
||||
|
||||
How do we implement backup and disaster recovery for the homelab cluster?
|
||||
|
||||
## Decision Drivers
|
||||
|
||||
* Full cluster state backup - resources, secrets, PVCs
|
||||
* Application-consistent backups for databases
|
||||
* S3-compatible storage for off-cluster backups
|
||||
* Scheduled automated backups
|
||||
* Selective restore capability
|
||||
* GitOps compatibility
|
||||
|
||||
## Considered Options
|
||||
|
||||
1. **Velero with Node Agent (Kopia)**
|
||||
2. **Kasten K10**
|
||||
3. **Longhorn snapshots only**
|
||||
4. **etcd snapshots + manual PVC backups**
|
||||
|
||||
## Decision Outcome
|
||||
|
||||
Chosen option: **Option 1 - Velero with Node Agent (Kopia)**
|
||||
|
||||
Velero provides comprehensive Kubernetes backup/restore with file-level PVC backups via the Node Agent (formerly Restic, now Kopia). Backups are stored on the external NAS via S3-compatible storage.
|
||||
|
||||
### Positive Consequences
|
||||
|
||||
* Full cluster state captured (deployments, secrets, configmaps)
|
||||
* PVC data backed up via file-level snapshots
|
||||
* S3 backend on NAS for off-cluster storage
|
||||
* Scheduled daily backups with retention
|
||||
* Selective namespace/label restore
|
||||
* Active CNCF project with strong community
|
||||
|
||||
### Negative Consequences
|
||||
|
||||
* Node Agent runs as DaemonSet (14 pods on current cluster)
|
||||
* File-level backup slower than volume snapshots
|
||||
* Full cluster restore requires careful ordering
|
||||
* Some CRDs may need special handling
|
||||
|
||||
## Architecture
|
||||
|
||||
```
|
||||
┌─────────────────────────────────────────────────────────────────┐
|
||||
│ Velero Server │
|
||||
│ (velero namespace) │
|
||||
└────────────────────────────┬────────────────────────────────────┘
|
||||
│
|
||||
┌──────────────┼──────────────┐
|
||||
│ │ │
|
||||
▼ ▼ ▼
|
||||
┌───────────┐ ┌───────────┐ ┌───────────┐
|
||||
│ Node │ │ Node │ │ Node │
|
||||
│ Agent │ │ Agent │ │ Agent │
|
||||
│ (per node)│ │ (per node)│ │ (per node)│
|
||||
└─────┬─────┘ └─────┬─────┘ └─────┬─────┘
|
||||
│ │ │
|
||||
└──────────────┼──────────────┘
|
||||
│
|
||||
▼
|
||||
┌───────────────────────────┐
|
||||
│ BackupStorageLocation │
|
||||
│ (S3 on NAS - candlekeep) │
|
||||
│ /backups/velero │
|
||||
└───────────────────────────┘
|
||||
```
|
||||
|
||||
## Configuration
|
||||
|
||||
### Schedule
|
||||
|
||||
```yaml
|
||||
apiVersion: velero.io/v1
|
||||
kind: Schedule
|
||||
metadata:
|
||||
name: nightly-cluster-backup
|
||||
namespace: velero
|
||||
spec:
|
||||
schedule: "0 2 * * *" # 2 AM daily
|
||||
template:
|
||||
includedNamespaces:
|
||||
- "*"
|
||||
excludedNamespaces:
|
||||
- kube-system
|
||||
- kube-node-lease
|
||||
- kube-public
|
||||
includedResources:
|
||||
- "*"
|
||||
excludeNodeAgent: false
|
||||
defaultVolumesToFsBackup: true
|
||||
ttl: 720h # 30 days retention
|
||||
```
|
||||
|
||||
### Backup Storage Location
|
||||
|
||||
```yaml
|
||||
apiVersion: velero.io/v1
|
||||
kind: BackupStorageLocation
|
||||
metadata:
|
||||
name: default
|
||||
namespace: velero
|
||||
spec:
|
||||
provider: aws
|
||||
objectStorage:
|
||||
bucket: velero
|
||||
config:
|
||||
region: us-east-1
|
||||
s3ForcePathStyle: "true"
|
||||
s3Url: http://candlekeep.lab.daviestechlabs.io:9000
|
||||
```
|
||||
|
||||
## Backup Scope
|
||||
|
||||
### Included
|
||||
|
||||
| Category | Examples | Backup Method |
|
||||
|----------|----------|---------------|
|
||||
| Kubernetes resources | Deployments, Services, ConfigMaps | Velero native |
|
||||
| Secrets | Vault-synced, SOPS-decrypted | Velero native |
|
||||
| Persistent Volumes | Database data, user files | Node Agent (Kopia) |
|
||||
| CRDs | CNPG Clusters, RayServices, HelmReleases | Velero native |
|
||||
|
||||
### Excluded
|
||||
|
||||
| Category | Reason |
|
||||
|----------|--------|
|
||||
| kube-system | Rebuilt from Talos config |
|
||||
| flux-system | Rebuilt from Git (GitOps) |
|
||||
| Node-local data | Ephemeral, not critical |
|
||||
|
||||
## Recovery Procedures
|
||||
|
||||
### Full Cluster Recovery
|
||||
|
||||
1. Bootstrap new Talos cluster
|
||||
2. Install Velero with same BSL configuration
|
||||
3. `velero restore create --from-backup nightly-cluster-backup-YYYYMMDD`
|
||||
4. Re-bootstrap Flux for GitOps reconciliation
|
||||
|
||||
### Selective Namespace Recovery
|
||||
|
||||
```bash
|
||||
velero restore create \
|
||||
--from-backup nightly-cluster-backup-20260205020000 \
|
||||
--include-namespaces ai-ml \
|
||||
--restore-pvs
|
||||
```
|
||||
|
||||
### Database Recovery (CNPG)
|
||||
|
||||
For CNPG clusters, prefer CNPG's native PITR:
|
||||
```bash
|
||||
# CNPG handles its own WAL archiving to S3
|
||||
# Velero provides secondary backup layer
|
||||
```
|
||||
|
||||
## Monitoring
|
||||
|
||||
| Metric | Alert Threshold |
|
||||
|--------|-----------------|
|
||||
| `velero_backup_success_total` | No increase in 25h |
|
||||
| `velero_backup_failure_total` | Any increase |
|
||||
| Backup duration | > 4 hours |
|
||||
|
||||
## Links
|
||||
|
||||
* [Velero Documentation](https://velero.io/docs/)
|
||||
* [Node Agent (Kopia) Integration](https://velero.io/docs/main/file-system-backup/)
|
||||
* Related: [ADR-0026](0026-storage-strategy.md) - Storage Strategy
|
||||
267
decisions/0033-data-analytics-platform.md
Normal file
267
decisions/0033-data-analytics-platform.md
Normal file
@@ -0,0 +1,267 @@
|
||||
# Data Analytics Platform Architecture
|
||||
|
||||
* Status: accepted
|
||||
* Date: 2026-02-05
|
||||
* Deciders: Billy
|
||||
* Technical Story: Build a modern lakehouse architecture for HTTP analytics and ML feature engineering
|
||||
|
||||
## Context and Problem Statement
|
||||
|
||||
The homelab generates significant telemetry data from HTTP traffic (via Envoy Gateway), application logs, and ML inference metrics. This data is valuable for:
|
||||
- Traffic pattern analysis
|
||||
- Security anomaly detection
|
||||
- ML feature engineering
|
||||
- Cost optimization insights
|
||||
|
||||
How do we build a scalable analytics platform that supports both batch and real-time processing?
|
||||
|
||||
## Decision Drivers
|
||||
|
||||
* Modern lakehouse architecture (SQL + streaming)
|
||||
* Real-time and batch processing capabilities
|
||||
* Cost-effective on homelab hardware
|
||||
* Integration with existing observability stack
|
||||
* Support for ML feature pipelines
|
||||
* Open table formats for interoperability
|
||||
|
||||
## Considered Options
|
||||
|
||||
1. **Lakehouse: Nessie + Spark + Flink + Trino + RisingWave**
|
||||
2. **Traditional DWH: ClickHouse only**
|
||||
3. **Cloud-native: Databricks/Snowflake (SaaS)**
|
||||
4. **Minimal: PostgreSQL with TimescaleDB**
|
||||
|
||||
## Decision Outcome
|
||||
|
||||
Chosen option: **Option 1 - Modern Lakehouse Architecture**
|
||||
|
||||
A full lakehouse stack with Apache Iceberg tables (via Nessie catalog), Spark for batch ETL, Flink for streaming, Trino for interactive queries, and RisingWave for streaming SQL.
|
||||
|
||||
### Positive Consequences
|
||||
|
||||
* Unified batch and streaming on same data
|
||||
* Git-like versioning of tables via Nessie
|
||||
* Standard SQL across all engines
|
||||
* Decoupled compute and storage
|
||||
* Open formats prevent vendor lock-in
|
||||
* ML feature engineering support
|
||||
|
||||
### Negative Consequences
|
||||
|
||||
* Complex multi-component architecture
|
||||
* Higher resource requirements
|
||||
* Steeper learning curve
|
||||
* Multiple operators to maintain
|
||||
|
||||
## Architecture
|
||||
|
||||
```
|
||||
┌─────────────────────────────────────────────────────────────────────────────┐
|
||||
│ DATA SOURCES │
|
||||
│ ┌──────────────┐ ┌──────────────┐ ┌──────────────┐ ┌──────────────┐ │
|
||||
│ │ Envoy Logs │ │ Application │ │ Inference │ │ Prometheus │ │
|
||||
│ │ (HTTPRoute) │ │ Telemetry │ │ Metrics │ │ Metrics │ │
|
||||
│ └──────┬───────┘ └──────┬───────┘ └──────┬───────┘ └──────┬───────┘ │
|
||||
└─────────┼─────────────────┼─────────────────┼─────────────────┼─────────────┘
|
||||
│ │ │ │
|
||||
└─────────────────┼─────────────────┼─────────────────┘
|
||||
▼ │
|
||||
┌───────────────────────┐ │
|
||||
│ NATS JetStream │◄──────┘
|
||||
│ (Event Streaming) │
|
||||
└───────────┬───────────┘
|
||||
│
|
||||
┌───────────────┼───────────────┐
|
||||
│ │ │
|
||||
▼ ▼ ▼
|
||||
┌─────────────────┐ ┌───────────┐ ┌───────────────────┐
|
||||
│ Apache Flink │ │ RisingWave│ │ Apache Spark │
|
||||
│ (Streaming ETL)│ │ (Stream │ │ (Batch ETL) │
|
||||
│ │ │ SQL) │ │ │
|
||||
└────────┬────────┘ └─────┬─────┘ └─────────┬─────────┘
|
||||
│ │ │
|
||||
└────────────────┼─────────────────┘
|
||||
│ Write Iceberg Tables
|
||||
▼
|
||||
┌───────────────────────┐
|
||||
│ Nessie │
|
||||
│ (Iceberg Catalog) │
|
||||
│ Git-like versioning │
|
||||
└───────────┬───────────┘
|
||||
│
|
||||
▼
|
||||
┌───────────────────────┐
|
||||
│ NFS Storage │
|
||||
│ (candlekeep:/lakehouse)│
|
||||
└───────────────────────┘
|
||||
│
|
||||
▼
|
||||
┌───────────────────────┐
|
||||
│ Trino │
|
||||
│ (Interactive Query) │
|
||||
│ + Grafana Dashboards │
|
||||
└───────────────────────┘
|
||||
```
|
||||
|
||||
## Component Details
|
||||
|
||||
### Apache Nessie (Iceberg Catalog)
|
||||
|
||||
**Purpose:** Git-like version control for data tables
|
||||
|
||||
```yaml
|
||||
# HelmRelease: nessie
|
||||
# Version: 0.107.1
|
||||
spec:
|
||||
versionStoreType: ROCKSDB # Embedded storage
|
||||
catalog:
|
||||
iceberg:
|
||||
configDefaults:
|
||||
warehouse: s3://lakehouse/
|
||||
```
|
||||
|
||||
**Features:**
|
||||
- Branch/tag data versions
|
||||
- Time travel queries
|
||||
- Multi-table transactions
|
||||
- Cross-engine compatibility
|
||||
|
||||
### Apache Spark (Batch Processing)
|
||||
|
||||
**Purpose:** Large-scale batch ETL and ML feature engineering
|
||||
|
||||
```yaml
|
||||
# SparkApplication for HTTPRoute analytics
|
||||
apiVersion: sparkoperator.k8s.io/v1beta2
|
||||
kind: SparkApplication
|
||||
spec:
|
||||
type: Python
|
||||
mode: cluster
|
||||
sparkConf:
|
||||
spark.sql.catalog.nessie: org.apache.iceberg.spark.SparkCatalog
|
||||
spark.sql.catalog.nessie.catalog-impl: org.apache.iceberg.nessie.NessieCatalog
|
||||
spark.sql.catalog.nessie.uri: http://nessie:19120/api/v1
|
||||
```
|
||||
|
||||
**Use Cases:**
|
||||
- Daily HTTPRoute log aggregation
|
||||
- Feature engineering for ML
|
||||
- Historical data compaction
|
||||
|
||||
### Apache Flink (Stream Processing)
|
||||
|
||||
**Purpose:** Real-time event processing
|
||||
|
||||
```yaml
|
||||
# Flink Kubernetes Operator
|
||||
# Version: 1.13.0
|
||||
spec:
|
||||
job:
|
||||
jarURI: local:///opt/flink/jobs/httproute-analytics.jar
|
||||
parallelism: 2
|
||||
```
|
||||
|
||||
**Use Cases:**
|
||||
- Real-time traffic anomaly detection
|
||||
- Streaming ETL to Iceberg
|
||||
- Session windowing for user analytics
|
||||
|
||||
### RisingWave (Streaming SQL)
|
||||
|
||||
**Purpose:** Simplified streaming SQL for real-time dashboards
|
||||
|
||||
```sql
|
||||
-- Materialized view for real-time traffic
|
||||
CREATE MATERIALIZED VIEW traffic_5min AS
|
||||
SELECT
|
||||
window_start,
|
||||
route_name,
|
||||
COUNT(*) as request_count,
|
||||
AVG(response_time_ms) as avg_latency
|
||||
FROM httproute_events
|
||||
GROUP BY
|
||||
TUMBLE(event_time, INTERVAL '5 MINUTES'),
|
||||
route_name;
|
||||
```
|
||||
|
||||
**Use Cases:**
|
||||
- Real-time Grafana dashboards
|
||||
- Streaming aggregations
|
||||
- Alerting triggers
|
||||
|
||||
### Trino (Interactive Query)
|
||||
|
||||
**Purpose:** Fast SQL queries across Iceberg tables
|
||||
|
||||
```yaml
|
||||
# Trino coordinator + 2 workers
|
||||
catalogs:
|
||||
iceberg: |
|
||||
connector.name=iceberg
|
||||
iceberg.catalog.type=nessie
|
||||
iceberg.nessie.uri=http://nessie:19120/api/v1
|
||||
```
|
||||
|
||||
**Use Cases:**
|
||||
- Ad-hoc analytics queries
|
||||
- Grafana data source for dashboards
|
||||
- Cross-table JOINs
|
||||
|
||||
## Data Flow: HTTPRoute Analytics
|
||||
|
||||
```
|
||||
Envoy Gateway
|
||||
│
|
||||
▼ (access logs via OTEL)
|
||||
NATS JetStream
|
||||
│
|
||||
├─► Flink Job (streaming)
|
||||
│ │
|
||||
│ ▼
|
||||
│ Iceberg Table: httproute_raw
|
||||
│
|
||||
└─► Spark Job (nightly batch)
|
||||
│
|
||||
▼
|
||||
Iceberg Table: httproute_daily_agg
|
||||
│
|
||||
▼
|
||||
Trino ─► Grafana Dashboard
|
||||
```
|
||||
|
||||
## Storage Layout
|
||||
|
||||
```
|
||||
candlekeep:/kubernetes/lakehouse/
|
||||
├── warehouse/
|
||||
│ └── analytics/
|
||||
│ ├── httproute_raw/ # Raw events (partitioned by date)
|
||||
│ ├── httproute_daily_agg/ # Daily aggregates
|
||||
│ ├── inference_metrics/ # ML inference stats
|
||||
│ └── feature_store/ # ML features
|
||||
└── checkpoints/
|
||||
├── flink/ # Flink savepoints
|
||||
└── spark/ # Spark checkpoints
|
||||
```
|
||||
|
||||
## Resource Allocation
|
||||
|
||||
| Component | Replicas | CPU | Memory |
|
||||
|-----------|----------|-----|--------|
|
||||
| Nessie | 1 | 0.5 | 512Mi |
|
||||
| Spark Operator | 1 | 0.2 | 256Mi |
|
||||
| Flink Operator | 1 | 0.2 | 256Mi |
|
||||
| Flink JobManager | 1 | 1 | 2Gi |
|
||||
| Flink TaskManager | 2 | 2 | 4Gi |
|
||||
| RisingWave | 1 | 2 | 4Gi |
|
||||
| Trino Coordinator | 1 | 1 | 2Gi |
|
||||
| Trino Worker | 2 | 2 | 4Gi |
|
||||
|
||||
## Links
|
||||
|
||||
* [Apache Iceberg](https://iceberg.apache.org/)
|
||||
* [Project Nessie](https://projectnessie.org/)
|
||||
* [Apache Flink](https://flink.apache.org/)
|
||||
* [RisingWave](https://risingwave.com/)
|
||||
* [Trino](https://trino.io/)
|
||||
* Related: [ADR-0025](0025-observability-stack.md) - Observability Stack
|
||||
206
decisions/0034-volcano-batch-scheduling.md
Normal file
206
decisions/0034-volcano-batch-scheduling.md
Normal file
@@ -0,0 +1,206 @@
|
||||
# Volcano Batch Scheduling Strategy
|
||||
|
||||
* Status: accepted
|
||||
* Date: 2026-02-05
|
||||
* Deciders: Billy
|
||||
* Technical Story: Optimize scheduling for batch ML and analytics workloads
|
||||
|
||||
## Context and Problem Statement
|
||||
|
||||
The homelab runs diverse workloads including:
|
||||
- AI/ML training jobs (batch, GPU-intensive)
|
||||
- Spark/Flink analytics jobs (batch, CPU/memory-intensive)
|
||||
- KubeRay cluster with multiple GPU workers
|
||||
- Long-running inference services
|
||||
|
||||
The default Kubernetes scheduler (kube-scheduler) is optimized for microservices, not batch workloads. It lacks:
|
||||
- Gang scheduling (all-or-nothing pod placement)
|
||||
- Fair-share queuing across teams/projects
|
||||
- Preemption policies for priority workloads
|
||||
- Resource reservation for batch jobs
|
||||
|
||||
How do we optimize scheduling for batch and ML workloads?
|
||||
|
||||
## Decision Drivers
|
||||
|
||||
* Gang scheduling for distributed ML training
|
||||
* Fair-share resource allocation
|
||||
* Priority-based preemption
|
||||
* Integration with Kubeflow and Spark
|
||||
* GPU-aware scheduling
|
||||
* Queue management for multi-tenant scenarios
|
||||
|
||||
## Considered Options
|
||||
|
||||
1. **Volcano Scheduler**
|
||||
2. **Apache YuniKorn**
|
||||
3. **Kubernetes default scheduler with Priority Classes**
|
||||
4. **Kueue (Kubernetes Batch Workload Queueing)**
|
||||
|
||||
## Decision Outcome
|
||||
|
||||
Chosen option: **Option 1 - Volcano Scheduler**
|
||||
|
||||
Volcano is a CNCF project designed for batch, HPC, and ML workloads. It provides gang scheduling, queue management, and integrates natively with Spark, Flink, and ML frameworks.
|
||||
|
||||
### Positive Consequences
|
||||
|
||||
* Gang scheduling prevents partial deployments
|
||||
* Queue-based fair-share resource management
|
||||
* Native Spark and Flink integration
|
||||
* Preemption for high-priority jobs
|
||||
* CNCF project with active community
|
||||
* Coexists with default scheduler
|
||||
|
||||
### Negative Consequences
|
||||
|
||||
* Additional scheduler components (admission, controller, scheduler)
|
||||
* Learning curve for queue configuration
|
||||
* Workloads must opt-in via scheduler name
|
||||
|
||||
## Architecture
|
||||
|
||||
```
|
||||
┌─────────────────────────────────────────────────────────────────┐
|
||||
│ Volcano System │
|
||||
│ (volcano-system namespace) │
|
||||
│ │
|
||||
│ ┌─────────────────┐ ┌───────────────────┐ ┌───────────────┐ │
|
||||
│ │ Admission │ │ Controllers │ │ Scheduler │ │
|
||||
│ │ Webhook │ │ (Job lifecycle) │ │ (Placement) │ │
|
||||
│ └─────────────────┘ └───────────────────┘ └───────────────┘ │
|
||||
└─────────────────────────────────────────────────────────────────┘
|
||||
│
|
||||
▼
|
||||
┌─────────────────────────────────────────────────────────────────┐
|
||||
│ Queues │
|
||||
│ ┌───────────────────────────────────────────────────────────┐ │
|
||||
│ │ ml-training │ analytics │ inference │ default │ │
|
||||
│ │ weight: 40 │ weight: 30 │ weight: 20 │ weight: 10│ │
|
||||
│ └───────────────────────────────────────────────────────────┘ │
|
||||
└─────────────────────────────────────────────────────────────────┘
|
||||
│
|
||||
▼
|
||||
┌─────────────────────────────────────────────────────────────────┐
|
||||
│ Workloads │
|
||||
│ ┌──────────────┐ ┌──────────────┐ ┌──────────────────────┐ │
|
||||
│ │ Spark Jobs │ │ Flink Jobs │ │ ML Training (KFP) │ │
|
||||
│ │ (analytics) │ │ (analytics) │ │ (ml-training) │ │
|
||||
│ └──────────────┘ └──────────────┘ └──────────────────────┘ │
|
||||
└─────────────────────────────────────────────────────────────────┘
|
||||
```
|
||||
|
||||
## Configuration
|
||||
|
||||
### Queue Definition
|
||||
|
||||
```yaml
|
||||
apiVersion: scheduling.volcano.sh/v1beta1
|
||||
kind: Queue
|
||||
metadata:
|
||||
name: ml-training
|
||||
spec:
|
||||
weight: 40
|
||||
reclaimable: true
|
||||
guarantee:
|
||||
resource:
|
||||
cpu: "4"
|
||||
memory: "16Gi"
|
||||
capability:
|
||||
resource:
|
||||
cpu: "32"
|
||||
memory: "128Gi"
|
||||
nvidia.com/gpu: "2"
|
||||
```
|
||||
|
||||
### Spark Integration
|
||||
|
||||
```yaml
|
||||
apiVersion: sparkoperator.k8s.io/v1beta2
|
||||
kind: SparkApplication
|
||||
metadata:
|
||||
name: analytics-job
|
||||
spec:
|
||||
batchScheduler: volcano
|
||||
batchSchedulerOptions:
|
||||
queue: analytics
|
||||
priorityClassName: normal
|
||||
driver:
|
||||
schedulerName: volcano
|
||||
executor:
|
||||
schedulerName: volcano
|
||||
instances: 4
|
||||
```
|
||||
|
||||
### Gang Scheduling for ML Training
|
||||
|
||||
```yaml
|
||||
apiVersion: batch.volcano.sh/v1alpha1
|
||||
kind: Job
|
||||
metadata:
|
||||
name: distributed-training
|
||||
spec:
|
||||
schedulerName: volcano
|
||||
minAvailable: 4 # Gang: all 4 pods or none
|
||||
queue: ml-training
|
||||
tasks:
|
||||
- name: worker
|
||||
replicas: 4
|
||||
template:
|
||||
spec:
|
||||
containers:
|
||||
- name: trainer
|
||||
resources:
|
||||
limits:
|
||||
nvidia.com/gpu: 1
|
||||
```
|
||||
|
||||
## Queue Structure
|
||||
|
||||
| Queue | Weight | Use Case | Guarantee | Preemptable |
|
||||
|-------|--------|----------|-----------|-------------|
|
||||
| `ml-training` | 40 | Kubeflow jobs, RayJobs | 4 CPU, 16Gi | No |
|
||||
| `analytics` | 30 | Spark/Flink batch jobs | 2 CPU, 8Gi | Yes |
|
||||
| `inference` | 20 | Batch inference jobs | 2 CPU, 8Gi | No |
|
||||
| `default` | 10 | Miscellaneous batch | None | Yes |
|
||||
|
||||
## Scheduler Selection
|
||||
|
||||
Workloads use Volcano by setting:
|
||||
|
||||
```yaml
|
||||
spec:
|
||||
schedulerName: volcano
|
||||
```
|
||||
|
||||
Long-running services (inference endpoints, databases) continue using the default scheduler for stability.
|
||||
|
||||
## Preemption Policy
|
||||
|
||||
```yaml
|
||||
apiVersion: scheduling.volcano.sh/v1beta1
|
||||
kind: PriorityClass
|
||||
metadata:
|
||||
name: high-priority
|
||||
spec:
|
||||
value: 1000
|
||||
preemptionPolicy: PreemptLowerPriority
|
||||
description: "High priority ML training jobs"
|
||||
```
|
||||
|
||||
## Monitoring
|
||||
|
||||
| Metric | Description |
|
||||
|--------|-------------|
|
||||
| `volcano_queue_allocated_*` | Resources currently allocated per queue |
|
||||
| `volcano_queue_pending_*` | Pending resource requests per queue |
|
||||
| `volcano_job_status` | Job lifecycle states |
|
||||
| `volcano_scheduler_throughput` | Scheduling decisions per second |
|
||||
|
||||
## Links
|
||||
|
||||
* [Volcano Documentation](https://volcano.sh/docs/)
|
||||
* [Gang Scheduling](https://volcano.sh/docs/gang_scheduling/)
|
||||
* [Spark on Volcano](https://volcano.sh/docs/spark/)
|
||||
* Related: [ADR-0009](0009-dual-workflow-engines.md) - Dual Workflow Engines
|
||||
* Related: [ADR-0033](0033-data-analytics-platform.md) - Data Analytics Platform
|
||||
195
decisions/0035-arm64-worker-strategy.md
Normal file
195
decisions/0035-arm64-worker-strategy.md
Normal file
@@ -0,0 +1,195 @@
|
||||
# ARM64 Raspberry Pi Worker Node Strategy
|
||||
|
||||
* Status: accepted
|
||||
* Date: 2026-02-05
|
||||
* Deciders: Billy
|
||||
* Technical Story: Integrate Raspberry Pi nodes into the Kubernetes cluster
|
||||
|
||||
## Context and Problem Statement
|
||||
|
||||
The homelab cluster includes 5 Raspberry Pi 4/5 nodes (ARM64 architecture) alongside x86_64 servers. These low-power nodes provide:
|
||||
- Additional compute capacity for lightweight workloads
|
||||
- Geographic distribution within the home network
|
||||
- Learning platform for multi-architecture Kubernetes
|
||||
|
||||
However, ARM64 nodes have constraints:
|
||||
- No GPU acceleration
|
||||
- Lower CPU/memory than x86_64 servers
|
||||
- Some container images lack ARM64 support
|
||||
- Limited local storage
|
||||
|
||||
How do we effectively integrate ARM64 nodes while avoiding scheduling failures?
|
||||
|
||||
## Decision Drivers
|
||||
|
||||
* Maximize utilization of ARM64 compute
|
||||
* Prevent ARM-incompatible workloads from scheduling
|
||||
* Maintain cluster stability
|
||||
* Support multi-arch container images
|
||||
* Minimize operational overhead
|
||||
|
||||
## Considered Options
|
||||
|
||||
1. **Node labels + affinity for workload placement**
|
||||
2. **Separate ARM64-only namespace**
|
||||
3. **Taints to exclude from general scheduling**
|
||||
4. **ARM64 nodes for specific workload types only**
|
||||
|
||||
## Decision Outcome
|
||||
|
||||
Chosen option: **Option 1 + Option 4 hybrid** - Use node labels with affinity rules, and designate ARM64 nodes for specific workload categories.
|
||||
|
||||
ARM64 nodes handle:
|
||||
- Lightweight control plane components (where multi-arch images exist)
|
||||
- Velero node-agent (backup DaemonSet)
|
||||
- Node-level monitoring (Prometheus node-exporter)
|
||||
- Future: Edge/IoT workloads
|
||||
|
||||
### Positive Consequences
|
||||
|
||||
* Clear workload segmentation
|
||||
* No scheduling failures from arch mismatch
|
||||
* Efficient use of low-power nodes
|
||||
* Room for future ARM-specific workloads
|
||||
* Cost-effective cluster expansion
|
||||
|
||||
### Negative Consequences
|
||||
|
||||
* Some nodes may be underutilized
|
||||
* Must maintain multi-arch image awareness
|
||||
* Additional scheduling complexity
|
||||
|
||||
## Cluster Composition
|
||||
|
||||
| Node | Architecture | Role | Instance Type |
|
||||
|------|--------------|------|---------------|
|
||||
| bruenor | amd64 | control-plane | - |
|
||||
| catti | amd64 | control-plane | - |
|
||||
| storm | amd64 | control-plane | - |
|
||||
| khelben | amd64 | GPU worker (Strix Halo) | - |
|
||||
| elminster | amd64 | GPU worker (NVIDIA) | - |
|
||||
| drizzt | amd64 | GPU worker (RDNA2) | - |
|
||||
| danilo | amd64 | GPU worker (Intel Arc) | - |
|
||||
| regis | amd64 | worker | - |
|
||||
| wulfgar | amd64 | worker | - |
|
||||
| **durnan** | **arm64** | worker | raspberry-pi |
|
||||
| **elaith** | **arm64** | worker | raspberry-pi |
|
||||
| **jarlaxle** | **arm64** | worker | raspberry-pi |
|
||||
| **mirt** | **arm64** | worker | raspberry-pi |
|
||||
| **volo** | **arm64** | worker | raspberry-pi |
|
||||
|
||||
## Node Labels
|
||||
|
||||
```yaml
|
||||
# Applied via Talos machine config or kubectl
|
||||
labels:
|
||||
kubernetes.io/arch: arm64
|
||||
kubernetes.io/os: linux
|
||||
node.kubernetes.io/instance-type: raspberry-pi
|
||||
kubernetes.io/storage: none # No Longhorn on Pis
|
||||
```
|
||||
|
||||
## Workload Placement
|
||||
|
||||
### DaemonSets (Run Everywhere)
|
||||
|
||||
These run on all nodes including ARM64:
|
||||
|
||||
| DaemonSet | Namespace | Multi-arch |
|
||||
|-----------|-----------|------------|
|
||||
| velero-node-agent | velero | ✅ |
|
||||
| cilium-agent | kube-system | ✅ |
|
||||
| node-exporter | observability | ✅ |
|
||||
|
||||
### ARM64-Excluded Workloads
|
||||
|
||||
These explicitly exclude ARM64 via node affinity:
|
||||
|
||||
```yaml
|
||||
spec:
|
||||
affinity:
|
||||
nodeAffinity:
|
||||
requiredDuringSchedulingIgnoredDuringExecution:
|
||||
nodeSelectorTerms:
|
||||
- matchExpressions:
|
||||
- key: kubernetes.io/arch
|
||||
operator: In
|
||||
values:
|
||||
- amd64
|
||||
```
|
||||
|
||||
| Workload Type | Reason for Exclusion |
|
||||
|---------------|----------------------|
|
||||
| GPU workloads | No GPU on Pis |
|
||||
| Longhorn | Pis have no storage label |
|
||||
| Heavy databases | Insufficient resources |
|
||||
| Most HelmReleases | Image compatibility |
|
||||
|
||||
### ARM64-Compatible Light Workloads
|
||||
|
||||
Potential future workloads for ARM64 nodes:
|
||||
|
||||
| Workload | Use Case |
|
||||
|----------|----------|
|
||||
| MQTT broker | IoT message routing |
|
||||
| Pi-hole | DNS ad blocking |
|
||||
| Home Assistant | Home automation |
|
||||
| Lightweight proxies | Traffic routing |
|
||||
|
||||
## Storage Exclusion
|
||||
|
||||
ARM64 nodes are excluded from Longhorn:
|
||||
|
||||
```yaml
|
||||
# Longhorn Helm values
|
||||
defaultSettings:
|
||||
systemManagedComponentsNodeSelector: "kubernetes.io/arch:amd64"
|
||||
```
|
||||
|
||||
Node label:
|
||||
```yaml
|
||||
kubernetes.io/storage: none
|
||||
```
|
||||
|
||||
## Resource Constraints
|
||||
|
||||
| Node Type | CPU | Memory | Typical Available |
|
||||
|-----------|-----|--------|-------------------|
|
||||
| Raspberry Pi 4 | 4 cores | 4-8GB | 3 cores, 3GB |
|
||||
| Raspberry Pi 5 | 4 cores | 8GB | 3.5 cores, 6GB |
|
||||
|
||||
## Multi-Architecture Image Strategy
|
||||
|
||||
For workloads that should run on ARM64:
|
||||
|
||||
1. **Use multi-arch base images** (e.g., `alpine`, `debian`)
|
||||
2. **Build with Docker buildx**:
|
||||
```bash
|
||||
docker buildx build --platform linux/amd64,linux/arm64 -t myimage:latest .
|
||||
```
|
||||
3. **Verify arch support** before deployment
|
||||
|
||||
## Monitoring ARM64 Nodes
|
||||
|
||||
```promql
|
||||
# Node resource usage by architecture
|
||||
sum by (node, arch) (
|
||||
node_memory_MemAvailable_bytes{}
|
||||
* on(node) group_left(arch)
|
||||
kube_node_labels{label_kubernetes_io_arch!=""}
|
||||
)
|
||||
```
|
||||
|
||||
## Future Considerations
|
||||
|
||||
- **Edge workloads**: ARM64 nodes ideal for edge compute patterns
|
||||
- **IoT integration**: MQTT, sensor data collection
|
||||
- **Scale-out**: Add more Pis for lightweight workload capacity
|
||||
- **ARM64 ML inference**: Some models support ARM (TensorFlow Lite)
|
||||
|
||||
## Links
|
||||
|
||||
* [Kubernetes Multi-Architecture](https://kubernetes.io/docs/concepts/containers/images/#multi-architecture-images)
|
||||
* [Talos on Raspberry Pi](https://talos.dev/v1.12/talos-guides/install/single-board-computers/rpi_generic/)
|
||||
* Related: [ADR-0002](0002-use-talos-linux.md) - Use Talos Linux
|
||||
* Related: [ADR-0026](0026-storage-strategy.md) - Storage Strategy
|
||||
256
decisions/0036-renovate-dependency-updates.md
Normal file
256
decisions/0036-renovate-dependency-updates.md
Normal file
@@ -0,0 +1,256 @@
|
||||
# Automated Dependency Updates with Renovate
|
||||
|
||||
* Status: accepted
|
||||
* Date: 2026-02-05
|
||||
* Deciders: Billy
|
||||
* Technical Story: Automate dependency updates across all homelab repositories
|
||||
|
||||
## Context and Problem Statement
|
||||
|
||||
The homelab consists of 20+ repositories containing:
|
||||
- Kubernetes manifests with container image references
|
||||
- Helm chart versions
|
||||
- Python/Go dependencies
|
||||
- GitHub Actions / Gitea Actions workflow versions
|
||||
|
||||
Manually tracking and updating dependencies is:
|
||||
- Time-consuming
|
||||
- Error-prone
|
||||
- Often neglected until security issues arise
|
||||
|
||||
How do we automate dependency updates while maintaining control over what gets updated?
|
||||
|
||||
## Decision Drivers
|
||||
|
||||
* Automated detection of outdated dependencies
|
||||
* PR-based update workflow for review
|
||||
* Support for Kubernetes manifests, Helm, Python, Go, Docker
|
||||
* Self-hosted on existing infrastructure
|
||||
* Configurable grouping and scheduling
|
||||
* Security update prioritization
|
||||
|
||||
## Considered Options
|
||||
|
||||
1. **Renovate (self-hosted)**
|
||||
2. **Dependabot (GitHub-native)**
|
||||
3. **Manual updates with version scripts**
|
||||
4. **Flux image automation**
|
||||
|
||||
## Decision Outcome
|
||||
|
||||
Chosen option: **Option 1 - Renovate (self-hosted)**
|
||||
|
||||
Renovate runs as a CronJob in the cluster, scanning all repositories in the Gitea organization and creating PRs for outdated dependencies. It supports more package managers than Dependabot and works with Gitea.
|
||||
|
||||
### Positive Consequences
|
||||
|
||||
* Comprehensive manager support (40+ package managers)
|
||||
* Works with self-hosted Gitea
|
||||
* Configurable grouping (batch minor updates)
|
||||
* Auto-merge for patch/minor updates
|
||||
* Dashboard for update overview
|
||||
* Reusable preset configurations
|
||||
|
||||
### Negative Consequences
|
||||
|
||||
* Additional CronJob to maintain
|
||||
* Configuration complexity
|
||||
* API token management for Gitea access
|
||||
|
||||
## Architecture
|
||||
|
||||
```
|
||||
┌───────────────────────────────────────────────────────────────────┐
|
||||
│ Renovate CronJob │
|
||||
│ (ci-cd namespace) │
|
||||
│ │
|
||||
│ Schedule: Every 8 hours (0 */8 * * *) │
|
||||
│ │
|
||||
│ ┌────────────────────────────────────────────────────────────┐ │
|
||||
│ │ Renovate Container │ │
|
||||
│ │ │ │
|
||||
│ │ 1. Fetch repositories from Gitea org │ │
|
||||
│ │ 2. Scan each repo for dependencies │ │
|
||||
│ │ 3. Compare versions with upstream registries │ │
|
||||
│ │ 4. Create/update PRs for outdated deps │ │
|
||||
│ │ 5. Auto-merge approved patches │ │
|
||||
│ └────────────────────────────────────────────────────────────┘ │
|
||||
└───────────────────────────────────────────────────────────────────┘
|
||||
│
|
||||
▼
|
||||
┌───────────────────────────────────────────────────────────────────┐
|
||||
│ Gitea │
|
||||
│ │
|
||||
│ ┌───────────────┐ ┌───────────────┐ ┌───────────────┐ │
|
||||
│ │ homelab-k8s2 │ │ chat-handler │ │ kuberay-images│ │
|
||||
│ │ │ │ │ │ │ │
|
||||
│ │ PR: Update │ │ PR: Update │ │ PR: Update │ │
|
||||
│ │ flux to 2.5.0 │ │ httpx to 0.28 │ │ ROCm to 6.4 │ │
|
||||
│ └───────────────┘ └───────────────┘ └───────────────┘ │
|
||||
└───────────────────────────────────────────────────────────────────┘
|
||||
```
|
||||
|
||||
## Configuration
|
||||
|
||||
### CronJob
|
||||
|
||||
```yaml
|
||||
apiVersion: batch/v1
|
||||
kind: CronJob
|
||||
metadata:
|
||||
name: renovate
|
||||
namespace: ci-cd
|
||||
spec:
|
||||
schedule: "0 */8 * * *" # Every 8 hours
|
||||
jobTemplate:
|
||||
spec:
|
||||
template:
|
||||
spec:
|
||||
containers:
|
||||
- name: renovate
|
||||
image: renovate/renovate:39
|
||||
env:
|
||||
- name: RENOVATE_PLATFORM
|
||||
value: "gitea"
|
||||
- name: RENOVATE_ENDPOINT
|
||||
value: "https://git.daviestechlabs.io/api/v1"
|
||||
- name: RENOVATE_TOKEN
|
||||
valueFrom:
|
||||
secretKeyRef:
|
||||
name: renovate-github-token
|
||||
key: token
|
||||
- name: RENOVATE_AUTODISCOVER
|
||||
value: "true"
|
||||
- name: RENOVATE_AUTODISCOVER_FILTER
|
||||
value: "daviestechlabs/*"
|
||||
restartPolicy: OnFailure
|
||||
```
|
||||
|
||||
### Repository Config (renovate.json)
|
||||
|
||||
```json
|
||||
{
|
||||
"$schema": "https://docs.renovatebot.com/renovate-schema.json",
|
||||
"extends": [
|
||||
"config:recommended",
|
||||
"group:allNonMajor",
|
||||
":automergeMinor",
|
||||
":automergePatch"
|
||||
],
|
||||
"kubernetes": {
|
||||
"fileMatch": ["\\.ya?ml$"]
|
||||
},
|
||||
"packageRules": [
|
||||
{
|
||||
"matchManagers": ["helm-values", "helmv3"],
|
||||
"groupName": "helm charts"
|
||||
},
|
||||
{
|
||||
"matchPackagePatterns": ["^ghcr.io/"],
|
||||
"groupName": "GHCR images"
|
||||
},
|
||||
{
|
||||
"matchUpdateTypes": ["major"],
|
||||
"automerge": false,
|
||||
"labels": ["major-update"]
|
||||
}
|
||||
],
|
||||
"schedule": ["before 6am on monday"]
|
||||
}
|
||||
```
|
||||
|
||||
## Supported Package Managers
|
||||
|
||||
| Manager | File Patterns | Examples |
|
||||
|---------|---------------|----------|
|
||||
| kubernetes | `*.yaml`, `*.yml` | Container images in Deployments |
|
||||
| helm | `Chart.yaml`, `values.yaml` | Helm chart dependencies |
|
||||
| helmv3 | HelmRelease CRDs | Flux HelmReleases |
|
||||
| flux | Flux CRDs | GitRepository, OCIRepository |
|
||||
| pip | `requirements.txt`, `pyproject.toml` | Python packages |
|
||||
| gomod | `go.mod` | Go modules |
|
||||
| dockerfile | `Dockerfile*` | Base images |
|
||||
| github-actions | `.github/workflows/*.yml` | Action versions |
|
||||
| gitea-actions | `.gitea/workflows/*.yml` | Action versions |
|
||||
|
||||
## Update Strategy
|
||||
|
||||
### Auto-merge Enabled
|
||||
|
||||
| Update Type | Auto-merge | Delay |
|
||||
|-------------|------------|-------|
|
||||
| Patch (x.x.1 → x.x.2) | ✅ Yes | Immediate |
|
||||
| Minor (x.1.x → x.2.x) | ✅ Yes | 3 days stabilization |
|
||||
| Major (1.x.x → 2.x.x) | ❌ No | Manual review |
|
||||
|
||||
### Grouping Strategy
|
||||
|
||||
| Group | Contents | Frequency |
|
||||
|-------|----------|-----------|
|
||||
| `all-non-major` | All patch + minor updates | Weekly (Monday) |
|
||||
| `helm-charts` | All Helm chart updates | Weekly |
|
||||
| `container-images` | Docker image updates | Weekly |
|
||||
| `security` | CVE fixes | Immediate |
|
||||
|
||||
## Security Updates
|
||||
|
||||
Renovate prioritizes security updates:
|
||||
|
||||
```json
|
||||
{
|
||||
"vulnerabilityAlerts": {
|
||||
"enabled": true,
|
||||
"labels": ["security"]
|
||||
},
|
||||
"packageRules": [
|
||||
{
|
||||
"matchCategories": ["security"],
|
||||
"automerge": true,
|
||||
"schedule": ["at any time"],
|
||||
"prPriority": 10
|
||||
}
|
||||
]
|
||||
}
|
||||
```
|
||||
|
||||
## Dashboard
|
||||
|
||||
Renovate creates a "Dependency Dashboard" issue in each repository:
|
||||
|
||||
```markdown
|
||||
## Dependency Dashboard
|
||||
|
||||
### Open PRs
|
||||
- [ ] Update httpx to 0.28.1 (#42)
|
||||
- [x] Update pillow to 11.0.0 (#41) - merged
|
||||
|
||||
### Pending Approval
|
||||
- [ ] Major: Update pydantic to v2 (#40)
|
||||
|
||||
### Rate Limited
|
||||
- fastapi (waiting for next schedule window)
|
||||
```
|
||||
|
||||
## Secrets
|
||||
|
||||
| Secret | Source | Purpose |
|
||||
|--------|--------|---------|
|
||||
| `renovate-github-token` | Vault | Gitea API access |
|
||||
| `renovate-dockerhub` | Vault | Docker Hub rate limits |
|
||||
|
||||
## Monitoring
|
||||
|
||||
```promql
|
||||
# Renovate job success rate
|
||||
sum(kube_job_status_succeeded{job_name=~"renovate-.*"})
|
||||
/
|
||||
sum(kube_job_status_succeeded{job_name=~"renovate-.*"} + kube_job_status_failed{job_name=~"renovate-.*"})
|
||||
```
|
||||
|
||||
## Links
|
||||
|
||||
* [Renovate Documentation](https://docs.renovatebot.com/)
|
||||
* [Renovate Presets](https://docs.renovatebot.com/presets-default/)
|
||||
* [Gitea Platform Support](https://docs.renovatebot.com/modules/platform/gitea/)
|
||||
* Related: [ADR-0013](0013-gitea-actions-for-ci.md) - Gitea Actions for CI
|
||||
* Related: [ADR-0031](0031-gitea-cicd-strategy.md) - Gitea CI/CD Strategy
|
||||
187
decisions/0037-node-naming-conventions.md
Normal file
187
decisions/0037-node-naming-conventions.md
Normal file
@@ -0,0 +1,187 @@
|
||||
# Node Naming Conventions
|
||||
|
||||
* Status: accepted
|
||||
* Date: 2026-02-05
|
||||
* Deciders: Billy
|
||||
* Technical Story: Establish memorable, role-based naming for cluster nodes
|
||||
|
||||
## Context and Problem Statement
|
||||
|
||||
The homelab cluster has grown to include:
|
||||
- 14 Kubernetes nodes (control plane + workers)
|
||||
- Multiple storage servers
|
||||
- Development workstations
|
||||
|
||||
Generic names like `node-01`, `worker-gpu-1` are:
|
||||
- Hard to remember
|
||||
- Don't convey node purpose
|
||||
- Boring
|
||||
|
||||
How do we name nodes in a way that's memorable, fun, and indicates their role?
|
||||
|
||||
## Decision Drivers
|
||||
|
||||
* Names should indicate node role/capability
|
||||
* Easy to remember and reference in conversation
|
||||
* Consistent theme across the homelab
|
||||
* Scalable as more nodes are added
|
||||
|
||||
## Decision Outcome
|
||||
|
||||
Chosen option: **Dungeons & Dragons character naming scheme**
|
||||
|
||||
All nodes are named after famous D&D characters from Forgotten Realms, with character class mapping to node role.
|
||||
|
||||
## Naming Scheme
|
||||
|
||||
### Control Plane → Companions of the Hall
|
||||
|
||||
The control plane nodes are named after the legendary Companions of the Hall, Drizzt's closest allies.
|
||||
|
||||
| Node | Character | Hardware | Notes |
|
||||
|------|-----------|----------|-------|
|
||||
| `bruenor` | Bruenor Battlehammer | Intel N100 | Dwarf King of Mithral Hall |
|
||||
| `catti` | Catti-brie | Intel N100 | Human ranger, Bruenor's adopted daughter |
|
||||
| `storm` | Storm Silverhand | Intel N100 | Chosen of Mystra, Harper leader |
|
||||
|
||||
### Wizards → GPU Nodes (Spellcasters)
|
||||
|
||||
Wizards cast powerful spells, just as GPU nodes power AI/ML workloads.
|
||||
|
||||
| Node | Character | GPU | Notes |
|
||||
|------|-----------|-----|-------|
|
||||
| `khelben` | Khelben "Blackstaff" Arunsun | AMD Radeon 8060S 64GB | Primary AI inference, Strix Halo APU |
|
||||
| `elminster` | Elminster Aumar | NVIDIA RTX 2070 8GB | CUDA workloads, Sage of Shadowdale |
|
||||
| `drizzt` | Drizzt Do'Urden* | AMD Radeon 680M | ROCm backup node |
|
||||
| `danilo` | Danilo Thann | Intel Arc A770 | Intel inference, bard/wizard multiclass |
|
||||
| `regis` | Regis | NVIDIA GPU | Halfling with magical ruby, spellthief vibes |
|
||||
|
||||
*Drizzt is technically a ranger, but his magical scimitars and time in Menzoberranzan qualify him for the GPU tier.
|
||||
|
||||
### Rogues → ARM64 Edge Nodes
|
||||
|
||||
Rogues are nimble and work in the shadows—perfect for lightweight edge compute on Raspberry Pi nodes.
|
||||
|
||||
| Node | Character | Hardware | Notes |
|
||||
|------|-----------|----------|-------|
|
||||
| `durnan` | Durnan | Raspberry Pi 4 8GB | Yawning Portal innkeeper, retired adventurer |
|
||||
| `elaith` | Elaith Craulnober | Raspberry Pi 4 8GB | The Serpent, moon elf rogue |
|
||||
| `jarlaxle` | Jarlaxle Baenre | Raspberry Pi 4 8GB | Drow mercenary leader |
|
||||
| `mirt` | Mirt the Moneylender | Raspberry Pi 4 8GB | Harper agent, "Old Wolf" |
|
||||
| `volo` | Volothamp Geddarm | Raspberry Pi 4 8GB | Famous author and traveler |
|
||||
|
||||
### Fighters → x86 CPU Workers
|
||||
|
||||
Fighters are the workhorses, handling general compute without magical (GPU) abilities.
|
||||
|
||||
| Node | Character | Hardware | Notes |
|
||||
|------|-----------|----------|-------|
|
||||
| `wulfgar` | Wulfgar | Intel x86_64 | Barbarian of Icewind Dale, Aegis-fang wielder |
|
||||
|
||||
### Infrastructure Nodes (Locations)
|
||||
|
||||
| Node | Character/Location | Role | Notes |
|
||||
|------|-------------------|------|-------|
|
||||
| `candlekeep` | Candlekeep | Primary NAS (Synology) | Library fortress, knowledge storage |
|
||||
| `neverwinter` | Neverwinter | Fast NAS (TrueNAS Scale) | Jewel of the North, all-SSD, nfs-fast |
|
||||
| `waterdeep` | Waterdeep | Mac Mini dev workstation | City of Splendors, primary city |
|
||||
|
||||
### Future Expansion
|
||||
|
||||
| Class | Role | Candidate Names |
|
||||
|-------|------|-----------------|
|
||||
| Clerics | Database/backup nodes | Cadderly, Dawnbringer |
|
||||
| Fighters | High-CPU compute | Artemis Entreri, Obould |
|
||||
| Druids | Monitoring/observability | Jaheira, Cernd |
|
||||
| Bards | API gateways | Other Thann family members |
|
||||
| Paladins | Security nodes | Ajantis, Keldorn |
|
||||
|
||||
## Architecture
|
||||
|
||||
```
|
||||
┌───────────────────────────────────────────────────────────────────────────────┐
|
||||
│ Homelab Cluster (14 Kubernetes Nodes) │
|
||||
│ │
|
||||
│ ┌──────────────────────────────────────────────────────────────────────┐ │
|
||||
│ │ 👑 Control Plane (Companions of the Hall) │ │
|
||||
│ │ │ │
|
||||
│ │ bruenor catti storm │ │
|
||||
│ │ Intel N100 Intel N100 Intel N100 │ │
|
||||
│ │ "Dwarf King" "Catti-brie" "Silverhand" │ │
|
||||
│ └──────────────────────────────────────────────────────────────────────┘ │
|
||||
│ │
|
||||
│ ┌──────────────────────────────────────────────────────────────────────┐ │
|
||||
│ │ 🧙 Wizards (GPU Spellcasters) │ │
|
||||
│ │ │ │
|
||||
│ │ khelben elminster drizzt danilo regis │ │
|
||||
│ │ Radeon 8060S RTX 2070 Radeon 680M Arc A770 NVIDIA │ │
|
||||
│ │ 64GB unified 8GB VRAM iGPU 16GB GPU │ │
|
||||
│ │ "Blackstaff" "Sage" "Ranger" "Bard" "Ruby" │ │
|
||||
│ └──────────────────────────────────────────────────────────────────────┘ │
|
||||
│ │
|
||||
│ ┌──────────────────────────────────────────────────────────────────────┐ │
|
||||
│ │ 🗡️ Rogues (ARM64 Edge Nodes) │ │
|
||||
│ │ │ │
|
||||
│ │ durnan elaith jarlaxle mirt volo │ │
|
||||
│ │ Pi 4 8GB Pi 4 8GB Pi 4 8GB Pi 4 8GB Pi 4 8GB │ │
|
||||
│ │ "Innkeeper" "Serpent" "Mercenary" "Old Wolf" "Author" │ │
|
||||
│ └──────────────────────────────────────────────────────────────────────┘ │
|
||||
│ │
|
||||
│ ┌──────────────────────────────────────────────────────────────────────┐ │
|
||||
│ │ ⚔️ Fighters (x86 CPU Workers) │ │
|
||||
│ │ │ │
|
||||
│ │ wulfgar │ │
|
||||
│ │ Intel x86_64 │ │
|
||||
│ │ "Barbarian of Icewind Dale" │ │
|
||||
│ └──────────────────────────────────────────────────────────────────────┘ │
|
||||
└───────────────────────────────────────────────────────────────────────────────┘
|
||||
|
||||
┌───────────────────────────────────────────────────────────────────────────────┐
|
||||
│ 🏰 Locations (Off-Cluster Infrastructure) │
|
||||
│ │
|
||||
│ 📚 candlekeep ❄️ neverwinter 🏙️ waterdeep │
|
||||
│ Synology NAS TrueNAS Scale (SSD) Mac Mini │
|
||||
│ nfs-default nfs-fast Dev workstation │
|
||||
│ High capacity High speed Primary dev box │
|
||||
│ "Library Fortress" "Jewel of the North" "City of Splendors" │
|
||||
└───────────────────────────────────────────────────────────────────────────────┘
|
||||
```
|
||||
|
||||
## Storage Mapping
|
||||
|
||||
| Location | Storage Class | Speed | Capacity | Use Case |
|
||||
|----------|--------------|-------|----------|----------|
|
||||
| Candlekeep | `nfs-default` | HDD | High | Backups, archives, media |
|
||||
| Neverwinter | `nfs-fast` | SSD | Medium | Database WAL, hot data |
|
||||
| Longhorn | `longhorn` | Local SSD | Distributed | Replicated app data |
|
||||
|
||||
## Node Labels
|
||||
|
||||
```yaml
|
||||
# GPU Wizard nodes
|
||||
node.kubernetes.io/instance-type: gpu-wizard
|
||||
homelab.daviestechlabs.io/character-class: wizard
|
||||
homelab.daviestechlabs.io/character-name: khelben
|
||||
|
||||
# ARM64 Rogue nodes
|
||||
node.kubernetes.io/instance-type: raspberry-pi
|
||||
homelab.daviestechlabs.io/character-class: rogue
|
||||
homelab.daviestechlabs.io/character-name: jarlaxle
|
||||
```
|
||||
|
||||
## DNS/Hostname Resolution
|
||||
|
||||
All nodes are resolvable via:
|
||||
- Kubernetes DNS: `<node>.node.kubernetes.io`
|
||||
- Local DNS: `<node>.lab.daviestechlabs.io`
|
||||
- mDNS: `<node>.local`
|
||||
|
||||
## References
|
||||
|
||||
* [Forgotten Realms Wiki](https://forgottenrealms.fandom.com/)
|
||||
* [Khelben Arunsun](https://forgottenrealms.fandom.com/wiki/Khelben_Arunsun)
|
||||
* [Elminster](https://forgottenrealms.fandom.com/wiki/Elminster_Aumar)
|
||||
* [Candlekeep](https://forgottenrealms.fandom.com/wiki/Candlekeep)
|
||||
* [Neverwinter](https://forgottenrealms.fandom.com/wiki/Neverwinter)
|
||||
* Related: [ADR-0035](0035-arm64-worker-strategy.md) - ARM64 Worker Strategy
|
||||
* Related: [ADR-0011](0011-kuberay-unified-serving.md) - KubeRay Unified Serving
|
||||
Reference in New Issue
Block a user