daviestechlabs/homelab-design

Fork 0

Files

Billy D. 6e574ffc4b

Update README with ADR Index / update-readme (push) Successful in 9s

Details

adding in proto for mr beaker.

2026-02-26 06:17:10 -05:00

27 KiB

Raw Permalink Blame History

waterdeep (Mac Mini M4 Pro) as Dedicated Coding Agent with Fine-Tuned Model

Status: proposed
Date: 2026-02-26
Deciders: Billy
Technical Story: Repurpose waterdeep as a dedicated local coding agent serving a fine-tuned code-completion model for OpenCode, Copilot Chat, and other AI coding tools, with a pipeline for continually tuning the model on the homelab codebase

Context and Problem Statement

waterdeep is a Mac Mini M4 Pro with 48 GB of unified memory (ADR-0059). Its current role as a 3D avatar creation workstation (ADR-0059) is being superseded by the automated ComfyUI pipeline (ADR-0063), which handles avatar generation on a personal desktop as an on-demand Ray worker. This frees waterdeep for a higher-value use case.

GitHub Copilot and cloud-hosted coding assistants work well for general code, but they have no knowledge of DaviesTechLabs-specific patterns: the handler-base module API, NATS protobuf message conventions, Kubeflow pipeline structure, Ray Serve deployment patterns, Flux/Kustomize layout, or the Go handler lifecycle used across chat-handler, voice-assistant, pipeline-bridge, stt-module, and tts-module. A model fine-tuned on the homelab codebase would produce completions that follow project conventions out of the box.

With 48 GB of unified memory and no other workloads, waterdeep can serve Qwen 2.5 Coder 32B Instruct at Q8_0 quantisation (~34 GB) via MLX with ample headroom for KV cache, leaving the machine responsive for the inference server and macOS overhead. This is the largest purpose-built coding model that fits at high quantisation on this hardware, and it consistently outperforms general-purpose 70B models at Q4 on coding benchmarks.

How should we configure waterdeep as a dedicated coding agent and build a pipeline for fine-tuning the model on our codebase?

Decision Drivers

waterdeep's 48 GB unified memory is fully available — no competing workloads after ComfyUI pipeline takeover
Qwen 2.5 Coder 32B Instruct is the highest-quality open-source coding model that fits at Q8_0 (~34 GB weights + ~10 GB KV cache headroom)
MLX on Apple Silicon provides native Metal-accelerated inference with no framework overhead — purpose-built for M-series chips
OpenCode and VS Code Copilot Chat both support OpenAI-compatible API endpoints — a local server is a drop-in replacement
The homelab codebase has strong conventions (handler-base, protobuf messages, Kubeflow pipelines, Ray Serve apps, Flux GitOps) that a general model doesn't know
Existing training infrastructure (ADR-0058) provides Kubeflow Pipelines + MLflow + S3 data flow for fine-tuning orchestration
LoRA adapters are small (~50–200 MB) and can be merged into the base model or hot-swapped in mlx-lm-server
The cluster's CPU training capacity (126 cores, 378 GB RAM across 14 nodes) can prepare training datasets; waterdeep itself can run the LoRA fine-tune on its Metal GPU

Considered Options

Qwen 2.5 Coder 32B Instruct (Q8_0) via mlx-lm-server on waterdeep — fine-tuned with LoRA on the homelab codebase using MLX
Llama 3.1 70B Instruct (Q4_K_M) via llama.cpp on waterdeep — larger general-purpose model at aggressive quantisation
DeepSeek Coder V2 Lite 16B via MLX on waterdeep — smaller coding model, lower resource usage
Keep using cloud Copilot only — no local model, no fine-tuning

Decision Outcome

Chosen option: Option 1 — Qwen 2.5 Coder 32B Instruct (Q8_0) via mlx-lm-server, because it is the best-in-class open-source coding model at a quantisation level that preserves near-full quality, fits comfortably within the 48 GB memory budget with room for KV cache, and MLX provides the optimal inference stack for Apple Silicon. Fine-tuning with LoRA on the homelab codebase will specialise the model to project conventions.

Positive Consequences

Purpose-built coding model — Qwen 2.5 Coder 32B tops open-source coding benchmarks (HumanEval, MBPP, BigCodeBench)
Q8_0 quantisation preserves >99% of full-precision quality — minimal degradation vs Q4
~34 GB model weights + ~10 GB KV cache headroom = comfortable fit in 48 GB unified memory
MLX inference leverages Metal GPU for token generation — fast enough for interactive coding assistance
OpenAI-compatible API via mlx-lm-server — works with OpenCode, VS Code Copilot Chat (custom endpoint), Continue.dev, and any OpenAI SDK client
Fine-tuned LoRA adapter teaches project-specific patterns: handler-base API, NATS message conventions, Kubeflow pipeline structure, Flux layout
LoRA fine-tuning runs directly on waterdeep using mlx-lm — no cluster resources needed for training
Adapter files are small (~50–200 MB) — easy to version in Gitea and track in MLflow
Fully offline — no cloud dependency, no data leaves the network
Frees Copilot quota for non-coding tasks — local model handles bulk code completion

Negative Consequences

waterdeep is dedicated to this role — cannot simultaneously serve other workloads (Blender, etc.)
Model updates require manual download and conversion to MLX format
LoRA fine-tuning quality depends on training data curation — garbage in, garbage out
32B model is slower than cloud Copilot for very long completions — acceptable for interactive use
Single point of failure — if waterdeep is down, fall back to cloud Copilot

Pros and Cons of the Options

Option 1: Qwen 2.5 Coder 32B Instruct (Q8_0) via MLX

Good, because purpose-built for code — trained on 5.5T tokens of code data
Good, because 32B at Q8_0 (~34 GB) fits in 48 GB with KV cache headroom
Good, because Q8_0 preserves near-full quality (vs Q4 which drops noticeably on coding tasks)
Good, because MLX is Apple's native framework — zero-copy unified memory, Metal GPU kernels
Good, because mlx-lm supports LoRA fine-tuning natively — train and serve on the same machine
Good, because OpenAI-compatible API (mlx-lm-server) — drop-in for any coding tool
Bad, because 32B generates ~15–25 tokens/sec on M4 Pro — adequate but not instant for long outputs
Bad, because MLX model format requires conversion from HuggingFace (one-time, scripted)

Option 2: Llama 3.1 70B Instruct (Q4_K_M) via llama.cpp

Good, because 70B is a larger, more capable general model
Good, because llama.cpp is mature and well-supported on macOS
Bad, because Q4_K_M quantisation loses meaningful quality — especially on code tasks where precision matters
Bad, because ~42 GB weights leaves only ~6 GB for KV cache — tight, risks OOM on long contexts
Bad, because general-purpose model — not trained specifically for code, underperforms Qwen 2.5 Coder 32B on coding benchmarks despite being 2× larger
Bad, because slower token generation (~8–12 tok/s) due to larger model size
Bad, because llama.cpp doesn't natively support LoRA fine-tuning — need a separate training framework

Option 3: DeepSeek Coder V2 Lite 16B via MLX

Good, because smaller model — faster inference (~30–40 tok/s), lighter memory footprint
Good, because still a capable coding model
Bad, because significantly less capable than Qwen 2.5 Coder 32B on benchmarks
Bad, because leaves 30+ GB of unified memory unused — not maximising the hardware
Bad, because fewer parameters mean less capacity to absorb fine-tuning knowledge

Option 4: Cloud Copilot only

Good, because zero local infrastructure to maintain
Good, because always up-to-date with latest model improvements
Bad, because no knowledge of homelab-specific conventions — completions require heavy editing
Bad, because cloud latency for every completion
Bad, because data (code context) leaves the network
Bad, because wastes waterdeep's 48 GB of unified memory sitting idle

Architecture

Inference Server

┌──────────────────────────────────────────────────────────────────────────┐
│  waterdeep (Mac Mini M4 Pro · 48 GB unified · Metal GPU · dedicated)    │
│                                                                          │
│  ┌────────────────────────────────────────────────────────────────────┐  │
│  │  mlx-lm-server (launchd-managed)                                   │  │
│  │                                                                    │  │
│  │  Model: Qwen2.5-Coder-32B-Instruct (Q8_0, MLX format)             │  │
│  │  LoRA:  ~/.mlx-models/adapters/homelab-coder/latest/               │  │
│  │                                                                    │  │
│  │  Endpoint: http://waterdeep.lab.daviestechlabs.io:8080/v1          │  │
│  │  ├── /v1/completions         (code completion, FIM)                │  │
│  │  ├── /v1/chat/completions    (chat / instruct)                     │  │
│  │  └── /v1/models              (model listing)                       │  │
│  │                                                                    │  │
│  │  Memory: ~34 GB model + ~10 GB KV cache = ~44 GB                   │  │
│  └────────────────────────────────────────────────────────────────────┘  │
│                                                                          │
│  ┌─────────────────────────┐  ┌──────────────────────────────────────┐  │
│  │  macOS overhead ~3 GB    │  │  Training (on-demand, same GPU)      │  │
│  │  (kernel, WindowServer,  │  │  mlx-lm LoRA fine-tune               │  │
│  │   mDNSResponder, etc.)   │  │  (server stopped during training)    │  │
│  └─────────────────────────┘  └──────────────────────────────────────┘  │
└──────────────────────────────────────────────────────────────────────────┘
         │
         │ HTTP :8080 (OpenAI-compatible API)
         │
    ┌────┴──────────────────────────────────────────────────────┐
    │                                                            │
    ▼                                                            ▼
┌─────────────────────────────┐    ┌─────────────────────────────────────┐
│  VS Code (any machine)      │    │  OpenCode (terminal, any machine)   │
│                              │    │                                     │
│  Copilot Chat / Continue.dev │    │  OPENCODE_MODEL_PROVIDER=openai     │
│  Custom endpoint →           │    │  OPENAI_API_BASE=                   │
│  waterdeep:8080/v1           │    │    http://waterdeep:8080/v1         │
└─────────────────────────────┘    └─────────────────────────────────────┘

Fine-Tuning Pipeline

┌─────────────────────────────────────────────────────────────────────────────┐
│                        Fine-Tuning Pipeline (Kubeflow)                      │
│                                                                             │
│  Trigger: weekly cron or manual (after significant codebase changes)        │
│                                                                             │
│  ┌──────────────┐    ┌──────────────────┐    ┌────────────────────────┐    │
│  │ 1. Clone repos│    │ 2. Build training │    │ 3. Upload dataset to   │    │
│  │    from Gitea │───▶│    dataset        │───▶│    S3                  │    │
│  │    (all repos)│    │    (instruction   │    │    training-data/      │    │
│  │               │    │     pairs + FIM)  │    │    code-finetune/      │    │
│  └──────────────┘    └──────────────────┘    └──────────┬─────────────┘    │
│                                                          │                  │
│  ┌──────────────────────────────────────────────────────┐│                  │
│  │ 4. Trigger LoRA fine-tune on waterdeep               ││                  │
│  │    (SSH or webhook → mlx-lm lora on Metal GPU)       │◀                  │
│  │                                                      │                   │
│  │    Base: Qwen2.5-Coder-32B-Instruct (MLX Q8_0)      │                   │
│  │    Method: LoRA (r=16, alpha=32)                     │                   │
│  │    Data: instruction pairs + fill-in-middle samples  │                   │
│  │    Epochs: 3–5                                       │                   │
│  │    Output: adapter weights (~50–200 MB)              │                   │
│  └──────────────────────┬───────────────────────────────┘                   │
│                         │                                                    │
│  ┌──────────────────────▼───────────────────────────────┐                   │
│  │ 5. Evaluate adapter                                   │                   │
│  │    • HumanEval pass@1 (baseline vs fine-tuned)        │                   │
│  │    • Project-specific eval (handler-base patterns,    │                   │
│  │      Kubeflow pipeline templates, Flux manifests)     │                   │
│  └──────────────────────┬───────────────────────────────┘                   │
│                         │                                                    │
│  ┌──────────────────────▼───┐  ┌────────────────────────────────────────┐   │
│  │ 6. Push adapter to Gitea │  │ 7. Log metrics to MLflow               │   │
│  │    code-lora-adapters    │  │    experiment: waterdeep-coder-finetune │   │
│  │    repo (versioned)      │  │    metrics: eval_loss, humaneval,       │   │
│  └──────────────────────────┘  │             project_specific_score      │   │
│                                └────────────────────────────────────────┘   │
│                                                                             │
│  ┌─────────────────────────────────────────────────────────────────────┐    │
│  │ 8. Deploy adapter on waterdeep                                      │    │
│  │    • Pull latest adapter from Gitea                                 │    │
│  │    • Restart mlx-lm-server with --adapter-path pointing to new ver  │    │
│  │    • Smoke test: send test completion requests                      │    │
│  └─────────────────────────────────────────────────────────────────────┘    │
└─────────────────────────────────────────────────────────────────────────────┘

Training Data Preparation

The training dataset is built from all DaviesTechLabs repositories:

Source	Format	Purpose
Go handlers (chat-handler, voice-assistant, etc.)	Instruction pairs	Teach handler-base API patterns, NATS message handling, protobuf encoding
Kubeflow pipelines (kubeflow/*.py)	Instruction pairs	Teach pipeline structure, KFP component patterns, S3 data flow
Ray Serve apps (ray-serve/)	Instruction pairs	Teach Ray Serve deployment, vLLM config, model serving patterns
Flux manifests (homelab-k8s2/)	Instruction pairs	Teach HelmRelease, Kustomization, namespace layout
Argo workflows (argo/*.yaml)	Instruction pairs	Teach WorkflowTemplate patterns, NATS triggers
ADRs (homelab-design/decisions/)	Instruction pairs	Teach architecture rationale and decision format
All source files	Fill-in-middle (FIM)	Teach code completion with project-specific context

Instruction pair example (Go handler):

{
  "instruction": "Create a new NATS handler module that bridges to an external gRPC service, following the handler-base pattern used in chat-handler and voice-assistant.",
  "output": "package main\n\nimport (\n\t\"context\"\n\t\"os\"\n\t\"os/signal\"\n\t\"syscall\"\n\n\t\"git.daviestechlabs.io/daviestechlabs/handler-base/config\"\n\t\"git.daviestechlabs.io/daviestechlabs/handler-base/handler\"\n\t\"git.daviestechlabs.io/daviestechlabs/handler-base/health\"\n\t..."
}

Fill-in-middle example:

{
  "prefix": "func (h *Handler) HandleMessage(ctx context.Context, msg *messages.UserMessage) (*messages.AssistantMessage, error) {\n\t",
  "suffix": "\n\treturn response, nil\n}",
  "middle": "response, err := h.client.Complete(ctx, msg.Content)\n\tif err != nil {\n\t\treturn nil, fmt.Errorf(\"completion failed: %w\", err)\n\t}"
}

Implementation Plan

1. Model Setup

# Install MLX and mlx-lm via uv (per ADR-0012)
uv tool install mlx-lm

# Download and convert Qwen 2.5 Coder 32B Instruct to MLX Q8_0 format
mlx_lm.convert \
  --hf-path Qwen/Qwen2.5-Coder-32B-Instruct \
  --mlx-path ~/.mlx-models/Qwen2.5-Coder-32B-Instruct-Q8 \
  --quantize \
  --q-bits 8

# Verify model loads and generates
mlx_lm.generate \
  --model ~/.mlx-models/Qwen2.5-Coder-32B-Instruct-Q8 \
  --prompt "def fibonacci(n: int) -> int:"

2. Inference Server (launchd)

# Start the server manually first to verify
mlx_lm.server \
  --model ~/.mlx-models/Qwen2.5-Coder-32B-Instruct-Q8 \
  --adapter-path ~/.mlx-models/adapters/homelab-coder/latest \
  --host 0.0.0.0 \
  --port 8080

# Verify OpenAI-compatible endpoint
curl http://localhost:8080/v1/chat/completions \
  -H "Content-Type: application/json" \
  -d '{
    "model": "qwen2.5-coder-32b",
    "messages": [{"role": "user", "content": "Write a Go handler using handler-base that processes NATS messages"}],
    "max_tokens": 512
  }'

launchd plist (~/Library/LaunchAgents/io.daviestechlabs.mlx-coder.plist):

<?xml version="1.0" encoding="UTF-8"?>
<!DOCTYPE plist PUBLIC "-//Apple//DTD PLIST 1.0//EN"
  "http://www.apple.com/DTDs/PropertyList-1.0.dtd">
<plist version="1.0">
<dict>
  <key>Label</key>
  <string>io.daviestechlabs.mlx-coder</string>
  <key>ProgramArguments</key>
  <array>
    <string>/Users/billy/.local/bin/mlx_lm.server</string>
    <string>--model</string>
    <string>/Users/billy/.mlx-models/Qwen2.5-Coder-32B-Instruct-Q8</string>
    <string>--adapter-path</string>
    <string>/Users/billy/.mlx-models/adapters/homelab-coder/latest</string>
    <string>--host</string>
    <string>0.0.0.0</string>
    <string>--port</string>
    <string>8080</string>
  </array>
  <key>RunAtLoad</key>
  <true/>
  <key>KeepAlive</key>
  <true/>
  <key>StandardOutPath</key>
  <string>/Users/billy/.mlx-models/logs/server.log</string>
  <key>StandardErrorPath</key>
  <string>/Users/billy/.mlx-models/logs/server.err</string>
</dict>
</plist>

# Load the service
launchctl load ~/Library/LaunchAgents/io.daviestechlabs.mlx-coder.plist

# Verify it's running
launchctl list | grep mlx-coder
curl http://waterdeep.lab.daviestechlabs.io:8080/v1/models

3. Client Configuration

OpenCode (~/.config/opencode/config.json on any dev machine):

{
  "provider": "openai",
  "model": "qwen2.5-coder-32b",
  "baseURL": "http://waterdeep.lab.daviestechlabs.io:8080/v1"
}

VS Code (settings.json — Continue.dev extension):

{
  "continue.models": [
    {
      "title": "waterdeep-coder",
      "provider": "openai",
      "model": "qwen2.5-coder-32b",
      "apiBase": "http://waterdeep.lab.daviestechlabs.io:8080/v1",
      "apiKey": "not-needed"
    }
  ]
}

4. Fine-Tuning on waterdeep (MLX LoRA)

# Prepare training data (run on cluster via Kubeflow, or locally)
# Output: train.jsonl and valid.jsonl in chat/instruction format

# Fine-tune with LoRA using mlx-lm
mlx_lm.lora \
  --model ~/.mlx-models/Qwen2.5-Coder-32B-Instruct-Q8 \
  --train \
  --data ~/.mlx-models/training-data/homelab-coder/ \
  --adapter-path ~/.mlx-models/adapters/homelab-coder/$(date +%Y%m%d)/ \
  --lora-layers 16 \
  --batch-size 1 \
  --iters 1000 \
  --learning-rate 1e-5 \
  --val-batches 25 \
  --save-every 100

# Evaluate the adapter
mlx_lm.generate \
  --model ~/.mlx-models/Qwen2.5-Coder-32B-Instruct-Q8 \
  --adapter-path ~/.mlx-models/adapters/homelab-coder/$(date +%Y%m%d)/ \
  --prompt "Create a new Go NATS handler using handler-base that..."

# Update the 'latest' symlink
ln -sfn ~/.mlx-models/adapters/homelab-coder/$(date +%Y%m%d) \
        ~/.mlx-models/adapters/homelab-coder/latest

# Restart the server to pick up new adapter
launchctl kickstart -k gui/$(id -u)/io.daviestechlabs.mlx-coder

5. Training Data Pipeline (Kubeflow)

A new code_finetune_pipeline.py orchestrates dataset preparation on the cluster:

 code_finetune_pipeline.yaml
       │
       ├── 1. clone_repos           Clone all DaviesTechLabs repos from Gitea
       ├── 2. extract_patterns      Parse Go, Python, YAML files into instruction pairs
       ├── 3. generate_fim          Create fill-in-middle samples from source files
       ├── 4. deduplicate           Remove near-duplicate samples (MinHash)
       ├── 5. format_dataset        Convert to mlx-lm JSONL format (train + validation split)
       ├── 6. upload_to_s3          Push dataset to s3://training-data/code-finetune/{run_id}/
       └── 7. log_to_mlflow         Log dataset stats (num_samples, token_count, repo_coverage)

The actual LoRA fine-tune runs on waterdeep (not the cluster) because:

mlx-lm LoRA leverages the M4 Pro's Metal GPU — significantly faster than CPU training
The model is already loaded on waterdeep — no need to transfer 34 GB to/from the cluster
Training a 32B model with LoRA requires ~40 GB — only waterdeep and khelben have enough memory

6. Memory Budget

Component	Memory
macOS + system services	~3 GB
Qwen 2.5 Coder 32B (Q8_0 weights)	~34 GB
KV cache (8192 context)	~6 GB
mlx-lm-server overhead	~1 GB
Total (inference)	~44 GB
Headroom	~4 GB

During LoRA fine-tuning (server stopped):

Component	Memory
macOS + system services	~3 GB
Model weights (frozen, Q8_0)	~34 GB
LoRA adapter gradients + optimizer	~4 GB
Training batch + activations	~5 GB
Total (training)	~46 GB
Headroom	~2 GB

Both workloads fit within the 48 GB budget. Inference and training are mutually exclusive — the server is stopped during fine-tuning runs to reclaim KV cache memory for training.

Security Considerations

mlx-lm-server has no authentication — bind to LAN only; waterdeep's firewall blocks external access
No code leaves the network — all inference and training is local
Training data is sourced exclusively from Gitea (internal repos) — no external data contamination
Adapter weights are versioned in Gitea — auditable lineage from training data to deployed model
Consider adding a simple API key check via a reverse proxy (Caddy/nginx) if the LAN is not fully trusted

Future Considerations

DGX Spark (ADR-0058): If acquired, DGX Spark could fine-tune larger coding models (70B+) or run full fine-tunes instead of LoRA. waterdeep would remain the serving endpoint unless the DGX Spark also serves inference.
Adapter hot-swap: mlx-lm supports loading adapters at request time — could serve multiple fine-tuned adapters (e.g., Go-specific, Python-specific, YAML-specific) from a single base model
RAG augmentation: Combine the fine-tuned model with a RAG pipeline that retrieves relevant code snippets from Milvus (ADR-0008) for even better context-aware completions
Continuous fine-tuning: Trigger the pipeline automatically on Gitea push events via NATS — the model stays current with codebase changes
Evaluation suite: Build a project-specific eval set (handler-base patterns, pipeline templates, Flux manifests) to measure fine-tuning quality beyond generic benchmarks
Newer models: As new coding models are released (Qwen 3 Coder, DeepSeek Coder V3, etc.), re-evaluate which model maximises quality within the 48 GB budget

27 KiB Raw Permalink Blame History Unescape Escape