Files
homelab-design/decisions/0064-waterdeep-coding-agent.md
Billy D. 6e574ffc4b
All checks were successful
Update README with ADR Index / update-readme (push) Successful in 9s
adding in proto for mr beaker.
2026-02-26 06:17:10 -05:00

27 KiB
Raw Permalink Blame History

waterdeep (Mac Mini M4 Pro) as Dedicated Coding Agent with Fine-Tuned Model

  • Status: proposed
  • Date: 2026-02-26
  • Deciders: Billy
  • Technical Story: Repurpose waterdeep as a dedicated local coding agent serving a fine-tuned code-completion model for OpenCode, Copilot Chat, and other AI coding tools, with a pipeline for continually tuning the model on the homelab codebase

Context and Problem Statement

waterdeep is a Mac Mini M4 Pro with 48 GB of unified memory (ADR-0059). Its current role as a 3D avatar creation workstation (ADR-0059) is being superseded by the automated ComfyUI pipeline (ADR-0063), which handles avatar generation on a personal desktop as an on-demand Ray worker. This frees waterdeep for a higher-value use case.

GitHub Copilot and cloud-hosted coding assistants work well for general code, but they have no knowledge of DaviesTechLabs-specific patterns: the handler-base module API, NATS protobuf message conventions, Kubeflow pipeline structure, Ray Serve deployment patterns, Flux/Kustomize layout, or the Go handler lifecycle used across chat-handler, voice-assistant, pipeline-bridge, stt-module, and tts-module. A model fine-tuned on the homelab codebase would produce completions that follow project conventions out of the box.

With 48 GB of unified memory and no other workloads, waterdeep can serve Qwen 2.5 Coder 32B Instruct at Q8_0 quantisation (~34 GB) via MLX with ample headroom for KV cache, leaving the machine responsive for the inference server and macOS overhead. This is the largest purpose-built coding model that fits at high quantisation on this hardware, and it consistently outperforms general-purpose 70B models at Q4 on coding benchmarks.

How should we configure waterdeep as a dedicated coding agent and build a pipeline for fine-tuning the model on our codebase?

Decision Drivers

  • waterdeep's 48 GB unified memory is fully available — no competing workloads after ComfyUI pipeline takeover
  • Qwen 2.5 Coder 32B Instruct is the highest-quality open-source coding model that fits at Q8_0 (~34 GB weights + ~10 GB KV cache headroom)
  • MLX on Apple Silicon provides native Metal-accelerated inference with no framework overhead — purpose-built for M-series chips
  • OpenCode and VS Code Copilot Chat both support OpenAI-compatible API endpoints — a local server is a drop-in replacement
  • The homelab codebase has strong conventions (handler-base, protobuf messages, Kubeflow pipelines, Ray Serve apps, Flux GitOps) that a general model doesn't know
  • Existing training infrastructure (ADR-0058) provides Kubeflow Pipelines + MLflow + S3 data flow for fine-tuning orchestration
  • LoRA adapters are small (~50200 MB) and can be merged into the base model or hot-swapped in mlx-lm-server
  • The cluster's CPU training capacity (126 cores, 378 GB RAM across 14 nodes) can prepare training datasets; waterdeep itself can run the LoRA fine-tune on its Metal GPU

Considered Options

  1. Qwen 2.5 Coder 32B Instruct (Q8_0) via mlx-lm-server on waterdeep — fine-tuned with LoRA on the homelab codebase using MLX
  2. Llama 3.1 70B Instruct (Q4_K_M) via llama.cpp on waterdeep — larger general-purpose model at aggressive quantisation
  3. DeepSeek Coder V2 Lite 16B via MLX on waterdeep — smaller coding model, lower resource usage
  4. Keep using cloud Copilot only — no local model, no fine-tuning

Decision Outcome

Chosen option: Option 1 — Qwen 2.5 Coder 32B Instruct (Q8_0) via mlx-lm-server, because it is the best-in-class open-source coding model at a quantisation level that preserves near-full quality, fits comfortably within the 48 GB memory budget with room for KV cache, and MLX provides the optimal inference stack for Apple Silicon. Fine-tuning with LoRA on the homelab codebase will specialise the model to project conventions.

Positive Consequences

  • Purpose-built coding model — Qwen 2.5 Coder 32B tops open-source coding benchmarks (HumanEval, MBPP, BigCodeBench)
  • Q8_0 quantisation preserves >99% of full-precision quality — minimal degradation vs Q4
  • ~34 GB model weights + ~10 GB KV cache headroom = comfortable fit in 48 GB unified memory
  • MLX inference leverages Metal GPU for token generation — fast enough for interactive coding assistance
  • OpenAI-compatible API via mlx-lm-server — works with OpenCode, VS Code Copilot Chat (custom endpoint), Continue.dev, and any OpenAI SDK client
  • Fine-tuned LoRA adapter teaches project-specific patterns: handler-base API, NATS message conventions, Kubeflow pipeline structure, Flux layout
  • LoRA fine-tuning runs directly on waterdeep using mlx-lm — no cluster resources needed for training
  • Adapter files are small (~50200 MB) — easy to version in Gitea and track in MLflow
  • Fully offline — no cloud dependency, no data leaves the network
  • Frees Copilot quota for non-coding tasks — local model handles bulk code completion

Negative Consequences

  • waterdeep is dedicated to this role — cannot simultaneously serve other workloads (Blender, etc.)
  • Model updates require manual download and conversion to MLX format
  • LoRA fine-tuning quality depends on training data curation — garbage in, garbage out
  • 32B model is slower than cloud Copilot for very long completions — acceptable for interactive use
  • Single point of failure — if waterdeep is down, fall back to cloud Copilot

Pros and Cons of the Options

Option 1: Qwen 2.5 Coder 32B Instruct (Q8_0) via MLX

  • Good, because purpose-built for code — trained on 5.5T tokens of code data
  • Good, because 32B at Q8_0 (~34 GB) fits in 48 GB with KV cache headroom
  • Good, because Q8_0 preserves near-full quality (vs Q4 which drops noticeably on coding tasks)
  • Good, because MLX is Apple's native framework — zero-copy unified memory, Metal GPU kernels
  • Good, because mlx-lm supports LoRA fine-tuning natively — train and serve on the same machine
  • Good, because OpenAI-compatible API (mlx-lm-server) — drop-in for any coding tool
  • Bad, because 32B generates ~1525 tokens/sec on M4 Pro — adequate but not instant for long outputs
  • Bad, because MLX model format requires conversion from HuggingFace (one-time, scripted)

Option 2: Llama 3.1 70B Instruct (Q4_K_M) via llama.cpp

  • Good, because 70B is a larger, more capable general model
  • Good, because llama.cpp is mature and well-supported on macOS
  • Bad, because Q4_K_M quantisation loses meaningful quality — especially on code tasks where precision matters
  • Bad, because ~42 GB weights leaves only ~6 GB for KV cache — tight, risks OOM on long contexts
  • Bad, because general-purpose model — not trained specifically for code, underperforms Qwen 2.5 Coder 32B on coding benchmarks despite being 2× larger
  • Bad, because slower token generation (~812 tok/s) due to larger model size
  • Bad, because llama.cpp doesn't natively support LoRA fine-tuning — need a separate training framework

Option 3: DeepSeek Coder V2 Lite 16B via MLX

  • Good, because smaller model — faster inference (~3040 tok/s), lighter memory footprint
  • Good, because still a capable coding model
  • Bad, because significantly less capable than Qwen 2.5 Coder 32B on benchmarks
  • Bad, because leaves 30+ GB of unified memory unused — not maximising the hardware
  • Bad, because fewer parameters mean less capacity to absorb fine-tuning knowledge

Option 4: Cloud Copilot only

  • Good, because zero local infrastructure to maintain
  • Good, because always up-to-date with latest model improvements
  • Bad, because no knowledge of homelab-specific conventions — completions require heavy editing
  • Bad, because cloud latency for every completion
  • Bad, because data (code context) leaves the network
  • Bad, because wastes waterdeep's 48 GB of unified memory sitting idle

Architecture

Inference Server

┌──────────────────────────────────────────────────────────────────────────┐
│  waterdeep (Mac Mini M4 Pro · 48 GB unified · Metal GPU · dedicated)    │
│                                                                          │
│  ┌────────────────────────────────────────────────────────────────────┐  │
│  │  mlx-lm-server (launchd-managed)                                   │  │
│  │                                                                    │  │
│  │  Model: Qwen2.5-Coder-32B-Instruct (Q8_0, MLX format)             │  │
│  │  LoRA:  ~/.mlx-models/adapters/homelab-coder/latest/               │  │
│  │                                                                    │  │
│  │  Endpoint: http://waterdeep.lab.daviestechlabs.io:8080/v1          │  │
│  │  ├── /v1/completions         (code completion, FIM)                │  │
│  │  ├── /v1/chat/completions    (chat / instruct)                     │  │
│  │  └── /v1/models              (model listing)                       │  │
│  │                                                                    │  │
│  │  Memory: ~34 GB model + ~10 GB KV cache = ~44 GB                   │  │
│  └────────────────────────────────────────────────────────────────────┘  │
│                                                                          │
│  ┌─────────────────────────┐  ┌──────────────────────────────────────┐  │
│  │  macOS overhead ~3 GB    │  │  Training (on-demand, same GPU)      │  │
│  │  (kernel, WindowServer,  │  │  mlx-lm LoRA fine-tune               │  │
│  │   mDNSResponder, etc.)   │  │  (server stopped during training)    │  │
│  └─────────────────────────┘  └──────────────────────────────────────┘  │
└──────────────────────────────────────────────────────────────────────────┘
         │
         │ HTTP :8080 (OpenAI-compatible API)
         │
    ┌────┴──────────────────────────────────────────────────────┐
    │                                                            │
    ▼                                                            ▼
┌─────────────────────────────┐    ┌─────────────────────────────────────┐
│  VS Code (any machine)      │    │  OpenCode (terminal, any machine)   │
│                              │    │                                     │
│  Copilot Chat / Continue.dev │    │  OPENCODE_MODEL_PROVIDER=openai     │
│  Custom endpoint →           │    │  OPENAI_API_BASE=                   │
│  waterdeep:8080/v1           │    │    http://waterdeep:8080/v1         │
└─────────────────────────────┘    └─────────────────────────────────────┘

Fine-Tuning Pipeline

┌─────────────────────────────────────────────────────────────────────────────┐
│                        Fine-Tuning Pipeline (Kubeflow)                      │
│                                                                             │
│  Trigger: weekly cron or manual (after significant codebase changes)        │
│                                                                             │
│  ┌──────────────┐    ┌──────────────────┐    ┌────────────────────────┐    │
│  │ 1. Clone repos│    │ 2. Build training │    │ 3. Upload dataset to   │    │
│  │    from Gitea │───▶│    dataset        │───▶│    S3                  │    │
│  │    (all repos)│    │    (instruction   │    │    training-data/      │    │
│  │               │    │     pairs + FIM)  │    │    code-finetune/      │    │
│  └──────────────┘    └──────────────────┘    └──────────┬─────────────┘    │
│                                                          │                  │
│  ┌──────────────────────────────────────────────────────┐│                  │
│  │ 4. Trigger LoRA fine-tune on waterdeep               ││                  │
│  │    (SSH or webhook → mlx-lm lora on Metal GPU)       │◀                  │
│  │                                                      │                   │
│  │    Base: Qwen2.5-Coder-32B-Instruct (MLX Q8_0)      │                   │
│  │    Method: LoRA (r=16, alpha=32)                     │                   │
│  │    Data: instruction pairs + fill-in-middle samples  │                   │
│  │    Epochs: 35                                       │                   │
│  │    Output: adapter weights (~50200 MB)              │                   │
│  └──────────────────────┬───────────────────────────────┘                   │
│                         │                                                    │
│  ┌──────────────────────▼───────────────────────────────┐                   │
│  │ 5. Evaluate adapter                                   │                   │
│  │    • HumanEval pass@1 (baseline vs fine-tuned)        │                   │
│  │    • Project-specific eval (handler-base patterns,    │                   │
│  │      Kubeflow pipeline templates, Flux manifests)     │                   │
│  └──────────────────────┬───────────────────────────────┘                   │
│                         │                                                    │
│  ┌──────────────────────▼───┐  ┌────────────────────────────────────────┐   │
│  │ 6. Push adapter to Gitea │  │ 7. Log metrics to MLflow               │   │
│  │    code-lora-adapters    │  │    experiment: waterdeep-coder-finetune │   │
│  │    repo (versioned)      │  │    metrics: eval_loss, humaneval,       │   │
│  └──────────────────────────┘  │             project_specific_score      │   │
│                                └────────────────────────────────────────┘   │
│                                                                             │
│  ┌─────────────────────────────────────────────────────────────────────┐    │
│  │ 8. Deploy adapter on waterdeep                                      │    │
│  │    • Pull latest adapter from Gitea                                 │    │
│  │    • Restart mlx-lm-server with --adapter-path pointing to new ver  │    │
│  │    • Smoke test: send test completion requests                      │    │
│  └─────────────────────────────────────────────────────────────────────┘    │
└─────────────────────────────────────────────────────────────────────────────┘

Training Data Preparation

The training dataset is built from all DaviesTechLabs repositories:

Source Format Purpose
Go handlers (chat-handler, voice-assistant, etc.) Instruction pairs Teach handler-base API patterns, NATS message handling, protobuf encoding
Kubeflow pipelines (kubeflow/*.py) Instruction pairs Teach pipeline structure, KFP component patterns, S3 data flow
Ray Serve apps (ray-serve/) Instruction pairs Teach Ray Serve deployment, vLLM config, model serving patterns
Flux manifests (homelab-k8s2/) Instruction pairs Teach HelmRelease, Kustomization, namespace layout
Argo workflows (argo/*.yaml) Instruction pairs Teach WorkflowTemplate patterns, NATS triggers
ADRs (homelab-design/decisions/) Instruction pairs Teach architecture rationale and decision format
All source files Fill-in-middle (FIM) Teach code completion with project-specific context

Instruction pair example (Go handler):

{
  "instruction": "Create a new NATS handler module that bridges to an external gRPC service, following the handler-base pattern used in chat-handler and voice-assistant.",
  "output": "package main\n\nimport (\n\t\"context\"\n\t\"os\"\n\t\"os/signal\"\n\t\"syscall\"\n\n\t\"git.daviestechlabs.io/daviestechlabs/handler-base/config\"\n\t\"git.daviestechlabs.io/daviestechlabs/handler-base/handler\"\n\t\"git.daviestechlabs.io/daviestechlabs/handler-base/health\"\n\t..."
}

Fill-in-middle example:

{
  "prefix": "func (h *Handler) HandleMessage(ctx context.Context, msg *messages.UserMessage) (*messages.AssistantMessage, error) {\n\t",
  "suffix": "\n\treturn response, nil\n}",
  "middle": "response, err := h.client.Complete(ctx, msg.Content)\n\tif err != nil {\n\t\treturn nil, fmt.Errorf(\"completion failed: %w\", err)\n\t}"
}

Implementation Plan

1. Model Setup

# Install MLX and mlx-lm via uv (per ADR-0012)
uv tool install mlx-lm

# Download and convert Qwen 2.5 Coder 32B Instruct to MLX Q8_0 format
mlx_lm.convert \
  --hf-path Qwen/Qwen2.5-Coder-32B-Instruct \
  --mlx-path ~/.mlx-models/Qwen2.5-Coder-32B-Instruct-Q8 \
  --quantize \
  --q-bits 8

# Verify model loads and generates
mlx_lm.generate \
  --model ~/.mlx-models/Qwen2.5-Coder-32B-Instruct-Q8 \
  --prompt "def fibonacci(n: int) -> int:"

2. Inference Server (launchd)

# Start the server manually first to verify
mlx_lm.server \
  --model ~/.mlx-models/Qwen2.5-Coder-32B-Instruct-Q8 \
  --adapter-path ~/.mlx-models/adapters/homelab-coder/latest \
  --host 0.0.0.0 \
  --port 8080

# Verify OpenAI-compatible endpoint
curl http://localhost:8080/v1/chat/completions \
  -H "Content-Type: application/json" \
  -d '{
    "model": "qwen2.5-coder-32b",
    "messages": [{"role": "user", "content": "Write a Go handler using handler-base that processes NATS messages"}],
    "max_tokens": 512
  }'

launchd plist (~/Library/LaunchAgents/io.daviestechlabs.mlx-coder.plist):

<?xml version="1.0" encoding="UTF-8"?>
<!DOCTYPE plist PUBLIC "-//Apple//DTD PLIST 1.0//EN"
  "http://www.apple.com/DTDs/PropertyList-1.0.dtd">
<plist version="1.0">
<dict>
  <key>Label</key>
  <string>io.daviestechlabs.mlx-coder</string>
  <key>ProgramArguments</key>
  <array>
    <string>/Users/billy/.local/bin/mlx_lm.server</string>
    <string>--model</string>
    <string>/Users/billy/.mlx-models/Qwen2.5-Coder-32B-Instruct-Q8</string>
    <string>--adapter-path</string>
    <string>/Users/billy/.mlx-models/adapters/homelab-coder/latest</string>
    <string>--host</string>
    <string>0.0.0.0</string>
    <string>--port</string>
    <string>8080</string>
  </array>
  <key>RunAtLoad</key>
  <true/>
  <key>KeepAlive</key>
  <true/>
  <key>StandardOutPath</key>
  <string>/Users/billy/.mlx-models/logs/server.log</string>
  <key>StandardErrorPath</key>
  <string>/Users/billy/.mlx-models/logs/server.err</string>
</dict>
</plist>
# Load the service
launchctl load ~/Library/LaunchAgents/io.daviestechlabs.mlx-coder.plist

# Verify it's running
launchctl list | grep mlx-coder
curl http://waterdeep.lab.daviestechlabs.io:8080/v1/models

3. Client Configuration

OpenCode (~/.config/opencode/config.json on any dev machine):

{
  "provider": "openai",
  "model": "qwen2.5-coder-32b",
  "baseURL": "http://waterdeep.lab.daviestechlabs.io:8080/v1"
}

VS Code (settings.json — Continue.dev extension):

{
  "continue.models": [
    {
      "title": "waterdeep-coder",
      "provider": "openai",
      "model": "qwen2.5-coder-32b",
      "apiBase": "http://waterdeep.lab.daviestechlabs.io:8080/v1",
      "apiKey": "not-needed"
    }
  ]
}

4. Fine-Tuning on waterdeep (MLX LoRA)

# Prepare training data (run on cluster via Kubeflow, or locally)
# Output: train.jsonl and valid.jsonl in chat/instruction format

# Fine-tune with LoRA using mlx-lm
mlx_lm.lora \
  --model ~/.mlx-models/Qwen2.5-Coder-32B-Instruct-Q8 \
  --train \
  --data ~/.mlx-models/training-data/homelab-coder/ \
  --adapter-path ~/.mlx-models/adapters/homelab-coder/$(date +%Y%m%d)/ \
  --lora-layers 16 \
  --batch-size 1 \
  --iters 1000 \
  --learning-rate 1e-5 \
  --val-batches 25 \
  --save-every 100

# Evaluate the adapter
mlx_lm.generate \
  --model ~/.mlx-models/Qwen2.5-Coder-32B-Instruct-Q8 \
  --adapter-path ~/.mlx-models/adapters/homelab-coder/$(date +%Y%m%d)/ \
  --prompt "Create a new Go NATS handler using handler-base that..."

# Update the 'latest' symlink
ln -sfn ~/.mlx-models/adapters/homelab-coder/$(date +%Y%m%d) \
        ~/.mlx-models/adapters/homelab-coder/latest

# Restart the server to pick up new adapter
launchctl kickstart -k gui/$(id -u)/io.daviestechlabs.mlx-coder

5. Training Data Pipeline (Kubeflow)

A new code_finetune_pipeline.py orchestrates dataset preparation on the cluster:

 code_finetune_pipeline.yaml
       │
       ├── 1. clone_repos           Clone all DaviesTechLabs repos from Gitea
       ├── 2. extract_patterns      Parse Go, Python, YAML files into instruction pairs
       ├── 3. generate_fim          Create fill-in-middle samples from source files
       ├── 4. deduplicate           Remove near-duplicate samples (MinHash)
       ├── 5. format_dataset        Convert to mlx-lm JSONL format (train + validation split)
       ├── 6. upload_to_s3          Push dataset to s3://training-data/code-finetune/{run_id}/
       └── 7. log_to_mlflow         Log dataset stats (num_samples, token_count, repo_coverage)

The actual LoRA fine-tune runs on waterdeep (not the cluster) because:

  • mlx-lm LoRA leverages the M4 Pro's Metal GPU — significantly faster than CPU training
  • The model is already loaded on waterdeep — no need to transfer 34 GB to/from the cluster
  • Training a 32B model with LoRA requires ~40 GB — only waterdeep and khelben have enough memory

6. Memory Budget

Component Memory
macOS + system services ~3 GB
Qwen 2.5 Coder 32B (Q8_0 weights) ~34 GB
KV cache (8192 context) ~6 GB
mlx-lm-server overhead ~1 GB
Total (inference) ~44 GB
Headroom ~4 GB

During LoRA fine-tuning (server stopped):

Component Memory
macOS + system services ~3 GB
Model weights (frozen, Q8_0) ~34 GB
LoRA adapter gradients + optimizer ~4 GB
Training batch + activations ~5 GB
Total (training) ~46 GB
Headroom ~2 GB

Both workloads fit within the 48 GB budget. Inference and training are mutually exclusive — the server is stopped during fine-tuning runs to reclaim KV cache memory for training.

Security Considerations

  • mlx-lm-server has no authentication — bind to LAN only; waterdeep's firewall blocks external access
  • No code leaves the network — all inference and training is local
  • Training data is sourced exclusively from Gitea (internal repos) — no external data contamination
  • Adapter weights are versioned in Gitea — auditable lineage from training data to deployed model
  • Consider adding a simple API key check via a reverse proxy (Caddy/nginx) if the LAN is not fully trusted

Future Considerations

  • DGX Spark (ADR-0058): If acquired, DGX Spark could fine-tune larger coding models (70B+) or run full fine-tunes instead of LoRA. waterdeep would remain the serving endpoint unless the DGX Spark also serves inference.
  • Adapter hot-swap: mlx-lm supports loading adapters at request time — could serve multiple fine-tuned adapters (e.g., Go-specific, Python-specific, YAML-specific) from a single base model
  • RAG augmentation: Combine the fine-tuned model with a RAG pipeline that retrieves relevant code snippets from Milvus (ADR-0008) for even better context-aware completions
  • Continuous fine-tuning: Trigger the pipeline automatically on Gitea push events via NATS — the model stays current with codebase changes
  • Evaluation suite: Build a project-specific eval set (handler-base patterns, pipeline templates, Flux manifests) to measure fine-tuning quality beyond generic benchmarks
  • Newer models: As new coding models are released (Qwen 3 Coder, DeepSeek Coder V3, etc.), re-evaluate which model maximises quality within the 48 GB budget
  • Updates: ADR-0059 — waterdeep repurposed from 3D avatar workstation to dedicated coding agent
  • Related: ADR-0058 — Training strategy (distributed CPU + DGX Spark path)
  • Related: ADR-0047 — MLflow experiment tracking
  • Related: ADR-0054 — Kubeflow Pipeline CI/CD
  • Related: ADR-0012 — uv for Python development
  • Related: ADR-0037 — Node naming conventions (waterdeep)
  • Related: ADR-0060 — Internal PKI (TLS for waterdeep endpoint)
  • Qwen 2.5 Coder — Model card
  • MLX LM — Apple MLX language model framework
  • OpenCode — Terminal-based AI coding assistant
  • Continue.dev — VS Code AI coding extension with custom model support