From 6e574ffc4b45481a4194b12b8334366779ece608 Mon Sep 17 00:00:00 2001 From: "Billy D." Date: Thu, 26 Feb 2026 06:17:00 -0500 Subject: [PATCH] adding in proto for mr beaker. --- decisions/0064-waterdeep-coding-agent.md | 445 +++++++++++++++++++++++ 1 file changed, 445 insertions(+) create mode 100644 decisions/0064-waterdeep-coding-agent.md diff --git a/decisions/0064-waterdeep-coding-agent.md b/decisions/0064-waterdeep-coding-agent.md new file mode 100644 index 0000000..e586a6b --- /dev/null +++ b/decisions/0064-waterdeep-coding-agent.md @@ -0,0 +1,445 @@ +# waterdeep (Mac Mini M4 Pro) as Dedicated Coding Agent with Fine-Tuned Model + +* Status: proposed +* Date: 2026-02-26 +* Deciders: Billy +* Technical Story: Repurpose waterdeep as a dedicated local coding agent serving a fine-tuned code-completion model for OpenCode, Copilot Chat, and other AI coding tools, with a pipeline for continually tuning the model on the homelab codebase + +## Context and Problem Statement + +**waterdeep** is a Mac Mini M4 Pro with 48 GB of unified memory ([ADR-0059](0059-mac-mini-ray-worker.md)). Its current role as a 3D avatar creation workstation ([ADR-0059](0059-mac-mini-ray-worker.md)) is being superseded by the automated ComfyUI pipeline ([ADR-0063](0063-comfyui-3d-avatar-pipeline.md)), which handles avatar generation on a personal desktop as an on-demand Ray worker. This frees waterdeep for a higher-value use case. + +GitHub Copilot and cloud-hosted coding assistants work well for general code, but they have no knowledge of DaviesTechLabs-specific patterns: the handler-base module API, NATS protobuf message conventions, Kubeflow pipeline structure, Ray Serve deployment patterns, Flux/Kustomize layout, or the Go handler lifecycle used across chat-handler, voice-assistant, pipeline-bridge, stt-module, and tts-module. A model fine-tuned on the homelab codebase would produce completions that follow project conventions out of the box. + +With 48 GB of unified memory and no other workloads, waterdeep can serve **Qwen 2.5 Coder 32B Instruct** at Q8_0 quantisation (~34 GB) via MLX with ample headroom for KV cache, leaving the machine responsive for the inference server and macOS overhead. This is the largest purpose-built coding model that fits at high quantisation on this hardware, and it consistently outperforms general-purpose 70B models at Q4 on coding benchmarks. + +How should we configure waterdeep as a dedicated coding agent and build a pipeline for fine-tuning the model on our codebase? + +## Decision Drivers + +* waterdeep's 48 GB unified memory is fully available — no competing workloads after ComfyUI pipeline takeover +* Qwen 2.5 Coder 32B Instruct is the highest-quality open-source coding model that fits at Q8_0 (~34 GB weights + ~10 GB KV cache headroom) +* MLX on Apple Silicon provides native Metal-accelerated inference with no framework overhead — purpose-built for M-series chips +* OpenCode and VS Code Copilot Chat both support OpenAI-compatible API endpoints — a local server is a drop-in replacement +* The homelab codebase has strong conventions (handler-base, protobuf messages, Kubeflow pipelines, Ray Serve apps, Flux GitOps) that a general model doesn't know +* Existing training infrastructure ([ADR-0058](0058-training-strategy-cpu-dgx-spark.md)) provides Kubeflow Pipelines + MLflow + S3 data flow for fine-tuning orchestration +* LoRA adapters are small (~50–200 MB) and can be merged into the base model or hot-swapped in mlx-lm-server +* The cluster's CPU training capacity (126 cores, 378 GB RAM across 14 nodes) can prepare training datasets; waterdeep itself can run the LoRA fine-tune on its Metal GPU + +## Considered Options + +1. **Qwen 2.5 Coder 32B Instruct (Q8_0) via mlx-lm-server on waterdeep** — fine-tuned with LoRA on the homelab codebase using MLX +2. **Llama 3.1 70B Instruct (Q4_K_M) via llama.cpp on waterdeep** — larger general-purpose model at aggressive quantisation +3. **DeepSeek Coder V2 Lite 16B via MLX on waterdeep** — smaller coding model, lower resource usage +4. **Keep using cloud Copilot only** — no local model, no fine-tuning + +## Decision Outcome + +Chosen option: **Option 1 — Qwen 2.5 Coder 32B Instruct (Q8_0) via mlx-lm-server**, because it is the best-in-class open-source coding model at a quantisation level that preserves near-full quality, fits comfortably within the 48 GB memory budget with room for KV cache, and MLX provides the optimal inference stack for Apple Silicon. Fine-tuning with LoRA on the homelab codebase will specialise the model to project conventions. + +### Positive Consequences + +* Purpose-built coding model — Qwen 2.5 Coder 32B tops open-source coding benchmarks (HumanEval, MBPP, BigCodeBench) +* Q8_0 quantisation preserves >99% of full-precision quality — minimal degradation vs Q4 +* ~34 GB model weights + ~10 GB KV cache headroom = comfortable fit in 48 GB unified memory +* MLX inference leverages Metal GPU for token generation — fast enough for interactive coding assistance +* OpenAI-compatible API via mlx-lm-server — works with OpenCode, VS Code Copilot Chat (custom endpoint), Continue.dev, and any OpenAI SDK client +* Fine-tuned LoRA adapter teaches project-specific patterns: handler-base API, NATS message conventions, Kubeflow pipeline structure, Flux layout +* LoRA fine-tuning runs directly on waterdeep using mlx-lm — no cluster resources needed for training +* Adapter files are small (~50–200 MB) — easy to version in Gitea and track in MLflow +* Fully offline — no cloud dependency, no data leaves the network +* Frees Copilot quota for non-coding tasks — local model handles bulk code completion + +### Negative Consequences + +* waterdeep is dedicated to this role — cannot simultaneously serve other workloads (Blender, etc.) +* Model updates require manual download and conversion to MLX format +* LoRA fine-tuning quality depends on training data curation — garbage in, garbage out +* 32B model is slower than cloud Copilot for very long completions — acceptable for interactive use +* Single point of failure — if waterdeep is down, fall back to cloud Copilot + +## Pros and Cons of the Options + +### Option 1: Qwen 2.5 Coder 32B Instruct (Q8_0) via MLX + +* Good, because purpose-built for code — trained on 5.5T tokens of code data +* Good, because 32B at Q8_0 (~34 GB) fits in 48 GB with KV cache headroom +* Good, because Q8_0 preserves near-full quality (vs Q4 which drops noticeably on coding tasks) +* Good, because MLX is Apple's native framework — zero-copy unified memory, Metal GPU kernels +* Good, because mlx-lm supports LoRA fine-tuning natively — train and serve on the same machine +* Good, because OpenAI-compatible API (mlx-lm-server) — drop-in for any coding tool +* Bad, because 32B generates ~15–25 tokens/sec on M4 Pro — adequate but not instant for long outputs +* Bad, because MLX model format requires conversion from HuggingFace (one-time, scripted) + +### Option 2: Llama 3.1 70B Instruct (Q4_K_M) via llama.cpp + +* Good, because 70B is a larger, more capable general model +* Good, because llama.cpp is mature and well-supported on macOS +* Bad, because Q4_K_M quantisation loses meaningful quality — especially on code tasks where precision matters +* Bad, because ~42 GB weights leaves only ~6 GB for KV cache — tight, risks OOM on long contexts +* Bad, because general-purpose model — not trained specifically for code, underperforms Qwen 2.5 Coder 32B on coding benchmarks despite being 2× larger +* Bad, because slower token generation (~8–12 tok/s) due to larger model size +* Bad, because llama.cpp doesn't natively support LoRA fine-tuning — need a separate training framework + +### Option 3: DeepSeek Coder V2 Lite 16B via MLX + +* Good, because smaller model — faster inference (~30–40 tok/s), lighter memory footprint +* Good, because still a capable coding model +* Bad, because significantly less capable than Qwen 2.5 Coder 32B on benchmarks +* Bad, because leaves 30+ GB of unified memory unused — not maximising the hardware +* Bad, because fewer parameters mean less capacity to absorb fine-tuning knowledge + +### Option 4: Cloud Copilot only + +* Good, because zero local infrastructure to maintain +* Good, because always up-to-date with latest model improvements +* Bad, because no knowledge of homelab-specific conventions — completions require heavy editing +* Bad, because cloud latency for every completion +* Bad, because data (code context) leaves the network +* Bad, because wastes waterdeep's 48 GB of unified memory sitting idle + +## Architecture + +### Inference Server + +``` +┌──────────────────────────────────────────────────────────────────────────┐ +│ waterdeep (Mac Mini M4 Pro · 48 GB unified · Metal GPU · dedicated) │ +│ │ +│ ┌────────────────────────────────────────────────────────────────────┐ │ +│ │ mlx-lm-server (launchd-managed) │ │ +│ │ │ │ +│ │ Model: Qwen2.5-Coder-32B-Instruct (Q8_0, MLX format) │ │ +│ │ LoRA: ~/.mlx-models/adapters/homelab-coder/latest/ │ │ +│ │ │ │ +│ │ Endpoint: http://waterdeep.lab.daviestechlabs.io:8080/v1 │ │ +│ │ ├── /v1/completions (code completion, FIM) │ │ +│ │ ├── /v1/chat/completions (chat / instruct) │ │ +│ │ └── /v1/models (model listing) │ │ +│ │ │ │ +│ │ Memory: ~34 GB model + ~10 GB KV cache = ~44 GB │ │ +│ └────────────────────────────────────────────────────────────────────┘ │ +│ │ +│ ┌─────────────────────────┐ ┌──────────────────────────────────────┐ │ +│ │ macOS overhead ~3 GB │ │ Training (on-demand, same GPU) │ │ +│ │ (kernel, WindowServer, │ │ mlx-lm LoRA fine-tune │ │ +│ │ mDNSResponder, etc.) │ │ (server stopped during training) │ │ +│ └─────────────────────────┘ └──────────────────────────────────────┘ │ +└──────────────────────────────────────────────────────────────────────────┘ + │ + │ HTTP :8080 (OpenAI-compatible API) + │ + ┌────┴──────────────────────────────────────────────────────┐ + │ │ + ▼ ▼ +┌─────────────────────────────┐ ┌─────────────────────────────────────┐ +│ VS Code (any machine) │ │ OpenCode (terminal, any machine) │ +│ │ │ │ +│ Copilot Chat / Continue.dev │ │ OPENCODE_MODEL_PROVIDER=openai │ +│ Custom endpoint → │ │ OPENAI_API_BASE= │ +│ waterdeep:8080/v1 │ │ http://waterdeep:8080/v1 │ +└─────────────────────────────┘ └─────────────────────────────────────┘ +``` + +### Fine-Tuning Pipeline + +``` +┌─────────────────────────────────────────────────────────────────────────────┐ +│ Fine-Tuning Pipeline (Kubeflow) │ +│ │ +│ Trigger: weekly cron or manual (after significant codebase changes) │ +│ │ +│ ┌──────────────┐ ┌──────────────────┐ ┌────────────────────────┐ │ +│ │ 1. Clone repos│ │ 2. Build training │ │ 3. Upload dataset to │ │ +│ │ from Gitea │───▶│ dataset │───▶│ S3 │ │ +│ │ (all repos)│ │ (instruction │ │ training-data/ │ │ +│ │ │ │ pairs + FIM) │ │ code-finetune/ │ │ +│ └──────────────┘ └──────────────────┘ └──────────┬─────────────┘ │ +│ │ │ +│ ┌──────────────────────────────────────────────────────┐│ │ +│ │ 4. Trigger LoRA fine-tune on waterdeep ││ │ +│ │ (SSH or webhook → mlx-lm lora on Metal GPU) │◀ │ +│ │ │ │ +│ │ Base: Qwen2.5-Coder-32B-Instruct (MLX Q8_0) │ │ +│ │ Method: LoRA (r=16, alpha=32) │ │ +│ │ Data: instruction pairs + fill-in-middle samples │ │ +│ │ Epochs: 3–5 │ │ +│ │ Output: adapter weights (~50–200 MB) │ │ +│ └──────────────────────┬───────────────────────────────┘ │ +│ │ │ +│ ┌──────────────────────▼───────────────────────────────┐ │ +│ │ 5. Evaluate adapter │ │ +│ │ • HumanEval pass@1 (baseline vs fine-tuned) │ │ +│ │ • Project-specific eval (handler-base patterns, │ │ +│ │ Kubeflow pipeline templates, Flux manifests) │ │ +│ └──────────────────────┬───────────────────────────────┘ │ +│ │ │ +│ ┌──────────────────────▼───┐ ┌────────────────────────────────────────┐ │ +│ │ 6. Push adapter to Gitea │ │ 7. Log metrics to MLflow │ │ +│ │ code-lora-adapters │ │ experiment: waterdeep-coder-finetune │ │ +│ │ repo (versioned) │ │ metrics: eval_loss, humaneval, │ │ +│ └──────────────────────────┘ │ project_specific_score │ │ +│ └────────────────────────────────────────┘ │ +│ │ +│ ┌─────────────────────────────────────────────────────────────────────┐ │ +│ │ 8. Deploy adapter on waterdeep │ │ +│ │ • Pull latest adapter from Gitea │ │ +│ │ • Restart mlx-lm-server with --adapter-path pointing to new ver │ │ +│ │ • Smoke test: send test completion requests │ │ +│ └─────────────────────────────────────────────────────────────────────┘ │ +└─────────────────────────────────────────────────────────────────────────────┘ +``` + +### Training Data Preparation + +The training dataset is built from all DaviesTechLabs repositories: + +| Source | Format | Purpose | +|--------|--------|---------| +| Go handlers (chat-handler, voice-assistant, etc.) | Instruction pairs | Teach handler-base API patterns, NATS message handling, protobuf encoding | +| Kubeflow pipelines (kubeflow/*.py) | Instruction pairs | Teach pipeline structure, KFP component patterns, S3 data flow | +| Ray Serve apps (ray-serve/) | Instruction pairs | Teach Ray Serve deployment, vLLM config, model serving patterns | +| Flux manifests (homelab-k8s2/) | Instruction pairs | Teach HelmRelease, Kustomization, namespace layout | +| Argo workflows (argo/*.yaml) | Instruction pairs | Teach WorkflowTemplate patterns, NATS triggers | +| ADRs (homelab-design/decisions/) | Instruction pairs | Teach architecture rationale and decision format | +| All source files | Fill-in-middle (FIM) | Teach code completion with project-specific context | + +**Instruction pair example (Go handler):** + +```json +{ + "instruction": "Create a new NATS handler module that bridges to an external gRPC service, following the handler-base pattern used in chat-handler and voice-assistant.", + "output": "package main\n\nimport (\n\t\"context\"\n\t\"os\"\n\t\"os/signal\"\n\t\"syscall\"\n\n\t\"git.daviestechlabs.io/daviestechlabs/handler-base/config\"\n\t\"git.daviestechlabs.io/daviestechlabs/handler-base/handler\"\n\t\"git.daviestechlabs.io/daviestechlabs/handler-base/health\"\n\t..." +} +``` + +**Fill-in-middle example:** + +```json +{ + "prefix": "func (h *Handler) HandleMessage(ctx context.Context, msg *messages.UserMessage) (*messages.AssistantMessage, error) {\n\t", + "suffix": "\n\treturn response, nil\n}", + "middle": "response, err := h.client.Complete(ctx, msg.Content)\n\tif err != nil {\n\t\treturn nil, fmt.Errorf(\"completion failed: %w\", err)\n\t}" +} +``` + +## Implementation Plan + +### 1. Model Setup + +```bash +# Install MLX and mlx-lm via uv (per ADR-0012) +uv tool install mlx-lm + +# Download and convert Qwen 2.5 Coder 32B Instruct to MLX Q8_0 format +mlx_lm.convert \ + --hf-path Qwen/Qwen2.5-Coder-32B-Instruct \ + --mlx-path ~/.mlx-models/Qwen2.5-Coder-32B-Instruct-Q8 \ + --quantize \ + --q-bits 8 + +# Verify model loads and generates +mlx_lm.generate \ + --model ~/.mlx-models/Qwen2.5-Coder-32B-Instruct-Q8 \ + --prompt "def fibonacci(n: int) -> int:" +``` + +### 2. Inference Server (launchd) + +```bash +# Start the server manually first to verify +mlx_lm.server \ + --model ~/.mlx-models/Qwen2.5-Coder-32B-Instruct-Q8 \ + --adapter-path ~/.mlx-models/adapters/homelab-coder/latest \ + --host 0.0.0.0 \ + --port 8080 + +# Verify OpenAI-compatible endpoint +curl http://localhost:8080/v1/chat/completions \ + -H "Content-Type: application/json" \ + -d '{ + "model": "qwen2.5-coder-32b", + "messages": [{"role": "user", "content": "Write a Go handler using handler-base that processes NATS messages"}], + "max_tokens": 512 + }' +``` + +**launchd plist** (`~/Library/LaunchAgents/io.daviestechlabs.mlx-coder.plist`): + +```xml + + + + + Label + io.daviestechlabs.mlx-coder + ProgramArguments + + /Users/billy/.local/bin/mlx_lm.server + --model + /Users/billy/.mlx-models/Qwen2.5-Coder-32B-Instruct-Q8 + --adapter-path + /Users/billy/.mlx-models/adapters/homelab-coder/latest + --host + 0.0.0.0 + --port + 8080 + + RunAtLoad + + KeepAlive + + StandardOutPath + /Users/billy/.mlx-models/logs/server.log + StandardErrorPath + /Users/billy/.mlx-models/logs/server.err + + +``` + +```bash +# Load the service +launchctl load ~/Library/LaunchAgents/io.daviestechlabs.mlx-coder.plist + +# Verify it's running +launchctl list | grep mlx-coder +curl http://waterdeep.lab.daviestechlabs.io:8080/v1/models +``` + +### 3. Client Configuration + +**OpenCode** (`~/.config/opencode/config.json` on any dev machine): + +```json +{ + "provider": "openai", + "model": "qwen2.5-coder-32b", + "baseURL": "http://waterdeep.lab.daviestechlabs.io:8080/v1" +} +``` + +**VS Code** (settings.json — Continue.dev extension): + +```json +{ + "continue.models": [ + { + "title": "waterdeep-coder", + "provider": "openai", + "model": "qwen2.5-coder-32b", + "apiBase": "http://waterdeep.lab.daviestechlabs.io:8080/v1", + "apiKey": "not-needed" + } + ] +} +``` + +### 4. Fine-Tuning on waterdeep (MLX LoRA) + +```bash +# Prepare training data (run on cluster via Kubeflow, or locally) +# Output: train.jsonl and valid.jsonl in chat/instruction format + +# Fine-tune with LoRA using mlx-lm +mlx_lm.lora \ + --model ~/.mlx-models/Qwen2.5-Coder-32B-Instruct-Q8 \ + --train \ + --data ~/.mlx-models/training-data/homelab-coder/ \ + --adapter-path ~/.mlx-models/adapters/homelab-coder/$(date +%Y%m%d)/ \ + --lora-layers 16 \ + --batch-size 1 \ + --iters 1000 \ + --learning-rate 1e-5 \ + --val-batches 25 \ + --save-every 100 + +# Evaluate the adapter +mlx_lm.generate \ + --model ~/.mlx-models/Qwen2.5-Coder-32B-Instruct-Q8 \ + --adapter-path ~/.mlx-models/adapters/homelab-coder/$(date +%Y%m%d)/ \ + --prompt "Create a new Go NATS handler using handler-base that..." + +# Update the 'latest' symlink +ln -sfn ~/.mlx-models/adapters/homelab-coder/$(date +%Y%m%d) \ + ~/.mlx-models/adapters/homelab-coder/latest + +# Restart the server to pick up new adapter +launchctl kickstart -k gui/$(id -u)/io.daviestechlabs.mlx-coder +``` + +### 5. Training Data Pipeline (Kubeflow) + +A new `code_finetune_pipeline.py` orchestrates dataset preparation on the cluster: + +``` + code_finetune_pipeline.yaml + │ + ├── 1. clone_repos Clone all DaviesTechLabs repos from Gitea + ├── 2. extract_patterns Parse Go, Python, YAML files into instruction pairs + ├── 3. generate_fim Create fill-in-middle samples from source files + ├── 4. deduplicate Remove near-duplicate samples (MinHash) + ├── 5. format_dataset Convert to mlx-lm JSONL format (train + validation split) + ├── 6. upload_to_s3 Push dataset to s3://training-data/code-finetune/{run_id}/ + └── 7. log_to_mlflow Log dataset stats (num_samples, token_count, repo_coverage) +``` + +The actual LoRA fine-tune runs on waterdeep (not the cluster) because: +- mlx-lm LoRA leverages the M4 Pro's Metal GPU — significantly faster than CPU training +- The model is already loaded on waterdeep — no need to transfer 34 GB to/from the cluster +- Training a 32B model with LoRA requires ~40 GB — only waterdeep and khelben have enough memory + +### 6. Memory Budget + +| Component | Memory | +|-----------|--------| +| macOS + system services | ~3 GB | +| Qwen 2.5 Coder 32B (Q8_0 weights) | ~34 GB | +| KV cache (8192 context) | ~6 GB | +| mlx-lm-server overhead | ~1 GB | +| **Total (inference)** | **~44 GB** | +| **Headroom** | **~4 GB** | + +During LoRA fine-tuning (server stopped): + +| Component | Memory | +|-----------|--------| +| macOS + system services | ~3 GB | +| Model weights (frozen, Q8_0) | ~34 GB | +| LoRA adapter gradients + optimizer | ~4 GB | +| Training batch + activations | ~5 GB | +| **Total (training)** | **~46 GB** | +| **Headroom** | **~2 GB** | + +Both workloads fit within the 48 GB budget. Inference and training are mutually exclusive — the server is stopped during fine-tuning runs to reclaim KV cache memory for training. + +## Security Considerations + +* mlx-lm-server has no authentication — bind to LAN only; waterdeep's firewall blocks external access +* No code leaves the network — all inference and training is local +* Training data is sourced exclusively from Gitea (internal repos) — no external data contamination +* Adapter weights are versioned in Gitea — auditable lineage from training data to deployed model +* Consider adding a simple API key check via a reverse proxy (Caddy/nginx) if the LAN is not fully trusted + +## Future Considerations + +* **DGX Spark** ([ADR-0058](0058-training-strategy-cpu-dgx-spark.md)): If acquired, DGX Spark could fine-tune larger coding models (70B+) or run full fine-tunes instead of LoRA. waterdeep would remain the serving endpoint unless the DGX Spark also serves inference. +* **Adapter hot-swap**: mlx-lm supports loading adapters at request time — could serve multiple fine-tuned adapters (e.g., Go-specific, Python-specific, YAML-specific) from a single base model +* **RAG augmentation**: Combine the fine-tuned model with a RAG pipeline that retrieves relevant code snippets from Milvus ([ADR-0008](0008-use-milvus-for-vectors.md)) for even better context-aware completions +* **Continuous fine-tuning**: Trigger the pipeline automatically on Gitea push events via NATS — the model stays current with codebase changes +* **Evaluation suite**: Build a project-specific eval set (handler-base patterns, pipeline templates, Flux manifests) to measure fine-tuning quality beyond generic benchmarks +* **Newer models**: As new coding models are released (Qwen 3 Coder, DeepSeek Coder V3, etc.), re-evaluate which model maximises quality within the 48 GB budget + +## Links + +* Updates: [ADR-0059](0059-mac-mini-ray-worker.md) — waterdeep repurposed from 3D avatar workstation to dedicated coding agent +* Related: [ADR-0058](0058-training-strategy-cpu-dgx-spark.md) — Training strategy (distributed CPU + DGX Spark path) +* Related: [ADR-0047](0047-mlflow-experiment-tracking.md) — MLflow experiment tracking +* Related: [ADR-0054](0054-kubeflow-pipeline-cicd.md) — Kubeflow Pipeline CI/CD +* Related: [ADR-0012](0012-use-uv-for-python-development.md) — uv for Python development +* Related: [ADR-0037](0037-node-naming-conventions.md) — Node naming conventions (waterdeep) +* Related: [ADR-0060](0060-internal-pki-vault.md) — Internal PKI (TLS for waterdeep endpoint) +* [Qwen 2.5 Coder](https://huggingface.co/Qwen/Qwen2.5-Coder-32B-Instruct) — Model card +* [MLX LM](https://github.com/ml-explore/mlx-examples/tree/main/llms/mlx_lm) — Apple MLX language model framework +* [OpenCode](https://opencode.ai) — Terminal-based AI coding assistant +* [Continue.dev](https://continue.dev) — VS Code AI coding extension with custom model support