Compare commits
16 Commits
fbc2ef2b3f
...
main
| Author | SHA1 | Date | |
|---|---|---|---|
|
|
6097be571c | ||
| 6e574ffc4b | |||
|
|
b64102d853 | ||
| 4affddf9b4 | |||
|
|
e28382a765 | ||
| 100ba21eba | |||
|
|
f19fa3e969 | ||
| 50b14b2a75 | |||
|
|
32e370401f | ||
| 654b7ae774 | |||
| 9fe12e0cff | |||
|
|
defbd5b2f9 | ||
| 555b70b9d9 | |||
|
|
f650b4bd22 | ||
| cbd892c7c9 | |||
| e57d998d9a |
@@ -22,13 +22,13 @@ You are working on a **homelab Kubernetes cluster** running:
|
||||
|
||||
| Repo | Purpose |
|
||||
|------|---------|
|
||||
| `handler-base` | Shared Python library for NATS handlers |
|
||||
| `chat-handler` | Text chat with RAG pipeline |
|
||||
| `voice-assistant` | Voice pipeline (STT → RAG → LLM → TTS) |
|
||||
| `handler-base` | Shared Go module for NATS handlers (protobuf, health, OTel, clients) |
|
||||
| `chat-handler` | Text chat with RAG pipeline (Go) |
|
||||
| `voice-assistant` | Voice pipeline: STT → RAG → LLM → TTS (Go) |
|
||||
| `kuberay-images` | GPU-specific Ray worker Docker images |
|
||||
| `pipeline-bridge` | Bridge between pipelines and services |
|
||||
| `stt-module` | Speech-to-text service |
|
||||
| `tts-module` | Text-to-speech service |
|
||||
| `pipeline-bridge` | Bridge between pipelines and services (Go) |
|
||||
| `stt-module` | Speech-to-text service (Go) |
|
||||
| `tts-module` | Text-to-speech service (Go) |
|
||||
| `ray-serve` | Ray Serve inference services |
|
||||
| `argo` | Argo Workflows (training, batch inference) |
|
||||
| `kubeflow` | Kubeflow Pipeline definitions |
|
||||
@@ -48,7 +48,7 @@ You are working on a **homelab Kubernetes cluster** running:
|
||||
┌─────────────────────────────────────────────────────────────────┐
|
||||
│ NATS MESSAGE BUS │
|
||||
│ Subjects: ai.chat.*, ai.voice.*, ai.pipeline.* │
|
||||
│ Format: MessagePack (binary) │
|
||||
│ Format: Protocol Buffers (binary, see ADR-0061) │
|
||||
└───────────────────────────┬─────────────────────────────────────┘
|
||||
│
|
||||
┌───────────────────┼───────────────────┐
|
||||
@@ -93,19 +93,23 @@ talos/
|
||||
### AI/ML Services (Gitea daviestechlabs org)
|
||||
|
||||
```
|
||||
handler-base/ # Shared handler library
|
||||
├── handler_base/ # Core classes
|
||||
│ ├── handler.py # Base Handler class
|
||||
│ ├── nats_client.py # NATS wrapper
|
||||
│ └── clients/ # Service clients (STT, TTS, LLM, etc.)
|
||||
handler-base/ # Shared Go module (NATS, health, OTel, protobuf)
|
||||
├── clients/ # HTTP clients (LLM, STT, TTS, embeddings, reranker)
|
||||
├── config/ # Env-based configuration (struct tags)
|
||||
├── gen/messagespb/ # Generated protobuf stubs
|
||||
├── handler/ # Typed NATS message handler
|
||||
├── health/ # HTTP health + readiness server
|
||||
└── natsutil/ # NATS publish/request with protobuf
|
||||
|
||||
chat-handler/ # RAG chat service
|
||||
├── chat_handler_v2.py # Handler-base version
|
||||
└── Dockerfile.v2
|
||||
chat-handler/ # RAG chat service (Go)
|
||||
├── main.go
|
||||
├── main_test.go
|
||||
└── Dockerfile
|
||||
|
||||
voice-assistant/ # Voice pipeline service
|
||||
├── voice_assistant_v2.py # Handler-base version
|
||||
└── pipelines/voice_pipeline.py
|
||||
voice-assistant/ # Voice pipeline service (Go)
|
||||
├── main.go
|
||||
├── main_test.go
|
||||
└── Dockerfile
|
||||
|
||||
argo/ # Argo WorkflowTemplates
|
||||
├── batch-inference.yaml
|
||||
@@ -127,8 +131,23 @@ kuberay-images/ # GPU worker images
|
||||
|
||||
## 🔌 Service Endpoints (Internal)
|
||||
|
||||
```go
|
||||
// Copy-paste ready for Go handler services
|
||||
const (
|
||||
NATSUrl = "nats://nats.ai-ml.svc.cluster.local:4222"
|
||||
VLLMUrl = "http://llm-draft.ai-ml.svc.cluster.local:8000/v1"
|
||||
WhisperUrl = "http://whisper-predictor.ai-ml.svc.cluster.local"
|
||||
TTSUrl = "http://tts-predictor.ai-ml.svc.cluster.local"
|
||||
EmbeddingsUrl = "http://embeddings-predictor.ai-ml.svc.cluster.local"
|
||||
RerankerUrl = "http://reranker-predictor.ai-ml.svc.cluster.local"
|
||||
MilvusHost = "milvus.ai-ml.svc.cluster.local"
|
||||
MilvusPort = 19530
|
||||
ValkeyUrl = "redis://valkey.ai-ml.svc.cluster.local:6379"
|
||||
)
|
||||
```
|
||||
|
||||
```python
|
||||
# Copy-paste ready for Python code
|
||||
# For Python services (Ray Serve, Kubeflow pipelines, Gradio UIs)
|
||||
NATS_URL = "nats://nats.ai-ml.svc.cluster.local:4222"
|
||||
VLLM_URL = "http://llm-draft.ai-ml.svc.cluster.local:8000/v1"
|
||||
WHISPER_URL = "http://whisper-predictor.ai-ml.svc.cluster.local"
|
||||
@@ -175,7 +194,7 @@ f"ai.pipeline.status.{request_id}" # Status updates
|
||||
|
||||
### Add a New NATS Handler
|
||||
|
||||
1. Create handler repo or add to existing (use `handler-base` library)
|
||||
1. Create Go handler repo using `handler-base` module (see [ADR-0061](decisions/0061-go-handler-refactor.md))
|
||||
2. Add K8s Deployment in `homelab-k8s2/kubernetes/apps/ai-ml/`
|
||||
3. Push to main → Flux deploys automatically
|
||||
|
||||
|
||||
@@ -44,7 +44,7 @@ The homelab is a production-grade Kubernetes cluster running on bare-metal hardw
|
||||
│ │ • AI_PIPELINE (24h, file) - Workflow triggers │ │
|
||||
│ └─────────────────────────────────────────────────────────────────────┘ │
|
||||
│ │
|
||||
│ Message Format: MessagePack (binary, not JSON) │
|
||||
│ Message Format: Protocol Buffers (binary, see ADR-0061) │
|
||||
└─────────────────────────────────────────────────────────────────────────────┘
|
||||
│
|
||||
┌─────────────────────────┼─────────────────────────┐
|
||||
@@ -312,11 +312,12 @@ Applications ──► OpenTelemetry SDK ──► Jaeger/Tempo ──► Grafan
|
||||
|----------|-----------|-----|
|
||||
| Talos Linux | Immutable, API-driven, secure | [ADR-0002](decisions/0002-use-talos-linux.md) |
|
||||
| NATS over Kafka | Simpler ops, sufficient throughput | [ADR-0003](decisions/0003-use-nats-for-messaging.md) |
|
||||
| MessagePack over JSON | Binary efficiency for audio | [ADR-0004](decisions/0004-use-messagepack-for-nats.md) |
|
||||
| Protocol Buffers over MessagePack | Type-safe, schema-driven, Go-native | [ADR-0061](decisions/0061-go-handler-refactor.md) |
|
||||
| Multi-GPU heterogeneous | Cost optimization, workload matching | [ADR-0005](decisions/0005-multi-gpu-strategy.md) |
|
||||
| GitOps with Flux | Declarative, auditable, secure | [ADR-0006](decisions/0006-gitops-with-flux.md) |
|
||||
| KServe for inference | Standardized API, autoscaling | [ADR-0007](decisions/0007-use-kserve-for-inference.md) |
|
||||
| KubeRay unified backend | Fractional GPU, single endpoint | [ADR-0011](decisions/0011-kuberay-unified-gpu-backend.md) |
|
||||
| Go handler refactor | Slim images, type-safe protobuf for non-ML services | [ADR-0061](decisions/0061-go-handler-refactor.md) |
|
||||
|
||||
## Related Documents
|
||||
|
||||
|
||||
@@ -28,27 +28,29 @@ kubernetes/
|
||||
### AI/ML Repos (git.daviestechlabs.io/daviestechlabs)
|
||||
|
||||
```
|
||||
handler-base/ # Shared library for all handlers
|
||||
├── handler_base/
|
||||
│ ├── handler.py # Base Handler class
|
||||
│ ├── nats_client.py # NATS wrapper
|
||||
│ ├── config.py # Pydantic Settings
|
||||
│ ├── health.py # K8s probes
|
||||
│ ├── telemetry.py # OpenTelemetry
|
||||
│ └── clients/ # Service clients
|
||||
├── tests/
|
||||
└── pyproject.toml
|
||||
handler-base/ # Shared Go module for all NATS handlers
|
||||
├── clients/ # HTTP clients (LLM, STT, TTS, embeddings, reranker)
|
||||
├── config/ # Env-based configuration (struct tags)
|
||||
├── gen/messagespb/ # Generated protobuf stubs
|
||||
├── handler/ # Typed NATS message handler with OTel + health wiring
|
||||
├── health/ # HTTP health + readiness server
|
||||
├── messages/ # Type aliases from generated protobuf stubs
|
||||
├── natsutil/ # NATS publish/request with protobuf encoding
|
||||
├── proto/messages/v1/ # .proto schema source
|
||||
├── go.mod
|
||||
└── buf.yaml # buf protobuf toolchain config
|
||||
|
||||
chat-handler/ # Text chat service
|
||||
voice-assistant/ # Voice pipeline service
|
||||
pipeline-bridge/ # Workflow engine bridge
|
||||
├── {name}.py # Handler implementation (uses handler-base)
|
||||
├── pyproject.toml # PEP 621 project metadata (see ADR-0012)
|
||||
├── uv.lock # Deterministic lock file
|
||||
├── tests/
|
||||
│ ├── conftest.py
|
||||
│ └── test_{name}.py
|
||||
└── Dockerfile
|
||||
chat-handler/ # Text chat service (Go)
|
||||
voice-assistant/ # Voice pipeline service (Go)
|
||||
pipeline-bridge/ # Workflow engine bridge (Go)
|
||||
stt-module/ # Speech-to-text bridge (Go)
|
||||
tts-module/ # Text-to-speech bridge (Go)
|
||||
├── main.go # Service entry point
|
||||
├── main_test.go # Unit tests
|
||||
├── e2e_test.go # End-to-end tests
|
||||
├── go.mod # Go module (depends on handler-base)
|
||||
├── Dockerfile # Distroless container (~20 MB)
|
||||
└── renovate.json # Dependency update config
|
||||
|
||||
argo/ # Argo WorkflowTemplates
|
||||
├── {workflow-name}.yaml
|
||||
@@ -138,7 +140,20 @@ tts_task = synthesize_speech(text=llm_task.output) # noqa: F841
|
||||
|
||||
### Project Structure
|
||||
|
||||
```go
|
||||
// Go handler services use handler-base shared module
|
||||
import (
|
||||
"git.daviestechlabs.io/daviestechlabs/handler-base/clients"
|
||||
"git.daviestechlabs.io/daviestechlabs/handler-base/config"
|
||||
"git.daviestechlabs.io/daviestechlabs/handler-base/handler"
|
||||
"git.daviestechlabs.io/daviestechlabs/handler-base/health"
|
||||
"git.daviestechlabs.io/daviestechlabs/handler-base/messages"
|
||||
"git.daviestechlabs.io/daviestechlabs/handler-base/natsutil"
|
||||
)
|
||||
```
|
||||
|
||||
```python
|
||||
# Python remains for Ray Serve, Kubeflow pipelines, Gradio UIs
|
||||
# Use async/await for I/O
|
||||
async def handle_message(msg: Msg) -> None:
|
||||
...
|
||||
@@ -149,10 +164,6 @@ class ChatRequest:
|
||||
user_id: str
|
||||
message: str
|
||||
enable_rag: bool = True
|
||||
|
||||
# Use msgpack for NATS messages
|
||||
import msgpack
|
||||
data = msgpack.packb({"key": "value"})
|
||||
```
|
||||
|
||||
### Naming
|
||||
@@ -200,31 +211,36 @@ except Exception as e:
|
||||
|
||||
### NATS Message Handling
|
||||
|
||||
```python
|
||||
import nats
|
||||
import msgpack
|
||||
All NATS handler services use Go with Protocol Buffers encoding (see [ADR-0061](decisions/0061-go-handler-refactor.md)):
|
||||
|
||||
async def message_handler(msg: Msg) -> None:
|
||||
try:
|
||||
# Decode MessagePack
|
||||
data = msgpack.unpackb(msg.data, raw=False)
|
||||
```go
|
||||
// Go NATS handler (production pattern)
|
||||
func (h *Handler) handleMessage(msg *nats.Msg) {
|
||||
var req messages.ChatRequest
|
||||
if err := proto.Unmarshal(msg.Data, &req); err != nil {
|
||||
h.logger.Error("failed to unmarshal", "error", err)
|
||||
return
|
||||
}
|
||||
|
||||
# Process
|
||||
result = await process(data)
|
||||
// Process
|
||||
result, err := h.process(ctx, &req)
|
||||
if err != nil {
|
||||
h.logger.Error("handler error", "error", err)
|
||||
msg.Nak()
|
||||
return
|
||||
}
|
||||
|
||||
# Reply if request-reply pattern
|
||||
if msg.reply:
|
||||
await msg.respond(msgpack.packb(result))
|
||||
|
||||
# Acknowledge for JetStream
|
||||
await msg.ack()
|
||||
|
||||
except Exception as e:
|
||||
logger.error(f"Handler error: {e}")
|
||||
# NAK for retry (JetStream)
|
||||
await msg.nak()
|
||||
// Reply if request-reply pattern
|
||||
if msg.Reply != "" {
|
||||
data, _ := proto.Marshal(result)
|
||||
msg.Respond(data)
|
||||
}
|
||||
msg.Ack()
|
||||
}
|
||||
```
|
||||
|
||||
> **Python NATS** is still used in Ray Serve `runtime_env` and Kubeflow pipeline components where needed, but all dedicated NATS handler services are Go.
|
||||
|
||||
---
|
||||
|
||||
## Kubernetes Manifest Conventions
|
||||
@@ -499,8 +515,9 @@ Each application should have a README with:
|
||||
| Use `latest` image tags | Pin to specific versions |
|
||||
| Skip health checks | Always define liveness/readiness |
|
||||
| Ignore resource limits | Set appropriate requests/limits |
|
||||
| Use JSON for NATS messages | Use MessagePack (binary) |
|
||||
| Synchronous I/O in handlers | Use async/await |
|
||||
| Use JSON for NATS messages | Use Protocol Buffers (see ADR-0061) |
|
||||
| Write handler services in Python | Use Go with handler-base module (ADR-0061) |
|
||||
| Synchronous I/O in handlers | Use goroutines / async patterns |
|
||||
|
||||
---
|
||||
|
||||
|
||||
14
README.md
14
README.md
@@ -8,7 +8,7 @@
|
||||
[](LICENSE)
|
||||
|
||||
<!-- ADR-BADGES-START -->
|
||||
  
|
||||
  
|
||||
<!-- ADR-BADGES-END -->
|
||||
|
||||
## 📖 Quick Navigation
|
||||
@@ -94,7 +94,7 @@ homelab-design/
|
||||
| 0001 | [Record Architecture Decisions](decisions/0001-record-architecture-decisions.md) | ✅ accepted | 2025-11-30 |
|
||||
| 0002 | [Use Talos Linux for Kubernetes Nodes](decisions/0002-use-talos-linux.md) | ✅ accepted | 2025-11-30 |
|
||||
| 0003 | [Use NATS for AI/ML Messaging](decisions/0003-use-nats-for-messaging.md) | ✅ accepted | 2025-12-01 |
|
||||
| 0004 | [Use MessagePack for NATS Messages](decisions/0004-use-messagepack-for-nats.md) | ✅ accepted | 2025-12-01 |
|
||||
| 0004 | [Use MessagePack for NATS Messages](decisions/0004-use-messagepack-for-nats.md) | ♻️ superseded by [ADR-0061](0061-go-handler-refactor.md) (Protocol Buffers) | 2025-12-01 |
|
||||
| 0005 | [Multi-GPU Heterogeneous Strategy](decisions/0005-multi-gpu-strategy.md) | ✅ accepted | 2025-12-01 |
|
||||
| 0006 | [GitOps with Flux CD](decisions/0006-gitops-with-flux.md) | ✅ accepted | 2025-11-30 |
|
||||
| 0007 | [Use KServe for ML Model Serving](decisions/0007-use-kserve-for-inference.md) | ♻️ superseded by [ADR-0011](0011-kuberay-unified-gpu-backend.md) | 2025-12-15 (Updated: 2026-02-02) |
|
||||
@@ -109,7 +109,7 @@ homelab-design/
|
||||
| 0016 | [Affine Email Verification Strategy for Authentik OIDC](decisions/0016-affine-email-verification-strategy.md) | ✅ accepted | 2026-02-04 |
|
||||
| 0017 | [Secrets Management Strategy](decisions/0017-secrets-management-strategy.md) | ✅ accepted | 2026-02-04 |
|
||||
| 0018 | [Security Policy Enforcement](decisions/0018-security-policy-enforcement.md) | ✅ accepted | 2026-02-04 |
|
||||
| 0019 | [Python Module Deployment Strategy](decisions/0019-handler-deployment-strategy.md) | ✅ accepted | 2026-02-02 |
|
||||
| 0019 | [Python Module Deployment Strategy](decisions/0019-handler-deployment-strategy.md) | ♻️ superseded by [ADR-0061](0061-go-handler-refactor.md) | 2026-02-02 |
|
||||
| 0020 | [Internal Registry URLs for CI/CD](decisions/0020-internal-registry-for-cicd.md) | ✅ accepted | 2026-02-02 |
|
||||
| 0021 | [Notification Architecture](decisions/0021-notification-architecture.md) | ✅ accepted | 2026-02-04 |
|
||||
| 0022 | [ntfy-Discord Bridge Service](decisions/0022-ntfy-discord-bridge.md) | ✅ accepted | 2026-02-04 |
|
||||
@@ -149,8 +149,12 @@ homelab-design/
|
||||
| 0056 | [Custom Trained Voice Support in TTS Module](decisions/0056-custom-voice-support-tts-module.md) | ✅ accepted | 2026-02-13 |
|
||||
| 0057 | [Per-Repository Renovate Configurations](decisions/0057-renovate-per-repo-configs.md) | ✅ accepted | 2026-02-13 |
|
||||
| 0058 | [Training Strategy – Distributed CPU Now, DGX Spark Later](decisions/0058-training-strategy-cpu-dgx-spark.md) | ✅ accepted | 2026-02-14 |
|
||||
| 0059 | [Add Mac Mini M4 Pro (waterdeep) to Ray Cluster as External Worker](decisions/0059-mac-mini-ray-worker.md) | 📝 proposed | 2026-02-16 |
|
||||
| 0059 | [Mac Mini M4 Pro (waterdeep) as Local AI Agent for 3D Avatar Creation](decisions/0059-mac-mini-ray-worker.md) | ✅ accepted | 2026-02-16 |
|
||||
| 0060 | [Internal PKI with Vault and cert-manager](decisions/0060-internal-pki-vault.md) | ✅ accepted | 2026-02-16 |
|
||||
| 0061 | [Refactor NATS Handler Services from Python to Go](decisions/0061-go-handler-refactor.md) | ✅ accepted | 2026-02-19 |
|
||||
| 0062 | [BlenderMCP for 3D Avatar Creation via Kasm Workstation](decisions/0062-blender-mcp-3d-avatar-workflow.md) | ♻️ superseded by [ADR-0063](0063-comfyui-3d-avatar-pipeline.md) | 2026-02-21 |
|
||||
| 0063 | [ComfyUI Image-to-3D Avatar Pipeline with TRELLIS + UniRig](decisions/0063-comfyui-3d-avatar-pipeline.md) | 📝 proposed | 2026-02-24 |
|
||||
| 0064 | [waterdeep (Mac Mini M4 Pro) as Dedicated Coding Agent with Fine-Tuned Model](decisions/0064-waterdeep-coding-agent.md) | 📝 proposed | 2026-02-26 |
|
||||
<!-- ADR-TABLE-END -->
|
||||
|
||||
## 🔗 Related Repositories
|
||||
@@ -188,4 +192,4 @@ The former monolithic `llm-workflows` repo has been archived and decomposed into
|
||||
|
||||
---
|
||||
|
||||
*Last updated: 2026-02-17*
|
||||
*Last updated: 2026-02-26*
|
||||
|
||||
@@ -117,9 +117,14 @@ All AI inference runs on a unified Ray Serve endpoint with fractional GPU alloca
|
||||
|
||||
| Application | Language | Framework | Purpose |
|
||||
|-------------|----------|-----------|---------|
|
||||
| Companions | Go | net/http + HTMX | AI chat interface |
|
||||
| Voice WebApp | Python | Gradio | Voice assistant UI |
|
||||
| Various handlers | Python | asyncio + nats.py | NATS event handlers |
|
||||
| Companions | Go | net/http + HTMX | AI chat interface (SSR) |
|
||||
| Chat Handler | Go | handler-base | RAG + LLM text pipeline |
|
||||
| Voice Assistant | Go | handler-base | STT → RAG → LLM → TTS pipeline |
|
||||
| Pipeline Bridge | Go | handler-base | Kubeflow/Argo workflow triggers |
|
||||
| STT Module | Go | handler-base | Speech-to-text bridge |
|
||||
| TTS Module | Go | handler-base | Text-to-speech bridge |
|
||||
| Voice WebApp | Python | Gradio | Voice assistant UI (dev/testing) |
|
||||
| Ray Serve | Python | Ray Serve | GPU inference endpoints |
|
||||
|
||||
### Frontend
|
||||
|
||||
@@ -242,27 +247,41 @@ All AI inference runs on a unified Ray Serve endpoint with fractional GPU alloca
|
||||
|
||||
---
|
||||
|
||||
## Python Dependencies (handler-base)
|
||||
## Go Dependencies (handler-base)
|
||||
|
||||
Core library for all NATS handlers: [handler-base](https://git.daviestechlabs.io/daviestechlabs/handler-base)
|
||||
Shared Go module for all NATS handler services: [handler-base](https://git.daviestechlabs.io/daviestechlabs/handler-base)
|
||||
|
||||
```go
|
||||
// go.mod (handler-base v1.0.0)
|
||||
require (
|
||||
github.com/nats-io/nats.go // NATS client
|
||||
google.golang.org/protobuf // Protocol Buffers encoding
|
||||
github.com/zitadel/oidc/v3 // OIDC client
|
||||
go.opentelemetry.io/otel // OpenTelemetry traces + metrics
|
||||
github.com/milvus-io/milvus-sdk-go // Milvus vector search
|
||||
)
|
||||
```
|
||||
|
||||
See [ADR-0061](decisions/0061-go-handler-refactor.md) for the full refactoring rationale.
|
||||
|
||||
## Python Dependencies (ML/AI only)
|
||||
|
||||
Python is retained for ML inference, pipeline orchestration, and dev tools:
|
||||
|
||||
```toml
|
||||
# Core
|
||||
nats-py>=2.7.0 # NATS client
|
||||
msgpack>=1.0.0 # Binary serialization
|
||||
httpx>=0.27.0 # HTTP client
|
||||
# ray-serve (GPU inference)
|
||||
ray[serve]>=2.53.0
|
||||
vllm>=0.8.0
|
||||
faster-whisper>=1.0.0
|
||||
TTS>=0.22.0
|
||||
sentence-transformers>=3.0.0
|
||||
|
||||
# ML/AI
|
||||
pymilvus>=2.4.0 # Milvus client
|
||||
openai>=1.0.0 # vLLM OpenAI API
|
||||
# kubeflow (pipeline definitions)
|
||||
kfp>=2.12.1
|
||||
|
||||
# Observability
|
||||
opentelemetry-api>=1.20.0
|
||||
opentelemetry-sdk>=1.20.0
|
||||
mlflow>=2.10.0 # Experiment tracking
|
||||
|
||||
# Kubeflow (kubeflow repo)
|
||||
kfp>=2.12.1 # Pipeline SDK
|
||||
# mlflow (experiment tracking)
|
||||
mlflow>=3.7.0
|
||||
pymilvus>=2.4.0
|
||||
```
|
||||
|
||||
---
|
||||
|
||||
@@ -1,6 +1,6 @@
|
||||
# Use MessagePack for NATS Messages
|
||||
|
||||
* Status: accepted
|
||||
* Status: superseded by [ADR-0061](0061-go-handler-refactor.md) (Protocol Buffers)
|
||||
* Date: 2025-12-01
|
||||
* Deciders: Billy Davies
|
||||
* Technical Story: Selecting serialization format for NATS messages
|
||||
|
||||
@@ -1,10 +1,12 @@
|
||||
# Python Module Deployment Strategy
|
||||
|
||||
* Status: accepted
|
||||
* Status: superseded by [ADR-0061](0061-go-handler-refactor.md)
|
||||
* Date: 2026-02-02
|
||||
* Deciders: Billy
|
||||
* Technical Story: Define how Python handler modules are packaged and deployed to Kubernetes
|
||||
|
||||
> **Note (2026-02-23):** This ADR described deploying Python handlers as Ray Serve applications inside the Ray cluster. [ADR-0061](0061-go-handler-refactor.md) supersedes this approach — all five handler services (chat-handler, voice-assistant, pipeline-bridge, tts-module, stt-module) have been rewritten in Go and now deploy as standalone Kubernetes Deployments with distroless container images (~20 MB each). The Ray cluster is exclusively used for GPU inference workloads. The handler-base shared library is now a Go module published at `git.daviestechlabs.io/daviestechlabs/handler-base` using Protocol Buffers for NATS message encoding.
|
||||
|
||||
## Context
|
||||
|
||||
We have Python modules for AI/ML workflows that need to run on our unified GPU cluster:
|
||||
|
||||
@@ -14,7 +14,7 @@ How do we build a performant, maintainable frontend that integrates with the NAT
|
||||
## Decision Drivers
|
||||
|
||||
* Real-time streaming for chat and voice (WebSocket required)
|
||||
* Direct integration with NATS JetStream (binary MessagePack protocol)
|
||||
* Direct integration with NATS JetStream (Protocol Buffers encoding, see [ADR-0061](0061-go-handler-refactor.md))
|
||||
* Minimal client-side JavaScript (~20KB gzipped target)
|
||||
* No frontend build step (no webpack/vite/node required)
|
||||
* 3D avatar rendering for immersive experience
|
||||
@@ -39,8 +39,9 @@ Chosen option: **Option 1 - Go + HTMX + Alpine.js + Three.js**, because it provi
|
||||
* No npm, no webpack, no build step — assets served directly
|
||||
* Server-side rendering via Go templates
|
||||
* WebSocket handled natively in Go (gorilla/websocket)
|
||||
* NATS integration with MessagePack in the same binary
|
||||
* NATS integration with Protocol Buffers in the same binary
|
||||
* Distroless container image for minimal attack surface
|
||||
* Type-safe NATS messages via handler-base shared Go module (protobuf stubs)
|
||||
|
||||
### Negative Consequences
|
||||
|
||||
@@ -58,8 +59,9 @@ Chosen option: **Option 1 - Go + HTMX + Alpine.js + Three.js**, because it provi
|
||||
| Client state | Alpine.js 3 | Lightweight reactive UI for local state |
|
||||
| 3D Avatars | Three.js + VRM | 3D character rendering with lip-sync |
|
||||
| Styling | Tailwind CSS 4 + DaisyUI | Utility-first CSS with component library |
|
||||
| Messaging | NATS JetStream | Real-time pub/sub with MessagePack encoding |
|
||||
| Messaging | NATS JetStream | Real-time pub/sub with Protocol Buffers encoding |
|
||||
| Auth | golang-jwt/jwt/v5 | JWT token handling for OAuth flows |
|
||||
| Shared lib | handler-base (Go module) | NATS client, protobuf messages, health, OTel, HTTP clients |
|
||||
| Database | PostgreSQL (lib/pq) + SQLite | Persistent + local session storage |
|
||||
| Observability | OpenTelemetry SDK | Traces, metrics via OTLP gRPC |
|
||||
|
||||
@@ -88,7 +90,7 @@ Chosen option: **Option 1 - Go + HTMX + Alpine.js + Three.js**, because it provi
|
||||
│ ┌─────────┴─────────┐ │
|
||||
│ │ NATS Client │ │
|
||||
│ │ (JetStream + │ │
|
||||
│ │ MessagePack) │ │
|
||||
│ │ Protobuf) │ │
|
||||
│ └─────────┬─────────┘ │
|
||||
└────────────────────────┼────────────────────────────────────────┘
|
||||
│
|
||||
@@ -130,8 +132,9 @@ Chosen option: **Option 1 - Go + HTMX + Alpine.js + Three.js**, because it provi
|
||||
## Links
|
||||
|
||||
* Related to [ADR-0003](0003-use-nats-for-messaging.md) (NATS messaging)
|
||||
* Related to [ADR-0004](0004-use-messagepack-for-nats.md) (MessagePack encoding)
|
||||
* Related to [ADR-0004](0004-use-messagepack-for-nats.md) (MessagePack encoding — superseded by Protocol Buffers, see [ADR-0061](0061-go-handler-refactor.md))
|
||||
* Related to [ADR-0011](0011-kuberay-unified-gpu-backend.md) (Ray Serve backend)
|
||||
* Related to [ADR-0028](0028-authentik-sso-strategy.md) (OAuth/OIDC)
|
||||
* Related to [ADR-0061](0061-go-handler-refactor.md) (Go handler refactor — handler-base shared module, protobuf wire format)
|
||||
* [HTMX Documentation](https://htmx.org/docs/)
|
||||
* [VRM Specification](https://vrm.dev/en/)
|
||||
|
||||
@@ -1,338 +1,287 @@
|
||||
# Add Mac Mini M4 Pro (waterdeep) to Ray Cluster as External Worker
|
||||
# Mac Mini M4 Pro (waterdeep) as Local AI Agent for 3D Avatar Creation
|
||||
|
||||
* Status: proposed
|
||||
* Status: accepted
|
||||
* Date: 2026-02-16
|
||||
* Updated: 2026-02-23
|
||||
* Deciders: Billy
|
||||
* Technical Story: Expand Ray cluster with Apple Silicon compute for inference and training
|
||||
* Technical Story: Use waterdeep as a dedicated local AI workstation for BlenderMCP-driven 3D avatar creation, replacing the previously proposed Ray worker role
|
||||
|
||||
## Context and Problem Statement
|
||||
|
||||
The homelab Ray cluster currently runs entirely within Kubernetes, with GPU workers pinned to specific nodes:
|
||||
**waterdeep** is a Mac Mini M4 Pro with 48 GB of unified memory that currently serves as a development workstation (see [ADR-0037](0037-node-naming-conventions.md)). The original proposal was to add it to the Ray cluster as an external inference/training worker, but:
|
||||
|
||||
| Node | GPU | Memory | Workload |
|
||||
|------|-----|--------|----------|
|
||||
| khelben | Strix Halo (ROCm) | 128 GB unified | vLLM 70B (0.95 GPU) |
|
||||
| elminster | RTX 2070 (CUDA) | 8 GB VRAM | Whisper (0.5) + TTS (0.5) |
|
||||
| drizzt | Radeon 680M (ROCm) | 12 GB VRAM | Embeddings (0.8) |
|
||||
| danilo | Intel Arc (i915) | ~6 GB shared | Reranker (0.8) |
|
||||
- All Ray inference slots are already allocated and stable — adding a 5th GPU class (MPS) increases complexity without filling a gap
|
||||
- vLLM's MPS backend remains experimental — not production-ready for serving
|
||||
- The real unmet need is **3D avatar creation** for companions-frontend ([ADR-0063](0063-comfyui-3d-avatar-pipeline.md))
|
||||
|
||||
All GPUs are fully allocated to inference (see [ADR-0005](0005-multi-gpu-strategy.md), [ADR-0011](0011-kuberay-unified-gpu-backend.md)). Training is currently CPU-only and distributed across cluster nodes via Ray Train ([ADR-0058](0058-training-strategy-cpu-dgx-spark.md)).
|
||||
[ADR-0063](0063-comfyui-3d-avatar-pipeline.md) describes an automated ComfyUI + TRELLIS + UniRig pipeline for image-to-VRM avatar generation, running on a personal desktop as an on-demand Ray worker. This supersedes the manual BlenderMCP Kasm workflow from [ADR-0062](0062-blender-mcp-3d-avatar-workflow.md). waterdeep retains its role as an interactive Blender workstation for manual refinement of auto-generated models.
|
||||
|
||||
**waterdeep** is a Mac Mini M4 Pro with 48 GB of unified memory that currently serves as a development workstation (see [ADR-0037](0037-node-naming-conventions.md)). Its Apple Silicon GPU (MPS backend) and unified memory architecture make it a strong candidate for both inference and training workloads — but macOS cannot run Talos Linux or easily join the Kubernetes cluster as a native node.
|
||||
waterdeep's M4 Pro has a 16-core GPU with hardware-accelerated Metal rendering and 48 GB of unified memory shared between CPU and GPU. Running Blender natively on waterdeep with BlenderMCP gives a dramatically better 3D creation experience than Kasm.
|
||||
|
||||
How do we integrate waterdeep's compute into the Ray cluster without disrupting the existing Kubernetes-managed infrastructure?
|
||||
How should we use waterdeep to maximise the 3D avatar creation pipeline for companions-frontend?
|
||||
|
||||
## Decision Drivers
|
||||
|
||||
* 48 GB unified memory is sufficient for medium-large models (e.g., 7B–30B at Q4/Q8 quantisation)
|
||||
* Apple Silicon MPS backend is supported by PyTorch and vLLM (experimental)
|
||||
* macOS cannot run Talos Linux — must integrate without Kubernetes
|
||||
* Ray natively supports heterogeneous clusters with external workers
|
||||
* Must not impact existing inference serving stability
|
||||
* Training workloads ([ADR-0058](0058-training-strategy-cpu-dgx-spark.md)) would benefit from a GPU-accelerated worker
|
||||
* ARM64 architecture requires compatible Python packages and model formats
|
||||
* Blender on Kasm is CPU-rendered inside DinD — no Metal/Vulkan/CUDA GPU access, poor viewport performance
|
||||
* waterdeep has a 16-core Apple GPU with Metal support — Blender's Metal backend enables real-time viewport rendering, Cycles GPU rendering, and smooth sculpting
|
||||
* 48 GB unified memory means Blender, VS Code, and the MCP server can all run simultaneously without swapping
|
||||
* VS Code with Copilot agent mode and BlenderMCP server are installed on waterdeep — VS Code drives Blender via localhost:9876 with zero-latency socket communication
|
||||
* Exported VRM models must reach gravenhollow for production serving ([ADR-0063](0063-comfyui-3d-avatar-pipeline.md))
|
||||
* **rclone** chosen for asset promotion to gravenhollow's RustFS S3 endpoint — simpler than NFS mounts on macOS, consistent with existing Kasm rclone patterns, and avoids autofs/NFS fstab complexity
|
||||
* The automated ComfyUI pipeline from [ADR-0063](0063-comfyui-3d-avatar-pipeline.md) handles most avatar generation; waterdeep serves as the manual refinement station
|
||||
* ray cluster GPU fleet is fully allocated and stable — adding MPS complexity is not justified
|
||||
|
||||
## Considered Options
|
||||
|
||||
1. **External Ray worker on macOS** — run a Ray worker process natively on waterdeep that connects to the cluster Ray head over the network
|
||||
2. **Linux VM on Mac** — run UTM/Parallels VM with Linux, join as a Kubernetes node
|
||||
3. **K3s agent on macOS** — run K3s directly on macOS via Docker Desktop
|
||||
1. **Local AI agent on waterdeep** — Blender + BlenderMCP + VS Code natively on macOS, promoting assets to gravenhollow via rclone (S3)
|
||||
2. **External Ray worker on macOS** (original proposal) — join the Ray cluster for inference and training
|
||||
3. **Keep Kasm-only workflow** — rely entirely on the browser-based Kasm Blender workstation from ADR-0062
|
||||
|
||||
## Decision Outcome
|
||||
|
||||
Chosen option: **Option 1 — External Ray worker on macOS**, because Ray natively supports heterogeneous workers joining over the network. This avoids the complexity of running Kubernetes on macOS, lets waterdeep remain a development workstation, and leverages Apple Silicon MPS acceleration transparently through PyTorch.
|
||||
Chosen option: **Option 1 — Local AI agent on waterdeep**, because the Mac Mini's Metal GPU makes it dramatically better for 3D work than CPU-rendered Kasm, the Ray cluster doesn't need another worker, and the local workflow eliminates network latency between VS Code, the MCP server, and Blender.
|
||||
|
||||
### Positive Consequences
|
||||
|
||||
* Zero Kubernetes overhead on waterdeep — remains a usable dev workstation
|
||||
* 48 GB unified memory available for models (vs split VRAM/RAM on discrete GPUs)
|
||||
* MPS GPU acceleration for both inference and training
|
||||
* Adds a 5th GPU class to the Ray fleet (Apple MPS alongside ROCm, CUDA, Intel, RDNA2)
|
||||
* Training jobs ([ADR-0058](0058-training-strategy-cpu-dgx-spark.md)) gain a GPU-accelerated worker
|
||||
* Can run a secondary LLM instance for overflow or A/B testing
|
||||
* Quick to set up — single `ray start` command
|
||||
* Worker can be stopped/started without affecting the cluster
|
||||
* Metal GPU acceleration — real-time Eevee viewport, GPU-accelerated Cycles rendering, smooth 60fps sculpting
|
||||
* Zero-latency MCP — BlenderMCP socket (localhost:9876) has no network hop, instant command execution
|
||||
* 48 GB unified memory — large Blender scenes, multiple VRM models open simultaneously, no swap pressure
|
||||
* VS Code + Copilot agent mode + BlenderMCP server installed natively — single editor drives both code and Blender commands
|
||||
* rclone for asset promotion — consistent with Kasm rclone patterns, avoids macOS NFS/autofs complexity
|
||||
* Remaining a dev workstation — avatar creation is a creative dev workflow, not a server workload
|
||||
* Kasm Blender remains available as a browser-based fallback for remote/mobile access
|
||||
* Simpler than the Ray worker approach — no cluster integration, no GCS port exposure, no experimental MPS backend
|
||||
|
||||
### Negative Consequences
|
||||
|
||||
* Not managed by KubeRay or Flux — requires manual or launchd-based lifecycle management
|
||||
* Network dependency — if waterdeep sleeps or disconnects, Ray tasks on it fail
|
||||
* MPS backend has limited operator coverage compared to CUDA/ROCm
|
||||
* Python environment must be maintained separately (not in a container image)
|
||||
* No Longhorn storage — model cache managed locally or via NFS mount from gravenhollow (nfs-fast)
|
||||
* Monitoring not automatically scraped by Prometheus (needs node-exporter or push gateway)
|
||||
* Blender, VS Code, and add-ons must be installed and maintained locally on waterdeep via Homebrew
|
||||
* Assets created locally need explicit `rclone copy` to promote to gravenhollow (vs Kasm's automatic rclone to Quobyte S3)
|
||||
* waterdeep is a single machine — no redundancy for the 3D creation workflow
|
||||
* Not managed by Kubernetes or GitOps — relies on Homebrew-managed tooling
|
||||
|
||||
## Pros and Cons of the Options
|
||||
|
||||
### Option 1: External Ray worker on macOS
|
||||
### Option 1: Local AI agent on waterdeep
|
||||
|
||||
* Good, because Ray is designed for heterogeneous multi-node clusters
|
||||
* Good, because no VM overhead — full access to Metal/MPS and unified memory
|
||||
* Good, because waterdeep remains a functional dev workstation
|
||||
* Good, because trivial to start/stop (single process)
|
||||
* Bad, because not managed by Kubernetes or GitOps
|
||||
* Bad, because requires manual Python environment management
|
||||
* Bad, because MPS support in vLLM is experimental
|
||||
* Good, because Metal GPU acceleration makes Blender usable for real 3D work (sculpting, rendering, material preview)
|
||||
* Good, because localhost MCP socket eliminates all network latency
|
||||
* Good, because 48 GB unified memory supports complex scenes without swapping
|
||||
* Good, because no experimental backends (MPS/vLLM) — using Blender's mature Metal renderer
|
||||
* Good, because waterdeep stays a dev workstation, aligning with its named role
|
||||
* Bad, because local-only — no browser-based remote access (use Kasm for that)
|
||||
* Bad, because manual tool installation (Blender, VRM add-on, BlenderMCP, VS Code)
|
||||
* Bad, because asset promotion to gravenhollow requires explicit rclone command
|
||||
|
||||
### Option 2: Linux VM on Mac
|
||||
### Option 2: External Ray worker on macOS (original proposal)
|
||||
|
||||
* Good, because would be a standard Kubernetes node
|
||||
* Good, because managed by KubeRay like other workers
|
||||
* Bad, because VM overhead reduces available memory (hypervisor, guest OS)
|
||||
* Bad, because no MPS/Metal GPU passthrough to Linux VMs on Apple Silicon
|
||||
* Bad, because complex to maintain (VM lifecycle, networking, storage)
|
||||
* Bad, because wastes the primary advantage (Apple Silicon GPU)
|
||||
* Good, because adds GPU compute to the Ray cluster
|
||||
* Good, because training jobs gain MPS acceleration
|
||||
* Bad, because vLLM MPS backend is experimental — not production-ready
|
||||
* Bad, because adds a 5th GPU class (MPS) to an already complex fleet
|
||||
* Bad, because Ray GCS port exposure adds security surface
|
||||
* Bad, because doesn't address the actual unmet need (3D avatar creation)
|
||||
* Bad, because waterdeep becomes a server, degrading its dev workstation role
|
||||
|
||||
### Option 3: K3s agent on macOS
|
||||
### Option 3: Kasm-only workflow
|
||||
|
||||
* Good, because Kubernetes-native, managed by Flux
|
||||
* Bad, because K3s on macOS requires Docker Desktop (resource overhead)
|
||||
* Bad, because container networking on macOS is fragile
|
||||
* Bad, because MPS device access from within Docker containers is unreliable
|
||||
* Bad, because not a supported K3s configuration
|
||||
* Good, because browser-based — usable from any device
|
||||
* Good, because no local installation required
|
||||
* Bad, because CPU-rendered Blender inside DinD — poor viewport performance
|
||||
* Bad, because network latency between VS Code and Blender socket
|
||||
* Bad, because limited memory inside Kasm container
|
||||
* Bad, because no GPU acceleration for rendering or sculpting
|
||||
|
||||
## Architecture
|
||||
|
||||
```
|
||||
┌──────────────────────────────────────────────────────────────────────────┐
|
||||
│ Kubernetes Cluster (Talos) │
|
||||
┌─────────────────────────────────────────────────────────────────────────┐
|
||||
│ waterdeep (Mac Mini M4 Pro · 48 GB unified · Metal GPU) │
|
||||
│ │
|
||||
│ ┌──────────────────────────────────────────────────────────────────┐ │
|
||||
│ │ RayService (ai-inference) — KubeRay managed │ │
|
||||
│ ┌──────────────────────────────────────────────────────┐ │
|
||||
│ │ VS Code + GitHub Copilot (agent mode) │ │
|
||||
│ │ │ │
|
||||
│ │ Head: wulfgar │ │
|
||||
│ │ Workers: khelben (ROCm), elminster (CUDA), │ │
|
||||
│ │ drizzt (RDNA2), danilo (Intel) │ │
|
||||
│ └──────────────────────┬───────────────────────────────────────────┘ │
|
||||
│ │ Ray GCS (port 6379) │
|
||||
│ │ BlenderMCP Server (uvx blender-mcp) │ │
|
||||
│ │ DISABLE_TELEMETRY=true │ │
|
||||
│ │ │ │ │
|
||||
│ │ │ TCP localhost:9876 (zero latency) │ │
|
||||
│ │ ▼ │ │
|
||||
│ └─────────┬────────────────────────────────────────────┘ │
|
||||
│ │ │
|
||||
└─────────────────────────┼────────────────────────────────────────────────┘
|
||||
│ Home network (LAN)
|
||||
│ ┌─────────▼────────────────────────────────────────────┐ │
|
||||
│ │ Blender 4.x (native macOS) │ │
|
||||
│ │ │ │
|
||||
│ │ Renderer: Metal (Eevee real-time + Cycles GPU) │ │
|
||||
│ │ Add-ons: │ │
|
||||
│ │ • BlenderMCP (addon.py) — socket server :9876 │ │
|
||||
│ │ • VRM Add-on for Blender — import/export VRM │ │
|
||||
│ │ │ │
|
||||
│ │ Working files: ~/blender-avatars/ │ │
|
||||
│ │ ├── projects/ (.blend source files) │ │
|
||||
│ │ ├── exports/ (.vrm exported models) │ │
|
||||
│ │ └── textures/ (shared texture library) │ │
|
||||
│ └──────────────────────────────────────────────────────┘ │
|
||||
│ │ │
|
||||
│ rclone (S3 asset promotion) │
|
||||
│ gravenhollow RustFS :30292 │
|
||||
└──────────────────────────┼──────────────────────────────────────────────┘
|
||||
│
|
||||
┌─────────────────────────┼────────────────────────────────────────────────┐
|
||||
│ waterdeep (Mac Mini M4 Pro) │
|
||||
│ │ │
|
||||
│ ┌──────────────────────▼───────────────────────────────────────────┐ │
|
||||
│ │ External Ray Worker (ray start --address=...) │ │
|
||||
│ │ │ │
|
||||
│ │ • 12-core CPU (8P + 4E) + 16-core Neural Engine │ │
|
||||
│ │ • 48 GB unified memory (shared CPU/GPU) │ │
|
||||
│ │ • MPS (Metal) GPU backend via PyTorch │ │
|
||||
│ │ • Custom resource: gpu_apple_mps: 1 │ │
|
||||
│ │ │ │
|
||||
│ │ Workloads: │ │
|
||||
│ │ ├── Inference: secondary LLM (7B–30B), overflow serving │ │
|
||||
│ │ └── Training: LoRA/QLoRA fine-tuning via Ray Train │ │
|
||||
│ └──────────────────────────────────────────────────────────────────┘ │
|
||||
▼
|
||||
┌─────────────────────────────────────────────────────────────────────────┐
|
||||
│ gravenhollow.lab.daviestechlabs.io │
|
||||
│ (TrueNAS Scale · All-SSD · Dual 10GbE · 12.2 TB) │
|
||||
│ │
|
||||
│ Model cache: ~/Library/Caches/huggingface + NFS mount (gravenhollow) │
|
||||
└──────────────────────────────────────────────────────────────────────────┘
|
||||
│ NFS: /mnt/gravenhollow/kubernetes/avatar-models/ │
|
||||
│ ├── Seed-san.vrm (default model) │
|
||||
│ ├── Companion-A.vrm (promoted from waterdeep) │
|
||||
│ └── animations/ (shared animation clips) │
|
||||
│ │
|
||||
│ S3 (RustFS): avatar-models bucket │
|
||||
│ (same data, served via Cloudflare Tunnel for remote users) │
|
||||
└──────────────────────────┬──────────────────────────────────────────────┘
|
||||
│
|
||||
┌────────────┴───────────────┐
|
||||
│ │
|
||||
NFS (nfs-fast PVC) Cloudflare Tunnel
|
||||
│ (assets.daviestechlabs.io)
|
||||
▼ │
|
||||
┌──────────────────────────┐ ▼
|
||||
│ companions-frontend │ ┌──────────────────────────┐
|
||||
│ (Kubernetes pod) │ │ Remote users (CDN-cached │
|
||||
│ LAN users │ │ via Cloudflare edge) │
|
||||
└──────────────────────────┘ └──────────────────────────┘
|
||||
```
|
||||
|
||||
## Updated GPU Fleet
|
||||
|
||||
| Node | GPU | Backend | Memory | Custom Resource | Workload |
|
||||
|------|-----|---------|--------|-----------------|----------|
|
||||
| khelben | Strix Halo | ROCm | 128 GB unified | `gpu_strixhalo: 1` | vLLM 70B |
|
||||
| elminster | RTX 2070 | CUDA | 8 GB VRAM | `gpu_nvidia: 1` | Whisper + TTS |
|
||||
| drizzt | Radeon 680M | ROCm | 12 GB VRAM | `gpu_rdna2: 1` | Embeddings |
|
||||
| danilo | Intel Arc | i915/IPEX | ~6 GB shared | `gpu_intel: 1` | Reranker |
|
||||
| **waterdeep** | **M4 Pro** | **MPS (Metal)** | **48 GB unified** | **`gpu_apple_mps: 1`** | **LLM (7B–30B) + Training** |
|
||||
|
||||
## Implementation Plan
|
||||
|
||||
### 1. Network Prerequisites
|
||||
|
||||
waterdeep must be able to reach the Ray head node's GCS port:
|
||||
### 1. Install Blender and Add-ons
|
||||
|
||||
```bash
|
||||
# From waterdeep, verify connectivity
|
||||
nc -zv <ray-head-ip> 6379
|
||||
# Install Blender via Homebrew
|
||||
brew install --cask blender
|
||||
|
||||
# Download BlenderMCP add-on
|
||||
curl -LO https://raw.githubusercontent.com/ahujasid/blender-mcp/main/addon.py
|
||||
|
||||
# Install in Blender:
|
||||
# Edit > Preferences > Add-ons > Install... > select addon.py
|
||||
# Enable "Interface: Blender MCP"
|
||||
|
||||
# Install VRM Add-on for Blender:
|
||||
# Download from https://vrm-addon-for-blender.info/en/
|
||||
# Edit > Preferences > Add-ons > Install... > select VRM add-on zip
|
||||
# Enable "Import-Export: VRM"
|
||||
```
|
||||
|
||||
The Ray head service (`ai-inference-raycluster-head-svc`) is ClusterIP-only. Options to expose it:
|
||||
### 2. VS Code MCP Configuration
|
||||
|
||||
| Approach | Complexity | Recommended |
|
||||
|----------|-----------|-------------|
|
||||
| NodePort service on port 6379 | Low | For initial setup |
|
||||
| Envoy Gateway TCPRoute | Medium | For production use |
|
||||
| Tailscale/WireGuard mesh | Medium | If already in use |
|
||||
```json
|
||||
// .vscode/mcp.json (in companions-frontend or global settings)
|
||||
{
|
||||
"servers": {
|
||||
"blender": {
|
||||
"command": "uvx",
|
||||
"args": ["blender-mcp"],
|
||||
"env": {
|
||||
"BLENDER_HOST": "localhost",
|
||||
"BLENDER_PORT": "9876",
|
||||
"DISABLE_TELEMETRY": "true"
|
||||
}
|
||||
}
|
||||
}
|
||||
}
|
||||
```
|
||||
|
||||
### 2. Python Environment on waterdeep
|
||||
### 3. Python Environment for BlenderMCP
|
||||
|
||||
```bash
|
||||
# Install uv (per ADR-0012)
|
||||
curl -LsSf https://astral.sh/uv/install.sh | sh
|
||||
|
||||
# Create Ray worker environment
|
||||
uv venv ~/ray-worker --python 3.12
|
||||
source ~/ray-worker/bin/activate
|
||||
|
||||
# Install Ray with ML dependencies
|
||||
uv pip install "ray[default]==2.53.0" torch torchvision torchaudio \
|
||||
transformers accelerate peft bitsandbytes \
|
||||
ray-serve-apps # internal package from Gitea PyPI
|
||||
|
||||
# Verify MPS availability
|
||||
python -c "import torch; print(torch.backends.mps.is_available())"
|
||||
# uvx handles the BlenderMCP server environment automatically
|
||||
# Verify it works:
|
||||
uvx blender-mcp --help
|
||||
```
|
||||
|
||||
### 3. Start Ray Worker
|
||||
### 4. rclone for Asset Promotion
|
||||
|
||||
Use rclone to promote finished VRM exports to gravenhollow's RustFS S3 endpoint. This is consistent with the promotion pattern from [ADR-0063](0063-comfyui-3d-avatar-pipeline.md) and avoids macOS NFS/autofs complexity.
|
||||
|
||||
```bash
|
||||
# Join the cluster with custom resources
|
||||
ray start \
|
||||
--address="<ray-head-ip>:6379" \
|
||||
--num-cpus=12 \
|
||||
--num-gpus=1 \
|
||||
--resources='{"gpu_apple_mps": 1}' \
|
||||
--block
|
||||
# Install rclone
|
||||
brew install rclone
|
||||
|
||||
# Configure gravenhollow RustFS endpoint
|
||||
rclone config create gravenhollow s3 \
|
||||
provider=Other \
|
||||
endpoint=https://gravenhollow.lab.daviestechlabs.io:30292 \
|
||||
access_key_id=<key> \
|
||||
secret_access_key=<secret>
|
||||
|
||||
# Promote a finished VRM
|
||||
rclone copy ~/blender-avatars/exports/Companion-A.vrm gravenhollow:avatar-models/
|
||||
|
||||
# Sync all exports (idempotent)
|
||||
rclone sync ~/blender-avatars/exports/ gravenhollow:avatar-models/ --exclude "*.blend"
|
||||
```
|
||||
|
||||
### 4. launchd Service (Persistent)
|
||||
> **Why rclone over NFS?** macOS autofs/NFS mounts are fragile across reboots and network changes. rclone is a single binary, works over HTTPS, and matches the promotion pattern already used in Kasm workflows. The explicit `rclone copy` command also serves as a deliberate promotion gate — only intentionally promoted models reach production.
|
||||
|
||||
```xml
|
||||
<!-- ~/Library/LaunchAgents/io.ray.worker.plist -->
|
||||
<?xml version="1.0" encoding="UTF-8"?>
|
||||
<!DOCTYPE plist PUBLIC "-//Apple//DTD PLIST 1.0//EN"
|
||||
"http://www.apple.com/DTDs/PropertyList-1.0.dtd">
|
||||
<plist version="1.0">
|
||||
<dict>
|
||||
<key>Label</key>
|
||||
<string>io.ray.worker</string>
|
||||
<key>ProgramArguments</key>
|
||||
<array>
|
||||
<string>/Users/billy/ray-worker/bin/ray</string>
|
||||
<string>start</string>
|
||||
<string>--address=RAY_HEAD_IP:6379</string>
|
||||
<string>--num-cpus=12</string>
|
||||
<string>--num-gpus=1</string>
|
||||
<string>--resources={"gpu_apple_mps": 1}</string>
|
||||
<string>--block</string>
|
||||
</array>
|
||||
<key>RunAtLoad</key>
|
||||
<true/>
|
||||
<key>KeepAlive</key>
|
||||
<true/>
|
||||
<key>StandardOutPath</key>
|
||||
<string>/tmp/ray-worker.log</string>
|
||||
<key>StandardErrorPath</key>
|
||||
<string>/tmp/ray-worker-error.log</string>
|
||||
<key>EnvironmentVariables</key>
|
||||
<dict>
|
||||
<key>PATH</key>
|
||||
<string>/Users/billy/ray-worker/bin:/usr/local/bin:/usr/bin:/bin</string>
|
||||
</dict>
|
||||
</dict>
|
||||
</plist>
|
||||
```
|
||||
### 5. Avatar Creation Workflow (waterdeep)
|
||||
|
||||
```bash
|
||||
launchctl load ~/Library/LaunchAgents/io.ray.worker.plist
|
||||
```
|
||||
1. **Open Blender** on waterdeep (native Metal-accelerated)
|
||||
2. **Enable BlenderMCP** → 3D View sidebar → "BlenderMCP" tab → click "Connect"
|
||||
3. **Open VS Code** with Copilot agent mode — BlenderMCP server starts automatically
|
||||
4. **Create avatars** using AI-assisted prompts:
|
||||
- _"Create an anime-style character with silver hair and a mage outfit"_
|
||||
- _"Apply metallic blue material to the staff"_
|
||||
- _"Rig this character for VRM export with standard humanoid bones"_
|
||||
- _"Export as VRM to ~/blender-avatars/exports/Silver-Mage.vrm"_
|
||||
5. **Preview** in real-time — Metal GPU renders Eevee viewport at 60fps
|
||||
6. **Promote** the finished VRM to gravenhollow via rclone:
|
||||
```bash
|
||||
rclone copy ~/blender-avatars/exports/Silver-Mage-v1.vrm gravenhollow:avatar-models/
|
||||
```
|
||||
7. **Register** in companions-frontend — update `AllowedAvatarModels` in Go and JS allowlists, commit
|
||||
|
||||
### 5. Model Cache via NFS
|
||||
### 6. Workflow Comparison: waterdeep vs Kasm
|
||||
|
||||
Mount the gravenhollow NFS share on waterdeep so models are shared with the cluster via the fast all-SSD NAS:
|
||||
|
||||
```bash
|
||||
# Mount gravenhollow NFS share (all-SSD, dual 10GbE)
|
||||
sudo mount -t nfs gravenhollow.lab.daviestechlabs.io:/mnt/gravenhollow/kubernetes/models \
|
||||
/Volumes/model-cache
|
||||
|
||||
# Or add to /etc/fstab for persistence
|
||||
# gravenhollow.lab.daviestechlabs.io:/mnt/gravenhollow/kubernetes/models /Volumes/model-cache nfs rw 0 0
|
||||
|
||||
# Symlink to HuggingFace cache location
|
||||
ln -s /Volumes/model-cache ~/.cache/huggingface/hub
|
||||
```
|
||||
|
||||
### 6. Ray Serve Deployment Targeting
|
||||
|
||||
To schedule a deployment specifically on waterdeep, use the `gpu_apple_mps` custom resource in the RayService config:
|
||||
|
||||
```yaml
|
||||
# In rayservice.yaml serveConfigV2
|
||||
- name: llm-secondary
|
||||
route_prefix: /llm-secondary
|
||||
import_path: ray_serve.serve_llm:app
|
||||
runtime_env:
|
||||
env_vars:
|
||||
MODEL_ID: "Qwen/Qwen2.5-32B-Instruct-AWQ"
|
||||
DEVICE: "mps"
|
||||
MAX_MODEL_LEN: "4096"
|
||||
deployments:
|
||||
- name: LLMDeployment
|
||||
num_replicas: 1
|
||||
ray_actor_options:
|
||||
num_gpus: 0.95
|
||||
resources:
|
||||
gpu_apple_mps: 1
|
||||
```
|
||||
|
||||
### 7. Training Integration
|
||||
|
||||
Ray Train jobs from [ADR-0058](0058-training-strategy-cpu-dgx-spark.md) will automatically discover waterdeep as an available worker. To prefer it for GPU-accelerated training:
|
||||
|
||||
```python
|
||||
# In cpu_training_pipeline.py — updated to prefer MPS when available
|
||||
trainer = TorchTrainer(
|
||||
train_func,
|
||||
scaling_config=ScalingConfig(
|
||||
num_workers=1,
|
||||
use_gpu=True,
|
||||
resources_per_worker={"gpu_apple_mps": 1},
|
||||
),
|
||||
)
|
||||
```
|
||||
|
||||
## Monitoring
|
||||
|
||||
Since waterdeep is not a Kubernetes node, standard Prometheus scraping won't reach it. Options:
|
||||
|
||||
| Approach | Notes |
|
||||
|----------|-------|
|
||||
| Prometheus push gateway | Ray worker pushes metrics periodically |
|
||||
| Node-exporter on macOS | Homebrew `node_exporter`, scraped by Prometheus via static target |
|
||||
| Ray Dashboard | Already shows all connected workers (ray-serve.lab.daviestechlabs.io) |
|
||||
|
||||
The Ray Dashboard at `ray-serve.lab.daviestechlabs.io` will automatically show waterdeep as a connected node with its resources, tasks, and memory usage — no additional configuration needed.
|
||||
|
||||
## Power Management
|
||||
|
||||
To prevent macOS from sleeping and disconnecting the Ray worker:
|
||||
|
||||
```bash
|
||||
# Disable sleep when on power adapter
|
||||
sudo pmset -c sleep 0 displaysleep 0 disksleep 0
|
||||
|
||||
# Or use caffeinate for the Ray process
|
||||
caffeinate -s ray start --address=... --block
|
||||
```
|
||||
| Aspect | waterdeep (local) | Kasm (browser) |
|
||||
|--------|-------------------|----------------|
|
||||
| **GPU rendering** | Metal 16-core GPU — Eevee real-time, Cycles GPU | CPU-only software rendering |
|
||||
| **Viewport FPS** | 60fps (Metal) | 5–15fps (CPU rasterisation) |
|
||||
| **MCP latency** | localhost socket — sub-millisecond | Network hop to Kasm container |
|
||||
| **Memory** | 48 GB unified, shared with GPU | Limited by Kasm container allocation |
|
||||
| **Sculpting** | Smooth, hardware-accelerated | Laggy, CPU-bound |
|
||||
| **Asset promotion** | rclone to gravenhollow RustFS S3 | Auto rclone to Quobyte S3 → manual promote to gravenhollow |
|
||||
| **Access** | Local only (waterdeep physical/VNC) | Any browser, anywhere |
|
||||
| **Setup** | Homebrew + manual add-on install | Pre-baked in Kasm image |
|
||||
| **Use when** | Primary creation workflow | Remote access, quick edits, mobile |
|
||||
|
||||
## Security Considerations
|
||||
|
||||
* Ray's GCS port (6379) will be exposed outside the cluster — restrict with firewall rules to waterdeep's IP only
|
||||
* The Ray worker has no RBAC — it executes whatever tasks the head assigns
|
||||
* Model weights on NFS are read-only from waterdeep (mount with `ro` option if possible)
|
||||
* NFS traffic to gravenhollow traverses the LAN — ensure dual 10GbE links are active
|
||||
* Consider Tailscale or WireGuard for encrypted transport if the Ray GCS traffic crosses untrusted network segments
|
||||
* BlenderMCP's `execute_blender_code` runs arbitrary Python in Blender — review AI-generated code before execution, especially file I/O operations
|
||||
* Telemetry disabled via `DISABLE_TELEMETRY=true` in MCP server config
|
||||
* BlenderMCP socket (port 9876) bound to localhost — not exposed to the network
|
||||
* NFS traffic to gravenhollow traverses the LAN — no sensitive data in VRM files
|
||||
* waterdeep has no cluster access — compromise doesn't impact Kubernetes workloads
|
||||
* `.blend` source files stay local on waterdeep; only finished VRM exports are promoted to gravenhollow
|
||||
|
||||
## Future Considerations
|
||||
|
||||
* **DGX Spark** ([ADR-0058](0058-training-strategy-cpu-dgx-spark.md)): When acquired, waterdeep can shift to secondary inference while DGX Spark handles training
|
||||
* **vLLM MPS maturity**: As vLLM's MPS backend matures, waterdeep could serve larger models more efficiently
|
||||
* **MLX backend**: Apple's MLX framework may provide better performance than PyTorch MPS for some workloads — worth evaluating as an alternative serving backend
|
||||
* **Second Mac Mini**: If another Apple Silicon node is added, the external-worker pattern scales trivially
|
||||
* **DGX Spark** ([ADR-0058](0058-training-strategy-cpu-dgx-spark.md)): When acquired, DGX Spark handles training; waterdeep remains the 3D creation workstation
|
||||
* **Blender + MLX**: Apple's MLX framework could power local AI-generated textures or mesh deformation directly in Blender — worth evaluating as Blender add-ons mature
|
||||
* **Automated promotion**: A file watcher (fswatch/launchd) could auto-run `rclone sync` when a new VRM appears in `~/blender-avatars/exports/`
|
||||
* **VRM validation**: Add a pre-promotion check script that validates VRM humanoid rig completeness, expression morphs, and viseme shapes before copying to gravenhollow
|
||||
|
||||
## Links
|
||||
|
||||
* [Ray Clusters — Adding External Workers](https://docs.ray.io/en/latest/cluster/vms/getting-started.html)
|
||||
* [PyTorch MPS Backend](https://pytorch.org/docs/stable/notes/mps.html)
|
||||
* [vLLM Apple Silicon Support](https://docs.vllm.ai/en/latest/)
|
||||
* Related: [ADR-0005](0005-multi-gpu-strategy.md) — Multi-GPU strategy
|
||||
* Related: [ADR-0011](0011-kuberay-unified-gpu-backend.md) — KubeRay unified GPU backend
|
||||
* Related: [ADR-0024](0024-ray-repository-structure.md) — Ray repository structure
|
||||
* Related: [ADR-0035](0035-arm64-worker-strategy.md) — ARM64 worker strategy
|
||||
* Related: [ADR-0037](0037-node-naming-conventions.md) — Node naming conventions
|
||||
* Related: [ADR-0058](0058-training-strategy-cpu-dgx-spark.md) — Training strategy
|
||||
* Related: [ADR-0063](0063-comfyui-3d-avatar-pipeline.md) — ComfyUI image-to-3D avatar pipeline (supersedes ADR-0062)
|
||||
* Related: [ADR-0062](0062-blender-mcp-3d-avatar-workflow.md) — BlenderMCP 3D avatar workflow (superseded)
|
||||
* Related: [ADR-0046](0046-companions-frontend-architecture.md) — Companions frontend architecture (Three.js + VRM avatars)
|
||||
* Related: [ADR-0026](0026-storage-strategy.md) — Storage strategy (gravenhollow NFS-fast)
|
||||
* Related: [ADR-0037](0037-node-naming-conventions.md) — Node naming conventions (waterdeep)
|
||||
* Related: [ADR-0012](0012-use-uv-for-python-development.md) — uv for Python development
|
||||
* [BlenderMCP GitHub](https://github.com/ahujasid/blender-mcp)
|
||||
* [Blender Metal GPU Rendering](https://docs.blender.org/manual/en/latest/render/cycles/gpu_rendering.html)
|
||||
* [VRM Add-on for Blender](https://vrm-addon-for-blender.info/en/)
|
||||
* [@pixiv/three-vrm](https://github.com/pixiv/three-vrm)
|
||||
|
||||
147
decisions/0061-go-handler-refactor.md
Normal file
147
decisions/0061-go-handler-refactor.md
Normal file
@@ -0,0 +1,147 @@
|
||||
# Refactor NATS Handler Services from Python to Go
|
||||
|
||||
* Status: accepted
|
||||
* Date: 2026-02-19
|
||||
* Decided: 2026-02-21
|
||||
* Deciders: Billy
|
||||
* Technical Story: Reduce container image sizes and resource consumption for non-ML handler services by rewriting them in Go
|
||||
|
||||
## Context and Problem Statement
|
||||
|
||||
The AI pipeline's non-inference services — `chat-handler`, `voice-assistant`, `pipeline-bridge`, `tts-module`, and the HTTP-forwarding variant of `stt-module` — are Python applications built on the `handler-base` shared library. None of these services perform local ML inference; they orchestrate calls to external Ray Serve endpoints over HTTP and route messages via NATS with MessagePack encoding.
|
||||
|
||||
> **Implementation note (2026-02-21):** During the Go rewrite, the wire format was upgraded from MessagePack to **Protocol Buffers** (see [ADR-0004 superseded](0004-use-messagepack-for-nats.md)). The shared Go module is published as `handler-base` v1.0.0 (not `handler-go` as originally proposed).
|
||||
|
||||
Despite doing only lightweight I/O orchestration, each service inherits the full Python runtime and its dependency tree through `handler-base` (which pulls in `numpy`, `pymilvus`, `redis`, `httpx`, `pydantic`, `opentelemetry-*`, `mlflow`, and `psycopg2-binary`). This results in container images of **500–700 MB each** — five services totalling **~3 GB** of registry storage — for workloads that are fundamentally HTTP/NATS glue code.
|
||||
|
||||
The homelab already has two production Go services (`companions-frontend` and `ntfy-discord`) that prove the NATS + MessagePack + OpenTelemetry pattern works well in Go with images under 30 MB.
|
||||
|
||||
How do we reduce the image footprint and resource consumption of the non-ML handler services without disrupting the ML inference layer?
|
||||
|
||||
## Decision Drivers
|
||||
|
||||
* Container images for glue services are 500–700 MB despite doing no ML work
|
||||
* Go produces static binaries yielding images of ~15–30 MB (scratch/distroless base)
|
||||
* Go services start in milliseconds vs. seconds for Python, improving pod scheduling
|
||||
* Go's memory footprint is ~10× lower for equivalent I/O-bound workloads
|
||||
* The NATS + msgpack + OTel pattern is already proven in `companions-frontend`
|
||||
* Go has first-class Kubernetes client support (`client-go`) — relevant for `pipeline-bridge`
|
||||
* ML inference services (Ray Serve, kuberay-images) must remain Python — only orchestration moves
|
||||
* Five services share a common base (`handler-base`) — a single Go module replaces it for all
|
||||
|
||||
## Considered Options
|
||||
|
||||
1. **Rewrite handler services in Go with a shared Go module**
|
||||
2. **Optimise Python images (multi-stage builds, slim deps, compiled wheels)**
|
||||
3. **Keep current Python stack unchanged**
|
||||
|
||||
## Decision Outcome
|
||||
|
||||
Chosen option: **Option 1 — Rewrite handler services in Go**, because the services are pure I/O orchestration with no ML dependencies, the Go pattern is already proven in-cluster, and the image + resource savings are an order of magnitude improvement that Python optimisation cannot match.
|
||||
|
||||
### Positive Consequences
|
||||
|
||||
* Five container images shrink from ~3 GB total to ~100–150 MB total
|
||||
* Sub-second cold start enables faster rollouts and autoscaling via KEDA
|
||||
* Lower memory footprint frees cluster resources for ML workloads
|
||||
* Eliminates Python runtime CVE surface area from non-ML services
|
||||
* Single `handler-go` module provides shared NATS, health, OTel, and client code
|
||||
* `pipeline-bridge` gains `client-go` — the canonical Kubernetes client library
|
||||
* Go's type system catches message schema drift at compile time
|
||||
|
||||
### Negative Consequences
|
||||
|
||||
* One-time rewrite effort across five services
|
||||
* Team must maintain Go **and** Python codebases (Python remains for Ray Serve, Kubeflow pipelines, Gradio UIs)
|
||||
* `handler-go` needs feature parity with `handler-base` for the orchestration subset (NATS client, health server, OTel, HTTP clients, Milvus client)
|
||||
* Audio handling in `stt-module` (VAD) requires a Go webrtcvad binding or equivalent
|
||||
|
||||
## Pros and Cons of the Options
|
||||
|
||||
### Option 1 — Rewrite in Go
|
||||
|
||||
* Good, because images shrink from ~600 MB → ~20 MB per service
|
||||
* Good, because memory usage drops from ~150 MB → ~15 MB per service
|
||||
* Good, because startup time drops from ~3 s → <100 ms
|
||||
* Good, because Go has mature libraries for every dependency (nats.go, client-go, otel-go, milvus-sdk-go)
|
||||
* Good, because two existing Go services in the cluster prove the pattern
|
||||
* Bad, because one-time engineering effort to rewrite five services
|
||||
* Bad, because two language ecosystems to maintain
|
||||
|
||||
### Option 2 — Optimise Python images
|
||||
|
||||
* Good, because no rewrite needed
|
||||
* Good, because multi-stage builds and dependency trimming can reduce images by 30–50%
|
||||
* Bad, because Python runtime + interpreter overhead remains (~200 MB floor)
|
||||
* Bad, because memory and startup improvements are marginal
|
||||
* Bad, because `handler-base` dependency tree is difficult to slim without breaking shared code
|
||||
|
||||
### Option 3 — Keep current stack
|
||||
|
||||
* Good, because zero effort
|
||||
* Bad, because images remain 500–700 MB for glue code
|
||||
* Bad, because resource waste reduces headroom for ML workloads
|
||||
* Bad, because slow cold starts limit KEDA autoscaling effectiveness
|
||||
|
||||
## Implementation Plan
|
||||
|
||||
### Phase 1: `handler-base` Go Module (COMPLETE)
|
||||
|
||||
Published as `git.daviestechlabs.io/daviestechlabs/handler-base` v1.0.0 with:
|
||||
|
||||
| Package | Purpose | Python Equivalent |
|
||||
|---------|---------|-------------------|
|
||||
| `natsutil/` | NATS publish/request/decode with protobuf encoding | `handler_base.nats_client` |
|
||||
| `health/` | HTTP health + readiness server | `handler_base.health` |
|
||||
| `telemetry/` | OTel traces + metrics setup | `handler_base.telemetry` |
|
||||
| `config/` | Env-based configuration (struct tags) | `handler_base.config` (pydantic-settings) |
|
||||
| `clients/` | HTTP clients for LLM, embeddings, reranker, STT, TTS | `handler_base.clients` |
|
||||
| `handler/` | Typed NATS message handler with OTel + health wiring | `handler_base.handler` |
|
||||
| `messages/` | Type aliases from generated protobuf stubs | `handler_base.messages` |
|
||||
| `gen/messagespb/` | protoc-generated Go stubs (21 message types) | — |
|
||||
| `proto/messages/v1/` | `.proto` schema source | — |
|
||||
|
||||
### Phase 2: Service Ports (COMPLETE)
|
||||
|
||||
All five services rewritten in Go and migrated to handler-base v1.0.0 with protobuf wire format:
|
||||
|
||||
| Order | Service | Status | Notes |
|
||||
|-------|---------|--------|-------|
|
||||
| 1 | `pipeline-bridge` | ✅ Done | NATS + HTTP + k8s API calls. Parameters changed to `map[string]string`. |
|
||||
| 2 | `tts-module` | ✅ Done | NATS ↔ HTTP bridge. `[]*TTSVoiceInfo` pointer slices, `int32` casts. |
|
||||
| 3 | `chat-handler` | ✅ Done | Core text pipeline. `EffectiveQuery()` standalone func, `int32(TopK)`. |
|
||||
| 4 | `voice-assistant` | ✅ Done | Same pattern with `[]*DocumentSource` pointer slices. |
|
||||
| 5 | `stt-module` | ✅ Done | HTTP-forwarding variant. `SessionId`/`SpeakerId` field renames, `int32(Sequence)`. |
|
||||
|
||||
`companions-frontend` also migrated: 129-line duplicate type definitions replaced with type aliases from handler-base/messages.
|
||||
|
||||
### Phase 3: Cleanup (COMPLETE)
|
||||
|
||||
* ~~Archive Python versions of ported services~~ — Python handler-base remains for Ray Serve/Kubeflow
|
||||
* CI pipelines use `golangci-lint` v2 with errcheck, govet, staticcheck, misspell, bodyclose, nilerr
|
||||
* All repos pass `golangci-lint run ./...` and `go test ./...`
|
||||
* Wire format upgraded from MessagePack to Protocol Buffers (ADR-0004 superseded)
|
||||
|
||||
### What Stays in Python
|
||||
|
||||
| Repository | Reason |
|
||||
|------------|--------|
|
||||
| `ray-serve` | PyTorch, vLLM, sentence-transformers — core ML inference |
|
||||
| `kuberay-images` | GPU runtime Docker images (ROCm, CUDA, IPEX) |
|
||||
| `gradio-ui` | Gradio is Python-only; dev/testing tool, not production |
|
||||
| `kubeflow/` | Kubeflow Pipelines SDK is Python-only |
|
||||
| `mlflow/` | MLflow SDK integration (tracking + model registry) |
|
||||
| `stt-module` (local Whisper variant) | PyTorch + openai-whisper on GPU |
|
||||
| `spark-analytics-jobs` | PySpark (being replaced by Flink anyway) |
|
||||
|
||||
## Links
|
||||
|
||||
* Related: [ADR-0003](0003-use-nats-for-messaging.md) — NATS as messaging backbone
|
||||
* Related: [ADR-0004](0004-use-messagepack-for-nats.md) — MessagePack binary encoding
|
||||
* Related: [ADR-0011](0011-kuberay-unified-gpu-backend.md) — KubeRay unified GPU backend
|
||||
* Related: [ADR-0013](0013-gitea-actions-for-ci.md) — Gitea Actions CI
|
||||
* Related: [ADR-0014](0014-docker-build-best-practices.md) — Docker build best practices
|
||||
* Related: [ADR-0019](0019-handler-deployment-strategy.md) — Handler deployment strategy
|
||||
* Related: [ADR-0024](0024-ray-repository-structure.md) — Ray repository structure
|
||||
* Related: [ADR-0046](0046-companions-frontend-architecture.md) — Companions frontend (Go reference)
|
||||
* Related: [ADR-0051](0051-keda-event-driven-autoscaling.md) — KEDA autoscaling
|
||||
448
decisions/0062-blender-mcp-3d-avatar-workflow.md
Normal file
448
decisions/0062-blender-mcp-3d-avatar-workflow.md
Normal file
@@ -0,0 +1,448 @@
|
||||
# BlenderMCP for 3D Avatar Creation via Kasm Workstation
|
||||
|
||||
* Status: superseded by [ADR-0063](0063-comfyui-3d-avatar-pipeline.md)
|
||||
* Date: 2026-02-21
|
||||
* Deciders: Billy
|
||||
* Technical Story: Enable AI-assisted 3D avatar creation for companions-frontend using BlenderMCP in a Kasm Blender workstation with VS Code, storing assets in S3, serving locally from gravenhollow NFS and remotely via Cloudflare-cached RustFS
|
||||
|
||||
## Context and Problem Statement
|
||||
|
||||
The companions-frontend serves VRM avatar models for its Three.js-based 3D character rendering (see [ADR-0046](0046-companions-frontend-architecture.md)). Today the avatar library is limited to three models (`Seed-san.vrm`, `Aka.vrm`, `Midori.vrm`) — only one of which actually ships in the repo — and every model must be sourced or hand-sculpted externally.
|
||||
|
||||
Creating custom VRM avatars is a manual, time-intensive process: open Blender, sculpt/rig a character, export to VRM, iterate. There is no integration between the AI coding workflow (VS Code / Copilot) and Blender, so context switching between the editor and the 3D tool is constant.
|
||||
|
||||
How do we streamline custom 3D avatar creation for companions-frontend with AI assistance, while keeping assets durable and accessible across workstations?
|
||||
|
||||
## Decision Drivers
|
||||
|
||||
* The existing avatar pipeline is manual and disconnected from the development workflow
|
||||
* BlenderMCP (v1.5.5, 17k+ GitHub stars) bridges AI assistants to Blender via the Model Context Protocol — enabling prompt-driven 3D modelling, material control, scene manipulation, and code execution inside Blender
|
||||
* Kasm Workspaces already run in the cluster (`productivity` namespace) and support Docker-in-Docker with volume plugins for persistent storage
|
||||
* VS Code supports MCP servers natively (GitHub Copilot agent mode), meaning the same editor used for code can drive Blender scene creation
|
||||
* Custom volume mounts in Kasm map `/s3` to S3-compatible storage via the rclone Docker volume plugin — providing durable, off-node persistence
|
||||
* Quobyte S3-compatible endpoint with the `kasm` bucket is the existing Kasm storage backend
|
||||
* VRM models must ultimately land in the companions-frontend `/assets/models/` path at build time or be served from an external URL
|
||||
* Final production models and animations should live on gravenhollow (all-SSD TrueNAS, dual 10GbE) for fast local serving via NFS
|
||||
* Remote users accessing companions-chat through Cloudflare Tunnel need a CDN-cached path for multi-MB VRM downloads
|
||||
* Models are write-once/read-many — ideal for aggressive caching
|
||||
* gravenhollow already runs RustFS (S3-compatible) — exposing it via Cloudflare Tunnel gives CDN caching without a separate storage tier
|
||||
|
||||
## Considered Options
|
||||
|
||||
1. **BlenderMCP in Kasm Blender workstation + VS Code MCP client, assets in Quobyte S3 (`kasm` bucket)**
|
||||
2. **Local Blender + BlenderMCP on a developer laptop**
|
||||
3. **Hyper3D / Rodin cloud generation only (no Blender)**
|
||||
4. **Manual Blender workflow (status quo)**
|
||||
|
||||
## Decision Outcome
|
||||
|
||||
Chosen option: **Option 1 — BlenderMCP in Kasm Blender workstation + VS Code MCP client, assets in Quobyte S3**, because it integrates AI-assisted modelling directly into the existing Kasm + VS Code workflow, stores assets durably in S3, and requires no additional infrastructure beyond what is already deployed.
|
||||
|
||||
### Positive Consequences
|
||||
|
||||
* AI-assisted 3D modelling — prompt-driven creation, material application, and scene manipulation inside Blender via MCP
|
||||
* Zero context switching — VS Code agent mode drives Blender commands through the same editor used for code
|
||||
* Persistent storage — VRM exports written to `/s3` survive session teardown and are available from any Kasm session or CI pipeline
|
||||
* Existing infrastructure — Kasm agent, DinD, rclone volume plugin, Quobyte S3, gravenhollow NFS, and Cloudflare are all already deployed
|
||||
* No image rebuild for new models — VRM files live on gravenhollow NFS, mounted read-only into the pod; add a model and update the allowlist
|
||||
* LAN performance — all-SSD NFS with dual 10GbE delivers VRM files in <100ms
|
||||
* Remote performance — RustFS exposed through Cloudflare Tunnel with CDN caching at 300+ global PoPs; no separate storage tier needed
|
||||
* Poly Haven / Hyper3D integration — BlenderMCP supports downloading Poly Haven assets and generating models via Hyper3D Rodin, expanding the asset library
|
||||
* VRM ecosystem — Blender VRM add-on exports directly to VRM 0.x/1.0 format consumed by `@pixiv/three-vrm` in companions-frontend
|
||||
* Reproducible — Kasm workspace images are versioned; Blender + add-ons are pre-baked
|
||||
|
||||
### Negative Consequences
|
||||
|
||||
* BlenderMCP `execute_blender_code` tool runs arbitrary Python in Blender — must trust AI-generated code or review before execution
|
||||
* Socket-based communication (TCP 9876) between the MCP server and Blender add-on adds a failure mode
|
||||
* VRM export quality depends on correct rigging/weight painting — AI can scaffold but manual touch-up may still be needed
|
||||
* Kasm Blender image must be configured with both the BlenderMCP add-on and the VRM add-on pre-installed
|
||||
* Telemetry is on by default in BlenderMCP — must disable via `DISABLE_TELEMETRY=true` for privacy
|
||||
* Cache misses from remote users hit gravenhollow via the tunnel — negligible with immutable files and long TTLs
|
||||
|
||||
## Architecture
|
||||
|
||||
```
|
||||
┌─────────────────────────────────────────────────────────────────────────┐
|
||||
│ Developer Workstation │
|
||||
│ │
|
||||
│ ┌──────────────────────────────────┐ │
|
||||
│ │ VS Code (local) │ │
|
||||
│ │ │ │
|
||||
│ │ GitHub Copilot (agent mode) │ │
|
||||
│ │ │ │ │
|
||||
│ │ ▼ │ │
|
||||
│ │ BlenderMCP Server (MCP) │ │
|
||||
│ │ (uvx blender-mcp) │ │
|
||||
│ │ │ │ │
|
||||
│ └─────────┼────────────────────────┘ │
|
||||
│ │ TCP :9876 (JSON over socket) │
|
||||
└────────────┼────────────────────────────────────────────────────────────┘
|
||||
│
|
||||
▼
|
||||
┌─────────────────────────────────────────────────────────────────────────┐
|
||||
│ Kasm Blender Workstation (browser session) │
|
||||
│ kasm.daviestechlabs.io │
|
||||
│ │
|
||||
│ ┌──────────────────────────────────────────────────────┐ │
|
||||
│ │ Blender 4.x │ │
|
||||
│ │ │ │
|
||||
│ │ Add-ons: │ │
|
||||
│ │ • BlenderMCP (addon.py) — socket server :9876 │ │
|
||||
│ │ • VRM Add-on for Blender — import/export VRM │ │
|
||||
│ │ │ │
|
||||
│ │ ┌────────────────────────────────────────────────┐ │ │
|
||||
│ │ │ /s3/blender-avatars/ │ │ │
|
||||
│ │ │ ├── projects/ (.blend source files) │ │ │
|
||||
│ │ │ ├── exports/ (.vrm exported models) │ │ │
|
||||
│ │ │ └── textures/ (shared texture lib) │ │ │
|
||||
│ │ └────────────────────────────────────────────────┘ │ │
|
||||
│ └──────────────────────────────────────────────────────┘ │
|
||||
│ │ │
|
||||
│ rclone volume │
|
||||
│ plugin (S3) │
|
||||
└──────────────────────────┼──────────────────────────────────────────────┘
|
||||
│
|
||||
▼
|
||||
┌─────────────────────────────────────────────────────────────────────────┐
|
||||
│ Quobyte S3 Endpoint │
|
||||
│ Bucket: kasm │
|
||||
│ │
|
||||
│ kasm/blender-avatars/projects/Companion-A.blend │
|
||||
│ kasm/blender-avatars/exports/Companion-A.vrm │
|
||||
│ kasm/blender-avatars/textures/skin-tone-01.png │
|
||||
└──────────────────────────┬──────────────────────────────────────────────┘
|
||||
│
|
||||
rclone sync (promotion)
|
||||
│
|
||||
▼
|
||||
┌─────────────────────────────────────────────────────────────────────────┐
|
||||
│ gravenhollow.lab.daviestechlabs.io │
|
||||
│ (TrueNAS Scale · All-SSD · Dual 10GbE · 12.2 TB) │
|
||||
│ │
|
||||
│ NFS: /mnt/gravenhollow/kubernetes/avatar-models/ │
|
||||
│ ├── Seed-san.vrm (default model) │
|
||||
│ ├── Aka.vrm (Legend tier) │
|
||||
│ ├── Midori.vrm (Legend tier) │
|
||||
│ ├── Companion-A.vrm (custom, promoted from Kasm S3) │
|
||||
│ └── animations/ (shared animation clips) │
|
||||
│ │
|
||||
│ S3 (RustFS): avatar-models bucket │
|
||||
│ (same data as NFS dir, served via S3 API for Cloudflare Tunnel) │
|
||||
└──────────┬─────────────────────────────────┬────────────────────────────┘
|
||||
│ │
|
||||
NFS mount (nfs-fast) S3 API (RustFS :30292)
|
||||
for pod volume via Cloudflare Tunnel
|
||||
│ │
|
||||
▼ ▼
|
||||
┌──────────────────────────┐ ┌──────────────────────────────────────────┐
|
||||
│ companions-frontend │ │ Cloudflare Tunnel + CDN │
|
||||
│ (Kubernetes pod) │ │ │
|
||||
│ │ │ assets.daviestechlabs.io │
|
||||
│ /models/ volume mount │ │ → envoy-external │
|
||||
│ (nfs-fast PVC, RO) │ │ → avatar-assets-svc (in-cluster) │
|
||||
│ │ │ → gravenhollow RustFS :30292 │
|
||||
│ Go FileServer: │ │ │
|
||||
│ /assets/models/ → │ │ Cloudflare CDN caches at 300+ PoPs │
|
||||
│ serves from PVC │ │ Cache-Control: public, max-age=31536000 │
|
||||
│ │ │ (immutable, versioned filenames) │
|
||||
└──────────┬───────────────┘ └──────────────────────┬───────────────────┘
|
||||
│ │
|
||||
LAN clients Remote clients
|
||||
companions-chat.lab... companions-chat via
|
||||
(envoy-internal, direct) Cloudflare Tunnel
|
||||
│ │
|
||||
└──────────────────┬───────────────────────┘
|
||||
▼
|
||||
┌─────────────────────────────────────────────────────────────────────────┐
|
||||
│ Browser (Three.js) │
|
||||
│ AvatarManager.loadModel('/assets/models/Companion-A.vrm') │
|
||||
│ │
|
||||
│ LAN: fetch from companions-frontend pod (NFS-backed, ~10GbE) │
|
||||
│ Remote: fetch from assets.daviestechlabs.io (Cloudflare CDN-cached) │
|
||||
└─────────────────────────────────────────────────────────────────────────┘
|
||||
```
|
||||
|
||||
## Workflow
|
||||
|
||||
### 1. Kasm Workspace Setup
|
||||
|
||||
The Kasm Blender workspace image is configured with:
|
||||
|
||||
| Component | Version | Purpose |
|
||||
|-----------|---------|---------|
|
||||
| Blender | 4.x | 3D modelling and sculpting |
|
||||
| BlenderMCP add-on (`addon.py`) | 1.5.5 | Socket server for MCP commands |
|
||||
| VRM Add-on for Blender | latest | Import/export VRM format |
|
||||
| Python | 3.10+ | Blender scripting runtime |
|
||||
|
||||
The Kasm storage mapping mounts `/s3` via the rclone Docker volume plugin to the Quobyte S3 endpoint (`kasm` bucket). The sub-path `blender-avatars/` is used for all 3D asset work.
|
||||
|
||||
### 2. VS Code MCP Configuration
|
||||
|
||||
Add BlenderMCP as an MCP server in VS Code (`.vscode/mcp.json` or user settings):
|
||||
|
||||
```json
|
||||
{
|
||||
"servers": {
|
||||
"blender": {
|
||||
"command": "uvx",
|
||||
"args": ["blender-mcp"],
|
||||
"env": {
|
||||
"BLENDER_HOST": "localhost",
|
||||
"BLENDER_PORT": "9876",
|
||||
"DISABLE_TELEMETRY": "true"
|
||||
}
|
||||
}
|
||||
}
|
||||
}
|
||||
```
|
||||
|
||||
When the Kasm session is accessed remotely, set `BLENDER_HOST` to the Kasm workstation's reachable address.
|
||||
|
||||
### 3. Avatar Creation Workflow
|
||||
|
||||
1. **Launch** the Kasm Blender workspace via `kasm.daviestechlabs.io`
|
||||
2. **Enable** the BlenderMCP add-on in Blender → 3D View sidebar → "BlenderMCP" tab → "Connect to Claude"
|
||||
3. **Open VS Code** with Copilot agent mode and the BlenderMCP MCP server running
|
||||
4. **Prompt** the AI to create or modify avatars:
|
||||
- _"Create a humanoid character with anime-style proportions, blue hair, and a fantasy outfit"_
|
||||
- _"Apply a metallic gold material to the armor pieces"_
|
||||
- _"Set up the lighting for a character showcase render"_
|
||||
- _"Rig this character for VRM export with standard humanoid bones"_
|
||||
5. **Export** the finished model to VRM via the VRM add-on (or via BlenderMCP `execute_blender_code` calling the VRM export operator)
|
||||
6. **Save** the `.vrm` to `/s3/blender-avatars/exports/` and the `.blend` source to `/s3/blender-avatars/projects/`
|
||||
7. **Import** the VRM into companions-frontend — copy to `assets/models/`, update the allowlists in `internal/database/database.go` and `static/js/avatar.js`
|
||||
|
||||
### 4. Asset Pipeline (Kasm S3 → gravenhollow → production)
|
||||
|
||||
| Stage | Action |
|
||||
|-------|--------|
|
||||
| **Create** | AI-assisted modelling + VRM export in Kasm Blender → `/s3/blender-avatars/exports/*.vrm` |
|
||||
| **Store** | rclone syncs `/s3` to Quobyte S3 `kasm` bucket automatically |
|
||||
| **Promote** | `rclone copy quobyte:kasm/blender-avatars/exports/Model.vrm gravenhollow-nfs:/avatar-models/` (manual or CI) |
|
||||
| **Register** | Add model path to `AllowedAvatarModels` in Go and JS allowlists, commit to repo |
|
||||
| **Deploy** | Flux rolls out updated companions-frontend config; model already available on NFS PVC — no image rebuild needed |
|
||||
| **CDN** | Model immediately available via `assets.daviestechlabs.io` — Cloudflare Tunnel proxies to RustFS, CDN caches at edge |
|
||||
|
||||
### 5. Deployment and Storage Architecture
|
||||
|
||||
#### Local Serving (LAN users)
|
||||
|
||||
Companions-frontend currently serves VRM models via `http.FileServer(http.Dir("assets"))` from the container filesystem. This bakes models into the image and requires a rebuild to add new avatars.
|
||||
|
||||
The new approach mounts avatar models from gravenhollow via an `nfs-fast` PVC:
|
||||
|
||||
```yaml
|
||||
# PersistentVolumeClaim for avatar models
|
||||
apiVersion: v1
|
||||
kind: PersistentVolumeClaim
|
||||
metadata:
|
||||
name: avatar-models
|
||||
namespace: ai-ml
|
||||
spec:
|
||||
storageClassName: nfs-fast
|
||||
accessModes: [ReadOnlyMany]
|
||||
resources:
|
||||
requests:
|
||||
storage: 10Gi
|
||||
```
|
||||
|
||||
The pod mounts this PVC at `/models` and the Go server serves it at `/assets/models/`:
|
||||
|
||||
```go
|
||||
// Replace embedded assets with NFS-backed volume
|
||||
mux.Handle("/assets/models/", http.StripPrefix("/assets/models/",
|
||||
http.FileServer(http.Dir("/models"))))
|
||||
```
|
||||
|
||||
Benefits:
|
||||
- **No image rebuild** to add/update models — write to gravenhollow NFS, pod sees it immediately (with `actimeo=600` cache, within 10 minutes)
|
||||
- **All-SSD + dual 10GbE** — VRM files (typically 5–30 MB) load in <100ms on LAN
|
||||
- **ReadOnlyMany** — multiple replicas can share the same PVC
|
||||
- Source `.blend` files and textures remain on Quobyte S3 (Kasm bucket) for the creation workflow; only promoted VRM exports land on gravenhollow
|
||||
|
||||
#### Remote Serving (Cloudflare-cached RustFS)
|
||||
|
||||
Companions-chat is accessed externally via Cloudflare Tunnel → `envoy-internal`. Rather than duplicating assets to a separate storage tier (e.g., Cloudflare R2), gravenhollow's RustFS S3 endpoint is exposed directly through the Cloudflare Tunnel with a dedicated hostname. Cloudflare's CDN automatically caches responses at edge PoPs — since VRM files are immutable with year-long TTLs, virtually all requests are served from cache.
|
||||
|
||||
| | |
|
||||
|---|---|
|
||||
| **Origin** | gravenhollow RustFS `avatar-models` bucket (`:30292`, same data as NFS dir) |
|
||||
| **Public hostname** | `assets.daviestechlabs.io` (Cloudflare DNS, orange-clouded) |
|
||||
| **Tunnel routing** | Cloudflare Tunnel → `envoy-external` → `avatar-assets-svc` → gravenhollow RustFS |
|
||||
| **CDN caching** | Cloudflare CDN caches at 300+ global PoPs; `Cache-Control: public, max-age=31536000, immutable` |
|
||||
| **Egress** | Cloudflare-proxied traffic has no bandwidth surcharge |
|
||||
| **Auth** | Public read (models are not sensitive); RustFS write credentials stay internal |
|
||||
| **No sync needed** | Single source of truth — NFS and RustFS serve the same data from gravenhollow |
|
||||
|
||||
##### In-Cluster Proxy Service
|
||||
|
||||
An ExternalName or Endpoints service proxies cluster traffic to gravenhollow's RustFS endpoint so the HTTPRoute can reference it:
|
||||
|
||||
```yaml
|
||||
# Service pointing to gravenhollow RustFS for avatar assets
|
||||
apiVersion: v1
|
||||
kind: Service
|
||||
metadata:
|
||||
name: avatar-assets
|
||||
namespace: ai-ml
|
||||
spec:
|
||||
type: ExternalName
|
||||
externalName: gravenhollow.lab.daviestechlabs.io
|
||||
ports:
|
||||
- port: 30292
|
||||
protocol: TCP
|
||||
```
|
||||
|
||||
##### HTTPRoute (Cloudflare Tunnel → RustFS)
|
||||
|
||||
```yaml
|
||||
apiVersion: gateway.networking.k8s.io/v1
|
||||
kind: HTTPRoute
|
||||
metadata:
|
||||
name: avatar-assets
|
||||
namespace: ai-ml
|
||||
annotations:
|
||||
external-dns.alpha.kubernetes.io/hostname: assets.daviestechlabs.io
|
||||
spec:
|
||||
hostnames:
|
||||
- assets.daviestechlabs.io
|
||||
parentRefs:
|
||||
- name: envoy-external
|
||||
namespace: network
|
||||
rules:
|
||||
- matches:
|
||||
- path:
|
||||
type: PathPrefix
|
||||
value: /avatar-models/
|
||||
backendRefs:
|
||||
- name: avatar-assets
|
||||
port: 30292
|
||||
filters:
|
||||
- type: ResponseHeaderModifier
|
||||
responseHeaderModifier:
|
||||
set:
|
||||
- name: Cache-Control
|
||||
value: "public, max-age=31536000, immutable"
|
||||
- name: Access-Control-Allow-Origin
|
||||
value: "https://companions-chat.daviestechlabs.io"
|
||||
```
|
||||
|
||||
Cloudflare Tunnel picks up `assets.daviestechlabs.io` via the existing wildcard ingress rule (`*.daviestechlabs.io → envoy-external`). The CDN caches based on the `Cache-Control` header — after the first request per PoP, all subsequent loads are served from Cloudflare's edge.
|
||||
|
||||
##### Client-Side Routing
|
||||
|
||||
The frontend detects whether the user is on LAN or remote and routes model fetches accordingly:
|
||||
|
||||
```javascript
|
||||
// avatar.js — model URL resolution
|
||||
function resolveModelURL(path) {
|
||||
// LAN users: serve from the Go server (NFS-backed, same origin)
|
||||
// Remote users: serve from Cloudflare-cached RustFS
|
||||
const isLAN = location.hostname.endsWith('.lab.daviestechlabs.io');
|
||||
if (isLAN) return path; // e.g. /assets/models/Companion-A.vrm
|
||||
return `https://assets.daviestechlabs.io/avatar-models/${path.split('/').pop()}`;
|
||||
// → https://assets.daviestechlabs.io/avatar-models/Companion-A.vrm
|
||||
}
|
||||
```
|
||||
|
||||
Alternatively, the Go server can set the model base URL via a template variable based on the `Host` header, keeping the logic server-side.
|
||||
|
||||
#### Versioning Strategy
|
||||
|
||||
VRM files are immutable once promoted — updated models get a new filename (e.g., `Companion-A-v2.vrm`) rather than overwriting. This ensures:
|
||||
- Cloudflare CDN cache never serves stale content
|
||||
- Rollback is trivial — point the allowlist back to the previous version
|
||||
- Browser `Cache-Control: immutable` works correctly
|
||||
|
||||
#### Storage Tier Summary
|
||||
|
||||
| Location | Purpose | Tier | Access |
|
||||
|----------|---------|------|--------|
|
||||
| Quobyte S3 (`kasm` bucket) | Working files: `.blend`, textures, WIP exports | Kasm rclone volume | Kasm sessions only |
|
||||
| gravenhollow NFS (`/avatar-models/`) | Production VRM models + animations | `nfs-fast` PVC (RO) | companions-frontend pod, LAN |
|
||||
| gravenhollow RustFS S3 (`avatar-models`) | Same data as NFS, exposed to Cloudflare Tunnel for remote users | S3 API via HTTPRoute | Cloudflare CDN-cached, global |
|
||||
|
||||
## BlenderMCP Capabilities Used
|
||||
|
||||
| MCP Tool | Avatar Workflow Use |
|
||||
|----------|-------------------|
|
||||
| `get_scene_info` | Inspect current scene before modifications |
|
||||
| `create_object` | Scaffold base meshes for characters |
|
||||
| `modify_object` | Adjust proportions, positions, bone placement |
|
||||
| `set_material` | Apply skin, hair, clothing materials |
|
||||
| `execute_blender_code` | Run VRM export scripts, batch operations, custom rigging |
|
||||
| `get_screenshot` | AI reviews viewport to understand current state |
|
||||
| `poly_haven_download` | Fetch HDRIs, textures for environment/materials |
|
||||
| `hyper3d_generate` | Generate base 3D models from text prompts via Hyper3D Rodin |
|
||||
|
||||
## Security Considerations
|
||||
|
||||
* **Code execution:** BlenderMCP's `execute_blender_code` runs arbitrary Python in Blender. The Kasm session is sandboxed (DinD container with no cluster access), limiting blast radius. Always save before executing AI-generated code.
|
||||
* **Telemetry:** BlenderMCP collects anonymous telemetry by default. Disabled via `DISABLE_TELEMETRY=true` in the MCP server config.
|
||||
* **Network:** The TCP socket (port 9876) between the MCP server and Blender add-on is local to the session. If accessed remotely, ensure the connection is tunnelled or restricted.
|
||||
* **S3 credentials:** rclone volume plugin credentials are managed via Kasm storage mappings and the existing `kasm-agent` ExternalSecret — no new secrets required.
|
||||
* **RustFS exposure:** The `avatar-models` RustFS bucket is exposed read-only through Cloudflare Tunnel. RustFS write credentials remain internal. The HTTPRoute only routes GET requests to the bucket path — no write operations are reachable externally.
|
||||
* **Public assets:** Avatar models are public assets (served to any authenticated companions-chat user). No sensitive data in VRM files. CORS restricts to `companions-chat.daviestechlabs.io` origin.
|
||||
* **Model allowlist:** Even though models are served from NFS/R2, the server-side and client-side allowlists in companions-frontend gate which models users can actually select. Uploading a VRM to gravenhollow does not make it available without a code change.
|
||||
|
||||
## Pros and Cons of the Options
|
||||
|
||||
### Option 1 — BlenderMCP in Kasm + VS Code + Quobyte S3 + gravenhollow (NFS + RustFS via Cloudflare)
|
||||
|
||||
* Good, because AI-assisted modelling reduces manual effort for avatar creation
|
||||
* Good, because assets persist in S3 across sessions and are accessible from CI
|
||||
* Good, because no new infrastructure — Kasm, rclone, Quobyte, gravenhollow, Cloudflare Tunnel are all already deployed
|
||||
* Good, because VS Code MCP integration means one editor for code and 3D work
|
||||
* Good, because Kasm sandboxes Blender execution away from the cluster
|
||||
* Good, because NFS-fast serving decouples model assets from container images (no rebuild to add models)
|
||||
* Good, because RustFS through Cloudflare Tunnel provides CDN caching with zero additional storage tiers — no R2 bucket, no sync CronJob, no extra credentials
|
||||
* Good, because single source of truth — gravenhollow serves both LAN (NFS) and remote (RustFS → Cloudflare CDN) from the same data
|
||||
* Good, because immutable versioned filenames enable aggressive caching and trivial rollback
|
||||
* Good, because models are available to remote users immediately after promotion (no sync delay)
|
||||
* Bad, because BlenderMCP is a third-party tool with arbitrary code execution
|
||||
* Bad, because socket communication adds latency for remote Kasm sessions
|
||||
* Bad, because VRM rigging quality may require manual adjustment after AI scaffolding
|
||||
* Bad, because cache misses hit gravenhollow via the tunnel (negligible with immutable files + long TTLs)
|
||||
|
||||
### Option 2 — Local Blender + BlenderMCP on developer laptop
|
||||
|
||||
* Good, because lowest latency (everything local)
|
||||
* Good, because no Kasm dependency
|
||||
* Bad, because assets are local — no durable S3 storage without manual sync
|
||||
* Bad, because Blender + add-ons must be installed on every dev machine
|
||||
* Bad, because not reproducible across machines
|
||||
|
||||
### Option 3 — Hyper3D / Rodin cloud generation only
|
||||
|
||||
* Good, because no Blender installation needed
|
||||
* Good, because fully prompt-driven model generation
|
||||
* Bad, because limited control over output — no fine-tuning materials, rigging, or proportions
|
||||
* Bad, because Hyper3D free tier has daily generation limits
|
||||
* Bad, because generated models require post-processing for VRM compliance (humanoid rig, expressions, visemes)
|
||||
* Bad, because vendor dependency for a core asset pipeline
|
||||
|
||||
### Option 4 — Manual Blender workflow (status quo)
|
||||
|
||||
* Good, because full manual control
|
||||
* Good, because no new tooling
|
||||
* Bad, because slow — no AI assistance for repetitive modelling tasks
|
||||
* Bad, because no integration with the development workflow
|
||||
* Bad, because assets stored ad-hoc with no structured pipeline to companions-frontend
|
||||
|
||||
## Links
|
||||
|
||||
* Related to [ADR-0046](0046-companions-frontend-architecture.md) (companions-frontend architecture — Three.js + VRM avatars)
|
||||
* Related to [ADR-0026](0026-storage-strategy.md) (storage strategy — gravenhollow NFS-fast, Quobyte S3, rclone)
|
||||
* Related to [ADR-0044](0044-dns-and-external-access.md) (DNS and external access — Cloudflare Tunnel, split-horizon)
|
||||
* Related to [ADR-0049](0049-self-hosted-productivity-suite.md) (Kasm Workspaces)
|
||||
* Related to [ADR-0059](0059-mac-mini-ray-worker.md) (waterdeep as local AI agent — primary 3D creation workstation with Metal GPU)
|
||||
* [BlenderMCP GitHub](https://github.com/ahujasid/blender-mcp)
|
||||
* [VRM Add-on for Blender](https://vrm-addon-for-blender.info/en/)
|
||||
* [VRM Specification](https://vrm.dev/en/)
|
||||
* [@pixiv/three-vrm](https://github.com/pixiv/three-vrm) (runtime loader used in companions-frontend)
|
||||
* [Poly Haven](https://polyhaven.com/) (free 3D assets, HDRIs, textures)
|
||||
* [Hyper3D Rodin](https://hyper3d.ai/) (AI 3D model generation)
|
||||
* [Cloudflare Tunnel Docs](https://developers.cloudflare.com/cloudflare-one/connections/connect-networks/)
|
||||
* [Cloudflare CDN Cache Rules](https://developers.cloudflare.com/cache/)
|
||||
483
decisions/0063-comfyui-3d-avatar-pipeline.md
Normal file
483
decisions/0063-comfyui-3d-avatar-pipeline.md
Normal file
@@ -0,0 +1,483 @@
|
||||
# ComfyUI Image-to-3D Avatar Pipeline with TRELLIS + UniRig
|
||||
|
||||
* Status: proposed
|
||||
* Date: 2026-02-24
|
||||
* Deciders: Billy
|
||||
* Technical Story: Replace the manual BlenderMCP 3D avatar creation workflow with an automated, GPU-accelerated image-to-rigged-3D-model pipeline using ComfyUI, TRELLIS 2-4B, and UniRig — running on a personal desktop (NVIDIA RTX 4070) as an on-demand Ray worker, with direct MLflow logging and rclone asset promotion
|
||||
|
||||
## Context and Problem Statement
|
||||
|
||||
The companions-frontend serves VRM avatar models for Three.js-based 3D character rendering ([ADR-0046](0046-companions-frontend-architecture.md)). The previous approach ([ADR-0062](0062-blender-mcp-3d-avatar-workflow.md)) proposed using BlenderMCP in a Kasm workstation or on waterdeep ([ADR-0059](0059-mac-mini-ray-worker.md)) for AI-assisted avatar creation. While BlenderMCP bridges VS Code to Blender, the workflow is fundamentally **interactive and manual** — an operator must prompt the AI, review each sculpting step, and hand-tune rigging and VRM export. This is slow, non-reproducible, and doesn't scale.
|
||||
|
||||
Meanwhile, the state of the art in image-to-3D generation has matured significantly:
|
||||
|
||||
- **TRELLIS** (Microsoft, CVPR'25 Spotlight, 12k+ GitHub stars) generates high-quality textured 3D meshes from a single image in seconds using Structured 3D Latents (SLAT) — with models up to 2B parameters
|
||||
- **UniRig** (Tsinghua/Tripo, SIGGRAPH'25, 1.4k+ GitHub stars) automatically generates topologically valid skeletons and skinning weights for arbitrary 3D models using autoregressive transformers — the first model to rig humans, animals, and objects with a single unified framework
|
||||
- **ComfyUI-3D-Pack** (3.7k+ GitHub stars) provides battle-tested ComfyUI nodes for TRELLIS, 3D Gaussian Splatting, mesh processing, and GLB/VRM export — enabling node-graph-based automation without custom code
|
||||
|
||||
Together, these tools enable a fully automated **image → 3D mesh → rigged model → VRM** pipeline that eliminates manual Blender work for the common case, produces reproducible results, and integrates with the existing MLflow + Ray stack.
|
||||
|
||||
A personal desktop (Ryzen 9 7950X, 64 GB DDR5, NVIDIA RTX 4070 12 GB VRAM) running Arch Linux is available as an **on-demand external Ray worker** — it won't be a permanent cluster member (it's not running Talos), but can join the Ray cluster via `ray start` when 3D generation workloads need to run. This adds a 5th GPU to the fleet specifically for 3D generation, without disrupting the stable inference allocations.
|
||||
|
||||
How do we build an automated, reproducible image-to-VRM pipeline that leverages the desktop's CUDA GPU and integrates with the existing AI/ML platform for experiment tracking and asset serving?
|
||||
|
||||
## Decision Drivers
|
||||
|
||||
* BlenderMCP workflow from ADR-0062 is interactive and non-reproducible — every avatar requires an operator in the loop
|
||||
* TRELLIS generates production-quality textured meshes from a single reference image in ~30 seconds on a 12 GB GPU
|
||||
* UniRig automatically rigs arbitrary 3D models with skeleton + skinning weights — no manual weight painting
|
||||
* ComfyUI-3D-Pack bundles TRELLIS, mesh processing, and GLB export as composable nodes — enabling visual pipeline authoring
|
||||
* The desktop's RTX 4070 (12 GB VRAM) meets TRELLIS's 16 GB minimum when using fp16/attention optimizations, and exceeds UniRig's 8 GB requirement
|
||||
* The desktop can join/leave the Ray cluster on demand — no permanent infrastructure commitment
|
||||
* MLflow tracks generation parameters, quality metrics, and output artifacts for reproducibility — the desktop logs directly to the cluster's MLflow service over HTTP
|
||||
* waterdeep (Mac Mini M4 Pro) remains available for interactive Blender touch-up on models that need manual refinement
|
||||
* VRM export, asset promotion to gravenhollow, and serving architecture from ADR-0062 remain valid and are reused
|
||||
|
||||
## Considered Options
|
||||
|
||||
1. **ComfyUI + TRELLIS + UniRig on desktop Ray worker, with direct MLflow logging and rclone promotion**
|
||||
2. **BlenderMCP interactive workflow** (ADR-0062, superseded)
|
||||
3. **Cloud-hosted 3D generation (Hyper3D Rodin, Meshy, etc.)**
|
||||
4. **Run TRELLIS + UniRig directly as Ray Serve deployments in-cluster**
|
||||
|
||||
## Decision Outcome
|
||||
|
||||
Chosen option: **Option 1 — ComfyUI + TRELLIS + UniRig on desktop Ray worker**, because it automates the entire image-to-rigged-model pipeline without operator interaction, leverages purpose-built state-of-the-art models (TRELLIS for generation, UniRig for rigging), and uses the desktop's RTX 4070 as on-demand GPU capacity without disrupting the stable inference cluster. ComfyUI's visual node graph provides the pipeline orchestration directly on the desktop — no Kubernetes-side orchestrator needed since all compute is local to one machine.
|
||||
|
||||
waterdeep retains its role as an interactive Blender workstation for manual refinement of auto-generated models when needed — but the expectation is that most avatars pass through the automated pipeline without manual touch-up.
|
||||
|
||||
### Positive Consequences
|
||||
|
||||
* **Fully automated pipeline** — image → textured mesh → rigged model → VRM with no operator in the loop
|
||||
* **Reproducible** — same image + seed produces identical output; parameters tracked in MLflow
|
||||
* **Fast** — TRELLIS generates a mesh in ~30s, UniRig rigs it in ~60s; end-to-end under 5 minutes including VRM export
|
||||
* **On-demand GPU** — desktop joins Ray cluster only when needed; no standing resource cost
|
||||
* **Composable** — ComfyUI node graph can be extended with additional 3D processing nodes (Hunyuan3D, TripoSG, Stable3DGen) without code changes
|
||||
* **Quality** — TRELLIS (CVPR'25) and UniRig (SIGGRAPH'25) represent current state of the art
|
||||
* **MLflow integration** — generation parameters, mesh quality metrics, and output artifacts are logged directly to the cluster's MLflow service over HTTP
|
||||
* **Simple orchestration** — ComfyUI node graph handles the pipeline; no Kubernetes-side orchestrator needed for a single-GPU linear workflow
|
||||
* **Reuses existing serving architecture** — gravenhollow NFS + RustFS CDN serving from ADR-0062 is unchanged
|
||||
* **waterdeep fallback** — interactive Blender + BlenderMCP on waterdeep for models needing hand-tuning
|
||||
|
||||
### Negative Consequences
|
||||
|
||||
* Desktop must be powered on and `ray start` must be run manually to participate in the pipeline
|
||||
* TRELLIS requires NVIDIA CUDA — cannot run on the existing AMD/Intel GPU fleet (khelben, drizzt, danilo)
|
||||
* ComfyUI adds a Python dependency stack (PyTorch, CUDA, spconv, flash-attn) to maintain on the desktop
|
||||
* RTX 4070 has 12 GB VRAM — large TRELLIS models (2B params) may require fp16 + attention optimization; the 1.2B image-to-3D model fits comfortably
|
||||
* Auto-generated VRM models may still need manual expression/viseme morph targets for full companions-frontend lip-sync support
|
||||
* Desktop is not managed by GitOps/Kubernetes — Ansible or manual setup
|
||||
|
||||
## Pros and Cons of the Options
|
||||
|
||||
### Option 1 — ComfyUI + TRELLIS + UniRig on desktop Ray worker
|
||||
|
||||
* Good, because fully automated image-to-VRM pipeline eliminates manual sculpting
|
||||
* Good, because TRELLIS (CVPR'25) and UniRig (SIGGRAPH'25) are state-of-the-art, MIT-licensed
|
||||
* Good, because ComfyUI-3D-Pack provides tested node implementations — no custom TRELLIS integration code
|
||||
* Good, because desktop GPU is free/idle capacity with no cluster impact
|
||||
* Good, because MLflow integration reuses existing experiment tracking infrastructure
|
||||
* Good, because ComfyUI can queue and batch-generate multiple avatars unattended
|
||||
* Bad, because desktop availability is not guaranteed (must be manually started)
|
||||
* Bad, because CUDA-only — doesn't leverage the existing ROCm/Intel fleet
|
||||
* Bad, because auto-rigging quality varies by model topology — some models may need manual refinement
|
||||
|
||||
### Option 2 — BlenderMCP interactive workflow (ADR-0062)
|
||||
|
||||
* Good, because maximum creative control via VS Code + Copilot
|
||||
* Good, because Kasm provides browser-based access from anywhere
|
||||
* Bad, because every avatar requires an operator in the loop — slow and non-reproducible
|
||||
* Bad, because Blender sculpting from scratch is time-intensive even with AI assistance
|
||||
* Bad, because Kasm runs Blender CPU-only (no GPU acceleration inside DinD)
|
||||
* Bad, because no MLflow tracking or reproducibility
|
||||
|
||||
### Option 3 — Cloud-hosted 3D generation
|
||||
|
||||
* Good, because no local GPU required
|
||||
* Good, because some services (Meshy, Hyper3D Rodin) offer API access
|
||||
* Bad, because vendor dependency for a core asset pipeline
|
||||
* Bad, because free tiers have daily limits; paid tiers add recurring cost
|
||||
* Bad, because limited control over output quality, rigging, and VRM compliance
|
||||
* Bad, because data leaves the homelab network
|
||||
|
||||
### Option 4 — TRELLIS + UniRig as in-cluster Ray Serve deployments
|
||||
|
||||
* Good, because fully integrated with existing Ray cluster
|
||||
* Good, because no desktop dependency
|
||||
* Bad, because TRELLIS requires NVIDIA CUDA — no CUDA GPUs in-cluster have enough VRAM (elminster has 8 GB, needs 12–16 GB)
|
||||
* Bad, because would require purchasing new in-cluster NVIDIA hardware
|
||||
* Bad, because 3D generation is batch/occasional, not real-time serving — Ray Serve's always-on model is wasteful
|
||||
* Bad, because TRELLIS's CUDA dependencies (spconv, flash-attn, nvdiffrast, kaolin) conflict with existing Ray worker images
|
||||
|
||||
## Architecture
|
||||
|
||||
```
|
||||
┌─────────────────────────────────────────────────────────────────────────────┐
|
||||
│ Kubeflow Pipelines (namespace: kubeflow) │
|
||||
│ │
|
||||
│ ┌────────────────────────────────────────────────────────────────────────┐ │
|
||||
│ │ 3d_avatar_generation_pipeline │ │
|
||||
│ │ │ │
|
||||
│ │ 1. prepare_reference Load/generate reference image from prompt │ │
|
||||
│ │ │ (optional: use vLLM + Stable Diffusion) │ │
|
||||
│ │ ▼ │ │
|
||||
│ │ 2. generate_3d_mesh Submit RayJob → desktop ComfyUI worker │ │
|
||||
│ │ │ TRELLIS image-large (1.2B) → GLB mesh │ │
|
||||
│ │ ▼ │ │
|
||||
│ │ 3. auto_rig Submit RayJob → desktop UniRig worker │ │
|
||||
│ │ │ UniRig skeleton + skinning → rigged FBX/GLB │ │
|
||||
│ │ ▼ │ │
|
||||
│ │ 4. convert_to_vrm Blender CLI (headless) on desktop or cluster │ │
|
||||
│ │ │ Import rigged GLB → configure VRM metadata │ │
|
||||
│ │ ▼ → export .vrm │ │
|
||||
│ │ 5. validate_vrm Check humanoid rig, expressions, visemes │ │
|
||||
│ │ │ │ │
|
||||
│ │ ▼ │ │
|
||||
│ │ 6. promote_to_storage rclone copy → gravenhollow RustFS S3 │ │
|
||||
│ │ │ │ │
|
||||
│ │ ▼ │ │
|
||||
│ │ 7. log_to_mlflow Parameters, metrics, artifacts → MLflow │ │
|
||||
│ └────────────────────────────────────────────────────────────────────────┘ │
|
||||
└──────────────────────────────────────┬──────────────────────────────────────┘
|
||||
│
|
||||
RayJob CR (ephemeral)
|
||||
│
|
||||
▼
|
||||
┌─────────────────────────────────────────────────────────────────────────────┐
|
||||
│ desktop (Arch Linux · Ryzen 9 7950X · 64 GB DDR5 · RTX 4070 12 GB) │
|
||||
│ On-demand Ray worker (ray start --address=<ray-head>:6379) │
|
||||
│ │
|
||||
│ ┌───────────────────────────────────────────────────────────────────────┐ │
|
||||
│ │ ComfyUI + Custom Nodes │ │
|
||||
│ │ │ │
|
||||
│ │ ComfyUI-3D-Pack: │ │
|
||||
│ │ • TRELLIS image-large (1.2B) — image → textured GLB mesh │ │
|
||||
│ │ • Mesh processing nodes — simplify, UV unwrap, texture bake │ │
|
||||
│ │ • 3D preview — viewport render for quality check │ │
|
||||
│ │ • GLB/OBJ/PLY export │ │
|
||||
│ │ │ │
|
||||
│ │ UniRig: │ │
|
||||
│ │ • Skeleton prediction — autoregressive bone hierarchy │ │
|
||||
│ │ • Skinning weights — bone-point cross-attention │ │
|
||||
│ │ • Merge — skeleton + skin + original mesh → rigged model │ │
|
||||
│ │ • Supports GLB, FBX, OBJ input/output │ │
|
||||
│ │ │ │
|
||||
│ │ Blender 4.x (headless CLI): │ │
|
||||
│ │ • VRM Add-on for Blender — GLB → VRM conversion │ │
|
||||
│ │ • Humanoid rig mapping, expression morphs, viseme config │ │
|
||||
│ │ • Batch export via bpy scripting │ │
|
||||
│ └───────────────────────────────────────────────────────────────────────┘ │
|
||||
│ │
|
||||
│ GPU: NVIDIA RTX 4070 12 GB (CUDA 12.x) │
|
||||
│ Ray: worker node with resource label {"nvidia_gpu": 1, "rtx4070": 1} │
|
||||
│ Storage: ~/comfyui-3d/ (working dir), rclone → gravenhollow S3 │
|
||||
└──────────────────────────────────┬──────────────────────────────────────────┘
|
||||
│
|
||||
rclone (S3)
|
||||
│
|
||||
▼
|
||||
┌─────────────────────────────────────────────────────────────────────────────┐
|
||||
│ gravenhollow.lab.daviestechlabs.io │
|
||||
│ (TrueNAS Scale · All-SSD · Dual 10GbE · 12.2 TB) │
|
||||
│ │
|
||||
│ NFS: /mnt/gravenhollow/kubernetes/avatar-models/ │
|
||||
│ ├── Seed-san.vrm (default model) │
|
||||
│ ├── Generated-A-v1.vrm (auto-generated via pipeline) │
|
||||
│ └── animations/ (shared animation clips) │
|
||||
│ │
|
||||
│ S3 (RustFS): avatar-models bucket │
|
||||
│ (same data, served via Cloudflare Tunnel for remote users) │
|
||||
└──────────────────────────┬──────────────────────────────────────────────────┘
|
||||
│
|
||||
┌────────────┴───────────────┐
|
||||
│ │
|
||||
NFS (nfs-fast PVC) Cloudflare Tunnel
|
||||
│ (assets.daviestechlabs.io)
|
||||
▼ │
|
||||
┌──────────────────────────┐ ▼
|
||||
│ companions-frontend │ ┌──────────────────────────┐
|
||||
│ (Kubernetes pod) │ │ Remote users (CDN-cached │
|
||||
│ LAN users │ │ via Cloudflare edge) │
|
||||
└──────────────────────────┘ └──────────────────────────┘
|
||||
```
|
||||
|
||||
### Ray Cluster Integration
|
||||
|
||||
The desktop joins the existing KubeRay-managed cluster as an external worker. It is **not** a Talos node and not managed by Kubernetes — it connects to the Ray head node's GCS port directly:
|
||||
|
||||
```
|
||||
┌─────────────────────────────────────────────────────────────────────────────┐
|
||||
│ Ray Cluster (KubeRay RayService) │
|
||||
│ │
|
||||
│ Head: Ray head pod (in-cluster) │
|
||||
│ GCS port: 6379 (exposed via NodePort or LoadBalancer) │
|
||||
│ │
|
||||
│ In-Cluster Workers (permanent, managed by KubeRay): │
|
||||
│ ┌──────────┐ ┌──────────┐ ┌──────────┐ ┌──────────┐ │
|
||||
│ │ khelben │ │elminster │ │ drizzt │ │ danilo │ │
|
||||
│ │Strix Halo│ │RTX 2070 │ │Radeon 680│ │Intel Arc │ │
|
||||
│ │ ROCm │ │ CUDA │ │ ROCm │ │ Intel │ │
|
||||
│ │ /llm │ │/whisper │ │/embeddings│ │/reranker │ │
|
||||
│ │ │ │ /tts │ │ │ │ │ │
|
||||
│ └──────────┘ └──────────┘ └──────────┘ └──────────┘ │
|
||||
│ │
|
||||
│ External Worker (on-demand, self-managed): │
|
||||
│ ┌──────────────────────────────────────────────────┐ │
|
||||
│ │ desktop (Arch Linux, external) │ │
|
||||
│ │ RTX 4070 12 GB · CUDA │ │
|
||||
│ │ ComfyUI + TRELLIS + UniRig + Blender CLI │ │
|
||||
│ │ Resource labels: {"nvidia_gpu": 1, "3d_gen": 1} │ │
|
||||
│ │ Joins via: ray start --address=<head>:6379 │ │
|
||||
│ └──────────────────────────────────────────────────┘ │
|
||||
└─────────────────────────────────────────────────────────────────────────────┘
|
||||
```
|
||||
|
||||
The existing inference deployments (`/llm`, `/whisper`, `/tts`, `/embeddings`, `/reranker`) are unaffected — they are pinned to their respective in-cluster GPU nodes via Ray resource labels. The desktop's `3d_gen` resource label ensures only 3D generation RayJobs get scheduled there.
|
||||
|
||||
### Ray Service Multiplexing
|
||||
|
||||
The desktop's RTX 4070 can **time-share between inference overflow and 3D generation** when idle. When no 3D generation jobs are queued, the desktop can optionally serve as overflow capacity for inference workloads:
|
||||
|
||||
| Mode | When | What runs on desktop |
|
||||
|------|------|---------------------|
|
||||
| **3D generation** | ComfyUI workflow triggered (manually or via API) | ComfyUI + TRELLIS → UniRig → Blender VRM export |
|
||||
| **Inference overflow** | Manually enabled, high-traffic periods | vLLM (secondary), Whisper, or TTS replica |
|
||||
| **Idle** | Desktop powered on, no jobs | Ray worker connected but idle (0 resource cost) |
|
||||
|
||||
Mode switching is managed by Ray's resource scheduling — 3D jobs request `{"3d_gen": 1}` and inference jobs request their specific GPU labels. When the desktop is off, all workloads continue on the existing in-cluster fleet with no impact.
|
||||
|
||||
## Implementation Plan
|
||||
|
||||
### 1. Desktop Environment Setup
|
||||
|
||||
```bash
|
||||
# Install NVIDIA drivers + CUDA toolkit (Arch Linux)
|
||||
sudo pacman -S nvidia nvidia-utils cuda cudnn
|
||||
|
||||
# Install Python environment (uv per ADR-0012)
|
||||
curl -LsSf https://astral.sh/uv/install.sh | sh
|
||||
|
||||
# Create project directory
|
||||
mkdir -p ~/comfyui-3d && cd ~/comfyui-3d
|
||||
|
||||
# Install ComfyUI
|
||||
git clone https://github.com/comfyanonymous/ComfyUI.git
|
||||
cd ComfyUI
|
||||
uv venv --python 3.11
|
||||
source .venv/bin/activate
|
||||
uv pip install -r requirements.txt
|
||||
|
||||
# Install ComfyUI-3D-Pack (includes TRELLIS nodes)
|
||||
cd custom_nodes
|
||||
git clone https://github.com/MrForExample/ComfyUI-3D-Pack.git
|
||||
cd ComfyUI-3D-Pack
|
||||
uv pip install -r requirements.txt
|
||||
python install.py
|
||||
|
||||
# Install UniRig
|
||||
cd ~/comfyui-3d
|
||||
git clone https://github.com/VAST-AI-Research/UniRig.git
|
||||
cd UniRig
|
||||
uv pip install torch torchvision
|
||||
uv pip install -r requirements.txt
|
||||
uv pip install spconv-cu124 # Match CUDA version
|
||||
uv pip install flash-attn --no-build-isolation
|
||||
|
||||
# Install Blender (headless CLI for VRM export)
|
||||
sudo pacman -S blender
|
||||
# Install VRM Add-on
|
||||
python -c "import bpy, os; bpy.ops.preferences.addon_install(filepath=os.path.abspath('UniRig/blender/add-on-vrm-v2.20.77_modified.zip'))"
|
||||
|
||||
# Install rclone for asset promotion
|
||||
sudo pacman -S rclone
|
||||
rclone config create gravenhollow s3 \
|
||||
provider=Other \
|
||||
endpoint=https://gravenhollow.lab.daviestechlabs.io:30292 \
|
||||
access_key_id=<key> \
|
||||
secret_access_key=<secret>
|
||||
|
||||
# Install Ray for cluster joining
|
||||
uv pip install "ray[default]"
|
||||
```
|
||||
|
||||
### 2. Ray Worker Configuration
|
||||
|
||||
```bash
|
||||
# Join the Ray cluster on demand
|
||||
# Ray head GCS port must be exposed (NodePort 30637 or similar)
|
||||
ray start \
|
||||
--address=<ray-head-external-ip>:6379 \
|
||||
--num-cpus=16 \
|
||||
--num-gpus=1 \
|
||||
--resources='{"3d_gen": 1, "rtx4070": 1}' \
|
||||
--node-name=desktop
|
||||
|
||||
# Verify connection
|
||||
ray status # Should show desktop as a connected worker
|
||||
```
|
||||
|
||||
The Ray head's GCS port needs to be reachable from the desktop. Options:
|
||||
- **NodePort**: Expose port 6379 as a NodePort (e.g., 30637) on a cluster node
|
||||
- **Tailscale/WireGuard**: If the desktop is on a different network segment
|
||||
- **Direct LAN**: If desktop and cluster are on the same 192.168.100.0/24 subnet
|
||||
|
||||
### 3. ComfyUI Workflow (Node Graph)
|
||||
|
||||
The ComfyUI workflow JSON defines the image-to-GLB pipeline:
|
||||
|
||||
```
|
||||
[Load Image] → [TRELLIS Image-to-3D] → [Mesh Simplify] → [Texture Bake]
|
||||
│
|
||||
▼
|
||||
[Save GLB]
|
||||
│
|
||||
▼
|
||||
[UniRig Skeleton Prediction]
|
||||
│
|
||||
▼
|
||||
[UniRig Skinning Weights]
|
||||
│
|
||||
▼
|
||||
[UniRig Merge (rigged model)]
|
||||
│
|
||||
▼
|
||||
[Blender VRM Export (CLI)]
|
||||
│
|
||||
▼
|
||||
[Save VRM → ~/comfyui-3d/exports/]
|
||||
```
|
||||
|
||||
Key TRELLIS parameters exposed:
|
||||
- `sparse_structure_sampler_params.steps`: 12 (default)
|
||||
- `sparse_structure_sampler_params.cfg_strength`: 7.5
|
||||
- `slat_sampler_params.steps`: 12
|
||||
- `slat_sampler_params.cfg_strength`: 3.0
|
||||
- `simplify`: 0.95 (triangle reduction ratio)
|
||||
- `texture_size`: 1024
|
||||
|
||||
### 4. MLflow Experiment Tracking
|
||||
|
||||
The desktop logs directly to the cluster's MLflow service over HTTP. Set `MLFLOW_TRACKING_URI` in the ComfyUI environment or in a post-generation logging script:
|
||||
|
||||
```bash
|
||||
export MLFLOW_TRACKING_URI=http://<mlflow-service>:5000
|
||||
```
|
||||
|
||||
Each generation run logs to a dedicated MLflow experiment:
|
||||
|
||||
| What | MLflow Concept | Content |
|
||||
|------|---------------|---------|
|
||||
| Reference image | Artifact | `reference.png` |
|
||||
| TRELLIS parameters | Params | seed, cfg_strength, steps, simplify, texture_size |
|
||||
| UniRig parameters | Params | skeleton_seed |
|
||||
| Raw mesh | Artifact | `{name}_raw.glb` (pre-rigging) |
|
||||
| Rigged model | Artifact | `{name}_rigged.glb` (post-rigging) |
|
||||
| Final VRM | Artifact | `{name}.vrm` |
|
||||
| Mesh quality | Metrics | vertex_count, face_count, texture_resolution |
|
||||
| Rig quality | Metrics | bone_count, skinning_weight_coverage |
|
||||
| Pipeline duration | Metrics | trellis_time_s, unirig_time_s, total_time_s |
|
||||
|
||||
### 5. VRM Export Script (Blender CLI)
|
||||
|
||||
```python
|
||||
#!/usr/bin/env python3
|
||||
"""vrm_export.py — Headless Blender script for GLB→VRM conversion."""
|
||||
import bpy
|
||||
import sys
|
||||
|
||||
argv = sys.argv[sys.argv.index("--") + 1:]
|
||||
input_glb = argv[0]
|
||||
output_vrm = argv[1]
|
||||
avatar_name = argv[2] if len(argv) > 2 else "Generated Avatar"
|
||||
|
||||
# Clear scene
|
||||
bpy.ops.wm.read_factory_settings(use_empty=True)
|
||||
|
||||
# Import rigged GLB
|
||||
bpy.ops.import_scene.gltf(filepath=input_glb)
|
||||
|
||||
# Select armature
|
||||
armature = next(obj for obj in bpy.data.objects if obj.type == 'ARMATURE')
|
||||
bpy.context.view_layer.objects.active = armature
|
||||
|
||||
# Configure VRM metadata
|
||||
armature["vrm_addon_extension"] = {
|
||||
"spec_version": "1.0",
|
||||
"vrm0": {
|
||||
"meta": {
|
||||
"title": avatar_name,
|
||||
"author": "DaviesTechLabs Pipeline",
|
||||
"allowedUserName": "Everyone",
|
||||
}
|
||||
}
|
||||
}
|
||||
|
||||
# Export VRM
|
||||
bpy.ops.export_scene.vrm(filepath=output_vrm)
|
||||
print(f"Exported VRM: {output_vrm}")
|
||||
```
|
||||
|
||||
Invoked via:
|
||||
```bash
|
||||
blender --background --python vrm_export.py -- input.glb output.vrm "Avatar Name"
|
||||
```
|
||||
|
||||
### 6. Asset Promotion (Reuses ADR-0062 Architecture)
|
||||
|
||||
The VRM serving architecture from ADR-0062 is preserved unchanged:
|
||||
|
||||
| Stage | Action |
|
||||
|-------|--------|
|
||||
| **Generate** | Automated pipeline: image → TRELLIS → UniRig → VRM |
|
||||
| **Promote** | `rclone copy ~/comfyui-3d/exports/{name}.vrm gravenhollow:avatar-models/` |
|
||||
| **Register** | Add model path to `AllowedAvatarModels` in companions-frontend Go + JS allowlists |
|
||||
| **Deploy** | Flux rolls out config; model already on NFS PVC — no image rebuild |
|
||||
| **CDN** | Cloudflare Tunnel → RustFS → CDN cache at 300+ edge PoPs |
|
||||
|
||||
## Model Requirements and VRAM Budget
|
||||
|
||||
| Component | Model Size | VRAM Required | Notes |
|
||||
|-----------|-----------|---------------|-------|
|
||||
| TRELLIS image-large | 1.2B params | ~10 GB (fp16) | Image-to-3D, best quality |
|
||||
| TRELLIS text-xlarge | 2.0B params | ~14 GB (fp16) | Text-to-3D, optional |
|
||||
| UniRig skeleton | ~350M params | ~4 GB | Autoregressive skeleton prediction |
|
||||
| UniRig skinning | ~350M params | ~4 GB | Bone-point cross-attention |
|
||||
| Blender CLI | N/A | CPU only | Headless VRM export |
|
||||
|
||||
**RTX 4070 budget (12 GB):** Models are loaded sequentially (not concurrently) — TRELLIS runs first, output is saved to disk, then UniRig loads for rigging. Peak VRAM usage is ~10 GB during TRELLIS inference. The desktop's 64 GB system RAM provides ample buffer for model loading and mesh processing.
|
||||
|
||||
## Security Considerations
|
||||
|
||||
* **Ray GCS port exposure**: The Ray head's port 6379 must be reachable from the desktop. Use a NodePort with network policy restricting source IPs to the desktop's address, or use a WireGuard/Tailscale tunnel.
|
||||
* **No cluster credentials on desktop**: The desktop runs Ray worker processes and ComfyUI only — it has no `kubeconfig` or Kubernetes API access. Generation is triggered locally via ComfyUI's UI or API, not from the cluster.
|
||||
* **Model provenance**: TRELLIS and UniRig checkpoints are downloaded from Hugging Face (Microsoft and VAST-AI orgs respectively). Pin checkpoint hashes in the setup script.
|
||||
* **ComfyUI network**: ComfyUI's web UI (port 8188) should be bound to localhost only when not in use. It is not exposed to the cluster.
|
||||
* **rclone credentials**: gravenhollow RustFS write credentials stored in `~/.config/rclone/rclone.conf` with `600` permissions.
|
||||
* **Generated content**: Auto-generated 3D models inherit no licensing restrictions (TRELLIS and UniRig are both MIT-licensed).
|
||||
|
||||
## Future Considerations
|
||||
|
||||
* **Kubeflow pipeline for model refinement**: When iterating on existing models (re-rigging, parameter sweeps, A/B testing generation backends), a Kubeflow pipeline can orchestrate multi-step refinement workflows with artifact lineage, caching, and retries — submitting RayJobs to the desktop worker via the existing KFP + RayJob pattern from [ADR-0058](0058-training-strategy-cpu-dgx-spark.md)
|
||||
* **DGX Spark** ([ADR-0058](0058-training-strategy-cpu-dgx-spark.md)): When acquired, could run TRELLIS + UniRig in-cluster with dedicated GPU, eliminating desktop dependency
|
||||
* **Stable3DGen / Hunyuan3D alternatives**: ComfyUI-3D-Pack supports multiple generation backends — can A/B test quality via MLflow metrics
|
||||
* **VRM expression morphs**: Investigate automated viseme and expression blendshape generation for full lip-sync support without manual Blender work
|
||||
* **ComfyUI API mode**: ComfyUI supports headless API-only execution (`--listen 0.0.0.0 --port 8188`) — a script or future Kubeflow pipeline can submit workflows via HTTP POST to `/prompt`
|
||||
* **Text-to-3D**: Use the cluster's vLLM instance to generate a character description, then Stable Diffusion (on desktop) to create a reference image, feeding into TRELLIS — fully text-to-avatar pipeline
|
||||
* **Batch generation**: Schedule overnight batch runs via CronWorkflow to generate avatar libraries from curated reference images
|
||||
* **In-cluster migration**: If a 16+ GB NVIDIA GPU is added to the cluster (e.g., via DGX Spark or RTX 5070), migrate TRELLIS + UniRig to a dedicated Ray Serve deployment for always-available generation
|
||||
|
||||
## Links
|
||||
|
||||
* Supersedes: [ADR-0062](0062-blender-mcp-3d-avatar-workflow.md) — BlenderMCP for 3D avatar creation (interactive workflow)
|
||||
* Updates: [ADR-0059](0059-mac-mini-ray-worker.md) — waterdeep retains Blender role for manual refinement only
|
||||
* Related: [ADR-0046](0046-companions-frontend-architecture.md) — Companions frontend architecture (Three.js + VRM avatars)
|
||||
* Related: [ADR-0011](0011-kuberay-unified-gpu-backend.md) — KubeRay unified GPU backend
|
||||
* Related: [ADR-0005](0005-multi-gpu-strategy.md) — Multi-GPU heterogeneous strategy
|
||||
* Related: [ADR-0058](0058-training-strategy-cpu-dgx-spark.md) — Training strategy (Kubeflow + RayJob pattern for future pipeline work)
|
||||
* Related: [ADR-0047](0047-mlflow-experiment-tracking.md) — MLflow experiment tracking
|
||||
* Related: [ADR-0026](0026-storage-strategy.md) — Storage strategy (gravenhollow NFS-fast, RustFS S3)
|
||||
* [Microsoft TRELLIS](https://github.com/microsoft/TRELLIS) — Structured 3D Latents for Scalable 3D Generation (CVPR'25 Spotlight)
|
||||
* [VAST-AI UniRig](https://github.com/VAST-AI-Research/UniRig) — One Model to Rig Them All (SIGGRAPH'25)
|
||||
* [ComfyUI-3D-Pack](https://github.com/MrForExample/ComfyUI-3D-Pack) — Extensive 3D node suite for ComfyUI
|
||||
* [VRM Add-on for Blender](https://vrm-addon-for-blender.info/en/)
|
||||
* [@pixiv/three-vrm](https://github.com/pixiv/three-vrm) (runtime loader in companions-frontend)
|
||||
445
decisions/0064-waterdeep-coding-agent.md
Normal file
445
decisions/0064-waterdeep-coding-agent.md
Normal file
@@ -0,0 +1,445 @@
|
||||
# waterdeep (Mac Mini M4 Pro) as Dedicated Coding Agent with Fine-Tuned Model
|
||||
|
||||
* Status: proposed
|
||||
* Date: 2026-02-26
|
||||
* Deciders: Billy
|
||||
* Technical Story: Repurpose waterdeep as a dedicated local coding agent serving a fine-tuned code-completion model for OpenCode, Copilot Chat, and other AI coding tools, with a pipeline for continually tuning the model on the homelab codebase
|
||||
|
||||
## Context and Problem Statement
|
||||
|
||||
**waterdeep** is a Mac Mini M4 Pro with 48 GB of unified memory ([ADR-0059](0059-mac-mini-ray-worker.md)). Its current role as a 3D avatar creation workstation ([ADR-0059](0059-mac-mini-ray-worker.md)) is being superseded by the automated ComfyUI pipeline ([ADR-0063](0063-comfyui-3d-avatar-pipeline.md)), which handles avatar generation on a personal desktop as an on-demand Ray worker. This frees waterdeep for a higher-value use case.
|
||||
|
||||
GitHub Copilot and cloud-hosted coding assistants work well for general code, but they have no knowledge of DaviesTechLabs-specific patterns: the handler-base module API, NATS protobuf message conventions, Kubeflow pipeline structure, Ray Serve deployment patterns, Flux/Kustomize layout, or the Go handler lifecycle used across chat-handler, voice-assistant, pipeline-bridge, stt-module, and tts-module. A model fine-tuned on the homelab codebase would produce completions that follow project conventions out of the box.
|
||||
|
||||
With 48 GB of unified memory and no other workloads, waterdeep can serve **Qwen 2.5 Coder 32B Instruct** at Q8_0 quantisation (~34 GB) via MLX with ample headroom for KV cache, leaving the machine responsive for the inference server and macOS overhead. This is the largest purpose-built coding model that fits at high quantisation on this hardware, and it consistently outperforms general-purpose 70B models at Q4 on coding benchmarks.
|
||||
|
||||
How should we configure waterdeep as a dedicated coding agent and build a pipeline for fine-tuning the model on our codebase?
|
||||
|
||||
## Decision Drivers
|
||||
|
||||
* waterdeep's 48 GB unified memory is fully available — no competing workloads after ComfyUI pipeline takeover
|
||||
* Qwen 2.5 Coder 32B Instruct is the highest-quality open-source coding model that fits at Q8_0 (~34 GB weights + ~10 GB KV cache headroom)
|
||||
* MLX on Apple Silicon provides native Metal-accelerated inference with no framework overhead — purpose-built for M-series chips
|
||||
* OpenCode and VS Code Copilot Chat both support OpenAI-compatible API endpoints — a local server is a drop-in replacement
|
||||
* The homelab codebase has strong conventions (handler-base, protobuf messages, Kubeflow pipelines, Ray Serve apps, Flux GitOps) that a general model doesn't know
|
||||
* Existing training infrastructure ([ADR-0058](0058-training-strategy-cpu-dgx-spark.md)) provides Kubeflow Pipelines + MLflow + S3 data flow for fine-tuning orchestration
|
||||
* LoRA adapters are small (~50–200 MB) and can be merged into the base model or hot-swapped in mlx-lm-server
|
||||
* The cluster's CPU training capacity (126 cores, 378 GB RAM across 14 nodes) can prepare training datasets; waterdeep itself can run the LoRA fine-tune on its Metal GPU
|
||||
|
||||
## Considered Options
|
||||
|
||||
1. **Qwen 2.5 Coder 32B Instruct (Q8_0) via mlx-lm-server on waterdeep** — fine-tuned with LoRA on the homelab codebase using MLX
|
||||
2. **Llama 3.1 70B Instruct (Q4_K_M) via llama.cpp on waterdeep** — larger general-purpose model at aggressive quantisation
|
||||
3. **DeepSeek Coder V2 Lite 16B via MLX on waterdeep** — smaller coding model, lower resource usage
|
||||
4. **Keep using cloud Copilot only** — no local model, no fine-tuning
|
||||
|
||||
## Decision Outcome
|
||||
|
||||
Chosen option: **Option 1 — Qwen 2.5 Coder 32B Instruct (Q8_0) via mlx-lm-server**, because it is the best-in-class open-source coding model at a quantisation level that preserves near-full quality, fits comfortably within the 48 GB memory budget with room for KV cache, and MLX provides the optimal inference stack for Apple Silicon. Fine-tuning with LoRA on the homelab codebase will specialise the model to project conventions.
|
||||
|
||||
### Positive Consequences
|
||||
|
||||
* Purpose-built coding model — Qwen 2.5 Coder 32B tops open-source coding benchmarks (HumanEval, MBPP, BigCodeBench)
|
||||
* Q8_0 quantisation preserves >99% of full-precision quality — minimal degradation vs Q4
|
||||
* ~34 GB model weights + ~10 GB KV cache headroom = comfortable fit in 48 GB unified memory
|
||||
* MLX inference leverages Metal GPU for token generation — fast enough for interactive coding assistance
|
||||
* OpenAI-compatible API via mlx-lm-server — works with OpenCode, VS Code Copilot Chat (custom endpoint), Continue.dev, and any OpenAI SDK client
|
||||
* Fine-tuned LoRA adapter teaches project-specific patterns: handler-base API, NATS message conventions, Kubeflow pipeline structure, Flux layout
|
||||
* LoRA fine-tuning runs directly on waterdeep using mlx-lm — no cluster resources needed for training
|
||||
* Adapter files are small (~50–200 MB) — easy to version in Gitea and track in MLflow
|
||||
* Fully offline — no cloud dependency, no data leaves the network
|
||||
* Frees Copilot quota for non-coding tasks — local model handles bulk code completion
|
||||
|
||||
### Negative Consequences
|
||||
|
||||
* waterdeep is dedicated to this role — cannot simultaneously serve other workloads (Blender, etc.)
|
||||
* Model updates require manual download and conversion to MLX format
|
||||
* LoRA fine-tuning quality depends on training data curation — garbage in, garbage out
|
||||
* 32B model is slower than cloud Copilot for very long completions — acceptable for interactive use
|
||||
* Single point of failure — if waterdeep is down, fall back to cloud Copilot
|
||||
|
||||
## Pros and Cons of the Options
|
||||
|
||||
### Option 1: Qwen 2.5 Coder 32B Instruct (Q8_0) via MLX
|
||||
|
||||
* Good, because purpose-built for code — trained on 5.5T tokens of code data
|
||||
* Good, because 32B at Q8_0 (~34 GB) fits in 48 GB with KV cache headroom
|
||||
* Good, because Q8_0 preserves near-full quality (vs Q4 which drops noticeably on coding tasks)
|
||||
* Good, because MLX is Apple's native framework — zero-copy unified memory, Metal GPU kernels
|
||||
* Good, because mlx-lm supports LoRA fine-tuning natively — train and serve on the same machine
|
||||
* Good, because OpenAI-compatible API (mlx-lm-server) — drop-in for any coding tool
|
||||
* Bad, because 32B generates ~15–25 tokens/sec on M4 Pro — adequate but not instant for long outputs
|
||||
* Bad, because MLX model format requires conversion from HuggingFace (one-time, scripted)
|
||||
|
||||
### Option 2: Llama 3.1 70B Instruct (Q4_K_M) via llama.cpp
|
||||
|
||||
* Good, because 70B is a larger, more capable general model
|
||||
* Good, because llama.cpp is mature and well-supported on macOS
|
||||
* Bad, because Q4_K_M quantisation loses meaningful quality — especially on code tasks where precision matters
|
||||
* Bad, because ~42 GB weights leaves only ~6 GB for KV cache — tight, risks OOM on long contexts
|
||||
* Bad, because general-purpose model — not trained specifically for code, underperforms Qwen 2.5 Coder 32B on coding benchmarks despite being 2× larger
|
||||
* Bad, because slower token generation (~8–12 tok/s) due to larger model size
|
||||
* Bad, because llama.cpp doesn't natively support LoRA fine-tuning — need a separate training framework
|
||||
|
||||
### Option 3: DeepSeek Coder V2 Lite 16B via MLX
|
||||
|
||||
* Good, because smaller model — faster inference (~30–40 tok/s), lighter memory footprint
|
||||
* Good, because still a capable coding model
|
||||
* Bad, because significantly less capable than Qwen 2.5 Coder 32B on benchmarks
|
||||
* Bad, because leaves 30+ GB of unified memory unused — not maximising the hardware
|
||||
* Bad, because fewer parameters mean less capacity to absorb fine-tuning knowledge
|
||||
|
||||
### Option 4: Cloud Copilot only
|
||||
|
||||
* Good, because zero local infrastructure to maintain
|
||||
* Good, because always up-to-date with latest model improvements
|
||||
* Bad, because no knowledge of homelab-specific conventions — completions require heavy editing
|
||||
* Bad, because cloud latency for every completion
|
||||
* Bad, because data (code context) leaves the network
|
||||
* Bad, because wastes waterdeep's 48 GB of unified memory sitting idle
|
||||
|
||||
## Architecture
|
||||
|
||||
### Inference Server
|
||||
|
||||
```
|
||||
┌──────────────────────────────────────────────────────────────────────────┐
|
||||
│ waterdeep (Mac Mini M4 Pro · 48 GB unified · Metal GPU · dedicated) │
|
||||
│ │
|
||||
│ ┌────────────────────────────────────────────────────────────────────┐ │
|
||||
│ │ mlx-lm-server (launchd-managed) │ │
|
||||
│ │ │ │
|
||||
│ │ Model: Qwen2.5-Coder-32B-Instruct (Q8_0, MLX format) │ │
|
||||
│ │ LoRA: ~/.mlx-models/adapters/homelab-coder/latest/ │ │
|
||||
│ │ │ │
|
||||
│ │ Endpoint: http://waterdeep.lab.daviestechlabs.io:8080/v1 │ │
|
||||
│ │ ├── /v1/completions (code completion, FIM) │ │
|
||||
│ │ ├── /v1/chat/completions (chat / instruct) │ │
|
||||
│ │ └── /v1/models (model listing) │ │
|
||||
│ │ │ │
|
||||
│ │ Memory: ~34 GB model + ~10 GB KV cache = ~44 GB │ │
|
||||
│ └────────────────────────────────────────────────────────────────────┘ │
|
||||
│ │
|
||||
│ ┌─────────────────────────┐ ┌──────────────────────────────────────┐ │
|
||||
│ │ macOS overhead ~3 GB │ │ Training (on-demand, same GPU) │ │
|
||||
│ │ (kernel, WindowServer, │ │ mlx-lm LoRA fine-tune │ │
|
||||
│ │ mDNSResponder, etc.) │ │ (server stopped during training) │ │
|
||||
│ └─────────────────────────┘ └──────────────────────────────────────┘ │
|
||||
└──────────────────────────────────────────────────────────────────────────┘
|
||||
│
|
||||
│ HTTP :8080 (OpenAI-compatible API)
|
||||
│
|
||||
┌────┴──────────────────────────────────────────────────────┐
|
||||
│ │
|
||||
▼ ▼
|
||||
┌─────────────────────────────┐ ┌─────────────────────────────────────┐
|
||||
│ VS Code (any machine) │ │ OpenCode (terminal, any machine) │
|
||||
│ │ │ │
|
||||
│ Copilot Chat / Continue.dev │ │ OPENCODE_MODEL_PROVIDER=openai │
|
||||
│ Custom endpoint → │ │ OPENAI_API_BASE= │
|
||||
│ waterdeep:8080/v1 │ │ http://waterdeep:8080/v1 │
|
||||
└─────────────────────────────┘ └─────────────────────────────────────┘
|
||||
```
|
||||
|
||||
### Fine-Tuning Pipeline
|
||||
|
||||
```
|
||||
┌─────────────────────────────────────────────────────────────────────────────┐
|
||||
│ Fine-Tuning Pipeline (Kubeflow) │
|
||||
│ │
|
||||
│ Trigger: weekly cron or manual (after significant codebase changes) │
|
||||
│ │
|
||||
│ ┌──────────────┐ ┌──────────────────┐ ┌────────────────────────┐ │
|
||||
│ │ 1. Clone repos│ │ 2. Build training │ │ 3. Upload dataset to │ │
|
||||
│ │ from Gitea │───▶│ dataset │───▶│ S3 │ │
|
||||
│ │ (all repos)│ │ (instruction │ │ training-data/ │ │
|
||||
│ │ │ │ pairs + FIM) │ │ code-finetune/ │ │
|
||||
│ └──────────────┘ └──────────────────┘ └──────────┬─────────────┘ │
|
||||
│ │ │
|
||||
│ ┌──────────────────────────────────────────────────────┐│ │
|
||||
│ │ 4. Trigger LoRA fine-tune on waterdeep ││ │
|
||||
│ │ (SSH or webhook → mlx-lm lora on Metal GPU) │◀ │
|
||||
│ │ │ │
|
||||
│ │ Base: Qwen2.5-Coder-32B-Instruct (MLX Q8_0) │ │
|
||||
│ │ Method: LoRA (r=16, alpha=32) │ │
|
||||
│ │ Data: instruction pairs + fill-in-middle samples │ │
|
||||
│ │ Epochs: 3–5 │ │
|
||||
│ │ Output: adapter weights (~50–200 MB) │ │
|
||||
│ └──────────────────────┬───────────────────────────────┘ │
|
||||
│ │ │
|
||||
│ ┌──────────────────────▼───────────────────────────────┐ │
|
||||
│ │ 5. Evaluate adapter │ │
|
||||
│ │ • HumanEval pass@1 (baseline vs fine-tuned) │ │
|
||||
│ │ • Project-specific eval (handler-base patterns, │ │
|
||||
│ │ Kubeflow pipeline templates, Flux manifests) │ │
|
||||
│ └──────────────────────┬───────────────────────────────┘ │
|
||||
│ │ │
|
||||
│ ┌──────────────────────▼───┐ ┌────────────────────────────────────────┐ │
|
||||
│ │ 6. Push adapter to Gitea │ │ 7. Log metrics to MLflow │ │
|
||||
│ │ code-lora-adapters │ │ experiment: waterdeep-coder-finetune │ │
|
||||
│ │ repo (versioned) │ │ metrics: eval_loss, humaneval, │ │
|
||||
│ └──────────────────────────┘ │ project_specific_score │ │
|
||||
│ └────────────────────────────────────────┘ │
|
||||
│ │
|
||||
│ ┌─────────────────────────────────────────────────────────────────────┐ │
|
||||
│ │ 8. Deploy adapter on waterdeep │ │
|
||||
│ │ • Pull latest adapter from Gitea │ │
|
||||
│ │ • Restart mlx-lm-server with --adapter-path pointing to new ver │ │
|
||||
│ │ • Smoke test: send test completion requests │ │
|
||||
│ └─────────────────────────────────────────────────────────────────────┘ │
|
||||
└─────────────────────────────────────────────────────────────────────────────┘
|
||||
```
|
||||
|
||||
### Training Data Preparation
|
||||
|
||||
The training dataset is built from all DaviesTechLabs repositories:
|
||||
|
||||
| Source | Format | Purpose |
|
||||
|--------|--------|---------|
|
||||
| Go handlers (chat-handler, voice-assistant, etc.) | Instruction pairs | Teach handler-base API patterns, NATS message handling, protobuf encoding |
|
||||
| Kubeflow pipelines (kubeflow/*.py) | Instruction pairs | Teach pipeline structure, KFP component patterns, S3 data flow |
|
||||
| Ray Serve apps (ray-serve/) | Instruction pairs | Teach Ray Serve deployment, vLLM config, model serving patterns |
|
||||
| Flux manifests (homelab-k8s2/) | Instruction pairs | Teach HelmRelease, Kustomization, namespace layout |
|
||||
| Argo workflows (argo/*.yaml) | Instruction pairs | Teach WorkflowTemplate patterns, NATS triggers |
|
||||
| ADRs (homelab-design/decisions/) | Instruction pairs | Teach architecture rationale and decision format |
|
||||
| All source files | Fill-in-middle (FIM) | Teach code completion with project-specific context |
|
||||
|
||||
**Instruction pair example (Go handler):**
|
||||
|
||||
```json
|
||||
{
|
||||
"instruction": "Create a new NATS handler module that bridges to an external gRPC service, following the handler-base pattern used in chat-handler and voice-assistant.",
|
||||
"output": "package main\n\nimport (\n\t\"context\"\n\t\"os\"\n\t\"os/signal\"\n\t\"syscall\"\n\n\t\"git.daviestechlabs.io/daviestechlabs/handler-base/config\"\n\t\"git.daviestechlabs.io/daviestechlabs/handler-base/handler\"\n\t\"git.daviestechlabs.io/daviestechlabs/handler-base/health\"\n\t..."
|
||||
}
|
||||
```
|
||||
|
||||
**Fill-in-middle example:**
|
||||
|
||||
```json
|
||||
{
|
||||
"prefix": "func (h *Handler) HandleMessage(ctx context.Context, msg *messages.UserMessage) (*messages.AssistantMessage, error) {\n\t",
|
||||
"suffix": "\n\treturn response, nil\n}",
|
||||
"middle": "response, err := h.client.Complete(ctx, msg.Content)\n\tif err != nil {\n\t\treturn nil, fmt.Errorf(\"completion failed: %w\", err)\n\t}"
|
||||
}
|
||||
```
|
||||
|
||||
## Implementation Plan
|
||||
|
||||
### 1. Model Setup
|
||||
|
||||
```bash
|
||||
# Install MLX and mlx-lm via uv (per ADR-0012)
|
||||
uv tool install mlx-lm
|
||||
|
||||
# Download and convert Qwen 2.5 Coder 32B Instruct to MLX Q8_0 format
|
||||
mlx_lm.convert \
|
||||
--hf-path Qwen/Qwen2.5-Coder-32B-Instruct \
|
||||
--mlx-path ~/.mlx-models/Qwen2.5-Coder-32B-Instruct-Q8 \
|
||||
--quantize \
|
||||
--q-bits 8
|
||||
|
||||
# Verify model loads and generates
|
||||
mlx_lm.generate \
|
||||
--model ~/.mlx-models/Qwen2.5-Coder-32B-Instruct-Q8 \
|
||||
--prompt "def fibonacci(n: int) -> int:"
|
||||
```
|
||||
|
||||
### 2. Inference Server (launchd)
|
||||
|
||||
```bash
|
||||
# Start the server manually first to verify
|
||||
mlx_lm.server \
|
||||
--model ~/.mlx-models/Qwen2.5-Coder-32B-Instruct-Q8 \
|
||||
--adapter-path ~/.mlx-models/adapters/homelab-coder/latest \
|
||||
--host 0.0.0.0 \
|
||||
--port 8080
|
||||
|
||||
# Verify OpenAI-compatible endpoint
|
||||
curl http://localhost:8080/v1/chat/completions \
|
||||
-H "Content-Type: application/json" \
|
||||
-d '{
|
||||
"model": "qwen2.5-coder-32b",
|
||||
"messages": [{"role": "user", "content": "Write a Go handler using handler-base that processes NATS messages"}],
|
||||
"max_tokens": 512
|
||||
}'
|
||||
```
|
||||
|
||||
**launchd plist** (`~/Library/LaunchAgents/io.daviestechlabs.mlx-coder.plist`):
|
||||
|
||||
```xml
|
||||
<?xml version="1.0" encoding="UTF-8"?>
|
||||
<!DOCTYPE plist PUBLIC "-//Apple//DTD PLIST 1.0//EN"
|
||||
"http://www.apple.com/DTDs/PropertyList-1.0.dtd">
|
||||
<plist version="1.0">
|
||||
<dict>
|
||||
<key>Label</key>
|
||||
<string>io.daviestechlabs.mlx-coder</string>
|
||||
<key>ProgramArguments</key>
|
||||
<array>
|
||||
<string>/Users/billy/.local/bin/mlx_lm.server</string>
|
||||
<string>--model</string>
|
||||
<string>/Users/billy/.mlx-models/Qwen2.5-Coder-32B-Instruct-Q8</string>
|
||||
<string>--adapter-path</string>
|
||||
<string>/Users/billy/.mlx-models/adapters/homelab-coder/latest</string>
|
||||
<string>--host</string>
|
||||
<string>0.0.0.0</string>
|
||||
<string>--port</string>
|
||||
<string>8080</string>
|
||||
</array>
|
||||
<key>RunAtLoad</key>
|
||||
<true/>
|
||||
<key>KeepAlive</key>
|
||||
<true/>
|
||||
<key>StandardOutPath</key>
|
||||
<string>/Users/billy/.mlx-models/logs/server.log</string>
|
||||
<key>StandardErrorPath</key>
|
||||
<string>/Users/billy/.mlx-models/logs/server.err</string>
|
||||
</dict>
|
||||
</plist>
|
||||
```
|
||||
|
||||
```bash
|
||||
# Load the service
|
||||
launchctl load ~/Library/LaunchAgents/io.daviestechlabs.mlx-coder.plist
|
||||
|
||||
# Verify it's running
|
||||
launchctl list | grep mlx-coder
|
||||
curl http://waterdeep.lab.daviestechlabs.io:8080/v1/models
|
||||
```
|
||||
|
||||
### 3. Client Configuration
|
||||
|
||||
**OpenCode** (`~/.config/opencode/config.json` on any dev machine):
|
||||
|
||||
```json
|
||||
{
|
||||
"provider": "openai",
|
||||
"model": "qwen2.5-coder-32b",
|
||||
"baseURL": "http://waterdeep.lab.daviestechlabs.io:8080/v1"
|
||||
}
|
||||
```
|
||||
|
||||
**VS Code** (settings.json — Continue.dev extension):
|
||||
|
||||
```json
|
||||
{
|
||||
"continue.models": [
|
||||
{
|
||||
"title": "waterdeep-coder",
|
||||
"provider": "openai",
|
||||
"model": "qwen2.5-coder-32b",
|
||||
"apiBase": "http://waterdeep.lab.daviestechlabs.io:8080/v1",
|
||||
"apiKey": "not-needed"
|
||||
}
|
||||
]
|
||||
}
|
||||
```
|
||||
|
||||
### 4. Fine-Tuning on waterdeep (MLX LoRA)
|
||||
|
||||
```bash
|
||||
# Prepare training data (run on cluster via Kubeflow, or locally)
|
||||
# Output: train.jsonl and valid.jsonl in chat/instruction format
|
||||
|
||||
# Fine-tune with LoRA using mlx-lm
|
||||
mlx_lm.lora \
|
||||
--model ~/.mlx-models/Qwen2.5-Coder-32B-Instruct-Q8 \
|
||||
--train \
|
||||
--data ~/.mlx-models/training-data/homelab-coder/ \
|
||||
--adapter-path ~/.mlx-models/adapters/homelab-coder/$(date +%Y%m%d)/ \
|
||||
--lora-layers 16 \
|
||||
--batch-size 1 \
|
||||
--iters 1000 \
|
||||
--learning-rate 1e-5 \
|
||||
--val-batches 25 \
|
||||
--save-every 100
|
||||
|
||||
# Evaluate the adapter
|
||||
mlx_lm.generate \
|
||||
--model ~/.mlx-models/Qwen2.5-Coder-32B-Instruct-Q8 \
|
||||
--adapter-path ~/.mlx-models/adapters/homelab-coder/$(date +%Y%m%d)/ \
|
||||
--prompt "Create a new Go NATS handler using handler-base that..."
|
||||
|
||||
# Update the 'latest' symlink
|
||||
ln -sfn ~/.mlx-models/adapters/homelab-coder/$(date +%Y%m%d) \
|
||||
~/.mlx-models/adapters/homelab-coder/latest
|
||||
|
||||
# Restart the server to pick up new adapter
|
||||
launchctl kickstart -k gui/$(id -u)/io.daviestechlabs.mlx-coder
|
||||
```
|
||||
|
||||
### 5. Training Data Pipeline (Kubeflow)
|
||||
|
||||
A new `code_finetune_pipeline.py` orchestrates dataset preparation on the cluster:
|
||||
|
||||
```
|
||||
code_finetune_pipeline.yaml
|
||||
│
|
||||
├── 1. clone_repos Clone all DaviesTechLabs repos from Gitea
|
||||
├── 2. extract_patterns Parse Go, Python, YAML files into instruction pairs
|
||||
├── 3. generate_fim Create fill-in-middle samples from source files
|
||||
├── 4. deduplicate Remove near-duplicate samples (MinHash)
|
||||
├── 5. format_dataset Convert to mlx-lm JSONL format (train + validation split)
|
||||
├── 6. upload_to_s3 Push dataset to s3://training-data/code-finetune/{run_id}/
|
||||
└── 7. log_to_mlflow Log dataset stats (num_samples, token_count, repo_coverage)
|
||||
```
|
||||
|
||||
The actual LoRA fine-tune runs on waterdeep (not the cluster) because:
|
||||
- mlx-lm LoRA leverages the M4 Pro's Metal GPU — significantly faster than CPU training
|
||||
- The model is already loaded on waterdeep — no need to transfer 34 GB to/from the cluster
|
||||
- Training a 32B model with LoRA requires ~40 GB — only waterdeep and khelben have enough memory
|
||||
|
||||
### 6. Memory Budget
|
||||
|
||||
| Component | Memory |
|
||||
|-----------|--------|
|
||||
| macOS + system services | ~3 GB |
|
||||
| Qwen 2.5 Coder 32B (Q8_0 weights) | ~34 GB |
|
||||
| KV cache (8192 context) | ~6 GB |
|
||||
| mlx-lm-server overhead | ~1 GB |
|
||||
| **Total (inference)** | **~44 GB** |
|
||||
| **Headroom** | **~4 GB** |
|
||||
|
||||
During LoRA fine-tuning (server stopped):
|
||||
|
||||
| Component | Memory |
|
||||
|-----------|--------|
|
||||
| macOS + system services | ~3 GB |
|
||||
| Model weights (frozen, Q8_0) | ~34 GB |
|
||||
| LoRA adapter gradients + optimizer | ~4 GB |
|
||||
| Training batch + activations | ~5 GB |
|
||||
| **Total (training)** | **~46 GB** |
|
||||
| **Headroom** | **~2 GB** |
|
||||
|
||||
Both workloads fit within the 48 GB budget. Inference and training are mutually exclusive — the server is stopped during fine-tuning runs to reclaim KV cache memory for training.
|
||||
|
||||
## Security Considerations
|
||||
|
||||
* mlx-lm-server has no authentication — bind to LAN only; waterdeep's firewall blocks external access
|
||||
* No code leaves the network — all inference and training is local
|
||||
* Training data is sourced exclusively from Gitea (internal repos) — no external data contamination
|
||||
* Adapter weights are versioned in Gitea — auditable lineage from training data to deployed model
|
||||
* Consider adding a simple API key check via a reverse proxy (Caddy/nginx) if the LAN is not fully trusted
|
||||
|
||||
## Future Considerations
|
||||
|
||||
* **DGX Spark** ([ADR-0058](0058-training-strategy-cpu-dgx-spark.md)): If acquired, DGX Spark could fine-tune larger coding models (70B+) or run full fine-tunes instead of LoRA. waterdeep would remain the serving endpoint unless the DGX Spark also serves inference.
|
||||
* **Adapter hot-swap**: mlx-lm supports loading adapters at request time — could serve multiple fine-tuned adapters (e.g., Go-specific, Python-specific, YAML-specific) from a single base model
|
||||
* **RAG augmentation**: Combine the fine-tuned model with a RAG pipeline that retrieves relevant code snippets from Milvus ([ADR-0008](0008-use-milvus-for-vectors.md)) for even better context-aware completions
|
||||
* **Continuous fine-tuning**: Trigger the pipeline automatically on Gitea push events via NATS — the model stays current with codebase changes
|
||||
* **Evaluation suite**: Build a project-specific eval set (handler-base patterns, pipeline templates, Flux manifests) to measure fine-tuning quality beyond generic benchmarks
|
||||
* **Newer models**: As new coding models are released (Qwen 3 Coder, DeepSeek Coder V3, etc.), re-evaluate which model maximises quality within the 48 GB budget
|
||||
|
||||
## Links
|
||||
|
||||
* Updates: [ADR-0059](0059-mac-mini-ray-worker.md) — waterdeep repurposed from 3D avatar workstation to dedicated coding agent
|
||||
* Related: [ADR-0058](0058-training-strategy-cpu-dgx-spark.md) — Training strategy (distributed CPU + DGX Spark path)
|
||||
* Related: [ADR-0047](0047-mlflow-experiment-tracking.md) — MLflow experiment tracking
|
||||
* Related: [ADR-0054](0054-kubeflow-pipeline-cicd.md) — Kubeflow Pipeline CI/CD
|
||||
* Related: [ADR-0012](0012-use-uv-for-python-development.md) — uv for Python development
|
||||
* Related: [ADR-0037](0037-node-naming-conventions.md) — Node naming conventions (waterdeep)
|
||||
* Related: [ADR-0060](0060-internal-pki-vault.md) — Internal PKI (TLS for waterdeep endpoint)
|
||||
* [Qwen 2.5 Coder](https://huggingface.co/Qwen/Qwen2.5-Coder-32B-Instruct) — Model card
|
||||
* [MLX LM](https://github.com/ml-explore/mlx-examples/tree/main/llms/mlx_lm) — Apple MLX language model framework
|
||||
* [OpenCode](https://opencode.ai) — Terminal-based AI coding assistant
|
||||
* [Continue.dev](https://continue.dev) — VS Code AI coding extension with custom model support
|
||||
Reference in New Issue
Block a user