Files
homelab-design/decisions/0056-custom-voice-support-tts-module.md
Billy D. 51e6cee8ab
All checks were successful
Update README with ADR Index / update-readme (push) Successful in 6s
feat: ADR-0056 custom voice support, ADR-0057 Renovate per-repo configs
- ADR-0056: custom voice support in tts-module (VoiceRegistry)
- ADR-0057: shared Renovate preset rollout to all app repos
- Update ADR-0013: add tts-module and stt-module to CI table
- Update ADR-0036: cross-reference ADR-0057
2026-02-13 15:31:33 -05:00

4.6 KiB

Custom Trained Voice Support in TTS Module

  • Status: accepted
  • Date: 2026-02-13
  • Deciders: Billy Davies
  • Technical Story: Enable the TTS module to use custom voices generated by the coqui-voice-training Argo workflow

Context and Problem Statement

The coqui-voice-training Argo workflow trains custom VITS voice models from audio samples and exports them to NFS at /models/tts/custom/{voice-name}/. The TTS streaming module currently only supports the default XTTS speaker or ad-hoc voice cloning via base64-encoded reference audio (speaker_wav_b64). There is no mechanism to discover and use the fine-tuned models produced by the training pipeline.

How should the TTS module discover and serve custom trained voices so that callers can request a trained voice by name without providing reference audio?

Decision Drivers

  • Voices trained by the Argo pipeline should be usable immediately without service restarts
  • Callers should be able to request a trained voice by name (e.g. "speaker": "my-voice")
  • Existing ad-hoc voice cloning via speaker_wav_b64 must continue to work
  • Other services need a way to enumerate available voices
  • No external database or registry should be required — the file system is the source of truth

Considered Options

  1. File-system VoiceRegistry with periodic refresh — scan the NFS model store on startup and periodically
  2. Database-backed voice catalogue — store voice metadata in PostgreSQL
  3. NATS KV bucket for voice metadata — store voice info in NATS Key-Value store

Decision Outcome

Chosen option: Option 1 — File-system VoiceRegistry with periodic refresh, because it introduces zero new infrastructure, uses the model store as the single source of truth, and aligns with the export layout already produced by the coqui-voice-training workflow.

Positive Consequences

  • Zero additional infrastructure — reads directly from the NFS volume
  • Single source of truth — the trained model directory is the registry
  • Newly trained voices appear automatically within the refresh interval
  • On-demand refresh available via NATS for immediate availability
  • Fully backward compatible — existing speaker_wav_b64 cloning unchanged

Negative Consequences

  • Polling-based discovery adds slight latency (mitigated by configurable interval and on-demand refresh)
  • No metadata beyond what model_info.json contains (sufficient for current needs)
  • Requires NFS volume mounted at VOICE_MODEL_STORE path in the TTS pod

Implementation

VoiceRegistry

A VoiceRegistry class scans VOICE_MODEL_STORE (default /models/tts/custom) for voice directories. Each directory must contain:

File Required Description
model_info.json Yes Metadata: name, language, type, created_at
model.pth Yes Trained model weights
config.json No Model configuration

The registry is refreshed:

  • On service startup
  • Periodically every VOICE_REGISTRY_REFRESH_SECONDS (default 300s)
  • On demand via ai.voice.tts.voices.refresh NATS subject

Synthesis Routing

When a TTS request specifies a speaker, the service checks the registry first:

Request with speaker="my-voice"
  ├─ Found in VoiceRegistry → send model_path + config_path to XTTS
  ├─ Not found + speaker_wav_b64 present → ad-hoc voice cloning (existing)
  └─ Not found + no speaker_wav_b64 → use default speaker

New NATS Subjects

Subject Pattern Description
ai.voice.tts.voices.list Request-reply List default speaker + all custom voices
ai.voice.tts.voices.refresh Request-reply Trigger immediate registry rescan

New Environment Variables

Variable Default Description
VOICE_MODEL_STORE /models/tts/custom NFS path to trained voice models
VOICE_REGISTRY_REFRESH_SECONDS 300 Periodic rescan interval (seconds)

Integration with Training Pipeline

coqui-voice-training Argo Workflow
  └─ export-trained-model step
       └─ Writes to /models/tts/custom/{voice-name}/
            ├── model.pth
            ├── config.json
            └── model_info.json

TTS Streaming Service
  └─ VoiceRegistry
       └─ Scans /models/tts/custom/
            └─ Registers {voice-name} → CustomVoice(model_path, config_path, ...)
  • Related: ADR-0009 — Argo/Kubeflow workflow engines
  • Related: ADR-0011 — XTTS runs on the KubeRay GPU backend
  • Workflow: argo/coqui-voice-training.yaml
  • Module: tts-module/tts_streaming.py