- ADR-0056: custom voice support in tts-module (VoiceRegistry) - ADR-0057: shared Renovate preset rollout to all app repos - Update ADR-0013: add tts-module and stt-module to CI table - Update ADR-0036: cross-reference ADR-0057
4.6 KiB
Custom Trained Voice Support in TTS Module
- Status: accepted
- Date: 2026-02-13
- Deciders: Billy Davies
- Technical Story: Enable the TTS module to use custom voices generated by the
coqui-voice-trainingArgo workflow
Context and Problem Statement
The coqui-voice-training Argo workflow trains custom VITS voice models from audio samples and exports them to NFS at /models/tts/custom/{voice-name}/. The TTS streaming module currently only supports the default XTTS speaker or ad-hoc voice cloning via base64-encoded reference audio (speaker_wav_b64). There is no mechanism to discover and use the fine-tuned models produced by the training pipeline.
How should the TTS module discover and serve custom trained voices so that callers can request a trained voice by name without providing reference audio?
Decision Drivers
- Voices trained by the Argo pipeline should be usable immediately without service restarts
- Callers should be able to request a trained voice by name (e.g.
"speaker": "my-voice") - Existing ad-hoc voice cloning via
speaker_wav_b64must continue to work - Other services need a way to enumerate available voices
- No external database or registry should be required — the file system is the source of truth
Considered Options
- File-system VoiceRegistry with periodic refresh — scan the NFS model store on startup and periodically
- Database-backed voice catalogue — store voice metadata in PostgreSQL
- NATS KV bucket for voice metadata — store voice info in NATS Key-Value store
Decision Outcome
Chosen option: Option 1 — File-system VoiceRegistry with periodic refresh, because it introduces zero new infrastructure, uses the model store as the single source of truth, and aligns with the export layout already produced by the coqui-voice-training workflow.
Positive Consequences
- Zero additional infrastructure — reads directly from the NFS volume
- Single source of truth — the trained model directory is the registry
- Newly trained voices appear automatically within the refresh interval
- On-demand refresh available via NATS for immediate availability
- Fully backward compatible — existing
speaker_wav_b64cloning unchanged
Negative Consequences
- Polling-based discovery adds slight latency (mitigated by configurable interval and on-demand refresh)
- No metadata beyond what
model_info.jsoncontains (sufficient for current needs) - Requires NFS volume mounted at
VOICE_MODEL_STOREpath in the TTS pod
Implementation
VoiceRegistry
A VoiceRegistry class scans VOICE_MODEL_STORE (default /models/tts/custom) for voice directories. Each directory must contain:
| File | Required | Description |
|---|---|---|
model_info.json |
Yes | Metadata: name, language, type, created_at |
model.pth |
Yes | Trained model weights |
config.json |
No | Model configuration |
The registry is refreshed:
- On service startup
- Periodically every
VOICE_REGISTRY_REFRESH_SECONDS(default 300s) - On demand via
ai.voice.tts.voices.refreshNATS subject
Synthesis Routing
When a TTS request specifies a speaker, the service checks the registry first:
Request with speaker="my-voice"
├─ Found in VoiceRegistry → send model_path + config_path to XTTS
├─ Not found + speaker_wav_b64 present → ad-hoc voice cloning (existing)
└─ Not found + no speaker_wav_b64 → use default speaker
New NATS Subjects
| Subject | Pattern | Description |
|---|---|---|
ai.voice.tts.voices.list |
Request-reply | List default speaker + all custom voices |
ai.voice.tts.voices.refresh |
Request-reply | Trigger immediate registry rescan |
New Environment Variables
| Variable | Default | Description |
|---|---|---|
VOICE_MODEL_STORE |
/models/tts/custom |
NFS path to trained voice models |
VOICE_REGISTRY_REFRESH_SECONDS |
300 |
Periodic rescan interval (seconds) |
Integration with Training Pipeline
coqui-voice-training Argo Workflow
└─ export-trained-model step
└─ Writes to /models/tts/custom/{voice-name}/
├── model.pth
├── config.json
└── model_info.json
TTS Streaming Service
└─ VoiceRegistry
└─ Scans /models/tts/custom/
└─ Registers {voice-name} → CustomVoice(model_path, config_path, ...)