Files

Update README with ADR Index / update-readme (push) Successful in 6s

Details

feat: ADR-0056 custom voice support, ADR-0057 Renovate per-repo configs

- ADR-0056: custom voice support in tts-module (VoiceRegistry)
- ADR-0057: shared Renovate preset rollout to all app repos
- Update ADR-0013: add tts-module and stt-module to CI table
- Update ADR-0036: cross-reference ADR-0057

2026-02-13 15:31:33 -05:00

4.6 KiB

Raw Blame History

Custom Trained Voice Support in TTS Module

Status: accepted
Date: 2026-02-13
Deciders: Billy Davies
Technical Story: Enable the TTS module to use custom voices generated by the coqui-voice-training Argo workflow

Context and Problem Statement

The coqui-voice-training Argo workflow trains custom VITS voice models from audio samples and exports them to NFS at /models/tts/custom/{voice-name}/. The TTS streaming module currently only supports the default XTTS speaker or ad-hoc voice cloning via base64-encoded reference audio (speaker_wav_b64). There is no mechanism to discover and use the fine-tuned models produced by the training pipeline.

How should the TTS module discover and serve custom trained voices so that callers can request a trained voice by name without providing reference audio?

Decision Drivers

Voices trained by the Argo pipeline should be usable immediately without service restarts
Callers should be able to request a trained voice by name (e.g. "speaker": "my-voice")
Existing ad-hoc voice cloning via speaker_wav_b64 must continue to work
Other services need a way to enumerate available voices
No external database or registry should be required — the file system is the source of truth

Considered Options

File-system VoiceRegistry with periodic refresh — scan the NFS model store on startup and periodically
Database-backed voice catalogue — store voice metadata in PostgreSQL
NATS KV bucket for voice metadata — store voice info in NATS Key-Value store

Decision Outcome

Chosen option: Option 1 — File-system VoiceRegistry with periodic refresh, because it introduces zero new infrastructure, uses the model store as the single source of truth, and aligns with the export layout already produced by the coqui-voice-training workflow.

Positive Consequences

Zero additional infrastructure — reads directly from the NFS volume
Single source of truth — the trained model directory is the registry
Newly trained voices appear automatically within the refresh interval
On-demand refresh available via NATS for immediate availability
Fully backward compatible — existing speaker_wav_b64 cloning unchanged

Negative Consequences

Polling-based discovery adds slight latency (mitigated by configurable interval and on-demand refresh)
No metadata beyond what model_info.json contains (sufficient for current needs)
Requires NFS volume mounted at VOICE_MODEL_STORE path in the TTS pod

Implementation

VoiceRegistry

A VoiceRegistry class scans VOICE_MODEL_STORE (default /models/tts/custom) for voice directories. Each directory must contain:

File	Required	Description
`model_info.json`	Yes	Metadata: name, language, type, created_at
`model.pth`	Yes	Trained model weights
`config.json`	No	Model configuration

The registry is refreshed:

On service startup
Periodically every VOICE_REGISTRY_REFRESH_SECONDS (default 300s)
On demand via ai.voice.tts.voices.refresh NATS subject

Synthesis Routing

When a TTS request specifies a speaker, the service checks the registry first:

Request with speaker="my-voice"
  ├─ Found in VoiceRegistry → send model_path + config_path to XTTS
  ├─ Not found + speaker_wav_b64 present → ad-hoc voice cloning (existing)
  └─ Not found + no speaker_wav_b64 → use default speaker

New NATS Subjects

Subject	Pattern	Description
`ai.voice.tts.voices.list`	Request-reply	List default speaker + all custom voices
`ai.voice.tts.voices.refresh`	Request-reply	Trigger immediate registry rescan

New Environment Variables

Variable	Default	Description
`VOICE_MODEL_STORE`	`/models/tts/custom`	NFS path to trained voice models
`VOICE_REGISTRY_REFRESH_SECONDS`	`300`	Periodic rescan interval (seconds)

Integration with Training Pipeline

coqui-voice-training Argo Workflow
  └─ export-trained-model step
       └─ Writes to /models/tts/custom/{voice-name}/
            ├── model.pth
            ├── config.json
            └── model_info.json

TTS Streaming Service
  └─ VoiceRegistry
       └─ Scans /models/tts/custom/
            └─ Registers {voice-name} → CustomVoice(model_path, config_path, ...)

4.6 KiB Raw Blame History