All checks were successful
Update README with ADR Index / update-readme (push) Successful in 6s
- ADR-0056: custom voice support in tts-module (VoiceRegistry) - ADR-0057: shared Renovate preset rollout to all app repos - Update ADR-0013: add tts-module and stt-module to CI table - Update ADR-0036: cross-reference ADR-0057
110 lines
4.6 KiB
Markdown
110 lines
4.6 KiB
Markdown
# Custom Trained Voice Support in TTS Module
|
|
|
|
* Status: accepted
|
|
* Date: 2026-02-13
|
|
* Deciders: Billy Davies
|
|
* Technical Story: Enable the TTS module to use custom voices generated by the `coqui-voice-training` Argo workflow
|
|
|
|
## Context and Problem Statement
|
|
|
|
The `coqui-voice-training` Argo workflow trains custom VITS voice models from audio samples and exports them to NFS at `/models/tts/custom/{voice-name}/`. The TTS streaming module currently only supports the default XTTS speaker or ad-hoc voice cloning via base64-encoded reference audio (`speaker_wav_b64`). There is no mechanism to discover and use the fine-tuned models produced by the training pipeline.
|
|
|
|
How should the TTS module discover and serve custom trained voices so that callers can request a trained voice by name without providing reference audio?
|
|
|
|
## Decision Drivers
|
|
|
|
* Voices trained by the Argo pipeline should be usable immediately without service restarts
|
|
* Callers should be able to request a trained voice by name (e.g. `"speaker": "my-voice"`)
|
|
* Existing ad-hoc voice cloning via `speaker_wav_b64` must continue to work
|
|
* Other services need a way to enumerate available voices
|
|
* No external database or registry should be required — the file system is the source of truth
|
|
|
|
## Considered Options
|
|
|
|
1. **File-system VoiceRegistry with periodic refresh** — scan the NFS model store on startup and periodically
|
|
2. **Database-backed voice catalogue** — store voice metadata in PostgreSQL
|
|
3. **NATS KV bucket for voice metadata** — store voice info in NATS Key-Value store
|
|
|
|
## Decision Outcome
|
|
|
|
Chosen option: **Option 1 — File-system VoiceRegistry with periodic refresh**, because it introduces zero new infrastructure, uses the model store as the single source of truth, and aligns with the export layout already produced by the `coqui-voice-training` workflow.
|
|
|
|
### Positive Consequences
|
|
|
|
* Zero additional infrastructure — reads directly from the NFS volume
|
|
* Single source of truth — the trained model directory is the registry
|
|
* Newly trained voices appear automatically within the refresh interval
|
|
* On-demand refresh available via NATS for immediate availability
|
|
* Fully backward compatible — existing `speaker_wav_b64` cloning unchanged
|
|
|
|
### Negative Consequences
|
|
|
|
* Polling-based discovery adds slight latency (mitigated by configurable interval and on-demand refresh)
|
|
* No metadata beyond what `model_info.json` contains (sufficient for current needs)
|
|
* Requires NFS volume mounted at `VOICE_MODEL_STORE` path in the TTS pod
|
|
|
|
## Implementation
|
|
|
|
### VoiceRegistry
|
|
|
|
A `VoiceRegistry` class scans `VOICE_MODEL_STORE` (default `/models/tts/custom`) for voice directories. Each directory must contain:
|
|
|
|
| File | Required | Description |
|
|
|------|----------|-------------|
|
|
| `model_info.json` | Yes | Metadata: name, language, type, created_at |
|
|
| `model.pth` | Yes | Trained model weights |
|
|
| `config.json` | No | Model configuration |
|
|
|
|
The registry is refreshed:
|
|
- On service startup
|
|
- Periodically every `VOICE_REGISTRY_REFRESH_SECONDS` (default 300s)
|
|
- On demand via `ai.voice.tts.voices.refresh` NATS subject
|
|
|
|
### Synthesis Routing
|
|
|
|
When a TTS request specifies a `speaker`, the service checks the registry first:
|
|
|
|
```
|
|
Request with speaker="my-voice"
|
|
├─ Found in VoiceRegistry → send model_path + config_path to XTTS
|
|
├─ Not found + speaker_wav_b64 present → ad-hoc voice cloning (existing)
|
|
└─ Not found + no speaker_wav_b64 → use default speaker
|
|
```
|
|
|
|
### New NATS Subjects
|
|
|
|
| Subject | Pattern | Description |
|
|
|---------|---------|-------------|
|
|
| `ai.voice.tts.voices.list` | Request-reply | List default speaker + all custom voices |
|
|
| `ai.voice.tts.voices.refresh` | Request-reply | Trigger immediate registry rescan |
|
|
|
|
### New Environment Variables
|
|
|
|
| Variable | Default | Description |
|
|
|----------|---------|-------------|
|
|
| `VOICE_MODEL_STORE` | `/models/tts/custom` | NFS path to trained voice models |
|
|
| `VOICE_REGISTRY_REFRESH_SECONDS` | `300` | Periodic rescan interval (seconds) |
|
|
|
|
### Integration with Training Pipeline
|
|
|
|
```
|
|
coqui-voice-training Argo Workflow
|
|
└─ export-trained-model step
|
|
└─ Writes to /models/tts/custom/{voice-name}/
|
|
├── model.pth
|
|
├── config.json
|
|
└── model_info.json
|
|
|
|
TTS Streaming Service
|
|
└─ VoiceRegistry
|
|
└─ Scans /models/tts/custom/
|
|
└─ Registers {voice-name} → CustomVoice(model_path, config_path, ...)
|
|
```
|
|
|
|
## Links
|
|
|
|
* Related: [ADR-0009](0009-dual-workflow-engines.md) — Argo/Kubeflow workflow engines
|
|
* Related: [ADR-0011](0011-kuberay-unified-gpu-backend.md) — XTTS runs on the KubeRay GPU backend
|
|
* Workflow: `argo/coqui-voice-training.yaml`
|
|
* Module: `tts-module/tts_streaming.py`
|