Files
homelab-design/decisions/0056-custom-voice-support-tts-module.md
Billy D. 51e6cee8ab
All checks were successful
Update README with ADR Index / update-readme (push) Successful in 6s
feat: ADR-0056 custom voice support, ADR-0057 Renovate per-repo configs
- ADR-0056: custom voice support in tts-module (VoiceRegistry)
- ADR-0057: shared Renovate preset rollout to all app repos
- Update ADR-0013: add tts-module and stt-module to CI table
- Update ADR-0036: cross-reference ADR-0057
2026-02-13 15:31:33 -05:00

110 lines
4.6 KiB
Markdown

# Custom Trained Voice Support in TTS Module
* Status: accepted
* Date: 2026-02-13
* Deciders: Billy Davies
* Technical Story: Enable the TTS module to use custom voices generated by the `coqui-voice-training` Argo workflow
## Context and Problem Statement
The `coqui-voice-training` Argo workflow trains custom VITS voice models from audio samples and exports them to NFS at `/models/tts/custom/{voice-name}/`. The TTS streaming module currently only supports the default XTTS speaker or ad-hoc voice cloning via base64-encoded reference audio (`speaker_wav_b64`). There is no mechanism to discover and use the fine-tuned models produced by the training pipeline.
How should the TTS module discover and serve custom trained voices so that callers can request a trained voice by name without providing reference audio?
## Decision Drivers
* Voices trained by the Argo pipeline should be usable immediately without service restarts
* Callers should be able to request a trained voice by name (e.g. `"speaker": "my-voice"`)
* Existing ad-hoc voice cloning via `speaker_wav_b64` must continue to work
* Other services need a way to enumerate available voices
* No external database or registry should be required — the file system is the source of truth
## Considered Options
1. **File-system VoiceRegistry with periodic refresh** — scan the NFS model store on startup and periodically
2. **Database-backed voice catalogue** — store voice metadata in PostgreSQL
3. **NATS KV bucket for voice metadata** — store voice info in NATS Key-Value store
## Decision Outcome
Chosen option: **Option 1 — File-system VoiceRegistry with periodic refresh**, because it introduces zero new infrastructure, uses the model store as the single source of truth, and aligns with the export layout already produced by the `coqui-voice-training` workflow.
### Positive Consequences
* Zero additional infrastructure — reads directly from the NFS volume
* Single source of truth — the trained model directory is the registry
* Newly trained voices appear automatically within the refresh interval
* On-demand refresh available via NATS for immediate availability
* Fully backward compatible — existing `speaker_wav_b64` cloning unchanged
### Negative Consequences
* Polling-based discovery adds slight latency (mitigated by configurable interval and on-demand refresh)
* No metadata beyond what `model_info.json` contains (sufficient for current needs)
* Requires NFS volume mounted at `VOICE_MODEL_STORE` path in the TTS pod
## Implementation
### VoiceRegistry
A `VoiceRegistry` class scans `VOICE_MODEL_STORE` (default `/models/tts/custom`) for voice directories. Each directory must contain:
| File | Required | Description |
|------|----------|-------------|
| `model_info.json` | Yes | Metadata: name, language, type, created_at |
| `model.pth` | Yes | Trained model weights |
| `config.json` | No | Model configuration |
The registry is refreshed:
- On service startup
- Periodically every `VOICE_REGISTRY_REFRESH_SECONDS` (default 300s)
- On demand via `ai.voice.tts.voices.refresh` NATS subject
### Synthesis Routing
When a TTS request specifies a `speaker`, the service checks the registry first:
```
Request with speaker="my-voice"
├─ Found in VoiceRegistry → send model_path + config_path to XTTS
├─ Not found + speaker_wav_b64 present → ad-hoc voice cloning (existing)
└─ Not found + no speaker_wav_b64 → use default speaker
```
### New NATS Subjects
| Subject | Pattern | Description |
|---------|---------|-------------|
| `ai.voice.tts.voices.list` | Request-reply | List default speaker + all custom voices |
| `ai.voice.tts.voices.refresh` | Request-reply | Trigger immediate registry rescan |
### New Environment Variables
| Variable | Default | Description |
|----------|---------|-------------|
| `VOICE_MODEL_STORE` | `/models/tts/custom` | NFS path to trained voice models |
| `VOICE_REGISTRY_REFRESH_SECONDS` | `300` | Periodic rescan interval (seconds) |
### Integration with Training Pipeline
```
coqui-voice-training Argo Workflow
└─ export-trained-model step
└─ Writes to /models/tts/custom/{voice-name}/
├── model.pth
├── config.json
└── model_info.json
TTS Streaming Service
└─ VoiceRegistry
└─ Scans /models/tts/custom/
└─ Registers {voice-name} → CustomVoice(model_path, config_path, ...)
```
## Links
* Related: [ADR-0009](0009-dual-workflow-engines.md) — Argo/Kubeflow workflow engines
* Related: [ADR-0011](0011-kuberay-unified-gpu-backend.md) — XTTS runs on the KubeRay GPU backend
* Workflow: `argo/coqui-voice-training.yaml`
* Module: `tts-module/tts_streaming.py`