# Custom Trained Voice Support in TTS Module * Status: accepted * Date: 2026-02-13 * Deciders: Billy Davies * Technical Story: Enable the TTS module to use custom voices generated by the `coqui-voice-training` Argo workflow ## Context and Problem Statement The `coqui-voice-training` Argo workflow trains custom VITS voice models from audio samples and exports them to NFS at `/models/tts/custom/{voice-name}/`. The TTS streaming module currently only supports the default XTTS speaker or ad-hoc voice cloning via base64-encoded reference audio (`speaker_wav_b64`). There is no mechanism to discover and use the fine-tuned models produced by the training pipeline. How should the TTS module discover and serve custom trained voices so that callers can request a trained voice by name without providing reference audio? ## Decision Drivers * Voices trained by the Argo pipeline should be usable immediately without service restarts * Callers should be able to request a trained voice by name (e.g. `"speaker": "my-voice"`) * Existing ad-hoc voice cloning via `speaker_wav_b64` must continue to work * Other services need a way to enumerate available voices * No external database or registry should be required — the file system is the source of truth ## Considered Options 1. **File-system VoiceRegistry with periodic refresh** — scan the NFS model store on startup and periodically 2. **Database-backed voice catalogue** — store voice metadata in PostgreSQL 3. **NATS KV bucket for voice metadata** — store voice info in NATS Key-Value store ## Decision Outcome Chosen option: **Option 1 — File-system VoiceRegistry with periodic refresh**, because it introduces zero new infrastructure, uses the model store as the single source of truth, and aligns with the export layout already produced by the `coqui-voice-training` workflow. ### Positive Consequences * Zero additional infrastructure — reads directly from the NFS volume * Single source of truth — the trained model directory is the registry * Newly trained voices appear automatically within the refresh interval * On-demand refresh available via NATS for immediate availability * Fully backward compatible — existing `speaker_wav_b64` cloning unchanged ### Negative Consequences * Polling-based discovery adds slight latency (mitigated by configurable interval and on-demand refresh) * No metadata beyond what `model_info.json` contains (sufficient for current needs) * Requires NFS volume mounted at `VOICE_MODEL_STORE` path in the TTS pod ## Implementation ### VoiceRegistry A `VoiceRegistry` class scans `VOICE_MODEL_STORE` (default `/models/tts/custom`) for voice directories. Each directory must contain: | File | Required | Description | |------|----------|-------------| | `model_info.json` | Yes | Metadata: name, language, type, created_at | | `model.pth` | Yes | Trained model weights | | `config.json` | No | Model configuration | The registry is refreshed: - On service startup - Periodically every `VOICE_REGISTRY_REFRESH_SECONDS` (default 300s) - On demand via `ai.voice.tts.voices.refresh` NATS subject ### Synthesis Routing When a TTS request specifies a `speaker`, the service checks the registry first: ``` Request with speaker="my-voice" ├─ Found in VoiceRegistry → send model_path + config_path to XTTS ├─ Not found + speaker_wav_b64 present → ad-hoc voice cloning (existing) └─ Not found + no speaker_wav_b64 → use default speaker ``` ### New NATS Subjects | Subject | Pattern | Description | |---------|---------|-------------| | `ai.voice.tts.voices.list` | Request-reply | List default speaker + all custom voices | | `ai.voice.tts.voices.refresh` | Request-reply | Trigger immediate registry rescan | ### New Environment Variables | Variable | Default | Description | |----------|---------|-------------| | `VOICE_MODEL_STORE` | `/models/tts/custom` | NFS path to trained voice models | | `VOICE_REGISTRY_REFRESH_SECONDS` | `300` | Periodic rescan interval (seconds) | ### Integration with Training Pipeline ``` coqui-voice-training Argo Workflow └─ export-trained-model step └─ Writes to /models/tts/custom/{voice-name}/ ├── model.pth ├── config.json └── model_info.json TTS Streaming Service └─ VoiceRegistry └─ Scans /models/tts/custom/ └─ Registers {voice-name} → CustomVoice(model_path, config_path, ...) ``` ## Links * Related: [ADR-0009](0009-dual-workflow-engines.md) — Argo/Kubeflow workflow engines * Related: [ADR-0011](0011-kuberay-unified-gpu-backend.md) — XTTS runs on the KubeRay GPU backend * Workflow: `argo/coqui-voice-training.yaml` * Module: `tts-module/tts_streaming.py`