homelab-design/decisions/0056-custom-voice-support-tts-module.md

# Custom Trained Voice Support in TTS Module

* Status: accepted
* Date: 2026-02-13
* Deciders: Billy Davies
* Technical Story: Enable the TTS module to use custom voices generated by the `coqui-voice-training` Argo workflow

## Context and Problem Statement

The `coqui-voice-training` Argo workflow trains custom VITS voice models from audio samples and exports them to NFS at `/models/tts/custom/{voice-name}/`. The TTS streaming module currently only supports the default XTTS speaker or ad-hoc voice cloning via base64-encoded reference audio (`speaker_wav_b64`). There is no mechanism to discover and use the fine-tuned models produced by the training pipeline.

How should the TTS module discover and serve custom trained voices so that callers can request a trained voice by name without providing reference audio?

## Decision Drivers

* Voices trained by the Argo pipeline should be usable immediately without service restarts
* Callers should be able to request a trained voice by name (e.g. `"speaker": "my-voice"`)
* Existing ad-hoc voice cloning via `speaker_wav_b64` must continue to work
* Other services need a way to enumerate available voices
* No external database or registry should be required — the file system is the source of truth

## Considered Options

1. **File-system VoiceRegistry with periodic refresh** — scan the NFS model store on startup and periodically
2. **Database-backed voice catalogue** — store voice metadata in PostgreSQL
3. **NATS KV bucket for voice metadata** — store voice info in NATS Key-Value store

## Decision Outcome

Chosen option: **Option 1 — File-system VoiceRegistry with periodic refresh**, because it introduces zero new infrastructure, uses the model store as the single source of truth, and aligns with the export layout already produced by the `coqui-voice-training` workflow.

### Positive Consequences

* Zero additional infrastructure — reads directly from the NFS volume
* Single source of truth — the trained model directory is the registry
* Newly trained voices appear automatically within the refresh interval
* On-demand refresh available via NATS for immediate availability
* Fully backward compatible — existing `speaker_wav_b64` cloning unchanged

### Negative Consequences

* Polling-based discovery adds slight latency (mitigated by configurable interval and on-demand refresh)
* No metadata beyond what `model_info.json` contains (sufficient for current needs)
* Requires NFS volume mounted at `VOICE_MODEL_STORE` path in the TTS pod

## Implementation

### VoiceRegistry

A `VoiceRegistry` class scans `VOICE_MODEL_STORE` (default `/models/tts/custom`) for voice directories. Each directory must contain:

| File | Required | Description |
|------|----------|-------------|
| `model_info.json` | Yes | Metadata: name, language, type, created_at |
| `model.pth` | Yes | Trained model weights |
| `config.json` | No | Model configuration |

The registry is refreshed:
- On service startup
- Periodically every `VOICE_REGISTRY_REFRESH_SECONDS` (default 300s)
- On demand via `ai.voice.tts.voices.refresh` NATS subject

### Synthesis Routing

When a TTS request specifies a `speaker`, the service checks the registry first:

```
Request with speaker="my-voice"
  ├─ Found in VoiceRegistry → send model_path + config_path to XTTS
  ├─ Not found + speaker_wav_b64 present → ad-hoc voice cloning (existing)
  └─ Not found + no speaker_wav_b64 → use default speaker
```

### New NATS Subjects

| Subject | Pattern | Description |
|---------|---------|-------------|
| `ai.voice.tts.voices.list` | Request-reply | List default speaker + all custom voices |
| `ai.voice.tts.voices.refresh` | Request-reply | Trigger immediate registry rescan |

### New Environment Variables

| Variable | Default | Description |
|----------|---------|-------------|
| `VOICE_MODEL_STORE` | `/models/tts/custom` | NFS path to trained voice models |
| `VOICE_REGISTRY_REFRESH_SECONDS` | `300` | Periodic rescan interval (seconds) |

### Integration with Training Pipeline

```
coqui-voice-training Argo Workflow
  └─ export-trained-model step
       └─ Writes to /models/tts/custom/{voice-name}/
            ├── model.pth
            ├── config.json
            └── model_info.json

TTS Streaming Service
  └─ VoiceRegistry
       └─ Scans /models/tts/custom/
            └─ Registers {voice-name} → CustomVoice(model_path, config_path, ...)
```

## Links

* Related: [ADR-0009](0009-dual-workflow-engines.md) — Argo/Kubeflow workflow engines
* Related: [ADR-0011](0011-kuberay-unified-gpu-backend.md) — XTTS runs on the KubeRay GPU backend
* Workflow: `argo/coqui-voice-training.yaml`
* Module: `tts-module/tts_streaming.py`