feat: custom voice support, CI pipeline, and Renovate config

- VoiceRegistry for trained voices from Kubeflow pipeline - Custom voice routing in synthesize() - NATS subjects for listing/refreshing voices - pyproject.toml with ruff/pytest config - Full test suite (26 tests) - Gitea Actions CI (lint, test, docker, notify) - Renovate config for automated dependency updates Ref: ADR-0056, ADR-0057
2026-02-13 15:33:27 -05:00
parent d4fafea09b
commit 7b3bfc6812
10 changed files with 1783 additions and 113 deletions
--- a/README.md
+++ b/README.md
@@ -11,6 +11,8 @@ This module enables real-time text-to-speech synthesis by accepting text via NAT
 - **NATS Integration**: Accepts TTS requests via NATS messaging
 - **Streaming Audio**: Streams audio chunks back for immediate playback
 - **Voice Cloning**: Support for custom speaker voices via reference audio
+- **Custom Trained Voices**: Automatic discovery of voices trained by the `coqui-voice-training` Argo workflow
+- **Voice Registry**: Lists available voices and refreshes on-demand or periodically
 - **Multi-language**: Support for multiple languages via XTTS
 - **OpenTelemetry**: Full observability with tracing and metrics
 - **HyperDX Support**: Optional cloud observability integration
@@ -53,13 +55,17 @@ All messages use **msgpack** binary encoding.
 ```python
 {
    "text": "Hello, how can I help you today?",
-    "speaker": "default",  # Optional: speaker ID
+    "speaker": "default",  # Optional: speaker ID or custom voice name
    "language": "en",  # Optional: language code
-    "speaker_wav_b64": "...",  # Optional: base64 reference audio for voice cloning
+    "speaker_wav_b64": "...",  # Optional: base64 reference audio for ad-hoc voice cloning
    "stream": True  # Optional: stream chunks (default) or send complete audio
 }
 ```

+> **Custom voices:** When `speaker` matches the name of a custom trained voice
+> in the voice registry, the service automatically routes to the trained model.
+> No `speaker_wav_b64` is needed for trained voices.
+
 ### Audio Output (ai.voice.tts.audio.{session_id})

 **Streamed Chunk:**
@@ -106,6 +112,8 @@ All messages use **msgpack** binary encoding.
 | `TTS_DEFAULT_LANGUAGE` | `en` | Default language code |
 | `TTS_AUDIO_CHUNK_SIZE` | `32768` | Audio chunk size in bytes |
 | `TTS_SAMPLE_RATE` | `24000` | Audio sample rate (Hz) |
+| `VOICE_MODEL_STORE` | `/models/tts/custom` | Path to custom voice models (NFS mount) |
+| `VOICE_REGISTRY_REFRESH_SECONDS` | `300` | Interval to rescan model store for new voices |
 | `OTEL_ENABLED` | `true` | Enable OpenTelemetry |
 | `HYPERDX_ENABLED` | `false` | Enable HyperDX observability |

@@ -164,6 +172,62 @@ request = {
 }
 ```

+## Custom Trained Voices
+
+The `coqui-voice-training` Argo workflow trains custom TTS models and exports
+them to the model store (`VOICE_MODEL_STORE`, default `/models/tts/custom`).
+The TTS module discovers these voices automatically on startup and periodically
+re-scans for newly trained voices.
+
+### How it works
+
+1. The voice training pipeline exports a model to `/models/tts/custom/{voice-name}/`
+2. Each directory contains `model.pth`, `config.json`, and `model_info.json`
+3. The TTS module scans the store and registers each voice by name
+4. Requests with `"speaker": "my-voice"` automatically route to the trained model
+
+### Using a trained voice
+
+```python
+# Just set the speaker to the voice name — no reference audio needed
+request = {
+    "text": "This uses a fine-tuned voice model.",
+    "speaker": "my-custom-voice"  # Matches {voice-name} from training pipeline
+}
+```
+
+### Listing available voices
+
+Send a NATS request to `ai.voice.tts.voices.list`:
+
+```python
+import nats
+import msgpack
+import asyncio
+
+async def list_voices():
+    nc = await nats.connect("nats://localhost:4222")
+    resp = await nc.request("ai.voice.tts.voices.list", b"", timeout=5)
+    data = msgpack.unpackb(resp.data, raw=False)
+    print(f"Default speaker: {data['default_speaker']}")
+    for voice in data["custom_voices"]:
+        print(f"  - {voice['name']} ({voice['language']}, trained {voice['created_at']})")
+    await nc.close()
+
+asyncio.run(list_voices())
+```
+
+### Refreshing the voice registry
+
+Voices are re-scanned every `VOICE_REGISTRY_REFRESH_SECONDS` (default 5 min).
+To trigger an immediate refresh, publish to `ai.voice.tts.voices.refresh`:
+
+```python
+resp = await nc.request("ai.voice.tts.voices.refresh", b"", timeout=10)
+data = msgpack.unpackb(resp.data, raw=False)
+print(f"Found {data['count']} custom voice(s)")
+```
+
 ## License

 MIT