feat: add streaming TTS service with Coqui XTTS
- tts_streaming.py: NATS-based TTS using XTTS HTTP API - Streaming audio chunks for low-latency playback - Voice cloning support via reference audio - Multi-language synthesis - OpenTelemetry instrumentation with HyperDX support
This commit is contained in:
169
README.md
169
README.md
@@ -1,2 +1,169 @@
|
||||
# tts-module
|
||||
# Streaming TTS Module
|
||||
|
||||
A dedicated Text-to-Speech (TTS) service that processes synthesis requests from NATS using Coqui XTTS.
|
||||
|
||||
## Overview
|
||||
|
||||
This module enables real-time text-to-speech synthesis by accepting text via NATS and streaming audio chunks back as they're generated. This reduces latency for voice assistant applications by allowing playback to begin before synthesis completes.
|
||||
|
||||
## Features
|
||||
|
||||
- **NATS Integration**: Accepts TTS requests via NATS messaging
|
||||
- **Streaming Audio**: Streams audio chunks back for immediate playback
|
||||
- **Voice Cloning**: Support for custom speaker voices via reference audio
|
||||
- **Multi-language**: Support for multiple languages via XTTS
|
||||
- **OpenTelemetry**: Full observability with tracing and metrics
|
||||
- **HyperDX Support**: Optional cloud observability integration
|
||||
|
||||
## Architecture
|
||||
|
||||
```
|
||||
┌─────────────────┐
|
||||
│ Voice App │ (voice-assistant, chat-handler)
|
||||
│ │
|
||||
└────────┬────────┘
|
||||
│ Text
|
||||
▼
|
||||
┌─────────────────┐
|
||||
│ NATS Subject │ ai.voice.tts.request.{session_id}
|
||||
│ TTS Request │
|
||||
└────────┬────────┘
|
||||
│
|
||||
▼
|
||||
┌─────────────────┐
|
||||
│ TTS Streaming │ (This Service)
|
||||
│ Service │ - Calls XTTS API
|
||||
│ │ - Streams audio chunks
|
||||
└────────┬────────┘
|
||||
│
|
||||
▼
|
||||
┌─────────────────┐
|
||||
│ NATS Subject │ ai.voice.tts.audio.{session_id}
|
||||
│ Audio Chunks │
|
||||
└─────────────────┘
|
||||
```
|
||||
|
||||
## NATS Message Protocol
|
||||
|
||||
### TTS Request (ai.voice.tts.request.{session_id})
|
||||
|
||||
All messages use **msgpack** binary encoding.
|
||||
|
||||
**Request:**
|
||||
```python
|
||||
{
|
||||
"text": "Hello, how can I help you today?",
|
||||
"speaker": "default", # Optional: speaker ID
|
||||
"language": "en", # Optional: language code
|
||||
"speaker_wav_b64": "...", # Optional: base64 reference audio for voice cloning
|
||||
"stream": True # Optional: stream chunks (default) or send complete audio
|
||||
}
|
||||
```
|
||||
|
||||
### Audio Output (ai.voice.tts.audio.{session_id})
|
||||
|
||||
**Streamed Chunk:**
|
||||
```python
|
||||
{
|
||||
"session_id": "unique-session-id",
|
||||
"chunk_index": 0,
|
||||
"total_chunks": 5,
|
||||
"audio_b64": "base64-encoded-audio-chunk",
|
||||
"is_last": False,
|
||||
"timestamp": 1234567890.123,
|
||||
"sample_rate": 24000
|
||||
}
|
||||
```
|
||||
|
||||
**Complete Audio (when stream=False):**
|
||||
```python
|
||||
{
|
||||
"session_id": "unique-session-id",
|
||||
"audio_b64": "base64-encoded-complete-audio",
|
||||
"timestamp": 1234567890.123,
|
||||
"sample_rate": 24000
|
||||
}
|
||||
```
|
||||
|
||||
### Status Updates (ai.voice.tts.status.{session_id})
|
||||
|
||||
```python
|
||||
{
|
||||
"session_id": "unique-session-id",
|
||||
"status": "processing", # processing, completed, error
|
||||
"message": "Synthesizing 50 characters",
|
||||
"timestamp": 1234567890.123
|
||||
}
|
||||
```
|
||||
|
||||
## Environment Variables
|
||||
|
||||
| Variable | Default | Description |
|
||||
|----------|---------|-------------|
|
||||
| `NATS_URL` | `nats://nats.ai-ml.svc.cluster.local:4222` | NATS server URL |
|
||||
| `XTTS_URL` | `http://xtts-predictor.ai-ml.svc.cluster.local` | Coqui XTTS service URL |
|
||||
| `TTS_DEFAULT_SPEAKER` | `default` | Default speaker ID |
|
||||
| `TTS_DEFAULT_LANGUAGE` | `en` | Default language code |
|
||||
| `TTS_AUDIO_CHUNK_SIZE` | `32768` | Audio chunk size in bytes |
|
||||
| `TTS_SAMPLE_RATE` | `24000` | Audio sample rate (Hz) |
|
||||
| `OTEL_ENABLED` | `true` | Enable OpenTelemetry |
|
||||
| `HYPERDX_ENABLED` | `false` | Enable HyperDX observability |
|
||||
|
||||
## Building
|
||||
|
||||
```bash
|
||||
docker build -t tts-module:latest .
|
||||
```
|
||||
|
||||
## Testing
|
||||
|
||||
```bash
|
||||
# Port-forward NATS
|
||||
kubectl port-forward -n ai-ml svc/nats 4222:4222
|
||||
|
||||
# Send TTS request
|
||||
python -c "
|
||||
import nats
|
||||
import msgpack
|
||||
import asyncio
|
||||
|
||||
async def test():
|
||||
nc = await nats.connect('nats://localhost:4222')
|
||||
|
||||
request = {
|
||||
'text': 'Hello, this is a test of text to speech.',
|
||||
'stream': True
|
||||
}
|
||||
|
||||
await nc.publish(
|
||||
'ai.voice.tts.request.test-session',
|
||||
msgpack.packb(request)
|
||||
)
|
||||
await nc.close()
|
||||
|
||||
asyncio.run(test())
|
||||
"
|
||||
|
||||
# Subscribe to audio output
|
||||
nats sub "ai.voice.tts.audio.>"
|
||||
```
|
||||
|
||||
## Voice Cloning
|
||||
|
||||
To use a custom voice, provide reference audio in the request:
|
||||
|
||||
```python
|
||||
import base64
|
||||
|
||||
with open("reference_voice.wav", "rb") as f:
|
||||
speaker_wav_b64 = base64.b64encode(f.read()).decode()
|
||||
|
||||
request = {
|
||||
"text": "This will sound like the reference voice.",
|
||||
"speaker_wav_b64": speaker_wav_b64
|
||||
}
|
||||
```
|
||||
|
||||
## License
|
||||
|
||||
MIT
|
||||
|
||||
Reference in New Issue
Block a user