- tts_streaming.py: NATS-based TTS using XTTS HTTP API - Streaming audio chunks for low-latency playback - Voice cloning support via reference audio - Multi-language synthesis - OpenTelemetry instrumentation with HyperDX support
4.5 KiB
4.5 KiB
Streaming TTS Module
A dedicated Text-to-Speech (TTS) service that processes synthesis requests from NATS using Coqui XTTS.
Overview
This module enables real-time text-to-speech synthesis by accepting text via NATS and streaming audio chunks back as they're generated. This reduces latency for voice assistant applications by allowing playback to begin before synthesis completes.
Features
- NATS Integration: Accepts TTS requests via NATS messaging
- Streaming Audio: Streams audio chunks back for immediate playback
- Voice Cloning: Support for custom speaker voices via reference audio
- Multi-language: Support for multiple languages via XTTS
- OpenTelemetry: Full observability with tracing and metrics
- HyperDX Support: Optional cloud observability integration
Architecture
┌─────────────────┐
│ Voice App │ (voice-assistant, chat-handler)
│ │
└────────┬────────┘
│ Text
▼
┌─────────────────┐
│ NATS Subject │ ai.voice.tts.request.{session_id}
│ TTS Request │
└────────┬────────┘
│
▼
┌─────────────────┐
│ TTS Streaming │ (This Service)
│ Service │ - Calls XTTS API
│ │ - Streams audio chunks
└────────┬────────┘
│
▼
┌─────────────────┐
│ NATS Subject │ ai.voice.tts.audio.{session_id}
│ Audio Chunks │
└─────────────────┘
NATS Message Protocol
TTS Request (ai.voice.tts.request.{session_id})
All messages use msgpack binary encoding.
Request:
{
"text": "Hello, how can I help you today?",
"speaker": "default", # Optional: speaker ID
"language": "en", # Optional: language code
"speaker_wav_b64": "...", # Optional: base64 reference audio for voice cloning
"stream": True # Optional: stream chunks (default) or send complete audio
}
Audio Output (ai.voice.tts.audio.{session_id})
Streamed Chunk:
{
"session_id": "unique-session-id",
"chunk_index": 0,
"total_chunks": 5,
"audio_b64": "base64-encoded-audio-chunk",
"is_last": False,
"timestamp": 1234567890.123,
"sample_rate": 24000
}
Complete Audio (when stream=False):
{
"session_id": "unique-session-id",
"audio_b64": "base64-encoded-complete-audio",
"timestamp": 1234567890.123,
"sample_rate": 24000
}
Status Updates (ai.voice.tts.status.{session_id})
{
"session_id": "unique-session-id",
"status": "processing", # processing, completed, error
"message": "Synthesizing 50 characters",
"timestamp": 1234567890.123
}
Environment Variables
| Variable | Default | Description |
|---|---|---|
NATS_URL |
nats://nats.ai-ml.svc.cluster.local:4222 |
NATS server URL |
XTTS_URL |
http://xtts-predictor.ai-ml.svc.cluster.local |
Coqui XTTS service URL |
TTS_DEFAULT_SPEAKER |
default |
Default speaker ID |
TTS_DEFAULT_LANGUAGE |
en |
Default language code |
TTS_AUDIO_CHUNK_SIZE |
32768 |
Audio chunk size in bytes |
TTS_SAMPLE_RATE |
24000 |
Audio sample rate (Hz) |
OTEL_ENABLED |
true |
Enable OpenTelemetry |
HYPERDX_ENABLED |
false |
Enable HyperDX observability |
Building
docker build -t tts-module:latest .
Testing
# Port-forward NATS
kubectl port-forward -n ai-ml svc/nats 4222:4222
# Send TTS request
python -c "
import nats
import msgpack
import asyncio
async def test():
nc = await nats.connect('nats://localhost:4222')
request = {
'text': 'Hello, this is a test of text to speech.',
'stream': True
}
await nc.publish(
'ai.voice.tts.request.test-session',
msgpack.packb(request)
)
await nc.close()
asyncio.run(test())
"
# Subscribe to audio output
nats sub "ai.voice.tts.audio.>"
Voice Cloning
To use a custom voice, provide reference audio in the request:
import base64
with open("reference_voice.wav", "rb") as f:
speaker_wav_b64 = base64.b64encode(f.read()).decode()
request = {
"text": "This will sound like the reference voice.",
"speaker_wav_b64": speaker_wav_b64
}
License
MIT