Billy D. d4fafea09b feat: add streaming TTS service with Coqui XTTS
- tts_streaming.py: NATS-based TTS using XTTS HTTP API
- Streaming audio chunks for low-latency playback
- Voice cloning support via reference audio
- Multi-language synthesis
- OpenTelemetry instrumentation with HyperDX support
2026-02-02 06:23:34 -05:00
2026-02-02 11:10:19 +00:00

Streaming TTS Module

A dedicated Text-to-Speech (TTS) service that processes synthesis requests from NATS using Coqui XTTS.

Overview

This module enables real-time text-to-speech synthesis by accepting text via NATS and streaming audio chunks back as they're generated. This reduces latency for voice assistant applications by allowing playback to begin before synthesis completes.

Features

  • NATS Integration: Accepts TTS requests via NATS messaging
  • Streaming Audio: Streams audio chunks back for immediate playback
  • Voice Cloning: Support for custom speaker voices via reference audio
  • Multi-language: Support for multiple languages via XTTS
  • OpenTelemetry: Full observability with tracing and metrics
  • HyperDX Support: Optional cloud observability integration

Architecture

┌─────────────────┐
│   Voice App     │ (voice-assistant, chat-handler)
│                 │
└────────┬────────┘
         │ Text
         ▼
┌─────────────────┐
│  NATS Subject   │ ai.voice.tts.request.{session_id}
│  TTS Request    │
└────────┬────────┘
         │
         ▼
┌─────────────────┐
│  TTS Streaming  │ (This Service)
│     Service     │ - Calls XTTS API
│                 │ - Streams audio chunks
└────────┬────────┘
         │
         ▼
┌─────────────────┐
│  NATS Subject   │ ai.voice.tts.audio.{session_id}
│  Audio Chunks   │
└─────────────────┘

NATS Message Protocol

TTS Request (ai.voice.tts.request.{session_id})

All messages use msgpack binary encoding.

Request:

{
    "text": "Hello, how can I help you today?",
    "speaker": "default",  # Optional: speaker ID
    "language": "en",  # Optional: language code
    "speaker_wav_b64": "...",  # Optional: base64 reference audio for voice cloning
    "stream": True  # Optional: stream chunks (default) or send complete audio
}

Audio Output (ai.voice.tts.audio.{session_id})

Streamed Chunk:

{
    "session_id": "unique-session-id",
    "chunk_index": 0,
    "total_chunks": 5,
    "audio_b64": "base64-encoded-audio-chunk",
    "is_last": False,
    "timestamp": 1234567890.123,
    "sample_rate": 24000
}

Complete Audio (when stream=False):

{
    "session_id": "unique-session-id",
    "audio_b64": "base64-encoded-complete-audio",
    "timestamp": 1234567890.123,
    "sample_rate": 24000
}

Status Updates (ai.voice.tts.status.{session_id})

{
    "session_id": "unique-session-id",
    "status": "processing",  # processing, completed, error
    "message": "Synthesizing 50 characters",
    "timestamp": 1234567890.123
}

Environment Variables

Variable Default Description
NATS_URL nats://nats.ai-ml.svc.cluster.local:4222 NATS server URL
XTTS_URL http://xtts-predictor.ai-ml.svc.cluster.local Coqui XTTS service URL
TTS_DEFAULT_SPEAKER default Default speaker ID
TTS_DEFAULT_LANGUAGE en Default language code
TTS_AUDIO_CHUNK_SIZE 32768 Audio chunk size in bytes
TTS_SAMPLE_RATE 24000 Audio sample rate (Hz)
OTEL_ENABLED true Enable OpenTelemetry
HYPERDX_ENABLED false Enable HyperDX observability

Building

docker build -t tts-module:latest .

Testing

# Port-forward NATS
kubectl port-forward -n ai-ml svc/nats 4222:4222

# Send TTS request
python -c "
import nats
import msgpack
import asyncio

async def test():
    nc = await nats.connect('nats://localhost:4222')
    
    request = {
        'text': 'Hello, this is a test of text to speech.',
        'stream': True
    }
    
    await nc.publish(
        'ai.voice.tts.request.test-session',
        msgpack.packb(request)
    )
    await nc.close()

asyncio.run(test())
"

# Subscribe to audio output
nats sub "ai.voice.tts.audio.>"

Voice Cloning

To use a custom voice, provide reference audio in the request:

import base64

with open("reference_voice.wav", "rb") as f:
    speaker_wav_b64 = base64.b64encode(f.read()).decode()

request = {
    "text": "This will sound like the reference voice.",
    "speaker_wav_b64": speaker_wav_b64
}

License

MIT

Description
No description provided
Readme MIT 183 KiB
Languages
Go 97.1%
Dockerfile 2.9%