Billy D. b7340ab72b
Some checks failed
CI / Lint (push) Successful in 26s
CI / Test (push) Successful in 44s
CI / Release (push) Successful in 19s
CI / Docker Build & Push (push) Failing after 28s
CI / Notify (push) Successful in 1s
fix: replace astral-sh/setup-uv action with shell install
The JS-based GitHub Action doesn't work on Gitea's act runner.
Use curl installer + GITHUB_PATH instead.
2026-02-13 19:40:51 -05:00
2026-02-02 11:10:19 +00:00

Streaming TTS Module

A dedicated Text-to-Speech (TTS) service that processes synthesis requests from NATS using Coqui XTTS.

Overview

This module enables real-time text-to-speech synthesis by accepting text via NATS and streaming audio chunks back as they're generated. This reduces latency for voice assistant applications by allowing playback to begin before synthesis completes.

Features

  • NATS Integration: Accepts TTS requests via NATS messaging
  • Streaming Audio: Streams audio chunks back for immediate playback
  • Voice Cloning: Support for custom speaker voices via reference audio
  • Custom Trained Voices: Automatic discovery of voices trained by the coqui-voice-training Argo workflow
  • Voice Registry: Lists available voices and refreshes on-demand or periodically
  • Multi-language: Support for multiple languages via XTTS
  • OpenTelemetry: Full observability with tracing and metrics
  • HyperDX Support: Optional cloud observability integration

Architecture

┌─────────────────┐
│   Voice App     │ (voice-assistant, chat-handler)
│                 │
└────────┬────────┘
         │ Text
         ▼
┌─────────────────┐
│  NATS Subject   │ ai.voice.tts.request.{session_id}
│  TTS Request    │
└────────┬────────┘
         │
         ▼
┌─────────────────┐
│  TTS Streaming  │ (This Service)
│     Service     │ - Calls XTTS API
│                 │ - Streams audio chunks
└────────┬────────┘
         │
         ▼
┌─────────────────┐
│  NATS Subject   │ ai.voice.tts.audio.{session_id}
│  Audio Chunks   │
└─────────────────┘

NATS Message Protocol

TTS Request (ai.voice.tts.request.{session_id})

All messages use msgpack binary encoding.

Request:

{
    "text": "Hello, how can I help you today?",
    "speaker": "default",  # Optional: speaker ID or custom voice name
    "language": "en",  # Optional: language code
    "speaker_wav_b64": "...",  # Optional: base64 reference audio for ad-hoc voice cloning
    "stream": True  # Optional: stream chunks (default) or send complete audio
}

Custom voices: When speaker matches the name of a custom trained voice in the voice registry, the service automatically routes to the trained model. No speaker_wav_b64 is needed for trained voices.

Audio Output (ai.voice.tts.audio.{session_id})

Streamed Chunk:

{
    "session_id": "unique-session-id",
    "chunk_index": 0,
    "total_chunks": 5,
    "audio_b64": "base64-encoded-audio-chunk",
    "is_last": False,
    "timestamp": 1234567890.123,
    "sample_rate": 24000
}

Complete Audio (when stream=False):

{
    "session_id": "unique-session-id",
    "audio_b64": "base64-encoded-complete-audio",
    "timestamp": 1234567890.123,
    "sample_rate": 24000
}

Status Updates (ai.voice.tts.status.{session_id})

{
    "session_id": "unique-session-id",
    "status": "processing",  # processing, completed, error
    "message": "Synthesizing 50 characters",
    "timestamp": 1234567890.123
}

Environment Variables

Variable Default Description
NATS_URL nats://nats.ai-ml.svc.cluster.local:4222 NATS server URL
XTTS_URL http://xtts-predictor.ai-ml.svc.cluster.local Coqui XTTS service URL
TTS_DEFAULT_SPEAKER default Default speaker ID
TTS_DEFAULT_LANGUAGE en Default language code
TTS_AUDIO_CHUNK_SIZE 32768 Audio chunk size in bytes
TTS_SAMPLE_RATE 24000 Audio sample rate (Hz)
VOICE_MODEL_STORE /models/tts/custom Path to custom voice models (NFS mount)
VOICE_REGISTRY_REFRESH_SECONDS 300 Interval to rescan model store for new voices
OTEL_ENABLED true Enable OpenTelemetry
HYPERDX_ENABLED false Enable HyperDX observability

Building

docker build -t tts-module:latest .

Testing

# Port-forward NATS
kubectl port-forward -n ai-ml svc/nats 4222:4222

# Send TTS request
python -c "
import nats
import msgpack
import asyncio

async def test():
    nc = await nats.connect('nats://localhost:4222')
    
    request = {
        'text': 'Hello, this is a test of text to speech.',
        'stream': True
    }
    
    await nc.publish(
        'ai.voice.tts.request.test-session',
        msgpack.packb(request)
    )
    await nc.close()

asyncio.run(test())
"

# Subscribe to audio output
nats sub "ai.voice.tts.audio.>"

Voice Cloning

To use a custom voice, provide reference audio in the request:

import base64

with open("reference_voice.wav", "rb") as f:
    speaker_wav_b64 = base64.b64encode(f.read()).decode()

request = {
    "text": "This will sound like the reference voice.",
    "speaker_wav_b64": speaker_wav_b64
}

Custom Trained Voices

The coqui-voice-training Argo workflow trains custom TTS models and exports them to the model store (VOICE_MODEL_STORE, default /models/tts/custom). The TTS module discovers these voices automatically on startup and periodically re-scans for newly trained voices.

How it works

  1. The voice training pipeline exports a model to /models/tts/custom/{voice-name}/
  2. Each directory contains model.pth, config.json, and model_info.json
  3. The TTS module scans the store and registers each voice by name
  4. Requests with "speaker": "my-voice" automatically route to the trained model

Using a trained voice

# Just set the speaker to the voice name — no reference audio needed
request = {
    "text": "This uses a fine-tuned voice model.",
    "speaker": "my-custom-voice"  # Matches {voice-name} from training pipeline
}

Listing available voices

Send a NATS request to ai.voice.tts.voices.list:

import nats
import msgpack
import asyncio

async def list_voices():
    nc = await nats.connect("nats://localhost:4222")
    resp = await nc.request("ai.voice.tts.voices.list", b"", timeout=5)
    data = msgpack.unpackb(resp.data, raw=False)
    print(f"Default speaker: {data['default_speaker']}")
    for voice in data["custom_voices"]:
        print(f"  - {voice['name']} ({voice['language']}, trained {voice['created_at']})")
    await nc.close()

asyncio.run(list_voices())

Refreshing the voice registry

Voices are re-scanned every VOICE_REGISTRY_REFRESH_SECONDS (default 5 min). To trigger an immediate refresh, publish to ai.voice.tts.voices.refresh:

resp = await nc.request("ai.voice.tts.voices.refresh", b"", timeout=10)
data = msgpack.unpackb(resp.data, raw=False)
print(f"Found {data['count']} custom voice(s)")

License

MIT

Description
No description provided
Readme MIT 183 KiB
Languages
Go 97.1%
Dockerfile 2.9%