v0.0.3
- Set GOPRIVATE=git.daviestechlabs.io to bypass public Go module proxy - Configure git URL insteadOf with DISPATCH_TOKEN for private repo access
Streaming STT Module
A dedicated Speech-to-Text (STT) service that processes live audio streams from NATS for faster transcription responses.
Overview
This module enables real-time speech-to-text processing by accepting audio chunks as they arrive rather than waiting for complete audio files. This significantly reduces latency in voice assistant applications.
Features
- Live Audio Streaming: Accepts audio chunks via NATS as they're captured
- Incremental Processing: Transcribes audio as soon as sufficient data is buffered
- Session Management: Handles multiple concurrent streaming sessions
- Automatic Buffer Management: Processes audio based on size thresholds or timeout
- Partial Results: Publishes transcription results progressively during long streams
- Voice Activity Detection (VAD): Detects speech vs silence to optimize processing
- Interrupt Detection: Detects when user speaks during LLM response and switches back to listening mode
- Speaker Tracking: Support for speaker identification in multi-speaker scenarios
- State Management: Tracks listening/responding states for proper interrupt handling
Architecture
┌─────────────────┐
│ Audio Source │ (Frontend, Mobile App, etc.)
│ (Microphone) │
└────────┬────────┘
│ Chunks
▼
┌─────────────────┐
│ NATS Subject │ ai.voice.stream.{session_id}
│ Audio Stream │
└────────┬────────┘
│
▼
┌─────────────────┐
│ STT Streaming │ (This Service)
│ Service │ - Buffers chunks
│ │ - Transcribes when ready
└────────┬────────┘
│
▼
┌─────────────────┐
│ NATS Subject │ ai.voice.transcription.{session_id}
│ Transcription │
└─────────────────┘
Variants
stt_streaming.py (HTTP Backend)
Uses an external Whisper service via HTTP. Lightweight container, delegates GPU inference to a separate service.
stt_streaming_local.py (ROCm Backend)
Runs Whisper locally on AMD GPU using ROCm/PyTorch. Single container with embedded model.
NATS Message Protocol
Audio Stream Input (ai.voice.stream.{session_id})
All messages use msgpack binary encoding.
Start Stream:
{
"type": "start",
"session_id": "unique-session-id",
"sample_rate": 16000,
"channels": 1,
"state": "listening", # Optional: "listening" or "responding"
"speaker_id": "speaker-1" # Optional: identifier for speaker tracking
}
Audio Chunk:
{
"type": "chunk",
"audio_b64": "base64-encoded-audio-data",
"timestamp": 1234567890.123
}
State Change:
{
"type": "state_change",
"state": "responding" # "listening" or "responding"
}
End Stream:
{
"type": "end"
}
Transcription Output (ai.voice.transcription.{session_id})
Transcription Result:
{
"session_id": "unique-session-id",
"transcript": "transcribed text",
"sequence": 0,
"is_partial": False,
"is_final": True,
"timestamp": 1234567890.123,
"speaker_id": "speaker-1", # If provided in start message
"has_voice_activity": True,
"state": "listening"
}
Interrupt Notification:
{
"session_id": "unique-session-id",
"type": "interrupt",
"timestamp": 1234567890.123,
"speaker_id": "speaker-1"
}
Environment Variables
| Variable | Default | Description |
|---|---|---|
NATS_URL |
nats://nats.ai-ml.svc.cluster.local:4222 |
NATS server URL |
WHISPER_URL |
http://whisper-predictor.ai-ml.svc.cluster.local |
Whisper service URL (HTTP variant) |
WHISPER_MODEL_SIZE |
medium |
Whisper model size (ROCm variant) |
WHISPER_DEVICE |
cuda |
PyTorch device (ROCm variant) |
STT_BUFFER_SIZE_BYTES |
512000 |
Buffer size before processing (~5s) |
STT_CHUNK_TIMEOUT |
2.0 |
Seconds of silence before processing |
STT_ENABLE_VAD |
true |
Enable voice activity detection |
STT_VAD_AGGRESSIVENESS |
2 |
VAD aggressiveness (0-3) |
STT_ENABLE_INTERRUPT_DETECTION |
true |
Enable interrupt detection |
OTEL_ENABLED |
true |
Enable OpenTelemetry |
HYPERDX_ENABLED |
false |
Enable HyperDX observability |
Building
HTTP Variant
docker build -t stt-module:latest .
ROCm Variant (AMD GPU)
docker build -f Dockerfile.rocm -t stt-module:rocm --build-arg WHISPER_MODEL=medium .
Testing
# Port-forward NATS
kubectl port-forward -n ai-ml svc/nats 4222:4222
# Start a session
python -c "
import nats
import msgpack
import asyncio
async def test():
nc = await nats.connect('nats://localhost:4222')
await nc.publish('ai.voice.stream.test-session', msgpack.packb({'type': 'start'}))
# Send audio chunks...
await nc.publish('ai.voice.stream.test-session', msgpack.packb({'type': 'end'}))
await nc.close()
asyncio.run(test())
"
# Subscribe to transcriptions
nats sub "ai.voice.transcription.>"
License
MIT
Languages
Go
55.7%
Python
43%
Dockerfile
1.3%