Billy D. af9f8cc01e
Some checks failed
CI / Lint (pull_request) Failing after 57s
CI / Test (pull_request) Failing after 1m23s
CI / Release (pull_request) Has been skipped
CI / Docker Build & Push (pull_request) Has been skipped
CI / Notify (pull_request) Successful in 1s
feat: migrate to typed messages, drop base64, fix AudioBuffer
- AudioBuffer getAudio(): use ab.totalBytes directly (eliminates triple-copy)
- Decode STTStreamMessage via natsutil.Decode[messages.STTStreamMessage]
- Audio chunks arrive as raw []byte (no base64 decode needed)
- Publish STTTranscription struct (not map[string]any)
- Interrupts use messages.STTInterrupt
- Remove encoding/base64 import
- Add .dockerignore, GOAMD64=v3 in Dockerfile
- All 15 tests pass
2026-02-20 07:11:23 -05:00
2026-02-20 06:45:22 -05:00
2026-02-20 06:45:22 -05:00
2026-02-20 06:45:22 -05:00
2026-02-02 11:10:34 +00:00

Streaming STT Module

A dedicated Speech-to-Text (STT) service that processes live audio streams from NATS for faster transcription responses.

Overview

This module enables real-time speech-to-text processing by accepting audio chunks as they arrive rather than waiting for complete audio files. This significantly reduces latency in voice assistant applications.

Features

  • Live Audio Streaming: Accepts audio chunks via NATS as they're captured
  • Incremental Processing: Transcribes audio as soon as sufficient data is buffered
  • Session Management: Handles multiple concurrent streaming sessions
  • Automatic Buffer Management: Processes audio based on size thresholds or timeout
  • Partial Results: Publishes transcription results progressively during long streams
  • Voice Activity Detection (VAD): Detects speech vs silence to optimize processing
  • Interrupt Detection: Detects when user speaks during LLM response and switches back to listening mode
  • Speaker Tracking: Support for speaker identification in multi-speaker scenarios
  • State Management: Tracks listening/responding states for proper interrupt handling

Architecture

┌─────────────────┐
│  Audio Source   │ (Frontend, Mobile App, etc.)
│  (Microphone)   │
└────────┬────────┘
         │ Chunks
         ▼
┌─────────────────┐
│  NATS Subject   │ ai.voice.stream.{session_id}
│  Audio Stream   │
└────────┬────────┘
         │
         ▼
┌─────────────────┐
│  STT Streaming  │ (This Service)
│     Service     │ - Buffers chunks
│                 │ - Transcribes when ready
└────────┬────────┘
         │
         ▼
┌─────────────────┐
│  NATS Subject   │ ai.voice.transcription.{session_id}
│  Transcription  │
└─────────────────┘

Variants

stt_streaming.py (HTTP Backend)

Uses an external Whisper service via HTTP. Lightweight container, delegates GPU inference to a separate service.

stt_streaming_local.py (ROCm Backend)

Runs Whisper locally on AMD GPU using ROCm/PyTorch. Single container with embedded model.

NATS Message Protocol

Audio Stream Input (ai.voice.stream.{session_id})

All messages use msgpack binary encoding.

Start Stream:

{
    "type": "start",
    "session_id": "unique-session-id",
    "sample_rate": 16000,
    "channels": 1,
    "state": "listening",  # Optional: "listening" or "responding"
    "speaker_id": "speaker-1"  # Optional: identifier for speaker tracking
}

Audio Chunk:

{
    "type": "chunk",
    "audio_b64": "base64-encoded-audio-data",
    "timestamp": 1234567890.123
}

State Change:

{
    "type": "state_change",
    "state": "responding"  # "listening" or "responding"
}

End Stream:

{
    "type": "end"
}

Transcription Output (ai.voice.transcription.{session_id})

Transcription Result:

{
    "session_id": "unique-session-id",
    "transcript": "transcribed text",
    "sequence": 0,
    "is_partial": False,
    "is_final": True,
    "timestamp": 1234567890.123,
    "speaker_id": "speaker-1",  # If provided in start message
    "has_voice_activity": True,
    "state": "listening"
}

Interrupt Notification:

{
    "session_id": "unique-session-id",
    "type": "interrupt",
    "timestamp": 1234567890.123,
    "speaker_id": "speaker-1"
}

Environment Variables

Variable Default Description
NATS_URL nats://nats.ai-ml.svc.cluster.local:4222 NATS server URL
WHISPER_URL http://whisper-predictor.ai-ml.svc.cluster.local Whisper service URL (HTTP variant)
WHISPER_MODEL_SIZE medium Whisper model size (ROCm variant)
WHISPER_DEVICE cuda PyTorch device (ROCm variant)
STT_BUFFER_SIZE_BYTES 512000 Buffer size before processing (~5s)
STT_CHUNK_TIMEOUT 2.0 Seconds of silence before processing
STT_ENABLE_VAD true Enable voice activity detection
STT_VAD_AGGRESSIVENESS 2 VAD aggressiveness (0-3)
STT_ENABLE_INTERRUPT_DETECTION true Enable interrupt detection
OTEL_ENABLED true Enable OpenTelemetry
HYPERDX_ENABLED false Enable HyperDX observability

Building

HTTP Variant

docker build -t stt-module:latest .

ROCm Variant (AMD GPU)

docker build -f Dockerfile.rocm -t stt-module:rocm --build-arg WHISPER_MODEL=medium .

Testing

# Port-forward NATS
kubectl port-forward -n ai-ml svc/nats 4222:4222

# Start a session
python -c "
import nats
import msgpack
import asyncio

async def test():
    nc = await nats.connect('nats://localhost:4222')
    await nc.publish('ai.voice.stream.test-session', msgpack.packb({'type': 'start'}))
    # Send audio chunks...
    await nc.publish('ai.voice.stream.test-session', msgpack.packb({'type': 'end'}))
    await nc.close()

asyncio.run(test())
"

# Subscribe to transcriptions
nats sub "ai.voice.transcription.>"

License

MIT

Description
No description provided
Readme MIT 196 KiB
Languages
Go 55.7%
Python 43%
Dockerfile 1.3%