daviestechlabs/stt-module

Fork 0

Go to file

Billy D. c5c040d68d

CI / Lint (push) Successful in 2m38s

Details

CI / Test (push) Successful in 2m33s

Details

CI / Release (push) Successful in 1m15s

Details

CI / Docker Build & Push (push) Failing after 6m45s

Details

CI / Notify (push) Successful in 1s

Details

fix: add GOPRIVATE and git auth for private handler-base module

- Set GOPRIVATE=git.daviestechlabs.io to bypass public Go module proxy
- Configure git URL insteadOf with DISPATCH_TOKEN for private repo access

2026-02-20 09:22:39 -05:00

.gitea/workflows

fix: add GOPRIVATE and git auth for private handler-base module

2026-02-20 09:22:39 -05:00

.dockerignore

feat: migrate to typed messages, drop base64, fix AudioBuffer

2026-02-20 07:11:23 -05:00

.gitignore

feat: rewrite stt-module (HTTP variant) in Go

2026-02-19 18:04:15 -05:00

Dockerfile

feat: migrate to typed messages, drop base64, fix AudioBuffer

2026-02-20 07:11:23 -05:00

Dockerfile.rocm

feat: add streaming STT service with Whisper backend

2026-02-02 06:23:12 -05:00

e2e_test.go

fix: resolve golangci-lint errcheck warnings

2026-02-20 08:45:44 -05:00

go.mod

fix: use tagged handler-base v0.1.3, remove local replace directive

2026-02-20 09:00:49 -05:00

go.sum

fix: use tagged handler-base v0.1.3, remove local replace directive

2026-02-20 09:00:49 -05:00

LICENSE

Initial commit

2026-02-02 11:10:34 +00:00

main_test.go

fix: resolve golangci-lint errcheck warnings

2026-02-20 08:45:44 -05:00

main.go

fix: resolve golangci-lint errcheck warnings

2026-02-20 08:45:44 -05:00

README.md

feat: add streaming STT service with Whisper backend

2026-02-02 06:23:12 -05:00

renovate.json

feat: CI pipeline, lint fixes, and Renovate config

2026-02-13 15:33:35 -05:00

requirements-rocm.txt

feat: add streaming STT service with Whisper backend

2026-02-02 06:23:12 -05:00

stt_streaming_local.py

feat: CI pipeline, lint fixes, and Renovate config

2026-02-13 15:33:35 -05:00

README.md

Streaming STT Module

A dedicated Speech-to-Text (STT) service that processes live audio streams from NATS for faster transcription responses.

Overview

This module enables real-time speech-to-text processing by accepting audio chunks as they arrive rather than waiting for complete audio files. This significantly reduces latency in voice assistant applications.

Features

Live Audio Streaming: Accepts audio chunks via NATS as they're captured
Incremental Processing: Transcribes audio as soon as sufficient data is buffered
Session Management: Handles multiple concurrent streaming sessions
Automatic Buffer Management: Processes audio based on size thresholds or timeout
Partial Results: Publishes transcription results progressively during long streams
Voice Activity Detection (VAD): Detects speech vs silence to optimize processing
Interrupt Detection: Detects when user speaks during LLM response and switches back to listening mode
Speaker Tracking: Support for speaker identification in multi-speaker scenarios
State Management: Tracks listening/responding states for proper interrupt handling

Architecture

┌─────────────────┐
│  Audio Source   │ (Frontend, Mobile App, etc.)
│  (Microphone)   │
└────────┬────────┘
         │ Chunks
         ▼
┌─────────────────┐
│  NATS Subject   │ ai.voice.stream.{session_id}
│  Audio Stream   │
└────────┬────────┘
         │
         ▼
┌─────────────────┐
│  STT Streaming  │ (This Service)
│     Service     │ - Buffers chunks
│                 │ - Transcribes when ready
└────────┬────────┘
         │
         ▼
┌─────────────────┐
│  NATS Subject   │ ai.voice.transcription.{session_id}
│  Transcription  │
└─────────────────┘

Variants

stt_streaming.py (HTTP Backend)

Uses an external Whisper service via HTTP. Lightweight container, delegates GPU inference to a separate service.

stt_streaming_local.py (ROCm Backend)

Runs Whisper locally on AMD GPU using ROCm/PyTorch. Single container with embedded model.

NATS Message Protocol

Audio Stream Input (ai.voice.stream.{session_id})

All messages use msgpack binary encoding.

Start Stream:

{
    "type": "start",
    "session_id": "unique-session-id",
    "sample_rate": 16000,
    "channels": 1,
    "state": "listening",  # Optional: "listening" or "responding"
    "speaker_id": "speaker-1"  # Optional: identifier for speaker tracking
}

Audio Chunk:

{
    "type": "chunk",
    "audio_b64": "base64-encoded-audio-data",
    "timestamp": 1234567890.123
}

State Change:

{
    "type": "state_change",
    "state": "responding"  # "listening" or "responding"
}

End Stream:

{
    "type": "end"
}

Transcription Output (ai.voice.transcription.{session_id})

Transcription Result:

{
    "session_id": "unique-session-id",
    "transcript": "transcribed text",
    "sequence": 0,
    "is_partial": False,
    "is_final": True,
    "timestamp": 1234567890.123,
    "speaker_id": "speaker-1",  # If provided in start message
    "has_voice_activity": True,
    "state": "listening"
}

Interrupt Notification:

{
    "session_id": "unique-session-id",
    "type": "interrupt",
    "timestamp": 1234567890.123,
    "speaker_id": "speaker-1"
}

Environment Variables

Variable	Default	Description
`NATS_URL`	`nats://nats.ai-ml.svc.cluster.local:4222`	NATS server URL
`WHISPER_URL`	`http://whisper-predictor.ai-ml.svc.cluster.local`	Whisper service URL (HTTP variant)
`WHISPER_MODEL_SIZE`	`medium`	Whisper model size (ROCm variant)
`WHISPER_DEVICE`	`cuda`	PyTorch device (ROCm variant)
`STT_BUFFER_SIZE_BYTES`	`512000`	Buffer size before processing (~5s)
`STT_CHUNK_TIMEOUT`	`2.0`	Seconds of silence before processing
`STT_ENABLE_VAD`	`true`	Enable voice activity detection
`STT_VAD_AGGRESSIVENESS`	`2`	VAD aggressiveness (0-3)
`STT_ENABLE_INTERRUPT_DETECTION`	`true`	Enable interrupt detection
`OTEL_ENABLED`	`true`	Enable OpenTelemetry
`HYPERDX_ENABLED`	`false`	Enable HyperDX observability

Building

HTTP Variant

docker build -t stt-module:latest .

ROCm Variant (AMD GPU)

docker build -f Dockerfile.rocm -t stt-module:rocm --build-arg WHISPER_MODEL=medium .

Testing

# Port-forward NATS
kubectl port-forward -n ai-ml svc/nats 4222:4222

# Start a session
python -c "
import nats
import msgpack
import asyncio

async def test():
    nc = await nats.connect('nats://localhost:4222')
    await nc.publish('ai.voice.stream.test-session', msgpack.packb({'type': 'start'}))
    # Send audio chunks...
    await nc.publish('ai.voice.stream.test-session', msgpack.packb({'type': 'end'}))
    await nc.close()

asyncio.run(test())
"

# Subscribe to transcriptions
nats sub "ai.voice.transcription.>"

License

MIT