feat: add streaming STT service with Whisper backend

- stt_streaming.py: HTTP-based STT using external Whisper service - stt_streaming_local.py: ROCm-based local Whisper inference - Voice Activity Detection (VAD) with WebRTC - Interrupt detection for barge-in support - Session state management (listening/responding) - OpenTelemetry instrumentation with HyperDX support - Dockerfile variants for HTTP and ROCm deployments
2026-02-02 06:23:12 -05:00
parent 680e43fe39
commit 8fc5eb1193
9 changed files with 1473 additions and 1 deletions
--- a/README.md
+++ b/README.md
@@ -1,2 +1,182 @@
-# stt-module
+# Streaming STT Module

+A dedicated Speech-to-Text (STT) service that processes live audio streams from NATS for faster transcription responses.
+
+## Overview
+
+This module enables real-time speech-to-text processing by accepting audio chunks as they arrive rather than waiting for complete audio files. This significantly reduces latency in voice assistant applications.
+
+## Features
+
+- **Live Audio Streaming**: Accepts audio chunks via NATS as they're captured
+- **Incremental Processing**: Transcribes audio as soon as sufficient data is buffered
+- **Session Management**: Handles multiple concurrent streaming sessions
+- **Automatic Buffer Management**: Processes audio based on size thresholds or timeout
+- **Partial Results**: Publishes transcription results progressively during long streams
+- **Voice Activity Detection (VAD)**: Detects speech vs silence to optimize processing
+- **Interrupt Detection**: Detects when user speaks during LLM response and switches back to listening mode
+- **Speaker Tracking**: Support for speaker identification in multi-speaker scenarios
+- **State Management**: Tracks listening/responding states for proper interrupt handling
+
+## Architecture
+
+```
+┌─────────────────┐
+│  Audio Source   │ (Frontend, Mobile App, etc.)
+│  (Microphone)   │
+└────────┬────────┘
+         │ Chunks
+         ▼
+┌─────────────────┐
+│  NATS Subject   │ ai.voice.stream.{session_id}
+│  Audio Stream   │
+└────────┬────────┘
+         │
+         ▼
+┌─────────────────┐
+│  STT Streaming  │ (This Service)
+│     Service     │ - Buffers chunks
+│                 │ - Transcribes when ready
+└────────┬────────┘
+         │
+         ▼
+┌─────────────────┐
+│  NATS Subject   │ ai.voice.transcription.{session_id}
+│  Transcription  │
+└─────────────────┘
+```
+
+## Variants
+
+### stt_streaming.py (HTTP Backend)
+Uses an external Whisper service via HTTP. Lightweight container, delegates GPU inference to a separate service.
+
+### stt_streaming_local.py (ROCm Backend)
+Runs Whisper locally on AMD GPU using ROCm/PyTorch. Single container with embedded model.
+
+## NATS Message Protocol
+
+### Audio Stream Input (ai.voice.stream.{session_id})
+
+All messages use **msgpack** binary encoding.
+
+**Start Stream:**
+```python
+{
+    "type": "start",
+    "session_id": "unique-session-id",
+    "sample_rate": 16000,
+    "channels": 1,
+    "state": "listening",  # Optional: "listening" or "responding"
+    "speaker_id": "speaker-1"  # Optional: identifier for speaker tracking
+}
+```
+
+**Audio Chunk:**
+```python
+{
+    "type": "chunk",
+    "audio_b64": "base64-encoded-audio-data",
+    "timestamp": 1234567890.123
+}
+```
+
+**State Change:**
+```python
+{
+    "type": "state_change",
+    "state": "responding"  # "listening" or "responding"
+}
+```
+
+**End Stream:**
+```python
+{
+    "type": "end"
+}
+```
+
+### Transcription Output (ai.voice.transcription.{session_id})
+
+**Transcription Result:**
+```python
+{
+    "session_id": "unique-session-id",
+    "transcript": "transcribed text",
+    "sequence": 0,
+    "is_partial": False,
+    "is_final": True,
+    "timestamp": 1234567890.123,
+    "speaker_id": "speaker-1",  # If provided in start message
+    "has_voice_activity": True,
+    "state": "listening"
+}
+```
+
+**Interrupt Notification:**
+```python
+{
+    "session_id": "unique-session-id",
+    "type": "interrupt",
+    "timestamp": 1234567890.123,
+    "speaker_id": "speaker-1"
+}
+```
+
+## Environment Variables
+
+| Variable | Default | Description |
+|----------|---------|-------------|
+| `NATS_URL` | `nats://nats.ai-ml.svc.cluster.local:4222` | NATS server URL |
+| `WHISPER_URL` | `http://whisper-predictor.ai-ml.svc.cluster.local` | Whisper service URL (HTTP variant) |
+| `WHISPER_MODEL_SIZE` | `medium` | Whisper model size (ROCm variant) |
+| `WHISPER_DEVICE` | `cuda` | PyTorch device (ROCm variant) |
+| `STT_BUFFER_SIZE_BYTES` | `512000` | Buffer size before processing (~5s) |
+| `STT_CHUNK_TIMEOUT` | `2.0` | Seconds of silence before processing |
+| `STT_ENABLE_VAD` | `true` | Enable voice activity detection |
+| `STT_VAD_AGGRESSIVENESS` | `2` | VAD aggressiveness (0-3) |
+| `STT_ENABLE_INTERRUPT_DETECTION` | `true` | Enable interrupt detection |
+| `OTEL_ENABLED` | `true` | Enable OpenTelemetry |
+| `HYPERDX_ENABLED` | `false` | Enable HyperDX observability |
+
+## Building
+
+### HTTP Variant
+```bash
+docker build -t stt-module:latest .
+```
+
+### ROCm Variant (AMD GPU)
+```bash
+docker build -f Dockerfile.rocm -t stt-module:rocm --build-arg WHISPER_MODEL=medium .
+```
+
+## Testing
+
+```bash
+# Port-forward NATS
+kubectl port-forward -n ai-ml svc/nats 4222:4222
+
+# Start a session
+python -c "
+import nats
+import msgpack
+import asyncio
+
+async def test():
+    nc = await nats.connect('nats://localhost:4222')
+    await nc.publish('ai.voice.stream.test-session', msgpack.packb({'type': 'start'}))
+    # Send audio chunks...
+    await nc.publish('ai.voice.stream.test-session', msgpack.packb({'type': 'end'}))
+    await nc.close()
+
+asyncio.run(test())
+"
+
+# Subscribe to transcriptions
+nats sub "ai.voice.transcription.>"
+```
+
+## License
+
+MIT