feat: add streaming STT service with Whisper backend
- stt_streaming.py: HTTP-based STT using external Whisper service - stt_streaming_local.py: ROCm-based local Whisper inference - Voice Activity Detection (VAD) with WebRTC - Interrupt detection for barge-in support - Session state management (listening/responding) - OpenTelemetry instrumentation with HyperDX support - Dockerfile variants for HTTP and ROCm deployments
This commit is contained in:
182
README.md
182
README.md
@@ -1,2 +1,182 @@
|
||||
# stt-module
|
||||
# Streaming STT Module
|
||||
|
||||
A dedicated Speech-to-Text (STT) service that processes live audio streams from NATS for faster transcription responses.
|
||||
|
||||
## Overview
|
||||
|
||||
This module enables real-time speech-to-text processing by accepting audio chunks as they arrive rather than waiting for complete audio files. This significantly reduces latency in voice assistant applications.
|
||||
|
||||
## Features
|
||||
|
||||
- **Live Audio Streaming**: Accepts audio chunks via NATS as they're captured
|
||||
- **Incremental Processing**: Transcribes audio as soon as sufficient data is buffered
|
||||
- **Session Management**: Handles multiple concurrent streaming sessions
|
||||
- **Automatic Buffer Management**: Processes audio based on size thresholds or timeout
|
||||
- **Partial Results**: Publishes transcription results progressively during long streams
|
||||
- **Voice Activity Detection (VAD)**: Detects speech vs silence to optimize processing
|
||||
- **Interrupt Detection**: Detects when user speaks during LLM response and switches back to listening mode
|
||||
- **Speaker Tracking**: Support for speaker identification in multi-speaker scenarios
|
||||
- **State Management**: Tracks listening/responding states for proper interrupt handling
|
||||
|
||||
## Architecture
|
||||
|
||||
```
|
||||
┌─────────────────┐
|
||||
│ Audio Source │ (Frontend, Mobile App, etc.)
|
||||
│ (Microphone) │
|
||||
└────────┬────────┘
|
||||
│ Chunks
|
||||
▼
|
||||
┌─────────────────┐
|
||||
│ NATS Subject │ ai.voice.stream.{session_id}
|
||||
│ Audio Stream │
|
||||
└────────┬────────┘
|
||||
│
|
||||
▼
|
||||
┌─────────────────┐
|
||||
│ STT Streaming │ (This Service)
|
||||
│ Service │ - Buffers chunks
|
||||
│ │ - Transcribes when ready
|
||||
└────────┬────────┘
|
||||
│
|
||||
▼
|
||||
┌─────────────────┐
|
||||
│ NATS Subject │ ai.voice.transcription.{session_id}
|
||||
│ Transcription │
|
||||
└─────────────────┘
|
||||
```
|
||||
|
||||
## Variants
|
||||
|
||||
### stt_streaming.py (HTTP Backend)
|
||||
Uses an external Whisper service via HTTP. Lightweight container, delegates GPU inference to a separate service.
|
||||
|
||||
### stt_streaming_local.py (ROCm Backend)
|
||||
Runs Whisper locally on AMD GPU using ROCm/PyTorch. Single container with embedded model.
|
||||
|
||||
## NATS Message Protocol
|
||||
|
||||
### Audio Stream Input (ai.voice.stream.{session_id})
|
||||
|
||||
All messages use **msgpack** binary encoding.
|
||||
|
||||
**Start Stream:**
|
||||
```python
|
||||
{
|
||||
"type": "start",
|
||||
"session_id": "unique-session-id",
|
||||
"sample_rate": 16000,
|
||||
"channels": 1,
|
||||
"state": "listening", # Optional: "listening" or "responding"
|
||||
"speaker_id": "speaker-1" # Optional: identifier for speaker tracking
|
||||
}
|
||||
```
|
||||
|
||||
**Audio Chunk:**
|
||||
```python
|
||||
{
|
||||
"type": "chunk",
|
||||
"audio_b64": "base64-encoded-audio-data",
|
||||
"timestamp": 1234567890.123
|
||||
}
|
||||
```
|
||||
|
||||
**State Change:**
|
||||
```python
|
||||
{
|
||||
"type": "state_change",
|
||||
"state": "responding" # "listening" or "responding"
|
||||
}
|
||||
```
|
||||
|
||||
**End Stream:**
|
||||
```python
|
||||
{
|
||||
"type": "end"
|
||||
}
|
||||
```
|
||||
|
||||
### Transcription Output (ai.voice.transcription.{session_id})
|
||||
|
||||
**Transcription Result:**
|
||||
```python
|
||||
{
|
||||
"session_id": "unique-session-id",
|
||||
"transcript": "transcribed text",
|
||||
"sequence": 0,
|
||||
"is_partial": False,
|
||||
"is_final": True,
|
||||
"timestamp": 1234567890.123,
|
||||
"speaker_id": "speaker-1", # If provided in start message
|
||||
"has_voice_activity": True,
|
||||
"state": "listening"
|
||||
}
|
||||
```
|
||||
|
||||
**Interrupt Notification:**
|
||||
```python
|
||||
{
|
||||
"session_id": "unique-session-id",
|
||||
"type": "interrupt",
|
||||
"timestamp": 1234567890.123,
|
||||
"speaker_id": "speaker-1"
|
||||
}
|
||||
```
|
||||
|
||||
## Environment Variables
|
||||
|
||||
| Variable | Default | Description |
|
||||
|----------|---------|-------------|
|
||||
| `NATS_URL` | `nats://nats.ai-ml.svc.cluster.local:4222` | NATS server URL |
|
||||
| `WHISPER_URL` | `http://whisper-predictor.ai-ml.svc.cluster.local` | Whisper service URL (HTTP variant) |
|
||||
| `WHISPER_MODEL_SIZE` | `medium` | Whisper model size (ROCm variant) |
|
||||
| `WHISPER_DEVICE` | `cuda` | PyTorch device (ROCm variant) |
|
||||
| `STT_BUFFER_SIZE_BYTES` | `512000` | Buffer size before processing (~5s) |
|
||||
| `STT_CHUNK_TIMEOUT` | `2.0` | Seconds of silence before processing |
|
||||
| `STT_ENABLE_VAD` | `true` | Enable voice activity detection |
|
||||
| `STT_VAD_AGGRESSIVENESS` | `2` | VAD aggressiveness (0-3) |
|
||||
| `STT_ENABLE_INTERRUPT_DETECTION` | `true` | Enable interrupt detection |
|
||||
| `OTEL_ENABLED` | `true` | Enable OpenTelemetry |
|
||||
| `HYPERDX_ENABLED` | `false` | Enable HyperDX observability |
|
||||
|
||||
## Building
|
||||
|
||||
### HTTP Variant
|
||||
```bash
|
||||
docker build -t stt-module:latest .
|
||||
```
|
||||
|
||||
### ROCm Variant (AMD GPU)
|
||||
```bash
|
||||
docker build -f Dockerfile.rocm -t stt-module:rocm --build-arg WHISPER_MODEL=medium .
|
||||
```
|
||||
|
||||
## Testing
|
||||
|
||||
```bash
|
||||
# Port-forward NATS
|
||||
kubectl port-forward -n ai-ml svc/nats 4222:4222
|
||||
|
||||
# Start a session
|
||||
python -c "
|
||||
import nats
|
||||
import msgpack
|
||||
import asyncio
|
||||
|
||||
async def test():
|
||||
nc = await nats.connect('nats://localhost:4222')
|
||||
await nc.publish('ai.voice.stream.test-session', msgpack.packb({'type': 'start'}))
|
||||
# Send audio chunks...
|
||||
await nc.publish('ai.voice.stream.test-session', msgpack.packb({'type': 'end'}))
|
||||
await nc.close()
|
||||
|
||||
asyncio.run(test())
|
||||
"
|
||||
|
||||
# Subscribe to transcriptions
|
||||
nats sub "ai.voice.transcription.>"
|
||||
```
|
||||
|
||||
## License
|
||||
|
||||
MIT
|
||||
|
||||
Reference in New Issue
Block a user