# Streaming STT Module A dedicated Speech-to-Text (STT) service that processes live audio streams from NATS for faster transcription responses. ## Overview This module enables real-time speech-to-text processing by accepting audio chunks as they arrive rather than waiting for complete audio files. This significantly reduces latency in voice assistant applications. ## Features - **Live Audio Streaming**: Accepts audio chunks via NATS as they're captured - **Incremental Processing**: Transcribes audio as soon as sufficient data is buffered - **Session Management**: Handles multiple concurrent streaming sessions - **Automatic Buffer Management**: Processes audio based on size thresholds or timeout - **Partial Results**: Publishes transcription results progressively during long streams - **Voice Activity Detection (VAD)**: Detects speech vs silence to optimize processing - **Interrupt Detection**: Detects when user speaks during LLM response and switches back to listening mode - **Speaker Tracking**: Support for speaker identification in multi-speaker scenarios - **State Management**: Tracks listening/responding states for proper interrupt handling ## Architecture ``` ┌─────────────────┐ │ Audio Source │ (Frontend, Mobile App, etc.) │ (Microphone) │ └────────┬────────┘ │ Chunks ▼ ┌─────────────────┐ │ NATS Subject │ ai.voice.stream.{session_id} │ Audio Stream │ └────────┬────────┘ │ ▼ ┌─────────────────┐ │ STT Streaming │ (This Service) │ Service │ - Buffers chunks │ │ - Transcribes when ready └────────┬────────┘ │ ▼ ┌─────────────────┐ │ NATS Subject │ ai.voice.transcription.{session_id} │ Transcription │ └─────────────────┘ ``` ## Variants ### stt_streaming.py (HTTP Backend) Uses an external Whisper service via HTTP. Lightweight container, delegates GPU inference to a separate service. ### stt_streaming_local.py (ROCm Backend) Runs Whisper locally on AMD GPU using ROCm/PyTorch. Single container with embedded model. ## NATS Message Protocol ### Audio Stream Input (ai.voice.stream.{session_id}) All messages use **msgpack** binary encoding. **Start Stream:** ```python { "type": "start", "session_id": "unique-session-id", "sample_rate": 16000, "channels": 1, "state": "listening", # Optional: "listening" or "responding" "speaker_id": "speaker-1" # Optional: identifier for speaker tracking } ``` **Audio Chunk:** ```python { "type": "chunk", "audio_b64": "base64-encoded-audio-data", "timestamp": 1234567890.123 } ``` **State Change:** ```python { "type": "state_change", "state": "responding" # "listening" or "responding" } ``` **End Stream:** ```python { "type": "end" } ``` ### Transcription Output (ai.voice.transcription.{session_id}) **Transcription Result:** ```python { "session_id": "unique-session-id", "transcript": "transcribed text", "sequence": 0, "is_partial": False, "is_final": True, "timestamp": 1234567890.123, "speaker_id": "speaker-1", # If provided in start message "has_voice_activity": True, "state": "listening" } ``` **Interrupt Notification:** ```python { "session_id": "unique-session-id", "type": "interrupt", "timestamp": 1234567890.123, "speaker_id": "speaker-1" } ``` ## Environment Variables | Variable | Default | Description | |----------|---------|-------------| | `NATS_URL` | `nats://nats.ai-ml.svc.cluster.local:4222` | NATS server URL | | `WHISPER_URL` | `http://whisper-predictor.ai-ml.svc.cluster.local` | Whisper service URL (HTTP variant) | | `WHISPER_MODEL_SIZE` | `medium` | Whisper model size (ROCm variant) | | `WHISPER_DEVICE` | `cuda` | PyTorch device (ROCm variant) | | `STT_BUFFER_SIZE_BYTES` | `512000` | Buffer size before processing (~5s) | | `STT_CHUNK_TIMEOUT` | `2.0` | Seconds of silence before processing | | `STT_ENABLE_VAD` | `true` | Enable voice activity detection | | `STT_VAD_AGGRESSIVENESS` | `2` | VAD aggressiveness (0-3) | | `STT_ENABLE_INTERRUPT_DETECTION` | `true` | Enable interrupt detection | | `OTEL_ENABLED` | `true` | Enable OpenTelemetry | | `HYPERDX_ENABLED` | `false` | Enable HyperDX observability | ## Building ### HTTP Variant ```bash docker build -t stt-module:latest . ``` ### ROCm Variant (AMD GPU) ```bash docker build -f Dockerfile.rocm -t stt-module:rocm --build-arg WHISPER_MODEL=medium . ``` ## Testing ```bash # Port-forward NATS kubectl port-forward -n ai-ml svc/nats 4222:4222 # Start a session python -c " import nats import msgpack import asyncio async def test(): nc = await nats.connect('nats://localhost:4222') await nc.publish('ai.voice.stream.test-session', msgpack.packb({'type': 'start'})) # Send audio chunks... await nc.publish('ai.voice.stream.test-session', msgpack.packb({'type': 'end'})) await nc.close() asyncio.run(test()) " # Subscribe to transcriptions nats sub "ai.voice.transcription.>" ``` ## License MIT