stt-module/README.md

# Streaming STT Module

A dedicated Speech-to-Text (STT) service that processes live audio streams from NATS for faster transcription responses.

## Overview

This module enables real-time speech-to-text processing by accepting audio chunks as they arrive rather than waiting for complete audio files. This significantly reduces latency in voice assistant applications.

## Features

- **Live Audio Streaming**: Accepts audio chunks via NATS as they're captured
- **Incremental Processing**: Transcribes audio as soon as sufficient data is buffered
- **Session Management**: Handles multiple concurrent streaming sessions
- **Automatic Buffer Management**: Processes audio based on size thresholds or timeout
- **Partial Results**: Publishes transcription results progressively during long streams
- **Voice Activity Detection (VAD)**: Detects speech vs silence to optimize processing
- **Interrupt Detection**: Detects when user speaks during LLM response and switches back to listening mode
- **Speaker Tracking**: Support for speaker identification in multi-speaker scenarios
- **State Management**: Tracks listening/responding states for proper interrupt handling

## Architecture

```
┌─────────────────┐
│  Audio Source   │ (Frontend, Mobile App, etc.)
│  (Microphone)   │
└────────┬────────┘
         │ Chunks
         ▼
┌─────────────────┐
│  NATS Subject   │ ai.voice.stream.{session_id}
│  Audio Stream   │
└────────┬────────┘
         │
         ▼
┌─────────────────┐
│  STT Streaming  │ (This Service)
│     Service     │ - Buffers chunks
│                 │ - Transcribes when ready
└────────┬────────┘
         │
         ▼
┌─────────────────┐
│  NATS Subject   │ ai.voice.transcription.{session_id}
│  Transcription  │
└─────────────────┘
```

## Variants

### stt_streaming.py (HTTP Backend)
Uses an external Whisper service via HTTP. Lightweight container, delegates GPU inference to a separate service.

### stt_streaming_local.py (ROCm Backend)
Runs Whisper locally on AMD GPU using ROCm/PyTorch. Single container with embedded model.

## NATS Message Protocol

### Audio Stream Input (ai.voice.stream.{session_id})

All messages use **msgpack** binary encoding.

**Start Stream:**
```python
{
    "type": "start",
    "session_id": "unique-session-id",
    "sample_rate": 16000,
    "channels": 1,
    "state": "listening",  # Optional: "listening" or "responding"
    "speaker_id": "speaker-1"  # Optional: identifier for speaker tracking
}
```

**Audio Chunk:**
```python
{
    "type": "chunk",
    "audio_b64": "base64-encoded-audio-data",
    "timestamp": 1234567890.123
}
```

**State Change:**
```python
{
    "type": "state_change",
    "state": "responding"  # "listening" or "responding"
}
```

**End Stream:**
```python
{
    "type": "end"
}
```

### Transcription Output (ai.voice.transcription.{session_id})

**Transcription Result:**
```python
{
    "session_id": "unique-session-id",
    "transcript": "transcribed text",
    "sequence": 0,
    "is_partial": False,
    "is_final": True,
    "timestamp": 1234567890.123,
    "speaker_id": "speaker-1",  # If provided in start message
    "has_voice_activity": True,
    "state": "listening"
}
```

**Interrupt Notification:**
```python
{
    "session_id": "unique-session-id",
    "type": "interrupt",
    "timestamp": 1234567890.123,
    "speaker_id": "speaker-1"
}
```

## Environment Variables

| Variable | Default | Description |
|----------|---------|-------------|
| `NATS_URL` | `nats://nats.ai-ml.svc.cluster.local:4222` | NATS server URL |
| `WHISPER_URL` | `http://whisper-predictor.ai-ml.svc.cluster.local` | Whisper service URL (HTTP variant) |
| `WHISPER_MODEL_SIZE` | `medium` | Whisper model size (ROCm variant) |
| `WHISPER_DEVICE` | `cuda` | PyTorch device (ROCm variant) |
| `STT_BUFFER_SIZE_BYTES` | `512000` | Buffer size before processing (~5s) |
| `STT_CHUNK_TIMEOUT` | `2.0` | Seconds of silence before processing |
| `STT_ENABLE_VAD` | `true` | Enable voice activity detection |
| `STT_VAD_AGGRESSIVENESS` | `2` | VAD aggressiveness (0-3) |
| `STT_ENABLE_INTERRUPT_DETECTION` | `true` | Enable interrupt detection |
| `OTEL_ENABLED` | `true` | Enable OpenTelemetry |
| `HYPERDX_ENABLED` | `false` | Enable HyperDX observability |

## Building

### HTTP Variant
```bash
docker build -t stt-module:latest .
```

### ROCm Variant (AMD GPU)
```bash
docker build -f Dockerfile.rocm -t stt-module:rocm --build-arg WHISPER_MODEL=medium .
```

## Testing

```bash
# Port-forward NATS
kubectl port-forward -n ai-ml svc/nats 4222:4222

# Start a session
python -c "
import nats
import msgpack
import asyncio

async def test():
    nc = await nats.connect('nats://localhost:4222')
    await nc.publish('ai.voice.stream.test-session', msgpack.packb({'type': 'start'}))
    # Send audio chunks...
    await nc.publish('ai.voice.stream.test-session', msgpack.packb({'type': 'end'}))
    await nc.close()

asyncio.run(test())
"

# Subscribe to transcriptions
nats sub "ai.voice.transcription.>"
```

## License

MIT