feat: add streaming STT service with Whisper backend

- stt_streaming.py: HTTP-based STT using external Whisper service - stt_streaming_local.py: ROCm-based local Whisper inference - Voice Activity Detection (VAD) with WebRTC - Interrupt detection for barge-in support - Session state management (listening/responding) - OpenTelemetry instrumentation with HyperDX support - Dockerfile variants for HTTP and ROCm deployments
2026-02-02 06:23:12 -05:00
parent 680e43fe39
commit 8fc5eb1193
9 changed files with 1473 additions and 1 deletions
--- a/.gitignore
+++ b/.gitignore
@@ -0,0 +1,11 @@
+.venv/
+__pycache__/
+*.pyc
+*.pyo
+.pytest_cache/
+.mypy_cache/
+*.egg-info/
+dist/
+build/
+.env
+.env.local
--- a/14
+++ b/14
@@ -0,0 +1,14 @@
+FROM python:3.12-slim
+
+WORKDIR /app
+
+# Copy requirements and install dependencies
+COPY requirements.txt .
+RUN pip install --no-cache-dir -r requirements.txt
+
+# Copy application code
+COPY stt_streaming.py .
+COPY healthcheck.py .
+
+# Run the service
+CMD ["python", "stt_streaming.py"]
--- a/Dockerfile.rocm
+++ b/Dockerfile.rocm
@@ -0,0 +1,52 @@
+# STT Streaming Service with ROCm for AMD GPU Whisper inference
+# Targets AMD Strix Halo (gfx1151 / RDNA 3.5) but includes RDNA 3 compatibility
+#
+# Uses OpenAI Whisper with PyTorch ROCm backend
+#
+FROM docker.io/rocm/pytorch:rocm7.1_ubuntu24.04_py3.12_pytorch_release_2.9.1 AS base
+
+WORKDIR /app
+
+# Install system dependencies
+RUN apt-get update && apt-get install -y \
+    ffmpeg \
+    libsndfile1 \
+    && rm -rf /var/lib/apt/lists/*
+
+# WORKAROUND: ROCm/ROCm#5853 - Standard PyTorch ROCm wheels cause segfault in
+# libhsa-runtime64.so during VRAM allocation on gfx1151 (Strix Halo).
+# TheRock nightly builds work correctly. Install BEFORE other deps since
+# openai-whisper depends on torch.
+RUN pip install --no-cache-dir --break-system-packages \
+    --index-url https://rocm.nightlies.amd.com/v2/gfx1151/ \
+    torch torchaudio torchvision --force-reinstall
+
+# Install Python dependencies for STT streaming
+# Use pip directly (more reliable than uv in this base image)
+COPY requirements-rocm.txt .
+RUN pip install --no-cache-dir --break-system-packages -r requirements-rocm.txt
+
+# Download Whisper model at build time for faster startup
+# Using medium model for good accuracy/speed balance
+ARG WHISPER_MODEL=medium
+ENV WHISPER_MODEL_SIZE=${WHISPER_MODEL}
+
+# Pre-download the model during build (whisper is installed as openai-whisper)
+# Use python3 to ensure correct interpreter
+RUN python3 -c "import whisper; whisper.load_model('${WHISPER_MODEL}')" || echo "Model will be downloaded at runtime"
+
+# Copy application code
+COPY stt_streaming_local.py .
+COPY healthcheck.py .
+
+# Set ROCm environment for AMD Strix Halo (gfx1151 / RDNA 3.5)
+ENV HIP_VISIBLE_DEVICES=0
+ENV HSA_ENABLE_SDMA=0
+# Ensure PyTorch uses ROCm with expandable segments for large models
+ENV PYTORCH_HIP_ALLOC_CONF=expandable_segments:True,max_split_size_mb:512
+# Target gfx1151 (Strix Halo) - ROCm 7.1+ has native support
+# Falls back to runtime override if kernels not available
+ENV ROCM_TARGET_LST=gfx1151,gfx1100
+
+# Run the service
+CMD ["python", "stt_streaming_local.py"]
--- a/README.md
+++ b/README.md
@@ -1,2 +1,182 @@
-# stt-module
+# Streaming STT Module

+A dedicated Speech-to-Text (STT) service that processes live audio streams from NATS for faster transcription responses.
+
+## Overview
+
+This module enables real-time speech-to-text processing by accepting audio chunks as they arrive rather than waiting for complete audio files. This significantly reduces latency in voice assistant applications.
+
+## Features
+
+- **Live Audio Streaming**: Accepts audio chunks via NATS as they're captured
+- **Incremental Processing**: Transcribes audio as soon as sufficient data is buffered
+- **Session Management**: Handles multiple concurrent streaming sessions
+- **Automatic Buffer Management**: Processes audio based on size thresholds or timeout
+- **Partial Results**: Publishes transcription results progressively during long streams
+- **Voice Activity Detection (VAD)**: Detects speech vs silence to optimize processing
+- **Interrupt Detection**: Detects when user speaks during LLM response and switches back to listening mode
+- **Speaker Tracking**: Support for speaker identification in multi-speaker scenarios
+- **State Management**: Tracks listening/responding states for proper interrupt handling
+
+## Architecture
+
+```
+┌─────────────────┐
+│  Audio Source   │ (Frontend, Mobile App, etc.)
+│  (Microphone)   │
+└────────┬────────┘
+         │ Chunks
+         ▼
+┌─────────────────┐
+│  NATS Subject   │ ai.voice.stream.{session_id}
+│  Audio Stream   │
+└────────┬────────┘
+         │
+         ▼
+┌─────────────────┐
+│  STT Streaming  │ (This Service)
+│     Service     │ - Buffers chunks
+│                 │ - Transcribes when ready
+└────────┬────────┘
+         │
+         ▼
+┌─────────────────┐
+│  NATS Subject   │ ai.voice.transcription.{session_id}
+│  Transcription  │
+└─────────────────┘
+```
+
+## Variants
+
+### stt_streaming.py (HTTP Backend)
+Uses an external Whisper service via HTTP. Lightweight container, delegates GPU inference to a separate service.
+
+### stt_streaming_local.py (ROCm Backend)
+Runs Whisper locally on AMD GPU using ROCm/PyTorch. Single container with embedded model.
+
+## NATS Message Protocol
+
+### Audio Stream Input (ai.voice.stream.{session_id})
+
+All messages use **msgpack** binary encoding.
+
+**Start Stream:**
+```python
+{
+    "type": "start",
+    "session_id": "unique-session-id",
+    "sample_rate": 16000,
+    "channels": 1,
+    "state": "listening",  # Optional: "listening" or "responding"
+    "speaker_id": "speaker-1"  # Optional: identifier for speaker tracking
+}
+```
+
+**Audio Chunk:**
+```python
+{
+    "type": "chunk",
+    "audio_b64": "base64-encoded-audio-data",
+    "timestamp": 1234567890.123
+}
+```
+
+**State Change:**
+```python
+{
+    "type": "state_change",
+    "state": "responding"  # "listening" or "responding"
+}
+```
+
+**End Stream:**
+```python
+{
+    "type": "end"
+}
+```
+
+### Transcription Output (ai.voice.transcription.{session_id})
+
+**Transcription Result:**
+```python
+{
+    "session_id": "unique-session-id",
+    "transcript": "transcribed text",
+    "sequence": 0,
+    "is_partial": False,
+    "is_final": True,
+    "timestamp": 1234567890.123,
+    "speaker_id": "speaker-1",  # If provided in start message
+    "has_voice_activity": True,
+    "state": "listening"
+}
+```
+
+**Interrupt Notification:**
+```python
+{
+    "session_id": "unique-session-id",
+    "type": "interrupt",
+    "timestamp": 1234567890.123,
+    "speaker_id": "speaker-1"
+}
+```
+
+## Environment Variables
+
+| Variable | Default | Description |
+|----------|---------|-------------|
+| `NATS_URL` | `nats://nats.ai-ml.svc.cluster.local:4222` | NATS server URL |
+| `WHISPER_URL` | `http://whisper-predictor.ai-ml.svc.cluster.local` | Whisper service URL (HTTP variant) |
+| `WHISPER_MODEL_SIZE` | `medium` | Whisper model size (ROCm variant) |
+| `WHISPER_DEVICE` | `cuda` | PyTorch device (ROCm variant) |
+| `STT_BUFFER_SIZE_BYTES` | `512000` | Buffer size before processing (~5s) |
+| `STT_CHUNK_TIMEOUT` | `2.0` | Seconds of silence before processing |
+| `STT_ENABLE_VAD` | `true` | Enable voice activity detection |
+| `STT_VAD_AGGRESSIVENESS` | `2` | VAD aggressiveness (0-3) |
+| `STT_ENABLE_INTERRUPT_DETECTION` | `true` | Enable interrupt detection |
+| `OTEL_ENABLED` | `true` | Enable OpenTelemetry |
+| `HYPERDX_ENABLED` | `false` | Enable HyperDX observability |
+
+## Building
+
+### HTTP Variant
+```bash
+docker build -t stt-module:latest .
+```
+
+### ROCm Variant (AMD GPU)
+```bash
+docker build -f Dockerfile.rocm -t stt-module:rocm --build-arg WHISPER_MODEL=medium .
+```
+
+## Testing
+
+```bash
+# Port-forward NATS
+kubectl port-forward -n ai-ml svc/nats 4222:4222
+
+# Start a session
+python -c "
+import nats
+import msgpack
+import asyncio
+
+async def test():
+    nc = await nats.connect('nats://localhost:4222')
+    await nc.publish('ai.voice.stream.test-session', msgpack.packb({'type': 'start'}))
+    # Send audio chunks...
+    await nc.publish('ai.voice.stream.test-session', msgpack.packb({'type': 'end'}))
+    await nc.close()
+
+asyncio.run(test())
+"
+
+# Subscribe to transcriptions
+nats sub "ai.voice.transcription.>"
+```
+
+## License
+
+MIT
--- a/healthcheck.py
+++ b/healthcheck.py
@@ -0,0 +1,28 @@
+#!/usr/bin/env python3
+"""
+Health check script for Kubernetes probes
+Verifies NATS connectivity
+"""
+import sys
+import os
+import asyncio
+
+import nats
+
+NATS_URL = os.environ.get("NATS_URL", "nats://nats.ai-ml.svc.cluster.local:4222")
+
+
+async def check_health():
+    """Check if service can connect to NATS."""
+    try:
+        nc = await asyncio.wait_for(nats.connect(NATS_URL), timeout=5.0)
+        await nc.close()
+        return True
+    except Exception as e:
+        print(f"Health check failed: {e}", file=sys.stderr)
+        return False
+
+
+if __name__ == "__main__":
+    result = asyncio.run(check_health())
+    sys.exit(0 if result else 1)
--- a/requirements-rocm.txt
+++ b/requirements-rocm.txt
@@ -0,0 +1,24 @@
+# Core dependencies
+nats-py>=2.0.0,<3.0.0
+msgpack
+
+# Whisper for local STT inference (uses PyTorch already in base image)
+openai-whisper>=20231117
+
+# Audio processing
+soundfile
+numpy
+
+# OpenTelemetry core
+opentelemetry-api
+opentelemetry-sdk
+opentelemetry-exporter-otlp-proto-grpc
+opentelemetry-exporter-otlp-proto-http
+opentelemetry-instrumentation-logging
+
+# HyperDX support (uses OTLP protocol)
+# HyperDX is compatible with standard OTEL exporters, just needs API key header
+opentelemetry-sdk-extension-aws  # For additional context propagation
+
+# HTTP health server for kserve compatibility
+aiohttp
--- a/requirements.txt
+++ b/requirements.txt
@@ -0,0 +1,20 @@
+nats-py>=2.0.0,<3.0.0
+httpx>=0.20.0,<1.0.0
+msgpack
+
+# Audio processing
+numpy>=1.20.0,<2.0.0
+webrtcvad>=2.0.10
+# pyannote.audio>=3.1.0  # Optional: for advanced speaker diarization
+
+# OpenTelemetry core
+opentelemetry-api
+opentelemetry-sdk
+
+# OTEL exporters (gRPC for local collector, HTTP for HyperDX)
+opentelemetry-exporter-otlp-proto-grpc
+opentelemetry-exporter-otlp-proto-http
+
+# OTEL instrumentation
+opentelemetry-instrumentation-httpx
+opentelemetry-instrumentation-logging
--- a/stt_streaming.py
+++ b/stt_streaming.py
@@ -0,0 +1,632 @@
+#!/usr/bin/env python3
+"""
+Streaming STT Service
+
+Real-time Speech-to-Text service that processes live audio streams from NATS:
+1. Subscribe to audio stream subject (ai.voice.stream.{session_id})
+2. Buffer and accumulate audio chunks
+3. Transcribe when buffer reaches threshold or stream ends
+4. Publish transcription results to response channel (ai.voice.transcription.{session_id})
+
+This enables faster response times by processing audio as it arrives rather than
+waiting for complete audio upload.
+"""
+import asyncio
+import base64
+import contextlib
+import logging
+import os
+import signal
+import time
+import struct
+from typing import Dict, Optional, List, Tuple
+from io import BytesIO
+
+import httpx
+import msgpack
+import nats
+import nats.js
+from nats.aio.msg import Msg
+import numpy as np
+import webrtcvad
+
+# OpenTelemetry imports
+from opentelemetry import trace, metrics
+from opentelemetry.sdk.trace import TracerProvider
+from opentelemetry.sdk.trace.export import BatchSpanProcessor
+from opentelemetry.sdk.metrics import MeterProvider
+from opentelemetry.sdk.metrics.export import PeriodicExportingMetricReader
+from opentelemetry.exporter.otlp.proto.grpc.trace_exporter import OTLPSpanExporter
+from opentelemetry.exporter.otlp.proto.grpc.metric_exporter import OTLPMetricExporter
+from opentelemetry.exporter.otlp.proto.http.trace_exporter import OTLPSpanExporter as OTLPSpanExporterHTTP
+from opentelemetry.exporter.otlp.proto.http.metric_exporter import OTLPMetricExporter as OTLPMetricExporterHTTP
+from opentelemetry.sdk.resources import Resource, SERVICE_NAME, SERVICE_VERSION, SERVICE_NAMESPACE
+from opentelemetry.instrumentation.httpx import HTTPXClientInstrumentor
+from opentelemetry.instrumentation.logging import LoggingInstrumentor
+
+# Configure logging
+logging.basicConfig(
+    level=logging.INFO,
+    format='%(asctime)s - %(name)s - %(levelname)s - %(message)s'
+)
+logger = logging.getLogger("stt-streaming")
+
+# Initialize OpenTelemetry
+def setup_telemetry():
+    """Initialize OpenTelemetry tracing and metrics with HyperDX support."""
+    # Check if OTEL is enabled
+    otel_enabled = os.environ.get("OTEL_ENABLED", "true").lower() == "true"
+    if not otel_enabled:
+        logger.info("OpenTelemetry disabled")
+        return None, None
+    
+    # OTEL configuration
+    otel_endpoint = os.environ.get("OTEL_EXPORTER_OTLP_ENDPOINT", "http://opentelemetry-collector.observability.svc.cluster.local:4317")
+    service_name = os.environ.get("OTEL_SERVICE_NAME", "stt-streaming")
+    service_namespace = os.environ.get("OTEL_SERVICE_NAMESPACE", "ai-ml")
+    
+    # HyperDX configuration
+    hyperdx_api_key = os.environ.get("HYPERDX_API_KEY", "")
+    hyperdx_endpoint = os.environ.get("HYPERDX_ENDPOINT", "https://in-otel.hyperdx.io")
+    use_hyperdx = os.environ.get("HYPERDX_ENABLED", "false").lower() == "true" and hyperdx_api_key
+    
+    # Create resource with service information
+    resource = Resource.create({
+        SERVICE_NAME: service_name,
+        SERVICE_VERSION: os.environ.get("SERVICE_VERSION", "1.0.0"),
+        SERVICE_NAMESPACE: service_namespace,
+        "deployment.environment": os.environ.get("DEPLOYMENT_ENV", "production"),
+        "host.name": os.environ.get("HOSTNAME", "unknown"),
+    })
+    
+    # Setup tracing
+    trace_provider = TracerProvider(resource=resource)
+    
+    if use_hyperdx:
+        # Use HTTP exporter for HyperDX with API key header
+        logger.info(f"Configuring HyperDX exporter at {hyperdx_endpoint}")
+        headers = {"authorization": hyperdx_api_key}
+        otlp_span_exporter = OTLPSpanExporterHTTP(
+            endpoint=f"{hyperdx_endpoint}/v1/traces",
+            headers=headers
+        )
+        otlp_metric_exporter = OTLPMetricExporterHTTP(
+            endpoint=f"{hyperdx_endpoint}/v1/metrics",
+            headers=headers
+        )
+    else:
+        # Use gRPC exporter for standard OTEL collector
+        otlp_span_exporter = OTLPSpanExporter(endpoint=otel_endpoint, insecure=True)
+        otlp_metric_exporter = OTLPMetricExporter(endpoint=otel_endpoint, insecure=True)
+    
+    trace_provider.add_span_processor(BatchSpanProcessor(otlp_span_exporter))
+    trace.set_tracer_provider(trace_provider)
+    
+    # Setup metrics
+    metric_reader = PeriodicExportingMetricReader(otlp_metric_exporter, export_interval_millis=60000)
+    meter_provider = MeterProvider(resource=resource, metric_readers=[metric_reader])
+    metrics.set_meter_provider(meter_provider)
+    
+    # Instrument HTTPX
+    HTTPXClientInstrumentor().instrument()
+    
+    # Instrument logging
+    LoggingInstrumentor().instrument(set_logging_format=True)
+    
+    destination = "HyperDX" if use_hyperdx else "OTEL Collector"
+    logger.info(f"OpenTelemetry initialized - destination: {destination}, service: {service_name}")
+    
+    # Return tracer and meter for the service
+    tracer = trace.get_tracer(__name__)
+    meter = metrics.get_meter(__name__)
+    
+    return tracer, meter
+
+# Configuration from environment
+WHISPER_URL = os.environ.get("WHISPER_URL", "http://whisper-predictor.ai-ml.svc.cluster.local")
+NATS_URL = os.environ.get("NATS_URL", "nats://nats.ai-ml.svc.cluster.local:4222")
+
+# NATS subjects for streaming
+STREAM_SUBJECT_PREFIX = "ai.voice.stream"  # Full subject: ai.voice.stream.{session_id}
+TRANSCRIPTION_SUBJECT_PREFIX = "ai.voice.transcription"  # Full subject: ai.voice.transcription.{session_id}
+
+# Streaming parameters
+BUFFER_SIZE_BYTES = int(os.environ.get("STT_BUFFER_SIZE_BYTES", "512000"))  # ~5 seconds at 16kHz 16-bit
+CHUNK_TIMEOUT_SECONDS = float(os.environ.get("STT_CHUNK_TIMEOUT", "2.0"))  # Process after 2s of silence
+MAX_BUFFER_SIZE_BYTES = int(os.environ.get("STT_MAX_BUFFER_SIZE", "5120000"))  # ~50 seconds max
+
+# Audio constants
+AUDIO_SAMPLE_MAX_INT16 = 32768.0  # Maximum value for 16-bit signed integer audio
+VAD_VOICE_RATIO_THRESHOLD = float(os.environ.get("STT_VAD_VOICE_RATIO", "0.3"))  # Min ratio of voice frames
+
+# Voice Activity Detection (VAD) parameters
+ENABLE_VAD = os.environ.get("STT_ENABLE_VAD", "true").lower() == "true"
+VAD_AGGRESSIVENESS = int(os.environ.get("STT_VAD_AGGRESSIVENESS", "2"))  # 0-3, higher = more aggressive
+VAD_FRAME_DURATION_MS = int(os.environ.get("STT_VAD_FRAME_DURATION", "30"))  # 10, 20, or 30 ms
+
+# Audio threshold for interrupt detection (when LLM is responding)
+ENABLE_INTERRUPT_DETECTION = os.environ.get("STT_ENABLE_INTERRUPT_DETECTION", "true").lower() == "true"
+AUDIO_LEVEL_THRESHOLD = float(os.environ.get("STT_AUDIO_LEVEL_THRESHOLD", "0.02"))  # RMS threshold
+INTERRUPT_DURATION_THRESHOLD = float(os.environ.get("STT_INTERRUPT_DURATION", "0.5"))  # Seconds of speech to trigger
+
+# Speaker diarization
+ENABLE_SPEAKER_DIARIZATION = os.environ.get("STT_ENABLE_SPEAKER_DIARIZATION", "false").lower() == "true"
+
+# Session states
+SESSION_STATE_LISTENING = "listening"
+SESSION_STATE_RESPONDING = "responding"
+
+
+def calculate_audio_rms(audio_data: bytes, sample_width: int = 2) -> float:
+    """
+    Calculate RMS (Root Mean Square) audio level.
+    
+    Args:
+        audio_data: Raw audio bytes
+        sample_width: Bytes per sample (2 for 16-bit audio)
+    
+    Returns:
+        RMS level normalized to 0.0-1.0 range
+    """
+    if len(audio_data) < sample_width:
+        return 0.0
+    
+    # Convert bytes to numpy array of int16 samples
+    try:
+        samples = np.frombuffer(audio_data, dtype=np.int16)
+        # Calculate RMS and normalize
+        rms = np.sqrt(np.mean(samples.astype(np.float32) ** 2))
+        # Normalize to 0-1 range using defined constant
+        return float(rms / AUDIO_SAMPLE_MAX_INT16)
+    except Exception as e:
+        logger.warning(f"Error calculating RMS: {e}")
+        return 0.0
+
+
+def detect_voice_activity(audio_data: bytes, sample_rate: int = 16000) -> bool:
+    """
+    Detect if audio contains voice using WebRTC VAD.
+    
+    Args:
+        audio_data: Raw PCM audio bytes (16-bit, mono)
+        sample_rate: Audio sample rate (8000, 16000, 32000, or 48000)
+    
+    Returns:
+        True if voice is detected, False otherwise
+    """
+    if not ENABLE_VAD:
+        return True  # Assume voice present if VAD disabled
+    
+    try:
+        vad = webrtcvad.Vad(VAD_AGGRESSIVENESS)
+        
+        # WebRTC VAD requires specific frame sizes
+        # Frame duration must be 10, 20, or 30 ms
+        frame_size = int(sample_rate * VAD_FRAME_DURATION_MS / 1000) * 2  # *2 for 16-bit samples
+        
+        # Process audio in frames
+        voice_frames = 0
+        total_frames = 0
+        
+        for i in range(0, len(audio_data) - frame_size, frame_size):
+            frame = audio_data[i:i + frame_size]
+            if len(frame) == frame_size:
+                try:
+                    is_speech = vad.is_speech(frame, sample_rate)
+                    if is_speech:
+                        voice_frames += 1
+                    total_frames += 1
+                except Exception as e:
+                    logger.debug(f"VAD frame processing error: {e}")
+                    continue
+        
+        if total_frames == 0:
+            return False
+        
+        # Consider voice detected if voice ratio exceeds threshold
+        voice_ratio = voice_frames / total_frames
+        return voice_ratio > VAD_VOICE_RATIO_THRESHOLD
+        
+    except Exception as e:
+        logger.warning(f"VAD error: {e}")
+        return True  # Default to voice present on error
+
+
+class AudioBuffer:
+    """Manages audio chunks for a streaming session with VAD and speaker tracking."""
+    
+    def __init__(self, session_id: str):
+        self.session_id = session_id
+        self.chunks = []
+        self.total_bytes = 0
+        self.last_chunk_time = time.time()
+        self.is_complete = False
+        self.sequence = 0
+        self.state = SESSION_STATE_LISTENING  # Current session state
+        self.speaker_id = None  # For speaker diarization
+        self.interrupt_start_time = None  # Track when interrupt detection started
+        self.has_voice_activity = False  # Track if voice was detected in recent chunks
+        self._last_chunk_vad_result = None  # Cache VAD result for last chunk
+        
+    def add_chunk(self, audio_data: bytes) -> None:
+        """Add an audio chunk to the buffer and check for voice activity."""
+        self.chunks.append(audio_data)
+        self.total_bytes += len(audio_data)
+        self.last_chunk_time = time.time()
+        
+        # Check for voice activity in this chunk and cache result
+        has_voice = detect_voice_activity(audio_data)
+        self.has_voice_activity = has_voice
+        self._last_chunk_vad_result = has_voice
+        
+        logger.debug(f"Session {self.session_id}: Added chunk, total {self.total_bytes} bytes, voice={has_voice}")
+        
+    def check_interrupt(self, audio_data: bytes) -> bool:
+        """
+        Check if audio indicates an interrupt during responding state.
+        Uses cached VAD result if available.
+        
+        Returns:
+            True if interrupt detected, False otherwise
+        """
+        if not ENABLE_INTERRUPT_DETECTION:
+            return False
+        
+        if self.state != SESSION_STATE_RESPONDING:
+            return False
+        
+        # Calculate audio level
+        rms_level = calculate_audio_rms(audio_data)
+        
+        # Use cached VAD result if available to avoid duplicate processing
+        has_voice = self._last_chunk_vad_result if self._last_chunk_vad_result is not None else detect_voice_activity(audio_data)
+        
+        # Check if audio exceeds threshold and contains voice
+        if rms_level >= AUDIO_LEVEL_THRESHOLD and has_voice:
+            if self.interrupt_start_time is None:
+                self.interrupt_start_time = time.time()
+                logger.info(f"Session {self.session_id}: Potential interrupt detected (RMS={rms_level:.3f})")
+            
+            # Check if interrupt has lasted long enough
+            elapsed = time.time() - self.interrupt_start_time
+            if elapsed >= INTERRUPT_DURATION_THRESHOLD:
+                logger.info(f"Session {self.session_id}: Interrupt confirmed after {elapsed:.1f}s")
+                return True
+        else:
+            # Reset interrupt timer if audio drops below threshold
+            self.interrupt_start_time = None
+        
+        return False
+    
+    def set_state(self, state: str) -> None:
+        """Set the session state (listening or responding)."""
+        if state in (SESSION_STATE_LISTENING, SESSION_STATE_RESPONDING):
+            old_state = self.state
+            self.state = state
+            if old_state != state:
+                logger.info(f"Session {self.session_id}: State changed from {old_state} to {state}")
+                # Reset interrupt tracking when changing states
+                self.interrupt_start_time = None
+        
+    def should_process(self) -> bool:
+        """Determine if buffer should be processed now."""
+        # Don't process if no voice activity detected (unless buffer is full or timed out)
+        if ENABLE_VAD and not self.has_voice_activity:
+            # Still process if buffer is very large or has timed out
+            if self.total_bytes < BUFFER_SIZE_BYTES and time.time() - self.last_chunk_time < CHUNK_TIMEOUT_SECONDS:
+                return False
+        
+        # Process if buffer size threshold reached
+        if self.total_bytes >= BUFFER_SIZE_BYTES:
+            return True
+        # Process if no chunks received for timeout duration
+        if time.time() - self.last_chunk_time > CHUNK_TIMEOUT_SECONDS and self.total_bytes > 0:
+            return True
+        # Process if buffer is too large (safety limit)
+        if self.total_bytes >= MAX_BUFFER_SIZE_BYTES:
+            return True
+        return False
+    
+    def get_audio(self) -> bytes:
+        """Get concatenated audio data."""
+        return b''.join(self.chunks)
+    
+    def clear(self) -> None:
+        """Clear the buffer after processing."""
+        self.chunks = []
+        self.total_bytes = 0
+        self.sequence += 1
+        self._last_chunk_vad_result = None  # Clear cached VAD result
+        
+    def mark_complete(self) -> None:
+        """Mark stream as complete."""
+        self.is_complete = True
+
+
+class StreamingSTT:
+    """Streaming Speech-to-Text service."""
+    
+    def __init__(self):
+        self.nc = None
+        self.js = None
+        self.http_client = None
+        self.sessions: Dict[str, AudioBuffer] = {}
+        self.running = True
+        self.processing_tasks = {}
+        self.is_healthy = False
+        self.tracer = None
+        self.meter = None
+        self.stream_counter = None
+        self.transcription_duration = None
+        
+    async def setup(self):
+        """Initialize connections."""
+        # Initialize OpenTelemetry
+        self.tracer, self.meter = setup_telemetry()
+        
+        # Create metrics if OTEL is enabled
+        if self.meter:
+            self.stream_counter = self.meter.create_counter(
+                name="stt_streams_total",
+                description="Total number of STT streams processed",
+                unit="1"
+            )
+            self.transcription_duration = self.meter.create_histogram(
+                name="stt_transcription_duration_seconds",
+                description="Duration of STT transcription",
+                unit="s"
+            )
+        
+        # NATS connection
+        self.nc = await nats.connect(NATS_URL)
+        logger.info(f"Connected to NATS at {NATS_URL}")
+        
+        # Initialize JetStream context
+        self.js = self.nc.jetstream()
+        
+        # Create or update stream for voice stream messages
+        try:
+            stream_config = nats.js.api.StreamConfig(
+                name="AI_VOICE_STREAM",
+                subjects=["ai.voice.stream.>", "ai.voice.transcription.>"],
+                retention=nats.js.api.RetentionPolicy.LIMITS,
+                max_age=300,  # Keep messages for 5 minutes only (streaming is ephemeral)
+                storage=nats.js.api.StorageType.MEMORY,  # Use memory for streaming data
+            )
+            await self.js.add_stream(stream_config)
+            logger.info("Created/updated JetStream stream: AI_VOICE_STREAM")
+        except Exception as e:
+            # Stream might already exist
+            logger.info(f"JetStream stream setup: {e}")
+        
+        # HTTP client for Whisper service
+        self.http_client = httpx.AsyncClient(timeout=180.0)
+        logger.info("HTTP client initialized")
+        
+        # Mark as healthy once connections are established
+        self.is_healthy = True
+        
+    async def transcribe(self, audio_bytes: bytes) -> Optional[str]:
+        """Transcribe audio using Whisper."""
+        try:
+            files = {"file": ("audio.wav", audio_bytes, "audio/wav")}
+            response = await self.http_client.post(
+                f"{WHISPER_URL}/v1/audio/transcriptions",
+                files=files
+            )
+            response.raise_for_status()
+            result = response.json()
+            transcript = result.get("text", "")
+            logger.info(f"Transcribed: {transcript[:100]}...")
+            return transcript
+        except Exception as e:
+            logger.error(f"Transcription failed: {e}")
+            return None
+    
+    async def process_buffer(self, session_id: str):
+        """Process accumulated audio buffer for a session."""
+        buffer = self.sessions.get(session_id)
+        if not buffer:
+            return
+            
+        audio_data = buffer.get_audio()
+        if not audio_data:
+            return
+        
+        logger.info(f"Processing {len(audio_data)} bytes for session {session_id}, sequence {buffer.sequence}")
+        
+        # Transcribe
+        transcript = await self.transcribe(audio_data)
+        
+        if transcript:
+            # Publish transcription result using msgpack binary format
+            result = {
+                "session_id": session_id,
+                "transcript": transcript,
+                "sequence": buffer.sequence,
+                "is_partial": not buffer.is_complete,
+                "is_final": buffer.is_complete,
+                "timestamp": time.time(),
+                "speaker_id": buffer.speaker_id,
+                "has_voice_activity": buffer.has_voice_activity,
+                "state": buffer.state
+            }
+            
+            await self.nc.publish(
+                f"{TRANSCRIPTION_SUBJECT_PREFIX}.{session_id}",
+                msgpack.packb(result)
+            )
+            logger.info(f"Published transcription for session {session_id} (seq {buffer.sequence}, speaker={buffer.speaker_id})")
+        
+        # Clear buffer after processing
+        buffer.clear()
+        
+        # Clean up completed sessions asynchronously
+        if buffer.is_complete:
+            logger.info(f"Session {session_id} completed")
+            # Schedule cleanup task to avoid blocking
+            asyncio.create_task(self._cleanup_session(session_id))
+    
+    async def _cleanup_session(self, session_id: str):
+        """Clean up a completed session after a delay."""
+        # Keep session for a bit in case of late messages
+        await asyncio.sleep(5)
+        if session_id in self.sessions:
+            del self.sessions[session_id]
+            logger.info(f"Cleaned up session: {session_id}")
+        if session_id in self.processing_tasks:
+            del self.processing_tasks[session_id]
+    
+    async def monitor_buffer(self, session_id: str):
+        """Monitor buffer and trigger processing when needed."""
+        while self.running and session_id in self.sessions:
+            buffer = self.sessions.get(session_id)
+            if not buffer:
+                break
+                
+            if buffer.should_process():
+                await self.process_buffer(session_id)
+            
+            # Don't spin too fast
+            await asyncio.sleep(0.1)
+    
+    async def handle_stream_message(self, msg: Msg):
+        """Handle incoming audio stream message."""
+        try:
+            # Extract session_id from subject: ai.voice.stream.{session_id}
+            subject_parts = msg.subject.split('.')
+            if len(subject_parts) < 4:
+                logger.warning(f"Invalid subject format: {msg.subject}")
+                return
+            
+            session_id = subject_parts[3]
+            
+            # Parse message using msgpack binary format
+            data = msgpack.unpackb(msg.data, raw=False)
+            
+            # Handle control messages
+            if data.get("type") == "start":
+                logger.info(f"Starting stream session: {session_id}")
+                self.sessions[session_id] = AudioBuffer(session_id)
+                # Set initial state if provided
+                initial_state = data.get("state", SESSION_STATE_LISTENING)
+                self.sessions[session_id].set_state(initial_state)
+                # Store speaker_id if provided
+                speaker_id = data.get("speaker_id")
+                if speaker_id:
+                    self.sessions[session_id].speaker_id = speaker_id
+                    logger.info(f"Session {session_id}: Speaker ID set to {speaker_id}")
+                # Start monitoring task for this session
+                task = asyncio.create_task(self.monitor_buffer(session_id))
+                self.processing_tasks[session_id] = task
+                return
+            
+            if data.get("type") == "state_change":
+                logger.info(f"State change for session {session_id}")
+                buffer = self.sessions.get(session_id)
+                if buffer:
+                    new_state = data.get("state", SESSION_STATE_LISTENING)
+                    buffer.set_state(new_state)
+                    
+                    # If switching to listening mode, reset any interrupt tracking
+                    if new_state == SESSION_STATE_LISTENING:
+                        buffer.interrupt_start_time = None
+                return
+            
+            if data.get("type") == "end":
+                logger.info(f"Ending stream session: {session_id}")
+                buffer = self.sessions.get(session_id)
+                if buffer:
+                    buffer.mark_complete()
+                    # Process any remaining audio
+                    if buffer.total_bytes > 0:
+                        await self.process_buffer(session_id)
+                return
+            
+            # Handle audio chunk
+            if data.get("type") == "chunk":
+                audio_b64 = data.get("audio_b64", "")
+                if not audio_b64:
+                    return
+                
+                audio_bytes = base64.b64decode(audio_b64)
+                
+                # Create session if it doesn't exist (handle missing start message)
+                # Check both sessions and processing_tasks to avoid race conditions
+                if session_id not in self.sessions:
+                    logger.info(f"Auto-creating session: {session_id}")
+                    self.sessions[session_id] = AudioBuffer(session_id)
+                    # Only create monitoring task if not already exists
+                    if session_id not in self.processing_tasks:
+                        task = asyncio.create_task(self.monitor_buffer(session_id))
+                        self.processing_tasks[session_id] = task
+                
+                buffer = self.sessions[session_id]
+                
+                # Check for interrupt if in responding state
+                if buffer.check_interrupt(audio_bytes):
+                    # Publish interrupt notification
+                    interrupt_msg = {
+                        "session_id": session_id,
+                        "type": "interrupt",
+                        "timestamp": time.time(),
+                        "speaker_id": buffer.speaker_id
+                    }
+                    await self.nc.publish(
+                        f"{TRANSCRIPTION_SUBJECT_PREFIX}.{session_id}",
+                        msgpack.packb(interrupt_msg)
+                    )
+                    logger.info(f"Published interrupt notification for session {session_id}")
+                    
+                    # Automatically switch back to listening mode
+                    buffer.set_state(SESSION_STATE_LISTENING)
+                
+                # Add chunk to buffer
+                buffer.add_chunk(audio_bytes)
+                
+        except Exception as e:
+            logger.error(f"Error handling stream message: {e}", exc_info=True)
+    
+    async def run(self):
+        """Main run loop."""
+        await self.setup()
+        
+        # Note: STT streaming uses regular NATS subscribe (not pull-based JetStream consumer)
+        # because it handles real-time ephemeral audio streams with wildcard subscriptions.
+        # The stream audio chunks are not meant to be persisted long-term or replayed.
+        # However, the transcription RESULTS are published to JetStream for persistence.
+        sub = await self.nc.subscribe(f"{STREAM_SUBJECT_PREFIX}.>", cb=self.handle_stream_message)
+        logger.info(f"Subscribed to {STREAM_SUBJECT_PREFIX}.>")
+        
+        # Handle shutdown
+        def signal_handler():
+            self.running = False
+        
+        loop = asyncio.get_event_loop()
+        for sig in (signal.SIGTERM, signal.SIGINT):
+            loop.add_signal_handler(sig, signal_handler)
+        
+        # Keep running
+        while self.running:
+            await asyncio.sleep(1)
+        
+        # Cleanup
+        logger.info("Shutting down...")
+        
+        # Cancel all monitoring tasks and wait for them to complete
+        for task in self.processing_tasks.values():
+            task.cancel()
+        
+        # Wait for all tasks to complete or be cancelled
+        if self.processing_tasks:
+            await asyncio.gather(*self.processing_tasks.values(), return_exceptions=True)
+        
+        await sub.unsubscribe()
+        await self.nc.close()
+        await self.http_client.aclose()
+        logger.info("Shutdown complete")
+
+
+if __name__ == "__main__":
+    service = StreamingSTT()
+    asyncio.run(service.run())
--- a/stt_streaming_local.py
+++ b/stt_streaming_local.py
@@ -0,0 +1,511 @@
+#!/usr/bin/env python3
+"""
+Streaming STT Service with Local Whisper on ROCm
+
+Real-time Speech-to-Text service that processes live audio streams from NATS
+using local Whisper model running on AMD GPU via ROCm:
+
+1. Subscribe to audio stream subject (ai.voice.stream.{session_id})
+2. Buffer and accumulate audio chunks
+3. Transcribe locally using Whisper on AMD GPU
+4. Publish transcription results to response channel (ai.voice.transcription.{session_id})
+
+This version runs Whisper directly on the AMD GPU using ROCm/PyTorch backend
+instead of calling an external Whisper service.
+
+Supports HyperDX for observability via OpenTelemetry.
+"""
+import asyncio
+import base64
+import contextlib
+import io
+import logging
+import os
+import signal
+import tempfile
+import time
+from typing import Dict, Optional
+
+from aiohttp import web
+import msgpack
+import nats
+import nats.js
+import numpy as np
+import soundfile as sf
+import torch
+import whisper
+from nats.aio.msg import Msg
+
+# OpenTelemetry imports
+from opentelemetry import trace, metrics
+from opentelemetry.sdk.trace import TracerProvider
+from opentelemetry.sdk.trace.export import BatchSpanProcessor
+from opentelemetry.sdk.metrics import MeterProvider
+from opentelemetry.sdk.metrics.export import PeriodicExportingMetricReader
+from opentelemetry.exporter.otlp.proto.grpc.trace_exporter import OTLPSpanExporter
+from opentelemetry.exporter.otlp.proto.grpc.metric_exporter import OTLPMetricExporter
+from opentelemetry.exporter.otlp.proto.http.trace_exporter import OTLPSpanExporter as OTLPSpanExporterHTTP
+from opentelemetry.exporter.otlp.proto.http.metric_exporter import OTLPMetricExporter as OTLPMetricExporterHTTP
+from opentelemetry.sdk.resources import Resource, SERVICE_NAME, SERVICE_VERSION, SERVICE_NAMESPACE
+from opentelemetry.instrumentation.logging import LoggingInstrumentor
+
+# Configure logging
+logging.basicConfig(
+    level=logging.INFO,
+    format='%(asctime)s - %(name)s - %(levelname)s - %(message)s'
+)
+logger = logging.getLogger("stt-streaming-rocm")
+
+
+def setup_telemetry():
+    """Initialize OpenTelemetry tracing and metrics with HyperDX support."""
+    # Check if OTEL is enabled
+    otel_enabled = os.environ.get("OTEL_ENABLED", "true").lower() == "true"
+    if not otel_enabled:
+        logger.info("OpenTelemetry disabled")
+        return None, None
+    
+    # OTEL configuration
+    otel_endpoint = os.environ.get("OTEL_EXPORTER_OTLP_ENDPOINT", "http://opentelemetry-collector.observability.svc.cluster.local:4317")
+    service_name = os.environ.get("OTEL_SERVICE_NAME", "stt-streaming-rocm")
+    service_namespace = os.environ.get("OTEL_SERVICE_NAMESPACE", "ai-ml")
+    
+    # HyperDX configuration
+    hyperdx_api_key = os.environ.get("HYPERDX_API_KEY", "")
+    hyperdx_endpoint = os.environ.get("HYPERDX_ENDPOINT", "https://in-otel.hyperdx.io")
+    use_hyperdx = os.environ.get("HYPERDX_ENABLED", "false").lower() == "true" and hyperdx_api_key
+    
+    # Create resource with service information
+    resource = Resource.create({
+        SERVICE_NAME: service_name,
+        SERVICE_VERSION: os.environ.get("SERVICE_VERSION", "1.0.0"),
+        SERVICE_NAMESPACE: service_namespace,
+        "deployment.environment": os.environ.get("DEPLOYMENT_ENV", "production"),
+        "host.name": os.environ.get("HOSTNAME", "unknown"),
+    })
+    
+    # Setup tracing
+    trace_provider = TracerProvider(resource=resource)
+    
+    if use_hyperdx:
+        # Use HTTP exporter for HyperDX with API key header
+        logger.info(f"Configuring HyperDX exporter at {hyperdx_endpoint}")
+        headers = {"authorization": hyperdx_api_key}
+        otlp_span_exporter = OTLPSpanExporterHTTP(
+            endpoint=f"{hyperdx_endpoint}/v1/traces",
+            headers=headers
+        )
+        otlp_metric_exporter = OTLPMetricExporterHTTP(
+            endpoint=f"{hyperdx_endpoint}/v1/metrics",
+            headers=headers
+        )
+    else:
+        # Use gRPC exporter for standard OTEL collector
+        logger.info(f"Configuring OTEL gRPC exporter at {otel_endpoint}")
+        otlp_span_exporter = OTLPSpanExporter(endpoint=otel_endpoint, insecure=True)
+        otlp_metric_exporter = OTLPMetricExporter(endpoint=otel_endpoint, insecure=True)
+    
+    trace_provider.add_span_processor(BatchSpanProcessor(otlp_span_exporter))
+    trace.set_tracer_provider(trace_provider)
+    
+    # Setup metrics
+    metric_reader = PeriodicExportingMetricReader(otlp_metric_exporter, export_interval_millis=60000)
+    meter_provider = MeterProvider(resource=resource, metric_readers=[metric_reader])
+    metrics.set_meter_provider(meter_provider)
+    
+    # Instrument logging
+    LoggingInstrumentor().instrument(set_logging_format=True)
+    
+    destination = "HyperDX" if use_hyperdx else "OTEL Collector"
+    logger.info(f"OpenTelemetry initialized - destination: {destination}, service: {service_name}")
+    
+    # Return tracer and meter for the service
+    tracer = trace.get_tracer(__name__)
+    meter = metrics.get_meter(__name__)
+    
+    return tracer, meter
+
+
+# Configuration from environment
+NATS_URL = os.environ.get("NATS_URL", "nats://nats.ai-ml.svc.cluster.local:4222")
+WHISPER_MODEL_SIZE = os.environ.get("WHISPER_MODEL_SIZE", "medium")
+WHISPER_DEVICE = os.environ.get("WHISPER_DEVICE", "cuda")  # cuda uses ROCm on AMD
+WHISPER_FP16 = os.environ.get("WHISPER_FP16", "true").lower() == "true"
+
+# NATS subjects for streaming
+STREAM_SUBJECT_PREFIX = "ai.voice.stream"  # Full subject: ai.voice.stream.{session_id}
+TRANSCRIPTION_SUBJECT_PREFIX = "ai.voice.transcription"  # Full subject: ai.voice.transcription.{session_id}
+
+# Streaming parameters
+BUFFER_SIZE_BYTES = int(os.environ.get("STT_BUFFER_SIZE_BYTES", "512000"))  # ~5 seconds at 16kHz 16-bit
+CHUNK_TIMEOUT_SECONDS = float(os.environ.get("STT_CHUNK_TIMEOUT", "2.0"))  # Process after 2s of silence
+MAX_BUFFER_SIZE_BYTES = int(os.environ.get("STT_MAX_BUFFER_SIZE", "5120000"))  # ~50 seconds max
+
+# Health server port for kserve compatibility
+HEALTH_PORT = int(os.environ.get("HEALTH_PORT", "8000"))
+
+
+class AudioBuffer:
+    """Manages audio chunks for a streaming session."""
+    
+    def __init__(self, session_id: str):
+        self.session_id = session_id
+        self.chunks = []
+        self.total_bytes = 0
+        self.last_chunk_time = time.time()
+        self.is_complete = False
+        self.sequence = 0
+        
+    def add_chunk(self, audio_data: bytes) -> None:
+        """Add an audio chunk to the buffer."""
+        self.chunks.append(audio_data)
+        self.total_bytes += len(audio_data)
+        self.last_chunk_time = time.time()
+        logger.debug(f"Session {self.session_id}: Added chunk, total {self.total_bytes} bytes")
+        
+    def should_process(self) -> bool:
+        """Determine if buffer should be processed now."""
+        # Process if buffer size threshold reached
+        if self.total_bytes >= BUFFER_SIZE_BYTES:
+            return True
+        # Process if no chunks received for timeout duration
+        if time.time() - self.last_chunk_time > CHUNK_TIMEOUT_SECONDS and self.total_bytes > 0:
+            return True
+        # Process if buffer is too large (safety limit)
+        if self.total_bytes >= MAX_BUFFER_SIZE_BYTES:
+            return True
+        return False
+    
+    def get_audio(self) -> bytes:
+        """Get concatenated audio data."""
+        return b''.join(self.chunks)
+    
+    def clear(self) -> None:
+        """Clear the buffer after processing."""
+        self.chunks = []
+        self.total_bytes = 0
+        self.sequence += 1
+        
+    def mark_complete(self) -> None:
+        """Mark stream as complete."""
+        self.is_complete = True
+
+
+class StreamingSTTLocal:
+    """Streaming Speech-to-Text service with local Whisper on ROCm."""
+    
+    def __init__(self):
+        self.nc = None
+        self.js = None
+        self.whisper_model = None
+        self.sessions: Dict[str, AudioBuffer] = {}
+        self.running = True
+        self.processing_tasks = {}
+        self.is_healthy = False
+        self.tracer = None
+        self.meter = None
+        self.stream_counter = None
+        self.transcription_duration = None
+        self.gpu_memory_gauge = None
+        
+    async def setup(self):
+        """Initialize connections and load model."""
+        # Initialize OpenTelemetry
+        self.tracer, self.meter = setup_telemetry()
+        
+        # Create metrics if OTEL is enabled
+        if self.meter:
+            self.stream_counter = self.meter.create_counter(
+                name="stt_streams_total",
+                description="Total number of STT streams processed",
+                unit="1"
+            )
+            self.transcription_duration = self.meter.create_histogram(
+                name="stt_transcription_duration_seconds",
+                description="Duration of STT transcription",
+                unit="s"
+            )
+            self.gpu_memory_gauge = self.meter.create_observable_gauge(
+                name="stt_gpu_memory_bytes",
+                description="GPU memory usage in bytes",
+                callbacks=[self._get_gpu_memory]
+            )
+        
+        # Check GPU availability
+        if torch.cuda.is_available():
+            gpu_name = torch.cuda.get_device_name(0)
+            gpu_memory = torch.cuda.get_device_properties(0).total_memory / (1024**3)
+            logger.info(f"ROCm GPU available: {gpu_name} ({gpu_memory:.1f}GB)")
+        else:
+            logger.warning("No GPU available, falling back to CPU")
+        
+        # Load Whisper model
+        logger.info(f"Loading Whisper model: {WHISPER_MODEL_SIZE} on {WHISPER_DEVICE}")
+        start_time = time.time()
+        self.whisper_model = whisper.load_model(WHISPER_MODEL_SIZE, device=WHISPER_DEVICE)
+        load_time = time.time() - start_time
+        logger.info(f"Whisper model loaded in {load_time:.2f}s")
+        
+        # NATS connection
+        self.nc = await nats.connect(NATS_URL)
+        logger.info(f"Connected to NATS at {NATS_URL}")
+        
+        # Initialize JetStream context
+        self.js = self.nc.jetstream()
+        
+        # Create or update stream for voice stream messages
+        try:
+            stream_config = nats.js.api.StreamConfig(
+                name="AI_VOICE_STREAM",
+                subjects=["ai.voice.stream.>", "ai.voice.transcription.>"],
+                retention=nats.js.api.RetentionPolicy.LIMITS,
+                max_age=300,  # Keep messages for 5 minutes only (streaming is ephemeral)
+                storage=nats.js.api.StorageType.MEMORY,  # Use memory for streaming data
+            )
+            await self.js.add_stream(stream_config)
+            logger.info("Created/updated JetStream stream: AI_VOICE_STREAM")
+        except Exception as e:
+            # Stream might already exist
+            logger.info(f"JetStream stream setup: {e}")
+        
+        # Mark as healthy once connections are established
+        self.is_healthy = True
+    
+    async def health_handler(self, request: web.Request) -> web.Response:
+        """Handle health check requests for kserve compatibility."""
+        if self.is_healthy:
+            return web.json_response({
+                "status": "healthy",
+                "model": WHISPER_MODEL_SIZE,
+                "device": WHISPER_DEVICE,
+                "nats_connected": self.nc is not None and self.nc.is_connected,
+            })
+        else:
+            return web.json_response(
+                {"status": "unhealthy", "model": WHISPER_MODEL_SIZE},
+                status=503
+            )
+    
+    async def start_health_server(self) -> web.AppRunner:
+        """Start HTTP health server for kserve agent sidecar."""
+        app = web.Application()
+        app.router.add_get("/health", self.health_handler)
+        app.router.add_get("/ready", self.health_handler)
+        app.router.add_get("/", self.health_handler)
+        
+        runner = web.AppRunner(app)
+        await runner.setup()
+        site = web.TCPSite(runner, "0.0.0.0", HEALTH_PORT)
+        await site.start()
+        logger.info(f"Health server started on port {HEALTH_PORT}")
+        return runner
+    
+    def _get_gpu_memory(self, options):
+        """Callback for GPU memory gauge."""
+        if torch.cuda.is_available():
+            memory_used = torch.cuda.memory_allocated(0)
+            yield metrics.Observation(memory_used, {"device": "0"})
+        
+    def transcribe(self, audio_bytes: bytes) -> Optional[str]:
+        """Transcribe audio using local Whisper model."""
+        start_time = time.time()
+        
+        try:
+            # Write audio to temp file (Whisper needs file path or numpy array)
+            with tempfile.NamedTemporaryFile(suffix=".wav", delete=True) as tmp:
+                tmp.write(audio_bytes)
+                tmp.flush()
+                
+                # Transcribe with Whisper
+                result = self.whisper_model.transcribe(
+                    tmp.name,
+                    fp16=WHISPER_FP16 and WHISPER_DEVICE == "cuda",
+                    language="en",  # Can be made configurable
+                )
+                
+            transcript = result.get("text", "").strip()
+            
+            duration = time.time() - start_time
+            audio_duration = len(audio_bytes) / (16000 * 2)  # Assuming 16kHz 16-bit
+            rtf = duration / audio_duration if audio_duration > 0 else 0
+            
+            logger.info(f"Transcribed in {duration:.2f}s (RTF: {rtf:.2f}): {transcript[:100]}...")
+            
+            # Record metrics
+            if self.transcription_duration:
+                self.transcription_duration.record(duration, {"model": WHISPER_MODEL_SIZE})
+            
+            return transcript
+            
+        except Exception as e:
+            logger.error(f"Transcription failed: {e}", exc_info=True)
+            return None
+    
+    async def process_buffer(self, session_id: str):
+        """Process accumulated audio buffer for a session."""
+        buffer = self.sessions.get(session_id)
+        if not buffer:
+            return
+            
+        audio_data = buffer.get_audio()
+        if not audio_data:
+            return
+        
+        logger.info(f"Processing {len(audio_data)} bytes for session {session_id}, sequence {buffer.sequence}")
+        
+        # Record stream counter
+        if self.stream_counter:
+            self.stream_counter.add(1, {"session_id": session_id})
+        
+        # Transcribe in thread pool to avoid blocking event loop
+        loop = asyncio.get_event_loop()
+        transcript = await loop.run_in_executor(None, self.transcribe, audio_data)
+        
+        if transcript:
+            # Publish transcription result using msgpack binary format
+            result = {
+                "session_id": session_id,
+                "transcript": transcript,
+                "sequence": buffer.sequence,
+                "is_partial": not buffer.is_complete,
+                "is_final": buffer.is_complete,
+                "timestamp": time.time(),
+                "model": WHISPER_MODEL_SIZE,
+                "device": WHISPER_DEVICE,
+            }
+            
+            await self.nc.publish(
+                f"{TRANSCRIPTION_SUBJECT_PREFIX}.{session_id}",
+                msgpack.packb(result)
+            )
+            logger.info(f"Published transcription for session {session_id} (seq {buffer.sequence})")
+        
+        # Clear buffer after processing
+        buffer.clear()
+        
+        # Clean up completed sessions asynchronously
+        if buffer.is_complete:
+            logger.info(f"Session {session_id} completed")
+            # Schedule cleanup task to avoid blocking
+            asyncio.create_task(self._cleanup_session(session_id))
+    
+    async def _cleanup_session(self, session_id: str):
+        """Clean up a completed session after a delay."""
+        # Keep session for a bit in case of late messages
+        await asyncio.sleep(5)
+        if session_id in self.sessions:
+            del self.sessions[session_id]
+            logger.info(f"Cleaned up session: {session_id}")
+        if session_id in self.processing_tasks:
+            del self.processing_tasks[session_id]
+    
+    async def monitor_buffer(self, session_id: str):
+        """Monitor buffer and trigger processing when needed."""
+        while self.running and session_id in self.sessions:
+            buffer = self.sessions.get(session_id)
+            if not buffer:
+                break
+                
+            if buffer.should_process():
+                await self.process_buffer(session_id)
+            
+            # Don't spin too fast
+            await asyncio.sleep(0.1)
+    
+    async def handle_stream_message(self, msg: Msg):
+        """Handle incoming audio stream message."""
+        try:
+            # Extract session_id from subject: ai.voice.stream.{session_id}
+            subject_parts = msg.subject.split('.')
+            if len(subject_parts) < 4:
+                logger.warning(f"Invalid subject format: {msg.subject}")
+                return
+            
+            session_id = subject_parts[3]
+            
+            # Parse message using msgpack binary format
+            data = msgpack.unpackb(msg.data, raw=False)
+            
+            # Handle control messages
+            if data.get("type") == "start":
+                logger.info(f"Starting stream session: {session_id}")
+                self.sessions[session_id] = AudioBuffer(session_id)
+                # Start monitoring task for this session
+                task = asyncio.create_task(self.monitor_buffer(session_id))
+                self.processing_tasks[session_id] = task
+                return
+            
+            if data.get("type") == "end":
+                logger.info(f"Ending stream session: {session_id}")
+                buffer = self.sessions.get(session_id)
+                if buffer:
+                    buffer.mark_complete()
+                    # Process any remaining audio
+                    if buffer.total_bytes > 0:
+                        await self.process_buffer(session_id)
+                return
+            
+            # Handle audio chunk
+            if data.get("type") == "chunk":
+                audio_b64 = data.get("audio_b64", "")
+                if not audio_b64:
+                    return
+                
+                audio_bytes = base64.b64decode(audio_b64)
+                
+                # Create session if it doesn't exist (handle missing start message)
+                if session_id not in self.sessions:
+                    logger.info(f"Auto-creating session: {session_id}")
+                    self.sessions[session_id] = AudioBuffer(session_id)
+                    if session_id not in self.processing_tasks:
+                        task = asyncio.create_task(self.monitor_buffer(session_id))
+                        self.processing_tasks[session_id] = task
+                
+                # Add chunk to buffer
+                self.sessions[session_id].add_chunk(audio_bytes)
+                
+        except Exception as e:
+            logger.error(f"Error handling stream message: {e}", exc_info=True)
+    
+    async def run(self):
+        """Main run loop."""
+        await self.setup()
+        
+        # Start health server for kserve compatibility
+        health_runner = await self.start_health_server()
+        
+        # Subscribe to voice stream
+        sub = await self.nc.subscribe(f"{STREAM_SUBJECT_PREFIX}.>", cb=self.handle_stream_message)
+        logger.info(f"Subscribed to {STREAM_SUBJECT_PREFIX}.>")
+        
+        # Handle shutdown
+        def signal_handler():
+            self.running = False
+        
+        loop = asyncio.get_event_loop()
+        for sig in (signal.SIGTERM, signal.SIGINT):
+            loop.add_signal_handler(sig, signal_handler)
+        
+        # Keep running
+        while self.running:
+            await asyncio.sleep(1)
+        
+        # Cleanup
+        logger.info("Shutting down...")
+        
+        # Cancel all monitoring tasks and wait for them to complete
+        for task in self.processing_tasks.values():
+            task.cancel()
+        
+        if self.processing_tasks:
+            await asyncio.gather(*self.processing_tasks.values(), return_exceptions=True)
+        
+        await sub.unsubscribe()
+        await self.nc.close()
+        await health_runner.cleanup()
+        logger.info("Shutdown complete")
+
+
+if __name__ == "__main__":
+    service = StreamingSTTLocal()
+    asyncio.run(service.run())