realtime-mlx-stt


Namerealtime-mlx-stt JSON
Version 0.1.5 PyPI version JSON
download
home_pageNone
SummaryReal-time speech-to-text transcription optimized for Apple Silicon
upload_time2025-07-14 16:55:56
maintainerNone
docs_urlNone
authorNone
requires_python>=3.9
licenseNone
keywords speech-to-text stt transcription whisper mlx apple-silicon real-time voice-activity-detection vad wake-word speech-recognition audio-processing macos m1 m2 m3 neural-engine
VCS
bugtrack_url
requirements No requirements were recorded.
Travis-CI No Travis.
coveralls test coverage No coveralls.
            # Realtime_mlx_STT

[![PyPI version](https://badge.fury.io/py/realtime-mlx-stt.svg)](https://badge.fury.io/py/realtime-mlx-stt)
[![Python 3.9+](https://img.shields.io/badge/python-3.9+-blue.svg)](https://www.python.org/downloads/)
[![License: MIT](https://img.shields.io/badge/License-MIT-yellow.svg)](https://opensource.org/licenses/MIT)
[![Platform](https://img.shields.io/badge/platform-macOS%20(Apple%20Silicon)-lightgrey.svg)](https://support.apple.com/en-us/HT211814)

High-performance speech-to-text transcription library optimized exclusively for Apple Silicon. Leverages MLX framework for real-time on-device transcription with low latency.

> ⚠️ **IMPORTANT: This library is designed for LOCAL USE ONLY on macOS with Apple Silicon.** The included server is a development tool and should NOT be exposed to the internet or used in production environments without implementing proper security measures.

## Features

- **Real-time transcription** with low latency using MLX Whisper
- **Multiple APIs** - Python API, REST API, and WebSocket for different use cases  
- **Apple Silicon optimization** using MLX with Neural Engine acceleration
- **Voice activity detection** with WebRTC and Silero (configurable thresholds)
- **Wake word detection** using Porcupine ("Jarvis", "Alexa", etc.)
- **OpenAI integration** for cloud-based transcription alternative
- **Interactive CLI** for easy exploration of features
- **Web UI** with modern interface and real-time updates
- **Profile system** for quick configuration switching
- **Event-driven architecture** with command pattern
- **Thread-safe** and production-ready

## Language Selection

The Whisper large-v3-turbo model supports 99 languages with intelligent language detection:

- **Language-specific mode**: When you select a specific language (e.g., Norwegian, French, Spanish), the model uses language-specific tokens that significantly improve transcription accuracy for that language
- **Multi-language capability**: Even with a language selected, Whisper can still transcribe other languages if spoken - it's not restricted to only the selected language
- **Accuracy benefit**: Selecting the primary language you'll be speaking provides much more accurate transcription compared to auto-detect mode
- **Auto-detect mode**: When no language is specified, the model attempts to detect the language automatically, though with potentially lower accuracy

For example, if you select Norwegian (`no`) as your language:
- Norwegian speech will be transcribed with high accuracy
- English speech will still be transcribed correctly if spoken
- The model uses the Norwegian language token (50288) to optimize for Norwegian

This behavior matches OpenAI's Whisper API - the language parameter guides but doesn't restrict the model.

## Requirements

- **macOS** with Apple Silicon (M1/M2/M3) - Required, not optional
- **Python 3.9+** (3.11+ recommended for best performance)
- **MLX** for Apple Silicon optimization
- **PyAudio** for audio capture
- **WebRTC VAD** and **Silero VAD** for voice activity detection
- **Porcupine** for wake word detection (optional)
- **Torch** and **NumPy** for audio processing

> **Important Note**: This library is specifically optimized for Apple Silicon and will not work on Intel-based Macs or other platforms. It requires the Neural Engine found in Apple Silicon chips to achieve optimal performance.

## Installation

### Install from PyPI (Recommended)

```bash
# Basic installation
pip install realtime-mlx-stt

# With OpenAI support for cloud transcription
pip install "realtime-mlx-stt[openai]"

# With development tools
pip install "realtime-mlx-stt[dev]"

# With server support for REST/WebSocket APIs
pip install "realtime-mlx-stt[server]"

# Install everything
pip install "realtime-mlx-stt[openai,server,dev]"
```

## 📚 Documentation

- **[Usage Guide](USAGE_GUIDE.md)** - Common patterns and troubleshooting
- **[API Reference](realtime_mlx_stt/README.md)** - Detailed API documentation
- **[Examples](examples/)** - Working code examples

### Install from Source

```bash
# Clone the repository
git clone https://github.com/kristofferv98/Realtime_mlx_STT.git
cd Realtime_mlx_STT

# Set up Python environment (requires Python 3.9+ but 3.11+ recommended)
python -m venv venv
source venv/bin/activate  # On Windows: venv\Scripts\activate

# Install in development mode
pip install -e .
```

## Quick Start

### Interactive CLI (Recommended)

The easiest way to explore all features:

```bash
python examples/cli.py
```

This provides a menu-driven interface for:
- Quick 10-second transcription
- Continuous streaming mode
- OpenAI cloud transcription
- Wake word detection
- Audio device selection
- Language configuration

### Python API

```python
from realtime_mlx_stt import STTClient

# Simple transcription
client = STTClient()
for result in client.transcribe(duration=10):
    print(result.text)

# With OpenAI
client = STTClient(openai_api_key="sk-...")
for result in client.transcribe(engine="openai"):
    print(result.text)

# Wake word mode
client.start_wake_word("jarvis")
```

### Server Mode

> **Security Note**: The server is for local development only and binds to localhost by default. Do NOT expose it to the internet without proper authentication and security measures.

```bash
# Start server (localhost only - safe)
cd example_server
python server_example.py

# Opens web UI at http://localhost:8000
```

## Architecture

The library provides two specialized interfaces built on a common Features layer:

```
┌─────────────────────────────────────────────────┐
│          User Interfaces                         │
│  • CLI (examples/cli.py)                        │
│  • Web UI (example_server/)                     │
├─────────────────────────────────────────────────┤
│          API Layers                             │
│  • Python API (realtime_mlx_stt/)              │
│  • REST/WebSocket (src/Application/Server/)    │
├─────────────────────────────────────────────────┤
│          Features Layer                         │
│  • AudioCapture                                │
│  • VoiceActivityDetection                      │
│  • Transcription (MLX/OpenAI)                  │
│  • WakeWordDetection                           │
├─────────────────────────────────────────────────┤
│          Core & Infrastructure                  │
│  • Command/Event System                         │
│  • Logging & Configuration                      │
└─────────────────────────────────────────────────┘
```

### Key Design Principles

- **Vertical Slice Architecture**: Each feature is self-contained with Commands, Events, Handlers, and Models
- **Dual API Design**: Python API optimized for direct use, Server API optimized for multi-client scenarios
- **Event-Driven**: Features communicate via commands and events, not direct dependencies
- **Production Ready**: Thread-safe, lazy initialization, comprehensive error handling

## API Documentation

### Python API (realtime_mlx_stt)

```python
from realtime_mlx_stt import STTClient, TranscriptionSession, create_transcriber

# Method 1: Modern Client API
client = STTClient(
    openai_api_key="sk-...",     # Optional
    default_engine="mlx_whisper", # or "openai"
    default_language="en"         # or None for auto-detect
)

# Transcribe for fixed duration
for result in client.transcribe(duration=10):
    print(f"{result.text} (confidence: {result.confidence})")

# Streaming with stop word
with client.stream() as stream:
    for result in stream:
        print(result.text)
        if "stop" in result.text.lower():
            break

# Method 2: Session-based API
from realtime_mlx_stt import TranscriptionSession, ModelConfig, VADConfig

session = TranscriptionSession(
    model=ModelConfig(engine="mlx_whisper", language="no"),
    vad=VADConfig(sensitivity=0.8),
    on_transcription=lambda r: print(r.text)
)

with session:
    time.sleep(30)  # Listen for 30 seconds

# Method 3: Simple Transcriber
from realtime_mlx_stt import Transcriber
transcriber = Transcriber(language="es")
text = transcriber.transcribe_from_mic(duration=5)
print(f"You said: {text}")
```

### REST API

```bash
# Start system with profile
curl -X POST http://localhost:8000/api/v1/system/start \
  -H "Content-Type: application/json" \
  -d '{
    "profile": "vad-triggered",
    "custom_config": {
      "transcription": {"language": "fr"},
      "vad": {"sensitivity": 0.7}
    }
  }'

# Get system status
curl http://localhost:8000/api/v1/system/status

# Transcribe audio file
curl -X POST http://localhost:8000/api/v1/transcription/audio \
  -H "Content-Type: application/json" \
  -d '{"audio_data": "base64_encoded_audio_data"}'
```

### WebSocket Events

```javascript
const ws = new WebSocket('ws://localhost:8000/events');

ws.onmessage = (event) => {
    const data = JSON.parse(event.data);
    
    switch(data.type) {
        case 'transcription':
            if (data.is_final) {
                console.log(`Final: ${data.text}`);
            } else {
                console.log(`Transcribing: ${data.text}`);
            }
            break;
        case 'wake_word':
            console.log(`Wake word: ${data.wake_word}`);
            break;
    }
```

## Configuration

### Environment Variables

```bash
# API Keys
export OPENAI_API_KEY="sk-..."        # For OpenAI transcription
export PORCUPINE_ACCESS_KEY="..."     # For wake word detection
# Alternative names for Picovoice universal key (same as PORCUPINE_ACCESS_KEY):
# export PICOVOICE_ACCESS_KEY="..."
# export PICOVOICE_API_KEY="..."

# Logging
export LOG_LEVEL="INFO"               # DEBUG, INFO, WARNING, ERROR
export LOG_FORMAT="human"             # human, json, detailed
```

### Python Configuration

```python
from realtime_mlx_stt import ModelConfig, VADConfig, WakeWordConfig

# Model configuration
model = ModelConfig(
    engine="mlx_whisper",        # or "openai"
    model="whisper-large-v3-turbo",
    language="en"                # or None for auto-detect
)

# VAD configuration
vad = VADConfig(
    enabled=True,
    sensitivity=0.6,             # 0.0-1.0
    min_speech_duration=0.25,    # seconds
    min_silence_duration=0.1     # seconds
)

# Wake word configuration
# Note: Requires PORCUPINE_ACCESS_KEY environment variable
wake_word = WakeWordConfig(
    words=["jarvis", "computer"],
    sensitivity=0.7,
    timeout=30                   # seconds
)

## Testing

The project includes comprehensive tests for each feature and component:

```bash
# Run all tests
python tests/run_tests.py

# Run tests for a specific feature or component
python tests/run_tests.py -f VoiceActivityDetection
python tests/run_tests.py -f Infrastructure
python tests/run_tests.py -f Application  # Server/Client tests

# Run a specific test with verbose output
python tests/run_tests.py -t webrtc_vad_test -v
python tests/run_tests.py -t test_server_module -v

# Test with PYTHONPATH (if imports fail)
PYTHONPATH=/path/to/Realtime_mlx_STT python tests/run_tests.py
```

The Server implementation includes tests for:
- API Controllers (Transcription and System)
- WebSocket connections and event broadcasting
- Configuration and profile management
- Command/Event integration

## Performance

On Apple Silicon (M1/M2/M3), the MLX-optimized Whisper-large-v3-turbo model typically achieves:

- **Batch mode**: ~0.3-0.5x realtime (processes 60 seconds of audio in 20-30 seconds)
- **Streaming mode**: ~0.5-0.7x realtime (processes audio with ~2-3 second latency)

The MLX implementation takes full advantage of the Neural Engine in Apple Silicon chips, providing significantly better performance than CPU-based implementations.

## Contributing

Contributions are welcome! Please feel free to submit a Pull Request.

1. Fork the repository
2. Create your feature branch (`git checkout -b feature/amazing-feature`)
3. Commit your changes (`git commit -m 'Add some amazing feature'`)
4. Push to the branch (`git push origin feature/amazing-feature`)
5. Open a Pull Request

## Recent Updates

- **New Python API**: Added high-level `realtime_mlx_stt` package with STTClient, TranscriptionSession, and Transcriber
- **Interactive CLI**: New user-friendly CLI at `examples/cli.py` for exploring all features
- **Dual API Architecture**: Python API optimized for direct use, Server API for multi-client scenarios
- **Improved Examples**: Consolidated examples with clear documentation
- **Architecture Documentation**: Added comprehensive architecture documentation
- **OpenAI Integration**: Support for OpenAI's transcription API as alternative to local MLX

## License

This project is licensed under the MIT License - see the LICENSE file for details.

## Acknowledgments

- [OpenAI Whisper](https://github.com/openai/whisper) for the base Whisper large-v3-turbo model
- [MLX](https://github.com/ml-explore/mlx) for Apple Silicon optimization
- [RealtimeSTT](https://github.com/KoljaB/RealtimeSTT) for the original audio processing concepts
- [Picovoice Porcupine](https://picovoice.ai/platform/porcupine/) for wake word detection
- [Hugging Face](https://huggingface.co) for model distribution infrastructure

            

Raw data

            {
    "_id": null,
    "home_page": null,
    "name": "realtime-mlx-stt",
    "maintainer": null,
    "docs_url": null,
    "requires_python": ">=3.9",
    "maintainer_email": null,
    "keywords": "speech-to-text, stt, transcription, whisper, mlx, apple-silicon, real-time, voice-activity-detection, vad, wake-word, speech-recognition, audio-processing, macos, m1, m2, m3, neural-engine",
    "author": null,
    "author_email": "Kristoffer Vatnehol <kristoffer.vatnehol@gmail.com>",
    "download_url": "https://files.pythonhosted.org/packages/dd/0a/7697b2bac9d3052c09d2c0a7205281599584ba861c67e87a93b348e71145/realtime_mlx_stt-0.1.5.tar.gz",
    "platform": null,
    "description": "# Realtime_mlx_STT\n\n[![PyPI version](https://badge.fury.io/py/realtime-mlx-stt.svg)](https://badge.fury.io/py/realtime-mlx-stt)\n[![Python 3.9+](https://img.shields.io/badge/python-3.9+-blue.svg)](https://www.python.org/downloads/)\n[![License: MIT](https://img.shields.io/badge/License-MIT-yellow.svg)](https://opensource.org/licenses/MIT)\n[![Platform](https://img.shields.io/badge/platform-macOS%20(Apple%20Silicon)-lightgrey.svg)](https://support.apple.com/en-us/HT211814)\n\nHigh-performance speech-to-text transcription library optimized exclusively for Apple Silicon. Leverages MLX framework for real-time on-device transcription with low latency.\n\n> \u26a0\ufe0f **IMPORTANT: This library is designed for LOCAL USE ONLY on macOS with Apple Silicon.** The included server is a development tool and should NOT be exposed to the internet or used in production environments without implementing proper security measures.\n\n## Features\n\n- **Real-time transcription** with low latency using MLX Whisper\n- **Multiple APIs** - Python API, REST API, and WebSocket for different use cases  \n- **Apple Silicon optimization** using MLX with Neural Engine acceleration\n- **Voice activity detection** with WebRTC and Silero (configurable thresholds)\n- **Wake word detection** using Porcupine (\"Jarvis\", \"Alexa\", etc.)\n- **OpenAI integration** for cloud-based transcription alternative\n- **Interactive CLI** for easy exploration of features\n- **Web UI** with modern interface and real-time updates\n- **Profile system** for quick configuration switching\n- **Event-driven architecture** with command pattern\n- **Thread-safe** and production-ready\n\n## Language Selection\n\nThe Whisper large-v3-turbo model supports 99 languages with intelligent language detection:\n\n- **Language-specific mode**: When you select a specific language (e.g., Norwegian, French, Spanish), the model uses language-specific tokens that significantly improve transcription accuracy for that language\n- **Multi-language capability**: Even with a language selected, Whisper can still transcribe other languages if spoken - it's not restricted to only the selected language\n- **Accuracy benefit**: Selecting the primary language you'll be speaking provides much more accurate transcription compared to auto-detect mode\n- **Auto-detect mode**: When no language is specified, the model attempts to detect the language automatically, though with potentially lower accuracy\n\nFor example, if you select Norwegian (`no`) as your language:\n- Norwegian speech will be transcribed with high accuracy\n- English speech will still be transcribed correctly if spoken\n- The model uses the Norwegian language token (50288) to optimize for Norwegian\n\nThis behavior matches OpenAI's Whisper API - the language parameter guides but doesn't restrict the model.\n\n## Requirements\n\n- **macOS** with Apple Silicon (M1/M2/M3) - Required, not optional\n- **Python 3.9+** (3.11+ recommended for best performance)\n- **MLX** for Apple Silicon optimization\n- **PyAudio** for audio capture\n- **WebRTC VAD** and **Silero VAD** for voice activity detection\n- **Porcupine** for wake word detection (optional)\n- **Torch** and **NumPy** for audio processing\n\n> **Important Note**: This library is specifically optimized for Apple Silicon and will not work on Intel-based Macs or other platforms. It requires the Neural Engine found in Apple Silicon chips to achieve optimal performance.\n\n## Installation\n\n### Install from PyPI (Recommended)\n\n```bash\n# Basic installation\npip install realtime-mlx-stt\n\n# With OpenAI support for cloud transcription\npip install \"realtime-mlx-stt[openai]\"\n\n# With development tools\npip install \"realtime-mlx-stt[dev]\"\n\n# With server support for REST/WebSocket APIs\npip install \"realtime-mlx-stt[server]\"\n\n# Install everything\npip install \"realtime-mlx-stt[openai,server,dev]\"\n```\n\n## \ud83d\udcda Documentation\n\n- **[Usage Guide](USAGE_GUIDE.md)** - Common patterns and troubleshooting\n- **[API Reference](realtime_mlx_stt/README.md)** - Detailed API documentation\n- **[Examples](examples/)** - Working code examples\n\n### Install from Source\n\n```bash\n# Clone the repository\ngit clone https://github.com/kristofferv98/Realtime_mlx_STT.git\ncd Realtime_mlx_STT\n\n# Set up Python environment (requires Python 3.9+ but 3.11+ recommended)\npython -m venv venv\nsource venv/bin/activate  # On Windows: venv\\Scripts\\activate\n\n# Install in development mode\npip install -e .\n```\n\n## Quick Start\n\n### Interactive CLI (Recommended)\n\nThe easiest way to explore all features:\n\n```bash\npython examples/cli.py\n```\n\nThis provides a menu-driven interface for:\n- Quick 10-second transcription\n- Continuous streaming mode\n- OpenAI cloud transcription\n- Wake word detection\n- Audio device selection\n- Language configuration\n\n### Python API\n\n```python\nfrom realtime_mlx_stt import STTClient\n\n# Simple transcription\nclient = STTClient()\nfor result in client.transcribe(duration=10):\n    print(result.text)\n\n# With OpenAI\nclient = STTClient(openai_api_key=\"sk-...\")\nfor result in client.transcribe(engine=\"openai\"):\n    print(result.text)\n\n# Wake word mode\nclient.start_wake_word(\"jarvis\")\n```\n\n### Server Mode\n\n> **Security Note**: The server is for local development only and binds to localhost by default. Do NOT expose it to the internet without proper authentication and security measures.\n\n```bash\n# Start server (localhost only - safe)\ncd example_server\npython server_example.py\n\n# Opens web UI at http://localhost:8000\n```\n\n## Architecture\n\nThe library provides two specialized interfaces built on a common Features layer:\n\n```\n\u250c\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2510\n\u2502          User Interfaces                         \u2502\n\u2502  \u2022 CLI (examples/cli.py)                        \u2502\n\u2502  \u2022 Web UI (example_server/)                     \u2502\n\u251c\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2524\n\u2502          API Layers                             \u2502\n\u2502  \u2022 Python API (realtime_mlx_stt/)              \u2502\n\u2502  \u2022 REST/WebSocket (src/Application/Server/)    \u2502\n\u251c\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2524\n\u2502          Features Layer                         \u2502\n\u2502  \u2022 AudioCapture                                \u2502\n\u2502  \u2022 VoiceActivityDetection                      \u2502\n\u2502  \u2022 Transcription (MLX/OpenAI)                  \u2502\n\u2502  \u2022 WakeWordDetection                           \u2502\n\u251c\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2524\n\u2502          Core & Infrastructure                  \u2502\n\u2502  \u2022 Command/Event System                         \u2502\n\u2502  \u2022 Logging & Configuration                      \u2502\n\u2514\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2518\n```\n\n### Key Design Principles\n\n- **Vertical Slice Architecture**: Each feature is self-contained with Commands, Events, Handlers, and Models\n- **Dual API Design**: Python API optimized for direct use, Server API optimized for multi-client scenarios\n- **Event-Driven**: Features communicate via commands and events, not direct dependencies\n- **Production Ready**: Thread-safe, lazy initialization, comprehensive error handling\n\n## API Documentation\n\n### Python API (realtime_mlx_stt)\n\n```python\nfrom realtime_mlx_stt import STTClient, TranscriptionSession, create_transcriber\n\n# Method 1: Modern Client API\nclient = STTClient(\n    openai_api_key=\"sk-...\",     # Optional\n    default_engine=\"mlx_whisper\", # or \"openai\"\n    default_language=\"en\"         # or None for auto-detect\n)\n\n# Transcribe for fixed duration\nfor result in client.transcribe(duration=10):\n    print(f\"{result.text} (confidence: {result.confidence})\")\n\n# Streaming with stop word\nwith client.stream() as stream:\n    for result in stream:\n        print(result.text)\n        if \"stop\" in result.text.lower():\n            break\n\n# Method 2: Session-based API\nfrom realtime_mlx_stt import TranscriptionSession, ModelConfig, VADConfig\n\nsession = TranscriptionSession(\n    model=ModelConfig(engine=\"mlx_whisper\", language=\"no\"),\n    vad=VADConfig(sensitivity=0.8),\n    on_transcription=lambda r: print(r.text)\n)\n\nwith session:\n    time.sleep(30)  # Listen for 30 seconds\n\n# Method 3: Simple Transcriber\nfrom realtime_mlx_stt import Transcriber\ntranscriber = Transcriber(language=\"es\")\ntext = transcriber.transcribe_from_mic(duration=5)\nprint(f\"You said: {text}\")\n```\n\n### REST API\n\n```bash\n# Start system with profile\ncurl -X POST http://localhost:8000/api/v1/system/start \\\n  -H \"Content-Type: application/json\" \\\n  -d '{\n    \"profile\": \"vad-triggered\",\n    \"custom_config\": {\n      \"transcription\": {\"language\": \"fr\"},\n      \"vad\": {\"sensitivity\": 0.7}\n    }\n  }'\n\n# Get system status\ncurl http://localhost:8000/api/v1/system/status\n\n# Transcribe audio file\ncurl -X POST http://localhost:8000/api/v1/transcription/audio \\\n  -H \"Content-Type: application/json\" \\\n  -d '{\"audio_data\": \"base64_encoded_audio_data\"}'\n```\n\n### WebSocket Events\n\n```javascript\nconst ws = new WebSocket('ws://localhost:8000/events');\n\nws.onmessage = (event) => {\n    const data = JSON.parse(event.data);\n    \n    switch(data.type) {\n        case 'transcription':\n            if (data.is_final) {\n                console.log(`Final: ${data.text}`);\n            } else {\n                console.log(`Transcribing: ${data.text}`);\n            }\n            break;\n        case 'wake_word':\n            console.log(`Wake word: ${data.wake_word}`);\n            break;\n    }\n```\n\n## Configuration\n\n### Environment Variables\n\n```bash\n# API Keys\nexport OPENAI_API_KEY=\"sk-...\"        # For OpenAI transcription\nexport PORCUPINE_ACCESS_KEY=\"...\"     # For wake word detection\n# Alternative names for Picovoice universal key (same as PORCUPINE_ACCESS_KEY):\n# export PICOVOICE_ACCESS_KEY=\"...\"\n# export PICOVOICE_API_KEY=\"...\"\n\n# Logging\nexport LOG_LEVEL=\"INFO\"               # DEBUG, INFO, WARNING, ERROR\nexport LOG_FORMAT=\"human\"             # human, json, detailed\n```\n\n### Python Configuration\n\n```python\nfrom realtime_mlx_stt import ModelConfig, VADConfig, WakeWordConfig\n\n# Model configuration\nmodel = ModelConfig(\n    engine=\"mlx_whisper\",        # or \"openai\"\n    model=\"whisper-large-v3-turbo\",\n    language=\"en\"                # or None for auto-detect\n)\n\n# VAD configuration\nvad = VADConfig(\n    enabled=True,\n    sensitivity=0.6,             # 0.0-1.0\n    min_speech_duration=0.25,    # seconds\n    min_silence_duration=0.1     # seconds\n)\n\n# Wake word configuration\n# Note: Requires PORCUPINE_ACCESS_KEY environment variable\nwake_word = WakeWordConfig(\n    words=[\"jarvis\", \"computer\"],\n    sensitivity=0.7,\n    timeout=30                   # seconds\n)\n\n## Testing\n\nThe project includes comprehensive tests for each feature and component:\n\n```bash\n# Run all tests\npython tests/run_tests.py\n\n# Run tests for a specific feature or component\npython tests/run_tests.py -f VoiceActivityDetection\npython tests/run_tests.py -f Infrastructure\npython tests/run_tests.py -f Application  # Server/Client tests\n\n# Run a specific test with verbose output\npython tests/run_tests.py -t webrtc_vad_test -v\npython tests/run_tests.py -t test_server_module -v\n\n# Test with PYTHONPATH (if imports fail)\nPYTHONPATH=/path/to/Realtime_mlx_STT python tests/run_tests.py\n```\n\nThe Server implementation includes tests for:\n- API Controllers (Transcription and System)\n- WebSocket connections and event broadcasting\n- Configuration and profile management\n- Command/Event integration\n\n## Performance\n\nOn Apple Silicon (M1/M2/M3), the MLX-optimized Whisper-large-v3-turbo model typically achieves:\n\n- **Batch mode**: ~0.3-0.5x realtime (processes 60 seconds of audio in 20-30 seconds)\n- **Streaming mode**: ~0.5-0.7x realtime (processes audio with ~2-3 second latency)\n\nThe MLX implementation takes full advantage of the Neural Engine in Apple Silicon chips, providing significantly better performance than CPU-based implementations.\n\n## Contributing\n\nContributions are welcome! Please feel free to submit a Pull Request.\n\n1. Fork the repository\n2. Create your feature branch (`git checkout -b feature/amazing-feature`)\n3. Commit your changes (`git commit -m 'Add some amazing feature'`)\n4. Push to the branch (`git push origin feature/amazing-feature`)\n5. Open a Pull Request\n\n## Recent Updates\n\n- **New Python API**: Added high-level `realtime_mlx_stt` package with STTClient, TranscriptionSession, and Transcriber\n- **Interactive CLI**: New user-friendly CLI at `examples/cli.py` for exploring all features\n- **Dual API Architecture**: Python API optimized for direct use, Server API for multi-client scenarios\n- **Improved Examples**: Consolidated examples with clear documentation\n- **Architecture Documentation**: Added comprehensive architecture documentation\n- **OpenAI Integration**: Support for OpenAI's transcription API as alternative to local MLX\n\n## License\n\nThis project is licensed under the MIT License - see the LICENSE file for details.\n\n## Acknowledgments\n\n- [OpenAI Whisper](https://github.com/openai/whisper) for the base Whisper large-v3-turbo model\n- [MLX](https://github.com/ml-explore/mlx) for Apple Silicon optimization\n- [RealtimeSTT](https://github.com/KoljaB/RealtimeSTT) for the original audio processing concepts\n- [Picovoice Porcupine](https://picovoice.ai/platform/porcupine/) for wake word detection\n- [Hugging Face](https://huggingface.co) for model distribution infrastructure\n",
    "bugtrack_url": null,
    "license": null,
    "summary": "Real-time speech-to-text transcription optimized for Apple Silicon",
    "version": "0.1.5",
    "project_urls": {
        "Homepage": "https://github.com/kristofferv98/Realtime_mlx_STT",
        "Issues": "https://github.com/kristofferv98/Realtime_mlx_STT/issues",
        "Repository": "https://github.com/kristofferv98/Realtime_mlx_STT"
    },
    "split_keywords": [
        "speech-to-text",
        " stt",
        " transcription",
        " whisper",
        " mlx",
        " apple-silicon",
        " real-time",
        " voice-activity-detection",
        " vad",
        " wake-word",
        " speech-recognition",
        " audio-processing",
        " macos",
        " m1",
        " m2",
        " m3",
        " neural-engine"
    ],
    "urls": [
        {
            "comment_text": null,
            "digests": {
                "blake2b_256": "8a61371c1466bc71f41b0fdb1172e456b63aadbfd9245388982d06f19b129c0c",
                "md5": "6333dab9138f3274aa527a501670de4b",
                "sha256": "b68b0ef18bdfd182d4845694838c416c712d88b3cef903fce21649b9b649f24f"
            },
            "downloads": -1,
            "filename": "realtime_mlx_stt-0.1.5-py3-none-any.whl",
            "has_sig": false,
            "md5_digest": "6333dab9138f3274aa527a501670de4b",
            "packagetype": "bdist_wheel",
            "python_version": "py3",
            "requires_python": ">=3.9",
            "size": 284273,
            "upload_time": "2025-07-14T16:55:54",
            "upload_time_iso_8601": "2025-07-14T16:55:54.266294Z",
            "url": "https://files.pythonhosted.org/packages/8a/61/371c1466bc71f41b0fdb1172e456b63aadbfd9245388982d06f19b129c0c/realtime_mlx_stt-0.1.5-py3-none-any.whl",
            "yanked": false,
            "yanked_reason": null
        },
        {
            "comment_text": null,
            "digests": {
                "blake2b_256": "dd0a7697b2bac9d3052c09d2c0a7205281599584ba861c67e87a93b348e71145",
                "md5": "835cea83211a382a66bcedd260791079",
                "sha256": "7035bd65e5022dcc7798e3c2cc333d2f1fe7e0f9d05c0b2baea20773658980b1"
            },
            "downloads": -1,
            "filename": "realtime_mlx_stt-0.1.5.tar.gz",
            "has_sig": false,
            "md5_digest": "835cea83211a382a66bcedd260791079",
            "packagetype": "sdist",
            "python_version": "source",
            "requires_python": ">=3.9",
            "size": 202642,
            "upload_time": "2025-07-14T16:55:56",
            "upload_time_iso_8601": "2025-07-14T16:55:56.786811Z",
            "url": "https://files.pythonhosted.org/packages/dd/0a/7697b2bac9d3052c09d2c0a7205281599584ba861c67e87a93b348e71145/realtime_mlx_stt-0.1.5.tar.gz",
            "yanked": false,
            "yanked_reason": null
        }
    ],
    "upload_time": "2025-07-14 16:55:56",
    "github": true,
    "gitlab": false,
    "bitbucket": false,
    "codeberg": false,
    "github_user": "kristofferv98",
    "github_project": "Realtime_mlx_STT",
    "travis_ci": false,
    "coveralls": false,
    "github_actions": false,
    "lcname": "realtime-mlx-stt"
}
        
Elapsed time: 0.83804s