RealTimeSTT


NameRealTimeSTT JSON
Version 0.1.16 PyPI version JSON
download
home_pagehttps://github.com/KoljaB/RealTimeSTT
SummaryA fast Voice Activity Detection and Transcription System
upload_time2024-06-02 09:46:05
maintainerNone
docs_urlNone
authorKolja Beigel
requires_python>=3.6
licenseNone
keywords real-time audio transcription speech-to-text voice-activity-detection vad real-time-transcription ambient-noise-detection microphone-input faster_whisper speech-recognition voice-assistants audio-processing buffered-transcription pyaudio ambient-noise-level voice-deactivity
VCS
bugtrack_url
requirements PyAudio faster-whisper pvporcupine webrtcvad halo torch torchaudio scipy websockets
Travis-CI No Travis.
coveralls test coverage No coveralls.
            
# RealtimeSTT

*Easy-to-use, low-latency speech-to-text library for realtime applications*

## About the Project

RealtimeSTT listens to the microphone and transcribes voice into text.  

> **Hint:** *<strong>Check out [Linguflex](https://github.com/KoljaB/Linguflex)</strong>, the original project from which RealtimeSTT is spun off. It lets you control your environment by speaking and is one of the most capable and sophisticated open-source assistants currently available.*

It's ideal for:

- **Voice Assistants**
- Applications requiring **fast and precise** speech-to-text conversion

https://github.com/KoljaB/RealtimeSTT/assets/7604638/207cb9a2-4482-48e7-9d2b-0722c3ee6d14

### Updates

Latest Version: v0.1.15

See [release history](https://github.com/KoljaB/RealtimeSTT/releases).

> **Hint:** *Since we use the `multiprocessing` module now, ensure to include the `if __name__ == '__main__':` protection in your code to prevent unexpected behavior, especially on platforms like Windows. For a detailed explanation on why this is important, visit the [official Python documentation on `multiprocessing`](https://docs.python.org/3/library/multiprocessing.html#multiprocessing-programming).*

### Features

- **Voice Activity Detection**: Automatically detects when you start and stop speaking.
- **Realtime Transcription**: Transforms speech to text in real-time.
- **Wake Word Activation**: Can activate upon detecting a designated wake word.

> **Hint**: *Check out [RealtimeTTS](https://github.com/KoljaB/RealtimeTTS), the output counterpart of this library, for text-to-voice capabilities. Together, they form a powerful realtime audio wrapper around large language models.*

## Tech Stack

This library uses:

- **Voice Activity Detection**
  - [WebRTCVAD](https://github.com/wiseman/py-webrtcvad) for initial voice activity detection.
  - [SileroVAD](https://github.com/snakers4/silero-vad) for more accurate verification.
- **Speech-To-Text**
  - [Faster_Whisper](https://github.com/guillaumekln/faster-whisper) for instant (GPU-accelerated) transcription.
- **Wake Word Detection**
  - [Porcupine](https://github.com/Picovoice/porcupine) for wake word detection.

*These components represent the "industry standard" for cutting-edge applications, providing the most modern and effective foundation for building high-end solutions.*


## Installation

```bash
pip install RealtimeSTT
```

This will install all the necessary dependencies, including a **CPU support only** version of PyTorch.

Although it is possible to run RealtimeSTT with a CPU installation only (use a small model like "tiny" or "base" in this case) you will get way better experience using:

### GPU Support with CUDA (recommended)

Additional steps are needed for a **GPU-optimized** installation. These steps are recommended for those who require **better performance** and have a compatible NVIDIA GPU.

> **Note**: *To check if your NVIDIA GPU supports CUDA, visit the [official CUDA GPUs list](https://developer.nvidia.com/cuda-gpus).*

To use RealtimeSTT with GPU support via CUDA please follow these steps:

1. **Install NVIDIA CUDA Toolkit 11.8**:
    - Visit [NVIDIA CUDA Toolkit Archive](https://developer.nvidia.com/cuda-11-8-0-download-archive).
    - Select operating system and version.
    - Download and install the software.

2. **Install NVIDIA cuDNN 8.7.0 for CUDA 11.x**:
    - Visit [NVIDIA cuDNN Archive](https://developer.nvidia.com/rdp/cudnn-archive).
    - Click on "Download cuDNN v8.7.0 (November 28th, 2022), for CUDA 11.x".
    - Download and install the software.

3. **Install ffmpeg**:

    > **Note**: *Installation of ffmpeg might not actually be needed to operate RealtimeSTT* <sup> *thanks to jgilbert2017 for pointing this out</sup>

    You can download an installer for your OS from the [ffmpeg Website](https://ffmpeg.org/download.html).  
    
    Or use a package manager:

    - **On Ubuntu or Debian**:
        ```bash
        sudo apt update && sudo apt install ffmpeg
        ```

    - **On Arch Linux**:
        ```bash
        sudo pacman -S ffmpeg
        ```

    - **On MacOS using Homebrew** ([https://brew.sh/](https://brew.sh/)):
        ```bash
        brew install ffmpeg
        ```

    - **On Windows using Winget** [official documentation](https://learn.microsoft.com/en-us/windows/package-manager/winget/) :
        ```bash
        winget install Gyan.FFmpeg
        ```
        
    - **On Windows using Chocolatey** ([https://chocolatey.org/](https://chocolatey.org/)):
        ```bash
        choco install ffmpeg
        ```

    - **On Windows using Scoop** ([https://scoop.sh/](https://scoop.sh/)):
        ```bash
        scoop install ffmpeg
        ```    

4. **Install PyTorch with CUDA support**:
    ```bash
    pip uninstall torch
    pip install torch==2.2.2+cu118 torchaudio==2.2.2 --index-url https://download.pytorch.org/whl/cu118
    ```

## Quick Start

Basic usage:

### Manual Recording

Start and stop of recording are manually triggered.

```python
recorder.start()
recorder.stop()
print(recorder.text())
```

### Automatic Recording

Recording based on voice activity detection.

```python
with AudioToTextRecorder() as recorder:
    print(recorder.text())
```

When running recorder.text in a loop it is recommended to use a callback, allowing the transcription to be run asynchronously:

```python
def process_text(text):
    print (text)
    
while True:
    recorder.text(process_text)
```

### Wakewords

Keyword activation before detecting voice. Write the comma-separated list of your desired activation keywords into the wake_words parameter. You can choose wake words from these list: alexa, americano, blueberry, bumblebee, computer, grapefruits, grasshopper, hey google, hey siri, jarvis, ok google, picovoice, porcupine, terminator. 

```python
recorder = AudioToTextRecorder(wake_words="jarvis")

print('Say "Jarvis" then speak.')
print(recorder.text())
```

### Callbacks

You can set callback functions to be executed on different events (see [Configuration](#configuration)) :

```python
def my_start_callback():
    print("Recording started!")

def my_stop_callback():
    print("Recording stopped!")

recorder = AudioToTextRecorder(on_recording_start=my_start_callback,
                               on_recording_stop=my_stop_callback)
```

### Feed chunks

If you don't want to use the local microphone set use_microphone parameter to false and provide raw PCM audiochunks in 16-bit mono (samplerate 16000) with this method:

```python
recorder.feed_audio(audio_chunk)
```

### Shutdown

You can shutdown the recorder safely by using the context manager protocol:

```python
with AudioToTextRecorder() as recorder:
    [...]
```

Or you can call the shutdown method manually (if using "with" is not feasible):

```python
recorder.shutdown()
```

## Testing the Library

The test subdirectory contains a set of scripts to help you evaluate and understand the capabilities of the RealtimeTTS library.

Test scripts depending on RealtimeTTS library may require you to enter your azure service region within the script. 
When using OpenAI-, Azure- or Elevenlabs-related demo scripts the API Keys should be provided in the environment variables OPENAI_API_KEY, AZURE_SPEECH_KEY and ELEVENLABS_API_KEY (see [RealtimeTTS](https://github.com/KoljaB/RealtimeTTS))

- **simple_test.py**
    - **Description**: A "hello world" styled demonstration of the library's simplest usage.

- **realtimestt_test.py**
    - **Description**: Showcasing live-transcription.

- **wakeword_test.py**
    - **Description**: A demonstration of the wakeword activation.

- **translator.py**
    - **Dependencies**: Run `pip install openai realtimetts`.
    - **Description**: Real-time translations into six different languages.

- **openai_voice_interface.py**
    - **Dependencies**: Run `pip install openai realtimetts`.
    - **Description**: Wake word activated and voice based user interface to the OpenAI API.

- **advanced_talk.py**
    - **Dependencies**: Run `pip install openai keyboard realtimetts`.
    - **Description**: Choose TTS engine and voice before starting AI conversation.

- **minimalistic_talkbot.py**
    - **Dependencies**: Run `pip install openai realtimetts`.
    - **Description**: A basic talkbot in 20 lines of code.

The example_app subdirectory contains a polished user interface application for the OpenAI API based on PyQt5.

## Configuration

### Initialization Parameters for `AudioToTextRecorder`

When you initialize the `AudioToTextRecorder` class, you have various options to customize its behavior.

#### General Parameters

- **model** (str, default="tiny"): Model size or path for transcription.
    - Options: 'tiny', 'tiny.en', 'base', 'base.en', 'small', 'small.en', 'medium', 'medium.en', 'large-v1', 'large-v2'.
    - Note: If a size is provided, the model will be downloaded from the Hugging Face Hub.

- **language** (str, default=""): Language code for transcription. If left empty, the model will try to auto-detect the language. Supported language codes are listed in [Whisper Tokenizer library](https://github.com/openai/whisper/blob/main/whisper/tokenizer.py).

- **compute_type** (str, default="default"): Specifies the type of computation to be used for transcription. See [Whisper Quantization](https://opennmt.net/CTranslate2/quantization.html)

- **input_device_index** (int, default=0): Audio Input Device Index to use.

- **gpu_device_index** (int, default=0): GPU Device Index to use. The model can also be loaded on multiple GPUs by passing a list of IDs (e.g. [0, 1, 2, 3]).

- **on_recording_start**: A callable function triggered when recording starts.

- **on_recording_stop**: A callable function triggered when recording ends.

- **on_transcription_start**: A callable function triggered when transcription starts.

- **ensure_sentence_starting_uppercase** (bool, default=True): Ensures that every sentence detected by the algorithm starts with an uppercase letter.

- **ensure_sentence_ends_with_period** (bool, default=True): Ensures that every sentence that doesn't end with punctuation such as "?", "!" ends with a period

- **use_microphone** (bool, default=True): Usage of local microphone for transcription. Set to False if you want to provide chunks with feed_audio method.

- **spinner** (bool, default=True): Provides a spinner animation text with information about the current recorder state.

- **level** (int, default=logging.WARNING): Logging level.

- **handle_buffer_overflow** (bool, default=True): If set, the system will log a warning when an input overflow occurs during recording and remove the data from the buffer.

- **beam_size** (int, default=5): The beam size to use for beam search decoding.

- **initial_prompt** (str or iterable of int, default=None): Initial prompt to be fed to the transcription models.

- **suppress_tokens** (list of int, default=[-1]): Tokens to be suppressed from the transcription output.

- **on_recorded_chunk**: A callback function that is triggered when a chunk of audio is recorded. Submits the chunk data as parameter.

- **debug_mode** (bool, default=False): If set, the system prints additional debug information to the console.

#### Real-time Transcription Parameters

> **Note**: *When enabling realtime description a GPU installation is strongly advised. Using realtime transcription may create high GPU loads.*

- **enable_realtime_transcription** (bool, default=False): Enables or disables real-time transcription of audio. When set to True, the audio will be transcribed continuously as it is being recorded.

- **realtime_model_type** (str, default="tiny"): Specifies the size or path of the machine learning model to be used for real-time transcription.
    - Valid options: 'tiny', 'tiny.en', 'base', 'base.en', 'small', 'small.en', 'medium', 'medium.en', 'large-v1', 'large-v2'.

- **realtime_processing_pause** (float, default=0.2): Specifies the time interval in seconds after a chunk of audio gets transcribed. Lower values will result in more "real-time" (frequent) transcription updates but may increase computational load.

- **on_realtime_transcription_update**: A callback function that is triggered whenever there's an update in the real-time transcription. The function is called with the newly transcribed text as its argument.

- **on_realtime_transcription_stabilized**: A callback function that is triggered whenever there's an update in the real-time transcription and returns a higher quality, stabilized text as its argument.

- **beam_size_realtime** (int, default=3): The beam size to use for real-time transcription beam search decoding.

#### Voice Activation Parameters

- **silero_sensitivity** (float, default=0.6): Sensitivity for Silero's voice activity detection ranging from 0 (least sensitive) to 1 (most sensitive). Default is 0.6.

- **silero_use_onnx** (bool, default=False): Enables usage of the pre-trained model from Silero in the ONNX (Open Neural Network Exchange) format instead of the PyTorch format. Default is False. Recommended for faster performance.

- **webrtc_sensitivity** (int, default=3): Sensitivity for the WebRTC Voice Activity Detection engine ranging from 0 (least aggressive / most sensitive) to 3 (most aggressive, least sensitive). Default is 3.

- **post_speech_silence_duration** (float, default=0.2): Duration in seconds of silence that must follow speech before the recording is considered to be completed. This ensures that any brief pauses during speech don't prematurely end the recording.

- **min_gap_between_recordings** (float, default=1.0): Specifies the minimum time interval in seconds that should exist between the end of one recording session and the beginning of another to prevent rapid consecutive recordings.

- **min_length_of_recording** (float, default=1.0): Specifies the minimum duration in seconds that a recording session should last to ensure meaningful audio capture, preventing excessively short or fragmented recordings.

- **pre_recording_buffer_duration** (float, default=0.2): The time span, in seconds, during which audio is buffered prior to formal recording. This helps counterbalancing the latency inherent in speech activity detection, ensuring no initial audio is missed.

- **on_vad_detect_start**: A callable function triggered when the system starts to listen for voice activity.

- **on_vad_detect_stop**: A callable function triggered when the system stops to listen for voice activity.

#### Wake Word Parameters

- **wake_words** (str, default=""): Wake words for initiating the recording. Multiple wake words can be provided as a comma-separated string. Supported wake words are: alexa, americano, blueberry, bumblebee, computer, grapefruits, grasshopper, hey google, hey siri, jarvis, ok google, picovoice, porcupine, terminator

- **wake_words_sensitivity** (float, default=0.6): Sensitivity level for wake word detection (0 for least sensitive, 1 for most sensitive).

- **wake_word_activation_delay** (float, default=0): Duration in seconds after the start of monitoring before the system switches to wake word activation if no voice is initially detected. If set to zero, the system uses wake word activation immediately.

- **wake_word_timeout** (float, default=5): Duration in seconds after a wake word is recognized. If no subsequent voice activity is detected within this window, the system transitions back to an inactive state, awaiting the next wake word or voice activation.

- **on_wakeword_detected**: A callable function triggered when a wake word is detected.

- **on_wakeword_timeout**: A callable function triggered when the system goes back to an inactive state after when no speech was detected after wake word activation.

- **on_wakeword_detection_start**: A callable function triggered when the system starts to listen for wake words

- **on_wakeword_detection_end**: A callable function triggered when stopping to listen for wake words (e.g. because of timeout or wake word detected)

## Contribution

Contributions are always welcome! 

Shoutout to [Steven Linn](https://github.com/stevenlafl) for providing docker support. 

## License

MIT

## Author

Kolja Beigel  
Email: kolja.beigel@web.de  
[GitHub](https://github.com/KoljaB/RealtimeSTT)

            

Raw data

            {
    "_id": null,
    "home_page": "https://github.com/KoljaB/RealTimeSTT",
    "name": "RealTimeSTT",
    "maintainer": null,
    "docs_url": null,
    "requires_python": ">=3.6",
    "maintainer_email": null,
    "keywords": "real-time, audio, transcription, speech-to-text, voice-activity-detection, VAD, real-time-transcription, ambient-noise-detection, microphone-input, faster_whisper, speech-recognition, voice-assistants, audio-processing, buffered-transcription, pyaudio, ambient-noise-level, voice-deactivity",
    "author": "Kolja Beigel",
    "author_email": "kolja.beigel@web.de",
    "download_url": "https://files.pythonhosted.org/packages/1c/de/af1b596d35bcf654e27d5467b9d3ba92d075e74ffab67f7e619e7b9e5988/RealTimeSTT-0.1.16.tar.gz",
    "platform": null,
    "description": "\r\n# RealtimeSTT\r\n\r\n*Easy-to-use, low-latency speech-to-text library for realtime applications*\r\n\r\n## About the Project\r\n\r\nRealtimeSTT listens to the microphone and transcribes voice into text.  \r\n\r\n> **Hint:** *<strong>Check out [Linguflex](https://github.com/KoljaB/Linguflex)</strong>, the original project from which RealtimeSTT is spun off. It lets you control your environment by speaking and is one of the most capable and sophisticated open-source assistants currently available.*\r\n\r\nIt's ideal for:\r\n\r\n- **Voice Assistants**\r\n- Applications requiring **fast and precise** speech-to-text conversion\r\n\r\nhttps://github.com/KoljaB/RealtimeSTT/assets/7604638/207cb9a2-4482-48e7-9d2b-0722c3ee6d14\r\n\r\n### Updates\r\n\r\nLatest Version: v0.1.15\r\n\r\nSee [release history](https://github.com/KoljaB/RealtimeSTT/releases).\r\n\r\n> **Hint:** *Since we use the `multiprocessing` module now, ensure to include the `if __name__ == '__main__':` protection in your code to prevent unexpected behavior, especially on platforms like Windows. For a detailed explanation on why this is important, visit the [official Python documentation on `multiprocessing`](https://docs.python.org/3/library/multiprocessing.html#multiprocessing-programming).*\r\n\r\n### Features\r\n\r\n- **Voice Activity Detection**: Automatically detects when you start and stop speaking.\r\n- **Realtime Transcription**: Transforms speech to text in real-time.\r\n- **Wake Word Activation**: Can activate upon detecting a designated wake word.\r\n\r\n> **Hint**: *Check out [RealtimeTTS](https://github.com/KoljaB/RealtimeTTS), the output counterpart of this library, for text-to-voice capabilities. Together, they form a powerful realtime audio wrapper around large language models.*\r\n\r\n## Tech Stack\r\n\r\nThis library uses:\r\n\r\n- **Voice Activity Detection**\r\n  - [WebRTCVAD](https://github.com/wiseman/py-webrtcvad) for initial voice activity detection.\r\n  - [SileroVAD](https://github.com/snakers4/silero-vad) for more accurate verification.\r\n- **Speech-To-Text**\r\n  - [Faster_Whisper](https://github.com/guillaumekln/faster-whisper) for instant (GPU-accelerated) transcription.\r\n- **Wake Word Detection**\r\n  - [Porcupine](https://github.com/Picovoice/porcupine) for wake word detection.\r\n\r\n*These components represent the \"industry standard\" for cutting-edge applications, providing the most modern and effective foundation for building high-end solutions.*\r\n\r\n\r\n## Installation\r\n\r\n```bash\r\npip install RealtimeSTT\r\n```\r\n\r\nThis will install all the necessary dependencies, including a **CPU support only** version of PyTorch.\r\n\r\nAlthough it is possible to run RealtimeSTT with a CPU installation only (use a small model like \"tiny\" or \"base\" in this case) you will get way better experience using:\r\n\r\n### GPU Support with CUDA (recommended)\r\n\r\nAdditional steps are needed for a **GPU-optimized** installation. These steps are recommended for those who require **better performance** and have a compatible NVIDIA GPU.\r\n\r\n> **Note**: *To check if your NVIDIA GPU supports CUDA, visit the [official CUDA GPUs list](https://developer.nvidia.com/cuda-gpus).*\r\n\r\nTo use RealtimeSTT with GPU support via CUDA please follow these steps:\r\n\r\n1. **Install NVIDIA CUDA Toolkit 11.8**:\r\n    - Visit [NVIDIA CUDA Toolkit Archive](https://developer.nvidia.com/cuda-11-8-0-download-archive).\r\n    - Select operating system and version.\r\n    - Download and install the software.\r\n\r\n2. **Install NVIDIA cuDNN 8.7.0 for CUDA 11.x**:\r\n    - Visit [NVIDIA cuDNN Archive](https://developer.nvidia.com/rdp/cudnn-archive).\r\n    - Click on \"Download cuDNN v8.7.0 (November 28th, 2022), for CUDA 11.x\".\r\n    - Download and install the software.\r\n\r\n3. **Install ffmpeg**:\r\n\r\n    > **Note**: *Installation of ffmpeg might not actually be needed to operate RealtimeSTT* <sup> *thanks to jgilbert2017 for pointing this out</sup>\r\n\r\n    You can download an installer for your OS from the [ffmpeg Website](https://ffmpeg.org/download.html).  \r\n    \r\n    Or use a package manager:\r\n\r\n    - **On Ubuntu or Debian**:\r\n        ```bash\r\n        sudo apt update && sudo apt install ffmpeg\r\n        ```\r\n\r\n    - **On Arch Linux**:\r\n        ```bash\r\n        sudo pacman -S ffmpeg\r\n        ```\r\n\r\n    - **On MacOS using Homebrew** ([https://brew.sh/](https://brew.sh/)):\r\n        ```bash\r\n        brew install ffmpeg\r\n        ```\r\n\r\n    - **On Windows using Winget** [official documentation](https://learn.microsoft.com/en-us/windows/package-manager/winget/) :\r\n        ```bash\r\n        winget install Gyan.FFmpeg\r\n        ```\r\n        \r\n    - **On Windows using Chocolatey** ([https://chocolatey.org/](https://chocolatey.org/)):\r\n        ```bash\r\n        choco install ffmpeg\r\n        ```\r\n\r\n    - **On Windows using Scoop** ([https://scoop.sh/](https://scoop.sh/)):\r\n        ```bash\r\n        scoop install ffmpeg\r\n        ```    \r\n\r\n4. **Install PyTorch with CUDA support**:\r\n    ```bash\r\n    pip uninstall torch\r\n    pip install torch==2.2.2+cu118 torchaudio==2.2.2 --index-url https://download.pytorch.org/whl/cu118\r\n    ```\r\n\r\n## Quick Start\r\n\r\nBasic usage:\r\n\r\n### Manual Recording\r\n\r\nStart and stop of recording are manually triggered.\r\n\r\n```python\r\nrecorder.start()\r\nrecorder.stop()\r\nprint(recorder.text())\r\n```\r\n\r\n### Automatic Recording\r\n\r\nRecording based on voice activity detection.\r\n\r\n```python\r\nwith AudioToTextRecorder() as recorder:\r\n    print(recorder.text())\r\n```\r\n\r\nWhen running recorder.text in a loop it is recommended to use a callback, allowing the transcription to be run asynchronously:\r\n\r\n```python\r\ndef process_text(text):\r\n    print (text)\r\n    \r\nwhile True:\r\n    recorder.text(process_text)\r\n```\r\n\r\n### Wakewords\r\n\r\nKeyword activation before detecting voice. Write the comma-separated list of your desired activation keywords into the wake_words parameter. You can choose wake words from these list: alexa, americano, blueberry, bumblebee, computer, grapefruits, grasshopper, hey google, hey siri, jarvis, ok google, picovoice, porcupine, terminator. \r\n\r\n```python\r\nrecorder = AudioToTextRecorder(wake_words=\"jarvis\")\r\n\r\nprint('Say \"Jarvis\" then speak.')\r\nprint(recorder.text())\r\n```\r\n\r\n### Callbacks\r\n\r\nYou can set callback functions to be executed on different events (see [Configuration](#configuration)) :\r\n\r\n```python\r\ndef my_start_callback():\r\n    print(\"Recording started!\")\r\n\r\ndef my_stop_callback():\r\n    print(\"Recording stopped!\")\r\n\r\nrecorder = AudioToTextRecorder(on_recording_start=my_start_callback,\r\n                               on_recording_stop=my_stop_callback)\r\n```\r\n\r\n### Feed chunks\r\n\r\nIf you don't want to use the local microphone set use_microphone parameter to false and provide raw PCM audiochunks in 16-bit mono (samplerate 16000) with this method:\r\n\r\n```python\r\nrecorder.feed_audio(audio_chunk)\r\n```\r\n\r\n### Shutdown\r\n\r\nYou can shutdown the recorder safely by using the context manager protocol:\r\n\r\n```python\r\nwith AudioToTextRecorder() as recorder:\r\n    [...]\r\n```\r\n\r\nOr you can call the shutdown method manually (if using \"with\" is not feasible):\r\n\r\n```python\r\nrecorder.shutdown()\r\n```\r\n\r\n## Testing the Library\r\n\r\nThe test subdirectory contains a set of scripts to help you evaluate and understand the capabilities of the RealtimeTTS library.\r\n\r\nTest scripts depending on RealtimeTTS library may require you to enter your azure service region within the script. \r\nWhen using OpenAI-, Azure- or Elevenlabs-related demo scripts the API Keys should be provided in the environment variables OPENAI_API_KEY, AZURE_SPEECH_KEY and ELEVENLABS_API_KEY (see [RealtimeTTS](https://github.com/KoljaB/RealtimeTTS))\r\n\r\n- **simple_test.py**\r\n    - **Description**: A \"hello world\" styled demonstration of the library's simplest usage.\r\n\r\n- **realtimestt_test.py**\r\n    - **Description**: Showcasing live-transcription.\r\n\r\n- **wakeword_test.py**\r\n    - **Description**: A demonstration of the wakeword activation.\r\n\r\n- **translator.py**\r\n    - **Dependencies**: Run `pip install openai realtimetts`.\r\n    - **Description**: Real-time translations into six different languages.\r\n\r\n- **openai_voice_interface.py**\r\n    - **Dependencies**: Run `pip install openai realtimetts`.\r\n    - **Description**: Wake word activated and voice based user interface to the OpenAI API.\r\n\r\n- **advanced_talk.py**\r\n    - **Dependencies**: Run `pip install openai keyboard realtimetts`.\r\n    - **Description**: Choose TTS engine and voice before starting AI conversation.\r\n\r\n- **minimalistic_talkbot.py**\r\n    - **Dependencies**: Run `pip install openai realtimetts`.\r\n    - **Description**: A basic talkbot in 20 lines of code.\r\n\r\nThe example_app subdirectory contains a polished user interface application for the OpenAI API based on PyQt5.\r\n\r\n## Configuration\r\n\r\n### Initialization Parameters for `AudioToTextRecorder`\r\n\r\nWhen you initialize the `AudioToTextRecorder` class, you have various options to customize its behavior.\r\n\r\n#### General Parameters\r\n\r\n- **model** (str, default=\"tiny\"): Model size or path for transcription.\r\n    - Options: 'tiny', 'tiny.en', 'base', 'base.en', 'small', 'small.en', 'medium', 'medium.en', 'large-v1', 'large-v2'.\r\n    - Note: If a size is provided, the model will be downloaded from the Hugging Face Hub.\r\n\r\n- **language** (str, default=\"\"): Language code for transcription. If left empty, the model will try to auto-detect the language. Supported language codes are listed in [Whisper Tokenizer library](https://github.com/openai/whisper/blob/main/whisper/tokenizer.py).\r\n\r\n- **compute_type** (str, default=\"default\"): Specifies the type of computation to be used for transcription. See [Whisper Quantization](https://opennmt.net/CTranslate2/quantization.html)\r\n\r\n- **input_device_index** (int, default=0): Audio Input Device Index to use.\r\n\r\n- **gpu_device_index** (int, default=0): GPU Device Index to use. The model can also be loaded on multiple GPUs by passing a list of IDs (e.g. [0, 1, 2, 3]).\r\n\r\n- **on_recording_start**: A callable function triggered when recording starts.\r\n\r\n- **on_recording_stop**: A callable function triggered when recording ends.\r\n\r\n- **on_transcription_start**: A callable function triggered when transcription starts.\r\n\r\n- **ensure_sentence_starting_uppercase** (bool, default=True): Ensures that every sentence detected by the algorithm starts with an uppercase letter.\r\n\r\n- **ensure_sentence_ends_with_period** (bool, default=True): Ensures that every sentence that doesn't end with punctuation such as \"?\", \"!\" ends with a period\r\n\r\n- **use_microphone** (bool, default=True): Usage of local microphone for transcription. Set to False if you want to provide chunks with feed_audio method.\r\n\r\n- **spinner** (bool, default=True): Provides a spinner animation text with information about the current recorder state.\r\n\r\n- **level** (int, default=logging.WARNING): Logging level.\r\n\r\n- **handle_buffer_overflow** (bool, default=True): If set, the system will log a warning when an input overflow occurs during recording and remove the data from the buffer.\r\n\r\n- **beam_size** (int, default=5): The beam size to use for beam search decoding.\r\n\r\n- **initial_prompt** (str or iterable of int, default=None): Initial prompt to be fed to the transcription models.\r\n\r\n- **suppress_tokens** (list of int, default=[-1]): Tokens to be suppressed from the transcription output.\r\n\r\n- **on_recorded_chunk**: A callback function that is triggered when a chunk of audio is recorded. Submits the chunk data as parameter.\r\n\r\n- **debug_mode** (bool, default=False): If set, the system prints additional debug information to the console.\r\n\r\n#### Real-time Transcription Parameters\r\n\r\n> **Note**: *When enabling realtime description a GPU installation is strongly advised. Using realtime transcription may create high GPU loads.*\r\n\r\n- **enable_realtime_transcription** (bool, default=False): Enables or disables real-time transcription of audio. When set to True, the audio will be transcribed continuously as it is being recorded.\r\n\r\n- **realtime_model_type** (str, default=\"tiny\"): Specifies the size or path of the machine learning model to be used for real-time transcription.\r\n    - Valid options: 'tiny', 'tiny.en', 'base', 'base.en', 'small', 'small.en', 'medium', 'medium.en', 'large-v1', 'large-v2'.\r\n\r\n- **realtime_processing_pause** (float, default=0.2): Specifies the time interval in seconds after a chunk of audio gets transcribed. Lower values will result in more \"real-time\" (frequent) transcription updates but may increase computational load.\r\n\r\n- **on_realtime_transcription_update**: A callback function that is triggered whenever there's an update in the real-time transcription. The function is called with the newly transcribed text as its argument.\r\n\r\n- **on_realtime_transcription_stabilized**: A callback function that is triggered whenever there's an update in the real-time transcription and returns a higher quality, stabilized text as its argument.\r\n\r\n- **beam_size_realtime** (int, default=3): The beam size to use for real-time transcription beam search decoding.\r\n\r\n#### Voice Activation Parameters\r\n\r\n- **silero_sensitivity** (float, default=0.6): Sensitivity for Silero's voice activity detection ranging from 0 (least sensitive) to 1 (most sensitive). Default is 0.6.\r\n\r\n- **silero_use_onnx** (bool, default=False): Enables usage of the pre-trained model from Silero in the ONNX (Open Neural Network Exchange) format instead of the PyTorch format. Default is False. Recommended for faster performance.\r\n\r\n- **webrtc_sensitivity** (int, default=3): Sensitivity for the WebRTC Voice Activity Detection engine ranging from 0 (least aggressive / most sensitive) to 3 (most aggressive, least sensitive). Default is 3.\r\n\r\n- **post_speech_silence_duration** (float, default=0.2): Duration in seconds of silence that must follow speech before the recording is considered to be completed. This ensures that any brief pauses during speech don't prematurely end the recording.\r\n\r\n- **min_gap_between_recordings** (float, default=1.0): Specifies the minimum time interval in seconds that should exist between the end of one recording session and the beginning of another to prevent rapid consecutive recordings.\r\n\r\n- **min_length_of_recording** (float, default=1.0): Specifies the minimum duration in seconds that a recording session should last to ensure meaningful audio capture, preventing excessively short or fragmented recordings.\r\n\r\n- **pre_recording_buffer_duration** (float, default=0.2): The time span, in seconds, during which audio is buffered prior to formal recording. This helps counterbalancing the latency inherent in speech activity detection, ensuring no initial audio is missed.\r\n\r\n- **on_vad_detect_start**: A callable function triggered when the system starts to listen for voice activity.\r\n\r\n- **on_vad_detect_stop**: A callable function triggered when the system stops to listen for voice activity.\r\n\r\n#### Wake Word Parameters\r\n\r\n- **wake_words** (str, default=\"\"): Wake words for initiating the recording. Multiple wake words can be provided as a comma-separated string. Supported wake words are: alexa, americano, blueberry, bumblebee, computer, grapefruits, grasshopper, hey google, hey siri, jarvis, ok google, picovoice, porcupine, terminator\r\n\r\n- **wake_words_sensitivity** (float, default=0.6): Sensitivity level for wake word detection (0 for least sensitive, 1 for most sensitive).\r\n\r\n- **wake_word_activation_delay** (float, default=0): Duration in seconds after the start of monitoring before the system switches to wake word activation if no voice is initially detected. If set to zero, the system uses wake word activation immediately.\r\n\r\n- **wake_word_timeout** (float, default=5): Duration in seconds after a wake word is recognized. If no subsequent voice activity is detected within this window, the system transitions back to an inactive state, awaiting the next wake word or voice activation.\r\n\r\n- **on_wakeword_detected**: A callable function triggered when a wake word is detected.\r\n\r\n- **on_wakeword_timeout**: A callable function triggered when the system goes back to an inactive state after when no speech was detected after wake word activation.\r\n\r\n- **on_wakeword_detection_start**: A callable function triggered when the system starts to listen for wake words\r\n\r\n- **on_wakeword_detection_end**: A callable function triggered when stopping to listen for wake words (e.g. because of timeout or wake word detected)\r\n\r\n## Contribution\r\n\r\nContributions are always welcome! \r\n\r\nShoutout to [Steven Linn](https://github.com/stevenlafl) for providing docker support. \r\n\r\n## License\r\n\r\nMIT\r\n\r\n## Author\r\n\r\nKolja Beigel  \r\nEmail: kolja.beigel@web.de  \r\n[GitHub](https://github.com/KoljaB/RealtimeSTT)\r\n",
    "bugtrack_url": null,
    "license": null,
    "summary": "A fast Voice Activity Detection and Transcription System",
    "version": "0.1.16",
    "project_urls": {
        "Homepage": "https://github.com/KoljaB/RealTimeSTT"
    },
    "split_keywords": [
        "real-time",
        " audio",
        " transcription",
        " speech-to-text",
        " voice-activity-detection",
        " vad",
        " real-time-transcription",
        " ambient-noise-detection",
        " microphone-input",
        " faster_whisper",
        " speech-recognition",
        " voice-assistants",
        " audio-processing",
        " buffered-transcription",
        " pyaudio",
        " ambient-noise-level",
        " voice-deactivity"
    ],
    "urls": [
        {
            "comment_text": "",
            "digests": {
                "blake2b_256": "63536d612045e4fccbd7e0887deeb59537313a9a902b292d18823d64d459e02d",
                "md5": "01a3c11d915cf27818f8879e2dc602d4",
                "sha256": "8c0cf6b0db0b38b7cb6ba13fb68e8bade54b4636382758e80a4c7c3691f2cf89"
            },
            "downloads": -1,
            "filename": "RealTimeSTT-0.1.16-py3-none-any.whl",
            "has_sig": false,
            "md5_digest": "01a3c11d915cf27818f8879e2dc602d4",
            "packagetype": "bdist_wheel",
            "python_version": "py3",
            "requires_python": ">=3.6",
            "size": 23942,
            "upload_time": "2024-06-02T09:46:03",
            "upload_time_iso_8601": "2024-06-02T09:46:03.699050Z",
            "url": "https://files.pythonhosted.org/packages/63/53/6d612045e4fccbd7e0887deeb59537313a9a902b292d18823d64d459e02d/RealTimeSTT-0.1.16-py3-none-any.whl",
            "yanked": false,
            "yanked_reason": null
        },
        {
            "comment_text": "",
            "digests": {
                "blake2b_256": "1cdeaf1b596d35bcf654e27d5467b9d3ba92d075e74ffab67f7e619e7b9e5988",
                "md5": "5a90392fe363fd3364447df0ec29c346",
                "sha256": "efd6e3292ed28916456292cfeb9cf4956e896c3762b2abb89057e304db7a00e6"
            },
            "downloads": -1,
            "filename": "RealTimeSTT-0.1.16.tar.gz",
            "has_sig": false,
            "md5_digest": "5a90392fe363fd3364447df0ec29c346",
            "packagetype": "sdist",
            "python_version": "source",
            "requires_python": ">=3.6",
            "size": 23373,
            "upload_time": "2024-06-02T09:46:05",
            "upload_time_iso_8601": "2024-06-02T09:46:05.796236Z",
            "url": "https://files.pythonhosted.org/packages/1c/de/af1b596d35bcf654e27d5467b9d3ba92d075e74ffab67f7e619e7b9e5988/RealTimeSTT-0.1.16.tar.gz",
            "yanked": false,
            "yanked_reason": null
        }
    ],
    "upload_time": "2024-06-02 09:46:05",
    "github": true,
    "gitlab": false,
    "bitbucket": false,
    "codeberg": false,
    "github_user": "KoljaB",
    "github_project": "RealTimeSTT",
    "travis_ci": false,
    "coveralls": false,
    "github_actions": false,
    "requirements": [
        {
            "name": "PyAudio",
            "specs": [
                [
                    "==",
                    "0.2.14"
                ]
            ]
        },
        {
            "name": "faster-whisper",
            "specs": [
                [
                    "==",
                    "1.0.2"
                ]
            ]
        },
        {
            "name": "pvporcupine",
            "specs": [
                [
                    "==",
                    "1.9.5"
                ]
            ]
        },
        {
            "name": "webrtcvad",
            "specs": [
                [
                    "==",
                    "2.0.10"
                ]
            ]
        },
        {
            "name": "halo",
            "specs": [
                [
                    "==",
                    "0.0.31"
                ]
            ]
        },
        {
            "name": "torch",
            "specs": [
                [
                    "==",
                    "2.3.0"
                ]
            ]
        },
        {
            "name": "torchaudio",
            "specs": [
                [
                    "==",
                    "2.3.0"
                ]
            ]
        },
        {
            "name": "scipy",
            "specs": [
                [
                    "==",
                    "1.12.0"
                ]
            ]
        },
        {
            "name": "websockets",
            "specs": [
                [
                    "==",
                    "v12.0"
                ]
            ]
        }
    ],
    "lcname": "realtimestt"
}
        
Elapsed time: 1.31348s