listening_tool


Namelistening_tool JSON
Version 0.0.39 PyPI version JSON
download
home_pageNone
SummaryNone
upload_time2025-03-03 12:38:20
maintainerNone
docs_urlNone
authorC. Thomas Brittain
requires_python>=3.10
licenseMIT
keywords
VCS
bugtrack_url
requirements No requirements were recorded.
Travis-CI No Travis.
coveralls test coverage No coveralls.
            <!-- start setup -->
## Setup
A simple toolset for using [Whisper](https://openai.com/index/whisper/) models to transcribe audio in real-time.

The `listening_tool` is a wrapper around the whisper library that provides a simple interface for transcribing audio in real-time.  The module is designed to be versatile, piping the data to local or remote endpoints for further processing.  All aspects of the transcription can be configured via a config file (see bottom).

## Other Agent Tools
- [Thinking Tool](https://github.com/Ladvien/thinking_tool) - an Ollama based LLM server for distributed agentic operations.
- [Speaking Tool](https://github.com/Ladvien/speech_tool) - A simple text-to-speech server using Kokoro models.

### Prerequisites

#### MacOS
1. Install `brew install portaudio`

#### Linux

##### Ubuntu
```sh
sudo apt install portaudio19-dev -y
```

<!-- end setup -->

<!-- start quick_start -->
## Quick Start

Install the package and create a config file.
```
pip install listening_tool
```

Create a `config.yaml` file with the following content according to configuration options below.

Below is a basic example of how to use the listening tool to transcribe audio in real-time.
```python
from listening_tool import Config, RecordingDevice, ListeningTool, TranscriptionResult

def transcription_callback(text: str, result: TranscriptionResult) -> None:
    print("Here's what I heard: ")
    print(result)

config = Config.load("config.yaml")

recording_device = RecordingDevice(config.mic_config)
listening_tool = ListeningTool(
    config.listening_tool,
    recording_device,
)

listening_tool.listen(transcription_callback)
```

The `transcription_callback` function is called when a transcription is completed. 

<!-- end quick_start -->

## Documentation
- [Documentation](https://listening_tool.readthedocs.io/en/latest/)

## Attribution
The core of this code was heavily influenced and includes some code from:
- https://github.com/davabase/whisper_real_time/tree/master
- https://github.com/openai/whisper/discussions/608

Huge thanks to [davabase](https://github.com/davabase) for the initial code!  All I've done is wrap it up in a nice package.

<!-- start advanced_usage -->
### Send Text to Web API
```py
import requests
from listening_tool import Config, RecordingDevice, ListeningTool, TranscriptionResult

def transcription_callback(text: str, result: TranscriptionResult) -> None:
    # Send the transcription to a REST API
    requests.post(
        "http://localhost:5000/transcribe",
        json={"text": text, "result": result.to_dict()}
    )

config = Config.load("config.yaml")
recording_device = RecordingDevice(config.mic_config)
listening_tool = ListeningTool(
    config.listening_tool,
    recording_device,
)
listening_tool.listen(transcription_callback)
```

The `TranscriptionResult` object has a `.to_dict()` method that converts the object to a dictionary, which can be serialized to JSON.

```json
{
    "text": "This is only a test of words.",
    "segments": [
        {
            "id": 0,
            "seek": 0,
            "start": 0.0,
            "end": 1.8,
            "text": " This is only a test of words.",
            "tokens": [50363, 770, 318, 691, 257, 1332, 286, 2456, 13, 50463],
            "temperature": 0.0,
            "avg_logprob": -0.43947878750887787,
            "compression_ratio": 0.8285714285714286,
            "no_speech_prob": 0.0012085052439942956,
            "words": [
                {"word": " This", "start": 0.0, "end": 0.36, "probability": 0.750191330909729},
                {"word": " is", "start": 0.36, "end": 0.54, "probability": 0.997636079788208},
                {"word": " only", "start": 0.54, "end": 0.78, "probability": 0.998072624206543},
                {"word": " a", "start": 0.78, "end": 1.02, "probability": 0.9984667897224426},
                {"word": " test", "start": 1.02, "end": 1.28, "probability": 0.9980781078338623},
                {"word": " of", "start": 1.28, "end": 1.48, "probability": 0.99817955493927},
                {"word": " words.", "start": 1.48, "end": 1.8, "probability": 0.9987621307373047}
            ]
        }
    ],
    "language": "en",
    "processing_secs": 5.410359,
    "local_starttime": "2025-01-31T06:19:03.322642-06:00",
    "processing_rolling_avg_secs": 22.098183908976
}
```
<!-- end advanced_usage -->

<!-- start config -->
## Config
Config is a `yaml` file enabling control of all aspects of the audio recording, model config, and transcription formatting. Below is an example of a config file.

```yaml
mic_config:
  mic_name: "Jabra SPEAK 410 USB: Audio (hw:3,0)" # Linux only
  sample_rate: 16000
  energy_threshold: 3000 # 0-4000

listening_tool:
  record_timeout: 2 # 0-10
  phrase_timeout: 3 # 0-10
  in_memory: True
  transcribe_config:
    #  'tiny.en', 'tiny', 'base.en', 'base', 'small.en', 'small', 
    #'medium.en', 'medium', 'large-v1', 'large-v2', 'large-v3', 
    # 'large', 'large-v3-turbo', 'turbo'
    model: medium.en

    # Whether to display the text being decoded to the console.
    # If True, displays all the details, If False, displays
    # minimal details. If None, does not display anything
    verbose: True

    # Temperature for sampling. It can be a tuple of temperatures,
    # which will be successively used upon failures according to
    # either compression_ratio_threshold or logprob_threshold.
    temperature: "(0.0, 0.2, 0.4, 0.6, 0.8, 1.0)" # "(0.0, 0.2, 0.4, 0.6, 0.8, 1.0)"

    # If the gzip compression ratio is above this value,
    # treat as failed
    compression_ratio_threshold: 2.4 # 2.4

    # If the average log probability over sampled tokens is below this value, treat as failed
    logprob_threshold: -1.0 # -1.0

    # If the no_speech probability is higher than this value AND
    # the average log probability over sampled tokens is below
    # logprob_threshold, consider the segment as silent
    no_speech_threshold: 0.6 # 0.6

    # if True, the previous output of the model is provided as a
    # prompt for the next window; disabling may make the text
    # inconsistent across windows, but the model becomes less
    # prone to getting stuck in a failure loop, such as repetition
    # looping or timestamps going out of sync.
    condition_on_previous_text: True # True

    # Extract word-level timestamps using the cross-attention
    # pattern and dynamic time warping, and include the timestamps
    # for each word in each segment.
    # NOTE: Setting this to true also adds word level data to the
    # output, which can be useful for downstream processing.  E.g.,
    # {
    #   'word': 'test',
    #   'start': np.float64(1.0),
    #   'end': np.float64(1.6),
    #   'probability': np.float64(0.8470910787582397)
    # }
    word_timestamps: True # False

    # If word_timestamps is True, merge these punctuation symbols
    # with the next word

    prepend_punctuations: '"''“¿([{-'

    # If word_timestamps is True, merge these punctuation symbols with the previous word
    append_punctuations: '"''.。,,!!??::”)]}、'

    # Optional text to provide as a prompt for the first window.
    # This can be used to provide, or "prompt-engineer" a context
    # for transcription, e.g. custom vocabularies or proper nouns
    # to make it more likely to predict those word correctly.
    initial_prompt: "" # ""

    # Comma-separated list start,end,start,end,... timestamps
    # (in seconds) of clips to process. The last end timestamp
    # defaults to the end of the file.
    clip_timestamps: "0" # "0"

    # When word_timestamps is True, skip silent periods **longer**
    # than this threshold (in seconds) when a possible
    # hallucination is detected
    hallucination_silence_threshold: None # float | None

    # Keyword arguments to construct DecodingOptions instances
    # TODO: How can DecodingOptions work?

logging_config:
  level: INFO # DEBUG, INFO, WARNING, ERROR, CRITICAL
  filepath: "talking.log"
  log_entry_format: "%(asctime)s - %(levelname)s - %(message)s"
  date_format: "%Y-%m-%d %H:%M:%S"
```
<!-- end config -->
            

Raw data

            {
    "_id": null,
    "home_page": null,
    "name": "listening_tool",
    "maintainer": null,
    "docs_url": null,
    "requires_python": ">=3.10",
    "maintainer_email": null,
    "keywords": null,
    "author": "C. Thomas Brittain",
    "author_email": "cthomasbrittain@hotmail.com",
    "download_url": "https://files.pythonhosted.org/packages/88/48/e8fea78047489f42b6c7472e0af3447bc3534a87a03cf58b9f7f9e2a965a/listening_tool-0.0.39.tar.gz",
    "platform": null,
    "description": "<!-- start setup -->\n## Setup\nA simple toolset for using [Whisper](https://openai.com/index/whisper/) models to transcribe audio in real-time.\n\nThe `listening_tool` is a wrapper around the whisper library that provides a simple interface for transcribing audio in real-time.  The module is designed to be versatile, piping the data to local or remote endpoints for further processing.  All aspects of the transcription can be configured via a config file (see bottom).\n\n## Other Agent Tools\n- [Thinking Tool](https://github.com/Ladvien/thinking_tool) - an Ollama based LLM server for distributed agentic operations.\n- [Speaking Tool](https://github.com/Ladvien/speech_tool) - A simple text-to-speech server using Kokoro models.\n\n### Prerequisites\n\n#### MacOS\n1. Install `brew install portaudio`\n\n#### Linux\n\n##### Ubuntu\n```sh\nsudo apt install portaudio19-dev -y\n```\n\n<!-- end setup -->\n\n<!-- start quick_start -->\n## Quick Start\n\nInstall the package and create a config file.\n```\npip install listening_tool\n```\n\nCreate a `config.yaml` file with the following content according to configuration options below.\n\nBelow is a basic example of how to use the listening tool to transcribe audio in real-time.\n```python\nfrom listening_tool import Config, RecordingDevice, ListeningTool, TranscriptionResult\n\ndef transcription_callback(text: str, result: TranscriptionResult) -> None:\n    print(\"Here's what I heard: \")\n    print(result)\n\nconfig = Config.load(\"config.yaml\")\n\nrecording_device = RecordingDevice(config.mic_config)\nlistening_tool = ListeningTool(\n    config.listening_tool,\n    recording_device,\n)\n\nlistening_tool.listen(transcription_callback)\n```\n\nThe `transcription_callback` function is called when a transcription is completed. \n\n<!-- end quick_start -->\n\n## Documentation\n- [Documentation](https://listening_tool.readthedocs.io/en/latest/)\n\n## Attribution\nThe core of this code was heavily influenced and includes some code from:\n- https://github.com/davabase/whisper_real_time/tree/master\n- https://github.com/openai/whisper/discussions/608\n\nHuge thanks to [davabase](https://github.com/davabase) for the initial code!  All I've done is wrap it up in a nice package.\n\n<!-- start advanced_usage -->\n### Send Text to Web API\n```py\nimport requests\nfrom listening_tool import Config, RecordingDevice, ListeningTool, TranscriptionResult\n\ndef transcription_callback(text: str, result: TranscriptionResult) -> None:\n    # Send the transcription to a REST API\n    requests.post(\n        \"http://localhost:5000/transcribe\",\n        json={\"text\": text, \"result\": result.to_dict()}\n    )\n\nconfig = Config.load(\"config.yaml\")\nrecording_device = RecordingDevice(config.mic_config)\nlistening_tool = ListeningTool(\n    config.listening_tool,\n    recording_device,\n)\nlistening_tool.listen(transcription_callback)\n```\n\nThe `TranscriptionResult` object has a `.to_dict()` method that converts the object to a dictionary, which can be serialized to JSON.\n\n```json\n{\n    \"text\": \"This is only a test of words.\",\n    \"segments\": [\n        {\n            \"id\": 0,\n            \"seek\": 0,\n            \"start\": 0.0,\n            \"end\": 1.8,\n            \"text\": \" This is only a test of words.\",\n            \"tokens\": [50363, 770, 318, 691, 257, 1332, 286, 2456, 13, 50463],\n            \"temperature\": 0.0,\n            \"avg_logprob\": -0.43947878750887787,\n            \"compression_ratio\": 0.8285714285714286,\n            \"no_speech_prob\": 0.0012085052439942956,\n            \"words\": [\n                {\"word\": \" This\", \"start\": 0.0, \"end\": 0.36, \"probability\": 0.750191330909729},\n                {\"word\": \" is\", \"start\": 0.36, \"end\": 0.54, \"probability\": 0.997636079788208},\n                {\"word\": \" only\", \"start\": 0.54, \"end\": 0.78, \"probability\": 0.998072624206543},\n                {\"word\": \" a\", \"start\": 0.78, \"end\": 1.02, \"probability\": 0.9984667897224426},\n                {\"word\": \" test\", \"start\": 1.02, \"end\": 1.28, \"probability\": 0.9980781078338623},\n                {\"word\": \" of\", \"start\": 1.28, \"end\": 1.48, \"probability\": 0.99817955493927},\n                {\"word\": \" words.\", \"start\": 1.48, \"end\": 1.8, \"probability\": 0.9987621307373047}\n            ]\n        }\n    ],\n    \"language\": \"en\",\n    \"processing_secs\": 5.410359,\n    \"local_starttime\": \"2025-01-31T06:19:03.322642-06:00\",\n    \"processing_rolling_avg_secs\": 22.098183908976\n}\n```\n<!-- end advanced_usage -->\n\n<!-- start config -->\n## Config\nConfig is a `yaml` file enabling control of all aspects of the audio recording, model config, and transcription formatting. Below is an example of a config file.\n\n```yaml\nmic_config:\n  mic_name: \"Jabra SPEAK 410 USB: Audio (hw:3,0)\" # Linux only\n  sample_rate: 16000\n  energy_threshold: 3000 # 0-4000\n\nlistening_tool:\n  record_timeout: 2 # 0-10\n  phrase_timeout: 3 # 0-10\n  in_memory: True\n  transcribe_config:\n    #  'tiny.en', 'tiny', 'base.en', 'base', 'small.en', 'small', \n    #'medium.en', 'medium', 'large-v1', 'large-v2', 'large-v3', \n    # 'large', 'large-v3-turbo', 'turbo'\n    model: medium.en\n\n    # Whether to display the text being decoded to the console.\n    # If True, displays all the details, If False, displays\n    # minimal details. If None, does not display anything\n    verbose: True\n\n    # Temperature for sampling. It can be a tuple of temperatures,\n    # which will be successively used upon failures according to\n    # either compression_ratio_threshold or logprob_threshold.\n    temperature: \"(0.0, 0.2, 0.4, 0.6, 0.8, 1.0)\" # \"(0.0, 0.2, 0.4, 0.6, 0.8, 1.0)\"\n\n    # If the gzip compression ratio is above this value,\n    # treat as failed\n    compression_ratio_threshold: 2.4 # 2.4\n\n    # If the average log probability over sampled tokens is below this value, treat as failed\n    logprob_threshold: -1.0 # -1.0\n\n    # If the no_speech probability is higher than this value AND\n    # the average log probability over sampled tokens is below\n    # logprob_threshold, consider the segment as silent\n    no_speech_threshold: 0.6 # 0.6\n\n    # if True, the previous output of the model is provided as a\n    # prompt for the next window; disabling may make the text\n    # inconsistent across windows, but the model becomes less\n    # prone to getting stuck in a failure loop, such as repetition\n    # looping or timestamps going out of sync.\n    condition_on_previous_text: True # True\n\n    # Extract word-level timestamps using the cross-attention\n    # pattern and dynamic time warping, and include the timestamps\n    # for each word in each segment.\n    # NOTE: Setting this to true also adds word level data to the\n    # output, which can be useful for downstream processing.  E.g.,\n    # {\n    #   'word': 'test',\n    #   'start': np.float64(1.0),\n    #   'end': np.float64(1.6),\n    #   'probability': np.float64(0.8470910787582397)\n    # }\n    word_timestamps: True # False\n\n    # If word_timestamps is True, merge these punctuation symbols\n    # with the next word\n\n    prepend_punctuations: '\"''\u201c\u00bf([{-'\n\n    # If word_timestamps is True, merge these punctuation symbols with the previous word\n    append_punctuations: '\"''.\u3002,\uff0c!\uff01?\uff1f:\uff1a\u201d)]}\u3001'\n\n    # Optional text to provide as a prompt for the first window.\n    # This can be used to provide, or \"prompt-engineer\" a context\n    # for transcription, e.g. custom vocabularies or proper nouns\n    # to make it more likely to predict those word correctly.\n    initial_prompt: \"\" # \"\"\n\n    # Comma-separated list start,end,start,end,... timestamps\n    # (in seconds) of clips to process. The last end timestamp\n    # defaults to the end of the file.\n    clip_timestamps: \"0\" # \"0\"\n\n    # When word_timestamps is True, skip silent periods **longer**\n    # than this threshold (in seconds) when a possible\n    # hallucination is detected\n    hallucination_silence_threshold: None # float | None\n\n    # Keyword arguments to construct DecodingOptions instances\n    # TODO: How can DecodingOptions work?\n\nlogging_config:\n  level: INFO # DEBUG, INFO, WARNING, ERROR, CRITICAL\n  filepath: \"talking.log\"\n  log_entry_format: \"%(asctime)s - %(levelname)s - %(message)s\"\n  date_format: \"%Y-%m-%d %H:%M:%S\"\n```\n<!-- end config -->",
    "bugtrack_url": null,
    "license": "MIT",
    "summary": null,
    "version": "0.0.39",
    "project_urls": {
        "documentation": "https://listening_tool.readthedocs.io/en/latest/",
        "homepage": "https://github.com/Ladvien/speech-node",
        "repository": "https://github.com/Ladvien/speech-node"
    },
    "split_keywords": [],
    "urls": [
        {
            "comment_text": "",
            "digests": {
                "blake2b_256": "a904163e23612102964c175600115ac72313b84f5d83134af602120ed08ab068",
                "md5": "35a69ddd9362e0f14b04b62eb17b3dfc",
                "sha256": "bbe9fb6cf1fd6ddfd05ce7d9d073af86f52096f0e539ea298764469cbabc9d93"
            },
            "downloads": -1,
            "filename": "listening_tool-0.0.39-py3-none-any.whl",
            "has_sig": false,
            "md5_digest": "35a69ddd9362e0f14b04b62eb17b3dfc",
            "packagetype": "bdist_wheel",
            "python_version": "py3",
            "requires_python": ">=3.10",
            "size": 12353,
            "upload_time": "2025-03-03T12:38:18",
            "upload_time_iso_8601": "2025-03-03T12:38:18.358184Z",
            "url": "https://files.pythonhosted.org/packages/a9/04/163e23612102964c175600115ac72313b84f5d83134af602120ed08ab068/listening_tool-0.0.39-py3-none-any.whl",
            "yanked": false,
            "yanked_reason": null
        },
        {
            "comment_text": "",
            "digests": {
                "blake2b_256": "8848e8fea78047489f42b6c7472e0af3447bc3534a87a03cf58b9f7f9e2a965a",
                "md5": "86c34eb363ac81afe6d3708360b9d90e",
                "sha256": "641515e605170bf62a555ff2f8dedd961e0d95285e2a62c6c2594bf333a1b814"
            },
            "downloads": -1,
            "filename": "listening_tool-0.0.39.tar.gz",
            "has_sig": false,
            "md5_digest": "86c34eb363ac81afe6d3708360b9d90e",
            "packagetype": "sdist",
            "python_version": "source",
            "requires_python": ">=3.10",
            "size": 11421,
            "upload_time": "2025-03-03T12:38:20",
            "upload_time_iso_8601": "2025-03-03T12:38:20.185096Z",
            "url": "https://files.pythonhosted.org/packages/88/48/e8fea78047489f42b6c7472e0af3447bc3534a87a03cf58b9f7f9e2a965a/listening_tool-0.0.39.tar.gz",
            "yanked": false,
            "yanked_reason": null
        }
    ],
    "upload_time": "2025-03-03 12:38:20",
    "github": true,
    "gitlab": false,
    "bitbucket": false,
    "codeberg": false,
    "github_user": "Ladvien",
    "github_project": "speech-node",
    "travis_ci": false,
    "coveralls": false,
    "github_actions": false,
    "lcname": "listening_tool"
}
        
Elapsed time: 0.42085s