listening-node

Name	listening-node JSON
Version	0.0.28 JSON
	download
home_page	None
Summary	None
upload_time	2025-02-11 13:34:27
maintainer	None
docs_url	None
author	C. Thomas Brittain
requires_python	<4.0,>=3.10
license	MIT
keywords
VCS
bugtrack_url
requirements	No requirements were recorded.
Travis-CI	No Travis.
coveralls test coverage	No coveralls.

            ## Setup
A simple toolset for using [Whisper](https://openai.com/index/whisper/) models to transcribe audio in real-time.

The `listening_node` is a wrapper around the whisper library that provides a simple interface for transcribing audio in real-time.  The module is designed to be versatile, piping the data to local or remote endpoints for further processing.  All aspects of the transcription can be configured via a config file (see bottom).

## Install
```
pip install listening-node
```

## Prerequisites

### MacOS
1. Install `brew install portaudio`

## Attribution
The core of this code was heavily influenced and includes some code from:
- https://github.com/davabase/whisper_real_time/tree/master
- https://github.com/openai/whisper/discussions/608

Huge thanks to [davabase](https://github.com/davabase) for the initial code!  All I've done is wrap it up in a nice package.

## Examples

Below is a basic example of how to use the whisper worker to transcribe audio in real-time.
```python
from rich import print

from listening_node import Config, RecordingDevice, ListeningNode, TranscriptionResult

def transcription_callback(text: str, result: TranscriptionResult) -> None:
    print("Here's what I heard: ")
    print(result)

config = Config.load("config.yaml")

recording_device = RecordingDevice(config.mic_config)
listening_node = ListeningNode(
    config.listening_node,
    recording_device,
)

listening_node.listen(transcription_callback)
```

The `transcription_callback` function is called when a transcription is completed. 

### Sending Transcription to REST API
```python
import requests
from listening_node import Config, RecordingDevice, ListeningNode, TranscriptionResult


def transcription_callback(text: str, result: TranscriptionResult) -> None:
    # Send the transcription to a REST API
    requests.post(
        "http://localhost:5000/transcribe",
        json={"text": text, "result": result.to_dict()}
    )

config = Config.load("config.yaml")
recording_device = RecordingDevice(config.mic_config)
listening_node = ListeningNode(
    config.listening_node,
    recording_device,
)
listening_node.listen(transcription_callback)
```

The `TranscriptionResult` object has a `.to_dict()` method that converts the object to a dictionary, which can be serialized to JSON.

```json
{
    "text": "This is only a test of words.",
    "segments": [
        {
            "id": 0,
            "seek": 0,
            "start": 0.0,
            "end": 1.8,
            "text": " This is only a test of words.",
            "tokens": [50363, 770, 318, 691, 257, 1332, 286, 2456, 13, 50463],
            "temperature": 0.0,
            "avg_logprob": -0.43947878750887787,
            "compression_ratio": 0.8285714285714286,
            "no_speech_prob": 0.0012085052439942956,
            "words": [
                {"word": " This", "start": 0.0, "end": 0.36, "probability": 0.750191330909729},
                {"word": " is", "start": 0.36, "end": 0.54, "probability": 0.997636079788208},
                {"word": " only", "start": 0.54, "end": 0.78, "probability": 0.998072624206543},
                {"word": " a", "start": 0.78, "end": 1.02, "probability": 0.9984667897224426},
                {"word": " test", "start": 1.02, "end": 1.28, "probability": 0.9980781078338623},
                {"word": " of", "start": 1.28, "end": 1.48, "probability": 0.99817955493927},
                {"word": " words.", "start": 1.48, "end": 1.8, "probability": 0.9987621307373047}
            ]
        }
    ],
    "language": "en",
    "processing_secs": 5.410359,
    "local_starttime": "2025-01-31T06:19:03.322642-06:00",
    "processing_rolling_avg_secs": 22.098183908976
}
```

## Config
Config is a `yaml` file enabling control of all aspects of the audio recording, model config, and transcription formatting. Below is an example of a config file.

```yaml
mic_config:
  mic_name: "Jabra SPEAK 410 USB: Audio (hw:3,0)" # Linux only
  sample_rate: 16000
  energy_threshold: 3000 # 0-4000

listening_node:
  record_timeout: 2 # 0-10
  phrase_timeout: 3 # 0-10
  in_memory: True
  transcribe_config:
    #  'tiny.en', 'tiny', 'base.en', 'base', 'small.en', 'small', 'medium.en', 'medium', 'large-v1', 'large-v2', 'large-v3', 'large', 'large-v3-turbo', 'turbo'
    model: medium.en

    # Whether to display the text being decoded to the console.
    # If True, displays all the details, If False, displays
    # minimal details. If None, does not display anything
    verbose: True

    # Temperature for sampling. It can be a tuple of temperatures,
    # which will be successively used upon failures according to
    # either compression_ratio_threshold or logprob_threshold.
    temperature: "(0.0, 0.2, 0.4, 0.6, 0.8, 1.0)" # "(0.0, 0.2, 0.4, 0.6, 0.8, 1.0)"

    # If the gzip compression ratio is above this value,
    # treat as failed
    compression_ratio_threshold: 2.4 # 2.4

    # If the average log probability over sampled tokens is below this value, treat as failed
    logprob_threshold: -1.0 # -1.0

    # If the no_speech probability is higher than this value AND
    # the average log probability over sampled tokens is below
    # logprob_threshold, consider the segment as silent
    no_speech_threshold: 0.6 # 0.6

    # if True, the previous output of the model is provided as a
    # prompt for the next window; disabling may make the text
    # inconsistent across windows, but the model becomes less
    # prone to getting stuck in a failure loop, such as repetition
    # looping or timestamps going out of sync.
    condition_on_previous_text: True # True

    # Extract word-level timestamps using the cross-attention
    # pattern and dynamic time warping, and include the timestamps
    # for each word in each segment.
    # NOTE: Setting this to true also adds word level data to the
    # output, which can be useful for downstream processing.  E.g.,
    # {
    #   'word': 'test',
    #   'start': np.float64(1.0),
    #   'end': np.float64(1.6),
    #   'probability': np.float64(0.8470910787582397)
    # }
    word_timestamps: True # False

    # If word_timestamps is True, merge these punctuation symbols
    # with the next word

    prepend_punctuations: '"''“¿([{-'

    # If word_timestamps is True, merge these punctuation symbols with the previous word
    append_punctuations: '"''.。,，!！?？:：”)]}、'

    # Optional text to provide as a prompt for the first window.
    # This can be used to provide, or "prompt-engineer" a context
    # for transcription, e.g. custom vocabularies or proper nouns
    # to make it more likely to predict those word correctly.
    initial_prompt: "" # ""

    # Comma-separated list start,end,start,end,... timestamps
    # (in seconds) of clips to process. The last end timestamp
    # defaults to the end of the file.
    clip_timestamps: "0" # "0"

    # When word_timestamps is True, skip silent periods **longer**
    # than this threshold (in seconds) when a possible
    # hallucination is detected
    hallucination_silence_threshold: None # float | None

    # Keyword arguments to construct DecodingOptions instances
    # TODO: How can DecodingOptions work?

logging_config:
  level: INFO # DEBUG, INFO, WARNING, ERROR, CRITICAL
  filepath: "talking.log"
  log_entry_format: "%(asctime)s - %(levelname)s - %(message)s"
  date_format: "%Y-%m-%d %H:%M:%S"
```

Raw data

            {
    "_id": null,
    "home_page": null,
    "name": "listening-node",
    "maintainer": null,
    "docs_url": null,
    "requires_python": "<4.0,>=3.10",
    "maintainer_email": null,
    "keywords": null,
    "author": "C. Thomas Brittain",
    "author_email": "cthomasbrittain@hotmail.com",
    "download_url": "https://files.pythonhosted.org/packages/75/80/1760ca9e290ee7cb1b968c457a5072a6899a7a37250cf1ac46b121b3c0a1/listening_node-0.0.28.tar.gz",
    "platform": null,
    "description": "## Setup\nA simple toolset for using [Whisper](https://openai.com/index/whisper/) models to transcribe audio in real-time.\n\nThe `listening_node` is a wrapper around the whisper library that provides a simple interface for transcribing audio in real-time.  The module is designed to be versatile, piping the data to local or remote endpoints for further processing.  All aspects of the transcription can be configured via a config file (see bottom).\n\n## Install\n```\npip install listening-node\n```\n\n## Prerequisites\n\n### MacOS\n1. Install `brew install portaudio`\n\n## Attribution\nThe core of this code was heavily influenced and includes some code from:\n- https://github.com/davabase/whisper_real_time/tree/master\n- https://github.com/openai/whisper/discussions/608\n\nHuge thanks to [davabase](https://github.com/davabase) for the initial code!  All I've done is wrap it up in a nice package.\n\n## Examples\n\nBelow is a basic example of how to use the whisper worker to transcribe audio in real-time.\n```python\nfrom rich import print\n\nfrom listening_node import Config, RecordingDevice, ListeningNode, TranscriptionResult\n\ndef transcription_callback(text: str, result: TranscriptionResult) -> None:\n    print(\"Here's what I heard: \")\n    print(result)\n\nconfig = Config.load(\"config.yaml\")\n\nrecording_device = RecordingDevice(config.mic_config)\nlistening_node = ListeningNode(\n    config.listening_node,\n    recording_device,\n)\n\nlistening_node.listen(transcription_callback)\n```\n\nThe `transcription_callback` function is called when a transcription is completed. \n\n### Sending Transcription to REST API\n```python\nimport requests\nfrom listening_node import Config, RecordingDevice, ListeningNode, TranscriptionResult\n\n\ndef transcription_callback(text: str, result: TranscriptionResult) -> None:\n    # Send the transcription to a REST API\n    requests.post(\n        \"http://localhost:5000/transcribe\",\n        json={\"text\": text, \"result\": result.to_dict()}\n    )\n\nconfig = Config.load(\"config.yaml\")\nrecording_device = RecordingDevice(config.mic_config)\nlistening_node = ListeningNode(\n    config.listening_node,\n    recording_device,\n)\nlistening_node.listen(transcription_callback)\n```\n\nThe `TranscriptionResult` object has a `.to_dict()` method that converts the object to a dictionary, which can be serialized to JSON.\n\n```json\n{\n    \"text\": \"This is only a test of words.\",\n    \"segments\": [\n        {\n            \"id\": 0,\n            \"seek\": 0,\n            \"start\": 0.0,\n            \"end\": 1.8,\n            \"text\": \" This is only a test of words.\",\n            \"tokens\": [50363, 770, 318, 691, 257, 1332, 286, 2456, 13, 50463],\n            \"temperature\": 0.0,\n            \"avg_logprob\": -0.43947878750887787,\n            \"compression_ratio\": 0.8285714285714286,\n            \"no_speech_prob\": 0.0012085052439942956,\n            \"words\": [\n                {\"word\": \" This\", \"start\": 0.0, \"end\": 0.36, \"probability\": 0.750191330909729},\n                {\"word\": \" is\", \"start\": 0.36, \"end\": 0.54, \"probability\": 0.997636079788208},\n                {\"word\": \" only\", \"start\": 0.54, \"end\": 0.78, \"probability\": 0.998072624206543},\n                {\"word\": \" a\", \"start\": 0.78, \"end\": 1.02, \"probability\": 0.9984667897224426},\n                {\"word\": \" test\", \"start\": 1.02, \"end\": 1.28, \"probability\": 0.9980781078338623},\n                {\"word\": \" of\", \"start\": 1.28, \"end\": 1.48, \"probability\": 0.99817955493927},\n                {\"word\": \" words.\", \"start\": 1.48, \"end\": 1.8, \"probability\": 0.9987621307373047}\n            ]\n        }\n    ],\n    \"language\": \"en\",\n    \"processing_secs\": 5.410359,\n    \"local_starttime\": \"2025-01-31T06:19:03.322642-06:00\",\n    \"processing_rolling_avg_secs\": 22.098183908976\n}\n```\n\n## Config\nConfig is a `yaml` file enabling control of all aspects of the audio recording, model config, and transcription formatting. Below is an example of a config file.\n\n```yaml\nmic_config:\n  mic_name: \"Jabra SPEAK 410 USB: Audio (hw:3,0)\" # Linux only\n  sample_rate: 16000\n  energy_threshold: 3000 # 0-4000\n\nlistening_node:\n  record_timeout: 2 # 0-10\n  phrase_timeout: 3 # 0-10\n  in_memory: True\n  transcribe_config:\n    #  'tiny.en', 'tiny', 'base.en', 'base', 'small.en', 'small', 'medium.en', 'medium', 'large-v1', 'large-v2', 'large-v3', 'large', 'large-v3-turbo', 'turbo'\n    model: medium.en\n\n    # Whether to display the text being decoded to the console.\n    # If True, displays all the details, If False, displays\n    # minimal details. If None, does not display anything\n    verbose: True\n\n    # Temperature for sampling. It can be a tuple of temperatures,\n    # which will be successively used upon failures according to\n    # either compression_ratio_threshold or logprob_threshold.\n    temperature: \"(0.0, 0.2, 0.4, 0.6, 0.8, 1.0)\" # \"(0.0, 0.2, 0.4, 0.6, 0.8, 1.0)\"\n\n    # If the gzip compression ratio is above this value,\n    # treat as failed\n    compression_ratio_threshold: 2.4 # 2.4\n\n    # If the average log probability over sampled tokens is below this value, treat as failed\n    logprob_threshold: -1.0 # -1.0\n\n    # If the no_speech probability is higher than this value AND\n    # the average log probability over sampled tokens is below\n    # logprob_threshold, consider the segment as silent\n    no_speech_threshold: 0.6 # 0.6\n\n    # if True, the previous output of the model is provided as a\n    # prompt for the next window; disabling may make the text\n    # inconsistent across windows, but the model becomes less\n    # prone to getting stuck in a failure loop, such as repetition\n    # looping or timestamps going out of sync.\n    condition_on_previous_text: True # True\n\n    # Extract word-level timestamps using the cross-attention\n    # pattern and dynamic time warping, and include the timestamps\n    # for each word in each segment.\n    # NOTE: Setting this to true also adds word level data to the\n    # output, which can be useful for downstream processing.  E.g.,\n    # {\n    #   'word': 'test',\n    #   'start': np.float64(1.0),\n    #   'end': np.float64(1.6),\n    #   'probability': np.float64(0.8470910787582397)\n    # }\n    word_timestamps: True # False\n\n    # If word_timestamps is True, merge these punctuation symbols\n    # with the next word\n\n    prepend_punctuations: '\"''\u201c\u00bf([{-'\n\n    # If word_timestamps is True, merge these punctuation symbols with the previous word\n    append_punctuations: '\"''.\u3002,\uff0c!\uff01?\uff1f:\uff1a\u201d)]}\u3001'\n\n    # Optional text to provide as a prompt for the first window.\n    # This can be used to provide, or \"prompt-engineer\" a context\n    # for transcription, e.g. custom vocabularies or proper nouns\n    # to make it more likely to predict those word correctly.\n    initial_prompt: \"\" # \"\"\n\n    # Comma-separated list start,end,start,end,... timestamps\n    # (in seconds) of clips to process. The last end timestamp\n    # defaults to the end of the file.\n    clip_timestamps: \"0\" # \"0\"\n\n    # When word_timestamps is True, skip silent periods **longer**\n    # than this threshold (in seconds) when a possible\n    # hallucination is detected\n    hallucination_silence_threshold: None # float | None\n\n    # Keyword arguments to construct DecodingOptions instances\n    # TODO: How can DecodingOptions work?\n\nlogging_config:\n  level: INFO # DEBUG, INFO, WARNING, ERROR, CRITICAL\n  filepath: \"talking.log\"\n  log_entry_format: \"%(asctime)s - %(levelname)s - %(message)s\"\n  date_format: \"%Y-%m-%d %H:%M:%S\"\n```\n",
    "bugtrack_url": null,
    "license": "MIT",
    "summary": null,
    "version": "0.0.28",
    "project_urls": null,
    "split_keywords": [],
    "urls": [
        {
            "comment_text": "",
            "digests": {
                "blake2b_256": "5a515ade8be81ff6f97d3c2eec635df6d481a74107bbae1c4f5849a899d1b3b8",
                "md5": "b9c8faa1eba276c232bcc116638107a7",
                "sha256": "b4bbd12d158229ca94c4d44896b4ab9afe4f0e8c61af755f9abb6f1b41dee98a"
            },
            "downloads": -1,
            "filename": "listening_node-0.0.28-py3-none-any.whl",
            "has_sig": false,
            "md5_digest": "b9c8faa1eba276c232bcc116638107a7",
            "packagetype": "bdist_wheel",
            "python_version": "py3",
            "requires_python": "<4.0,>=3.10",
            "size": 11870,
            "upload_time": "2025-02-11T13:34:25",
            "upload_time_iso_8601": "2025-02-11T13:34:25.910546Z",
            "url": "https://files.pythonhosted.org/packages/5a/51/5ade8be81ff6f97d3c2eec635df6d481a74107bbae1c4f5849a899d1b3b8/listening_node-0.0.28-py3-none-any.whl",
            "yanked": false,
            "yanked_reason": null
        },
        {
            "comment_text": "",
            "digests": {
                "blake2b_256": "75801760ca9e290ee7cb1b968c457a5072a6899a7a37250cf1ac46b121b3c0a1",
                "md5": "63d6fc3e5d173177e9148f6d495c9a14",
                "sha256": "b977eda53c4deae9d8ebc8d8eff21e77c74c10ce862a651c4d81191546330559"
            },
            "downloads": -1,
            "filename": "listening_node-0.0.28.tar.gz",
            "has_sig": false,
            "md5_digest": "63d6fc3e5d173177e9148f6d495c9a14",
            "packagetype": "sdist",
            "python_version": "source",
            "requires_python": "<4.0,>=3.10",
            "size": 10418,
            "upload_time": "2025-02-11T13:34:27",
            "upload_time_iso_8601": "2025-02-11T13:34:27.815027Z",
            "url": "https://files.pythonhosted.org/packages/75/80/1760ca9e290ee7cb1b968c457a5072a6899a7a37250cf1ac46b121b3c0a1/listening_node-0.0.28.tar.gz",
            "yanked": false,
            "yanked_reason": null
        }
    ],
    "upload_time": "2025-02-11 13:34:27",
    "github": false,
    "gitlab": false,
    "bitbucket": false,
    "codeberg": false,
    "lcname": "listening-node"
}

C. Thomas Brittain