Name | listening-node JSON |
Version |
0.0.28
JSON |
| download |
home_page | None |
Summary | None |
upload_time | 2025-02-11 13:34:27 |
maintainer | None |
docs_url | None |
author | C. Thomas Brittain |
requires_python | <4.0,>=3.10 |
license | MIT |
keywords |
|
VCS |
|
bugtrack_url |
|
requirements |
No requirements were recorded.
|
Travis-CI |
No Travis.
|
coveralls test coverage |
No coveralls.
|
## Setup
A simple toolset for using [Whisper](https://openai.com/index/whisper/) models to transcribe audio in real-time.
The `listening_node` is a wrapper around the whisper library that provides a simple interface for transcribing audio in real-time. The module is designed to be versatile, piping the data to local or remote endpoints for further processing. All aspects of the transcription can be configured via a config file (see bottom).
## Install
```
pip install listening-node
```
## Prerequisites
### MacOS
1. Install `brew install portaudio`
## Attribution
The core of this code was heavily influenced and includes some code from:
- https://github.com/davabase/whisper_real_time/tree/master
- https://github.com/openai/whisper/discussions/608
Huge thanks to [davabase](https://github.com/davabase) for the initial code! All I've done is wrap it up in a nice package.
## Examples
Below is a basic example of how to use the whisper worker to transcribe audio in real-time.
```python
from rich import print
from listening_node import Config, RecordingDevice, ListeningNode, TranscriptionResult
def transcription_callback(text: str, result: TranscriptionResult) -> None:
print("Here's what I heard: ")
print(result)
config = Config.load("config.yaml")
recording_device = RecordingDevice(config.mic_config)
listening_node = ListeningNode(
config.listening_node,
recording_device,
)
listening_node.listen(transcription_callback)
```
The `transcription_callback` function is called when a transcription is completed.
### Sending Transcription to REST API
```python
import requests
from listening_node import Config, RecordingDevice, ListeningNode, TranscriptionResult
def transcription_callback(text: str, result: TranscriptionResult) -> None:
# Send the transcription to a REST API
requests.post(
"http://localhost:5000/transcribe",
json={"text": text, "result": result.to_dict()}
)
config = Config.load("config.yaml")
recording_device = RecordingDevice(config.mic_config)
listening_node = ListeningNode(
config.listening_node,
recording_device,
)
listening_node.listen(transcription_callback)
```
The `TranscriptionResult` object has a `.to_dict()` method that converts the object to a dictionary, which can be serialized to JSON.
```json
{
"text": "This is only a test of words.",
"segments": [
{
"id": 0,
"seek": 0,
"start": 0.0,
"end": 1.8,
"text": " This is only a test of words.",
"tokens": [50363, 770, 318, 691, 257, 1332, 286, 2456, 13, 50463],
"temperature": 0.0,
"avg_logprob": -0.43947878750887787,
"compression_ratio": 0.8285714285714286,
"no_speech_prob": 0.0012085052439942956,
"words": [
{"word": " This", "start": 0.0, "end": 0.36, "probability": 0.750191330909729},
{"word": " is", "start": 0.36, "end": 0.54, "probability": 0.997636079788208},
{"word": " only", "start": 0.54, "end": 0.78, "probability": 0.998072624206543},
{"word": " a", "start": 0.78, "end": 1.02, "probability": 0.9984667897224426},
{"word": " test", "start": 1.02, "end": 1.28, "probability": 0.9980781078338623},
{"word": " of", "start": 1.28, "end": 1.48, "probability": 0.99817955493927},
{"word": " words.", "start": 1.48, "end": 1.8, "probability": 0.9987621307373047}
]
}
],
"language": "en",
"processing_secs": 5.410359,
"local_starttime": "2025-01-31T06:19:03.322642-06:00",
"processing_rolling_avg_secs": 22.098183908976
}
```
## Config
Config is a `yaml` file enabling control of all aspects of the audio recording, model config, and transcription formatting. Below is an example of a config file.
```yaml
mic_config:
mic_name: "Jabra SPEAK 410 USB: Audio (hw:3,0)" # Linux only
sample_rate: 16000
energy_threshold: 3000 # 0-4000
listening_node:
record_timeout: 2 # 0-10
phrase_timeout: 3 # 0-10
in_memory: True
transcribe_config:
# 'tiny.en', 'tiny', 'base.en', 'base', 'small.en', 'small', 'medium.en', 'medium', 'large-v1', 'large-v2', 'large-v3', 'large', 'large-v3-turbo', 'turbo'
model: medium.en
# Whether to display the text being decoded to the console.
# If True, displays all the details, If False, displays
# minimal details. If None, does not display anything
verbose: True
# Temperature for sampling. It can be a tuple of temperatures,
# which will be successively used upon failures according to
# either compression_ratio_threshold or logprob_threshold.
temperature: "(0.0, 0.2, 0.4, 0.6, 0.8, 1.0)" # "(0.0, 0.2, 0.4, 0.6, 0.8, 1.0)"
# If the gzip compression ratio is above this value,
# treat as failed
compression_ratio_threshold: 2.4 # 2.4
# If the average log probability over sampled tokens is below this value, treat as failed
logprob_threshold: -1.0 # -1.0
# If the no_speech probability is higher than this value AND
# the average log probability over sampled tokens is below
# logprob_threshold, consider the segment as silent
no_speech_threshold: 0.6 # 0.6
# if True, the previous output of the model is provided as a
# prompt for the next window; disabling may make the text
# inconsistent across windows, but the model becomes less
# prone to getting stuck in a failure loop, such as repetition
# looping or timestamps going out of sync.
condition_on_previous_text: True # True
# Extract word-level timestamps using the cross-attention
# pattern and dynamic time warping, and include the timestamps
# for each word in each segment.
# NOTE: Setting this to true also adds word level data to the
# output, which can be useful for downstream processing. E.g.,
# {
# 'word': 'test',
# 'start': np.float64(1.0),
# 'end': np.float64(1.6),
# 'probability': np.float64(0.8470910787582397)
# }
word_timestamps: True # False
# If word_timestamps is True, merge these punctuation symbols
# with the next word
prepend_punctuations: '"''“¿([{-'
# If word_timestamps is True, merge these punctuation symbols with the previous word
append_punctuations: '"''.。,,!!??::”)]}、'
# Optional text to provide as a prompt for the first window.
# This can be used to provide, or "prompt-engineer" a context
# for transcription, e.g. custom vocabularies or proper nouns
# to make it more likely to predict those word correctly.
initial_prompt: "" # ""
# Comma-separated list start,end,start,end,... timestamps
# (in seconds) of clips to process. The last end timestamp
# defaults to the end of the file.
clip_timestamps: "0" # "0"
# When word_timestamps is True, skip silent periods **longer**
# than this threshold (in seconds) when a possible
# hallucination is detected
hallucination_silence_threshold: None # float | None
# Keyword arguments to construct DecodingOptions instances
# TODO: How can DecodingOptions work?
logging_config:
level: INFO # DEBUG, INFO, WARNING, ERROR, CRITICAL
filepath: "talking.log"
log_entry_format: "%(asctime)s - %(levelname)s - %(message)s"
date_format: "%Y-%m-%d %H:%M:%S"
```
Raw data
{
"_id": null,
"home_page": null,
"name": "listening-node",
"maintainer": null,
"docs_url": null,
"requires_python": "<4.0,>=3.10",
"maintainer_email": null,
"keywords": null,
"author": "C. Thomas Brittain",
"author_email": "cthomasbrittain@hotmail.com",
"download_url": "https://files.pythonhosted.org/packages/75/80/1760ca9e290ee7cb1b968c457a5072a6899a7a37250cf1ac46b121b3c0a1/listening_node-0.0.28.tar.gz",
"platform": null,
"description": "## Setup\nA simple toolset for using [Whisper](https://openai.com/index/whisper/) models to transcribe audio in real-time.\n\nThe `listening_node` is a wrapper around the whisper library that provides a simple interface for transcribing audio in real-time. The module is designed to be versatile, piping the data to local or remote endpoints for further processing. All aspects of the transcription can be configured via a config file (see bottom).\n\n## Install\n```\npip install listening-node\n```\n\n## Prerequisites\n\n### MacOS\n1. Install `brew install portaudio`\n\n## Attribution\nThe core of this code was heavily influenced and includes some code from:\n- https://github.com/davabase/whisper_real_time/tree/master\n- https://github.com/openai/whisper/discussions/608\n\nHuge thanks to [davabase](https://github.com/davabase) for the initial code! All I've done is wrap it up in a nice package.\n\n## Examples\n\nBelow is a basic example of how to use the whisper worker to transcribe audio in real-time.\n```python\nfrom rich import print\n\nfrom listening_node import Config, RecordingDevice, ListeningNode, TranscriptionResult\n\ndef transcription_callback(text: str, result: TranscriptionResult) -> None:\n print(\"Here's what I heard: \")\n print(result)\n\nconfig = Config.load(\"config.yaml\")\n\nrecording_device = RecordingDevice(config.mic_config)\nlistening_node = ListeningNode(\n config.listening_node,\n recording_device,\n)\n\nlistening_node.listen(transcription_callback)\n```\n\nThe `transcription_callback` function is called when a transcription is completed. \n\n### Sending Transcription to REST API\n```python\nimport requests\nfrom listening_node import Config, RecordingDevice, ListeningNode, TranscriptionResult\n\n\ndef transcription_callback(text: str, result: TranscriptionResult) -> None:\n # Send the transcription to a REST API\n requests.post(\n \"http://localhost:5000/transcribe\",\n json={\"text\": text, \"result\": result.to_dict()}\n )\n\nconfig = Config.load(\"config.yaml\")\nrecording_device = RecordingDevice(config.mic_config)\nlistening_node = ListeningNode(\n config.listening_node,\n recording_device,\n)\nlistening_node.listen(transcription_callback)\n```\n\nThe `TranscriptionResult` object has a `.to_dict()` method that converts the object to a dictionary, which can be serialized to JSON.\n\n```json\n{\n \"text\": \"This is only a test of words.\",\n \"segments\": [\n {\n \"id\": 0,\n \"seek\": 0,\n \"start\": 0.0,\n \"end\": 1.8,\n \"text\": \" This is only a test of words.\",\n \"tokens\": [50363, 770, 318, 691, 257, 1332, 286, 2456, 13, 50463],\n \"temperature\": 0.0,\n \"avg_logprob\": -0.43947878750887787,\n \"compression_ratio\": 0.8285714285714286,\n \"no_speech_prob\": 0.0012085052439942956,\n \"words\": [\n {\"word\": \" This\", \"start\": 0.0, \"end\": 0.36, \"probability\": 0.750191330909729},\n {\"word\": \" is\", \"start\": 0.36, \"end\": 0.54, \"probability\": 0.997636079788208},\n {\"word\": \" only\", \"start\": 0.54, \"end\": 0.78, \"probability\": 0.998072624206543},\n {\"word\": \" a\", \"start\": 0.78, \"end\": 1.02, \"probability\": 0.9984667897224426},\n {\"word\": \" test\", \"start\": 1.02, \"end\": 1.28, \"probability\": 0.9980781078338623},\n {\"word\": \" of\", \"start\": 1.28, \"end\": 1.48, \"probability\": 0.99817955493927},\n {\"word\": \" words.\", \"start\": 1.48, \"end\": 1.8, \"probability\": 0.9987621307373047}\n ]\n }\n ],\n \"language\": \"en\",\n \"processing_secs\": 5.410359,\n \"local_starttime\": \"2025-01-31T06:19:03.322642-06:00\",\n \"processing_rolling_avg_secs\": 22.098183908976\n}\n```\n\n## Config\nConfig is a `yaml` file enabling control of all aspects of the audio recording, model config, and transcription formatting. Below is an example of a config file.\n\n```yaml\nmic_config:\n mic_name: \"Jabra SPEAK 410 USB: Audio (hw:3,0)\" # Linux only\n sample_rate: 16000\n energy_threshold: 3000 # 0-4000\n\nlistening_node:\n record_timeout: 2 # 0-10\n phrase_timeout: 3 # 0-10\n in_memory: True\n transcribe_config:\n # 'tiny.en', 'tiny', 'base.en', 'base', 'small.en', 'small', 'medium.en', 'medium', 'large-v1', 'large-v2', 'large-v3', 'large', 'large-v3-turbo', 'turbo'\n model: medium.en\n\n # Whether to display the text being decoded to the console.\n # If True, displays all the details, If False, displays\n # minimal details. If None, does not display anything\n verbose: True\n\n # Temperature for sampling. It can be a tuple of temperatures,\n # which will be successively used upon failures according to\n # either compression_ratio_threshold or logprob_threshold.\n temperature: \"(0.0, 0.2, 0.4, 0.6, 0.8, 1.0)\" # \"(0.0, 0.2, 0.4, 0.6, 0.8, 1.0)\"\n\n # If the gzip compression ratio is above this value,\n # treat as failed\n compression_ratio_threshold: 2.4 # 2.4\n\n # If the average log probability over sampled tokens is below this value, treat as failed\n logprob_threshold: -1.0 # -1.0\n\n # If the no_speech probability is higher than this value AND\n # the average log probability over sampled tokens is below\n # logprob_threshold, consider the segment as silent\n no_speech_threshold: 0.6 # 0.6\n\n # if True, the previous output of the model is provided as a\n # prompt for the next window; disabling may make the text\n # inconsistent across windows, but the model becomes less\n # prone to getting stuck in a failure loop, such as repetition\n # looping or timestamps going out of sync.\n condition_on_previous_text: True # True\n\n # Extract word-level timestamps using the cross-attention\n # pattern and dynamic time warping, and include the timestamps\n # for each word in each segment.\n # NOTE: Setting this to true also adds word level data to the\n # output, which can be useful for downstream processing. E.g.,\n # {\n # 'word': 'test',\n # 'start': np.float64(1.0),\n # 'end': np.float64(1.6),\n # 'probability': np.float64(0.8470910787582397)\n # }\n word_timestamps: True # False\n\n # If word_timestamps is True, merge these punctuation symbols\n # with the next word\n\n prepend_punctuations: '\"''\u201c\u00bf([{-'\n\n # If word_timestamps is True, merge these punctuation symbols with the previous word\n append_punctuations: '\"''.\u3002,\uff0c!\uff01?\uff1f:\uff1a\u201d)]}\u3001'\n\n # Optional text to provide as a prompt for the first window.\n # This can be used to provide, or \"prompt-engineer\" a context\n # for transcription, e.g. custom vocabularies or proper nouns\n # to make it more likely to predict those word correctly.\n initial_prompt: \"\" # \"\"\n\n # Comma-separated list start,end,start,end,... timestamps\n # (in seconds) of clips to process. The last end timestamp\n # defaults to the end of the file.\n clip_timestamps: \"0\" # \"0\"\n\n # When word_timestamps is True, skip silent periods **longer**\n # than this threshold (in seconds) when a possible\n # hallucination is detected\n hallucination_silence_threshold: None # float | None\n\n # Keyword arguments to construct DecodingOptions instances\n # TODO: How can DecodingOptions work?\n\nlogging_config:\n level: INFO # DEBUG, INFO, WARNING, ERROR, CRITICAL\n filepath: \"talking.log\"\n log_entry_format: \"%(asctime)s - %(levelname)s - %(message)s\"\n date_format: \"%Y-%m-%d %H:%M:%S\"\n```\n",
"bugtrack_url": null,
"license": "MIT",
"summary": null,
"version": "0.0.28",
"project_urls": null,
"split_keywords": [],
"urls": [
{
"comment_text": "",
"digests": {
"blake2b_256": "5a515ade8be81ff6f97d3c2eec635df6d481a74107bbae1c4f5849a899d1b3b8",
"md5": "b9c8faa1eba276c232bcc116638107a7",
"sha256": "b4bbd12d158229ca94c4d44896b4ab9afe4f0e8c61af755f9abb6f1b41dee98a"
},
"downloads": -1,
"filename": "listening_node-0.0.28-py3-none-any.whl",
"has_sig": false,
"md5_digest": "b9c8faa1eba276c232bcc116638107a7",
"packagetype": "bdist_wheel",
"python_version": "py3",
"requires_python": "<4.0,>=3.10",
"size": 11870,
"upload_time": "2025-02-11T13:34:25",
"upload_time_iso_8601": "2025-02-11T13:34:25.910546Z",
"url": "https://files.pythonhosted.org/packages/5a/51/5ade8be81ff6f97d3c2eec635df6d481a74107bbae1c4f5849a899d1b3b8/listening_node-0.0.28-py3-none-any.whl",
"yanked": false,
"yanked_reason": null
},
{
"comment_text": "",
"digests": {
"blake2b_256": "75801760ca9e290ee7cb1b968c457a5072a6899a7a37250cf1ac46b121b3c0a1",
"md5": "63d6fc3e5d173177e9148f6d495c9a14",
"sha256": "b977eda53c4deae9d8ebc8d8eff21e77c74c10ce862a651c4d81191546330559"
},
"downloads": -1,
"filename": "listening_node-0.0.28.tar.gz",
"has_sig": false,
"md5_digest": "63d6fc3e5d173177e9148f6d495c9a14",
"packagetype": "sdist",
"python_version": "source",
"requires_python": "<4.0,>=3.10",
"size": 10418,
"upload_time": "2025-02-11T13:34:27",
"upload_time_iso_8601": "2025-02-11T13:34:27.815027Z",
"url": "https://files.pythonhosted.org/packages/75/80/1760ca9e290ee7cb1b968c457a5072a6899a7a37250cf1ac46b121b3c0a1/listening_node-0.0.28.tar.gz",
"yanked": false,
"yanked_reason": null
}
],
"upload_time": "2025-02-11 13:34:27",
"github": false,
"gitlab": false,
"bitbucket": false,
"codeberg": false,
"lcname": "listening-node"
}