armspeech

Name	armspeech JSON
Version	0.1.4 JSON
	download
home_page	https://github.com/Varuzhan97/armspeech
Summary	ArmSpeech is an offline Armenian speech recognition library (speech-to-text) and CLI tool based on Coqui STT (🐸STT) and trained on the ArmSpeech dataset.
upload_time	2023-06-06 16:25:24
maintainer
docs_url	None
author	Varuzhan Baghdasaryan
requires_python	>=3.6,<3.11
license	MIT
keywords	speech recognition speech-to-text armenian language
VCS
bugtrack_url
requirements	No requirements were recorded.
Travis-CI	No Travis.
coveralls test coverage	No coveralls.

            # ArmSpeech: Armenian Speech Recognition Library.

ArmSpeech is an offline Armenian speech recognition library (speech-to-text) and CLI tool based on [Coqui STT (🐸STT)](https://stt.readthedocs.io/en/latest/) and trained on the [ArmSpeech](https://www.ijscia.com/full-text-volume-3-issue-3-may-jun-2022-454-459/) dataset. [Coqui STT (🐸STT)](https://stt.readthedocs.io/en/latest/) is an open-source implementation of Baidu’s Deep Speech deep neural network. The engine is based on a recurrent neural network (RNN) and consists of 5 layers of hidden units.

The acoustic model and language model work together to produce better accuracy of prediction. The acoustic model uses a sequence-to-sequence algorithm, to learn which acoustic signals correspond to which letters in the language alphabet (outputs probabilities for each class of character, not at the word level). To distinguish homonyms (words that sound the same but are spelled differently), a language model comes to the rescue, which predicts which words will follow each other in a sequence (n-gram modeling).

For acoustic model training and validating used ArmSpeech Armenian spoken language corpus total of 15.7 hours. Language model training is based on the [KenLM Language Model Toolkit](https://kheafield.com/code/kenlm/) library. Necessary data for language model training was scraped from Armenian news websites articles about medicine, sport, culture, lifestyle, and politics.

If want to help me to increase the accuracy of transcriptions, then <a href="https://www.buymeacoffee.com/U2jtXgrwj4"><img src="https://img.buymeacoffee.com/button-api/?text=Buy me a coffee&emoji=&slug=U2jtXgrwj4&button_colour=FFDD00&font_colour=000000&font_family=Lato&outline_colour=000000&coffee_colour=ffffff" /></a>

## API

ArmSpeech can be used both as a Python module and a CLI tool. The library can be used in two ways:
* transcribe wav audio file,
* transcribe audio stream from microphone.

In both cases audio has the same parameters:
* wav audio format,
* mono channel,
* 16000hz sample rate.

### Python

| Function name | Description                    |
| ------------- | ------------------------------ |
| `set_beam_width(self, beam_width: int) -> int`      | Set the beam width value of the model (beam width used in the CTC decoder when building candidate transcriptions). A larger beam width value generates better results but increases decoding time. The function takes an integer (`beam_width`) and returns zero on success, and non-zero on failure. The default value is 1024.       |
| `set_scorer_alpha_beta(alpha: float, beta: float) -> int`   | Set hyperparameters alpha and beta of the external scorer (language model weight (`alpha`) and word insertion weight (`beta`) of the decoder. The function takes two floats (`alpha`, `beta`) and returns zero on success, and non-zero on failure. The default values are 0.931289039105002 for the `alpha` and 1.1834137581510284 for the `beta`.     |
| `from_wav(self, wav_path: str, get_metadata: bool = False) -> str`   | Transcribe wav audio file. The function takes two parameters: the absolute path of the audio file (`wav_path`) and a boolean parameter (`get_metadata`) for enabling metadata generation. `get_metadata` parameter is optional and the default value is false. The function returns either the transcript or a tuple of metadata, which includes the transcript too.     |
| `from_mic(self, vad_aggresivness: int = 3, spinner: bool = False, wav_save_path: str = None, get_metadata = False)`   | Transcribe audio stream taken from microphone. The generator function takes four parameters: an integer number (`vad_aggresivness`) in a range of [0, 3] for voice activity detection aggressiveness, a boolean for showing spinner (`spinner`) in the console while detected voice activity, an absolute path (`wav_save_path`) to save transcribed speeches, and a boolean parameter (`get_metadata`) for enabling metadata generation. All the parameters are optional (value of 3 for `vad_aggresivness`, false for `get_metadata` and `spinner`, and empty for `wav_save_path`. The function returns either the transcript or a tuple of metadata, which includes the transcript too.     |

The `from_mic()` generator function uses voice activity detection technology to detect speech by simply distinguishing between silence and speech. This is done by using Python free “webrtcvad” module, which is a Python interface to the WebRTC Voice Activity Detector (VAD) developed by Google. The application determines voice activity by a ratio of not null and null frames in 300 milliseconds. The portion of not null frames in given milliseconds must be equal to or greater than 75%.

In `from_mic()` and `from_wav()` functions setting the `get_metada` parameter to true, returns metadata of the audio file or stream, which includes the transcript, confidence score, and position of the token in seconds. An example of returned metadata is below:

`('հայերն աշխարհի հնագույն ազգերից մեկն են', -7.672598838806152, ('հ', 0.29999998211860657), ('ա', 0.41999998688697815), ('յ', 0.4399999976158142), ('ե', 0.5), ('ր', 0.5199999809265137), ('ն', 0.5399999618530273), (' ', 0.6800000071525574), ('ա', 0.699999988079071), ('շ', 0.7400000095367432), ('խ', 0.8999999761581421), ('ա', 0.9399999976158142), ('ր', 0.9599999785423279), ('հ', 1.0), ('ի', 1.0399999618530273), (' ', 1.1799999475479126), ('հ', 1.1999999284744263), ('ն', 1.2400000095367432), ('ա', 1.399999976158142), ('գ', 1.5), ('ո', 1.5199999809265137), ('ւ', 1.5799999237060547), ('յ', 1.6799999475479126), ('ն', 1.7799999713897705), (' ', 1.7999999523162842), ('ա', 2.0799999237060547), ('զ', 2.0999999046325684), ('գ', 2.2200000286102295), ('ե', 2.3399999141693115), ('ր', 2.379999876022339), ('ի', 2.4600000381469727), ('ց', 2.4800000190734863), (' ', 2.5), ('մ', 2.679999828338623), ('ե', 2.700000047683716), ('կ', 2.8399999141693115), ('ն', 2.93999981880188), (' ', 2.9600000381469727), ('ե', 2.9800000190734863), ('ն', 3.319999933242798))`

### CLI

CLI API took 7 optional parameters: `wav_path`, `beam_width`, `alpha_beta`, `get_metadata`, `spinner`, `vad_aggresivness`, and `wav_save_path`. Descriptions and return values are the same as for Python API. If the `wav_path` parameter is not empty, then the audio file will be transcribed, else microphone streaming will start.

## Install

```
pip install armspeech
```

## Usage examples

### Python

```
#Import library
from armspeech import ArmSpeech_STT

#Create object
armspeech_stt = ArmSpeech_STT()

#Transcribe wav audio file
result = armspeech_stt.from_wav(wav_path = 'path/to/wav/audio', get_metadata = True)
print(result)

#Start microphone streaming
for result in armspeech_stt.from_mic (vad_aggresivness = 2, spinner = True, wav_save_path = 'path/to/transcribed/speeches', get_metadata = False):
    print(result)
```

### CLI

```
armspeech_stt_cli --wav_path path/to/wav/audio --beam_width 2048 --alpha_beta 0.7 1.3 --get_metadata True
```

## Author's profiles

- [GitHub](https://github.com/Varuzhan97)
- [LinkedIn](linkedin.com/in/varuzhan-baghdasaryan-74b064147)
- [Email](www.varuzh2014@gmail.com)

## Acknowledgements

 - [ArmSpeech: Armenian Spoken Language Corpus](https://www.ijscia.com/full-text-volume-3-issue-3-may-jun-2022-454-459/)
 - [Extended ArmSpeech: Armenian Spoken Language Corpus](https://www.ijscia.com/full-text-volume-3-issue-4-jul-aug-2022-573-576/)
 - [Armenian Speech Recognition System: Acoustic and Language Models](https://www.ijscia.com/full-text-volume-3-issue-5-sep-oct-2022-719-724/)

Raw data

            {
    "_id": null,
    "home_page": "https://github.com/Varuzhan97/armspeech",
    "name": "armspeech",
    "maintainer": "",
    "docs_url": null,
    "requires_python": ">=3.6,<3.11",
    "maintainer_email": "",
    "keywords": "speech recognition,speech-to-text,Armenian language",
    "author": "Varuzhan Baghdasaryan",
    "author_email": "varuzh2014@gmail.com",
    "download_url": "https://files.pythonhosted.org/packages/ee/1a/b5b452600042f633bb26ad39a46bb6560ef160721a0377251017bc168aa2/armspeech-0.1.4.tar.gz",
    "platform": null,
    "description": "# ArmSpeech: Armenian Speech Recognition Library.\n\nArmSpeech is an offline Armenian speech recognition library (speech-to-text) and CLI tool based on [Coqui STT (\ud83d\udc38STT)](https://stt.readthedocs.io/en/latest/) and trained on the [ArmSpeech](https://www.ijscia.com/full-text-volume-3-issue-3-may-jun-2022-454-459/) dataset. [Coqui STT (\ud83d\udc38STT)](https://stt.readthedocs.io/en/latest/) is an open-source implementation of Baidu\u2019s Deep Speech deep neural network. The engine is based on a recurrent neural network (RNN) and consists of 5 layers of hidden units.\n\nThe acoustic model and language model work together to produce better accuracy of prediction. The acoustic model uses a sequence-to-sequence algorithm, to learn which acoustic signals correspond to which letters in the language alphabet (outputs probabilities for each class of character, not at the word level). To distinguish homonyms (words that sound the same but are spelled differently), a language model comes to the rescue, which predicts which words will follow each other in a sequence (n-gram modeling).\n\nFor acoustic model training and validating used ArmSpeech Armenian spoken language corpus total of 15.7 hours. Language model training is based on the [KenLM Language Model Toolkit](https://kheafield.com/code/kenlm/) library. Necessary data for language model training was scraped from Armenian news websites articles about medicine, sport, culture, lifestyle, and politics.\n\nIf want to help me to increase the accuracy of transcriptions, then <a href=\"https://www.buymeacoffee.com/U2jtXgrwj4\"><img src=\"https://img.buymeacoffee.com/button-api/?text=Buy me a coffee&emoji=&slug=U2jtXgrwj4&button_colour=FFDD00&font_colour=000000&font_family=Lato&outline_colour=000000&coffee_colour=ffffff\" /></a>\n\n## API\n\nArmSpeech can be used both as a Python module and a CLI tool. The library can be used in two ways:\n* transcribe wav audio file,\n* transcribe audio stream from microphone.\n\nIn both cases audio has the same parameters:\n* wav audio format,\n* mono channel,\n* 16000hz sample rate.\n\n### Python\n\n| Function name | Description                    |\n| ------------- | ------------------------------ |\n| `set_beam_width(self, beam_width: int) -> int`      | Set the beam width value of the model (beam width used in the CTC decoder when building candidate transcriptions). A larger beam width value generates better results but increases decoding time. The function takes an integer (`beam_width`) and returns zero on success, and non-zero on failure. The default value is 1024.       |\n| `set_scorer_alpha_beta(alpha: float, beta: float) -> int`   | Set hyperparameters alpha and beta of the external scorer (language model weight (`alpha`) and word insertion weight (`beta`) of the decoder. The function takes two floats (`alpha`, `beta`) and returns zero on success, and non-zero on failure. The default values are 0.931289039105002 for the `alpha` and 1.1834137581510284 for the `beta`.     |\n| `from_wav(self, wav_path: str, get_metadata: bool = False) -> str`   | Transcribe wav audio file. The function takes two parameters: the absolute path of the audio file (`wav_path`) and a boolean parameter (`get_metadata`) for enabling metadata generation. `get_metadata` parameter is optional and the default value is false. The function returns either the transcript or a tuple of metadata, which includes the transcript too.     |\n| `from_mic(self, vad_aggresivness: int = 3, spinner: bool = False, wav_save_path: str = None, get_metadata = False)`   | Transcribe audio stream taken from microphone. The generator function takes four parameters: an integer number (`vad_aggresivness`) in a range of [0, 3] for voice activity detection aggressiveness, a boolean for showing spinner (`spinner`) in the console while detected voice activity, an absolute path (`wav_save_path`) to save transcribed speeches, and a boolean parameter (`get_metadata`) for enabling metadata generation. All the parameters are optional (value of 3 for `vad_aggresivness`, false for `get_metadata` and `spinner`, and empty for `wav_save_path`. The function returns either the transcript or a tuple of metadata, which includes the transcript too.     |\n\nThe `from_mic()` generator function uses voice activity detection technology to detect speech by simply distinguishing between silence and speech. This is done by using Python free \u201cwebrtcvad\u201d module, which is a Python interface to the WebRTC Voice Activity Detector (VAD) developed by Google. The application determines voice activity by a ratio of not null and null frames in 300 milliseconds. The portion of not null frames in given milliseconds must be equal to or greater than 75%.\n\nIn `from_mic()` and `from_wav()` functions setting the `get_metada` parameter to true, returns metadata of the audio file or stream, which includes the transcript, confidence score, and position of the token in seconds. An example of returned metadata is below:\n\n`('\u0570\u0561\u0575\u0565\u0580\u0576 \u0561\u0577\u056d\u0561\u0580\u0570\u056b \u0570\u0576\u0561\u0563\u0578\u0582\u0575\u0576 \u0561\u0566\u0563\u0565\u0580\u056b\u0581 \u0574\u0565\u056f\u0576 \u0565\u0576', -7.672598838806152, ('\u0570', 0.29999998211860657), ('\u0561', 0.41999998688697815), ('\u0575', 0.4399999976158142), ('\u0565', 0.5), ('\u0580', 0.5199999809265137), ('\u0576', 0.5399999618530273), (' ', 0.6800000071525574), ('\u0561', 0.699999988079071), ('\u0577', 0.7400000095367432), ('\u056d', 0.8999999761581421), ('\u0561', 0.9399999976158142), ('\u0580', 0.9599999785423279), ('\u0570', 1.0), ('\u056b', 1.0399999618530273), (' ', 1.1799999475479126), ('\u0570', 1.1999999284744263), ('\u0576', 1.2400000095367432), ('\u0561', 1.399999976158142), ('\u0563', 1.5), ('\u0578', 1.5199999809265137), ('\u0582', 1.5799999237060547), ('\u0575', 1.6799999475479126), ('\u0576', 1.7799999713897705), (' ', 1.7999999523162842), ('\u0561', 2.0799999237060547), ('\u0566', 2.0999999046325684), ('\u0563', 2.2200000286102295), ('\u0565', 2.3399999141693115), ('\u0580', 2.379999876022339), ('\u056b', 2.4600000381469727), ('\u0581', 2.4800000190734863), (' ', 2.5), ('\u0574', 2.679999828338623), ('\u0565', 2.700000047683716), ('\u056f', 2.8399999141693115), ('\u0576', 2.93999981880188), (' ', 2.9600000381469727), ('\u0565', 2.9800000190734863), ('\u0576', 3.319999933242798))`\n\n### CLI\n\nCLI API took 7 optional parameters: `wav_path`, `beam_width`, `alpha_beta`, `get_metadata`, `spinner`, `vad_aggresivness`, and `wav_save_path`. Descriptions and return values are the same as for Python API. If the `wav_path` parameter is not empty, then the audio file will be transcribed, else microphone streaming will start.\n\n## Install\n\n```\npip install armspeech\n```\n\n## Usage examples\n\n### Python\n\n```\n#Import library\nfrom armspeech import ArmSpeech_STT\n\n#Create object\narmspeech_stt = ArmSpeech_STT()\n\n#Transcribe wav audio file\nresult = armspeech_stt.from_wav(wav_path = 'path/to/wav/audio', get_metadata = True)\nprint(result)\n\n#Start microphone streaming\nfor result in armspeech_stt.from_mic (vad_aggresivness = 2, spinner = True, wav_save_path = 'path/to/transcribed/speeches', get_metadata = False):\n    print(result)\n```\n\n### CLI\n\n```\narmspeech_stt_cli --wav_path path/to/wav/audio --beam_width 2048 --alpha_beta 0.7 1.3 --get_metadata True\n```\n\n## Author's profiles\n\n- [GitHub](https://github.com/Varuzhan97)\n- [LinkedIn](linkedin.com/in/varuzhan-baghdasaryan-74b064147)\n- [Email](www.varuzh2014@gmail.com)\n\n## Acknowledgements\n\n - [ArmSpeech: Armenian Spoken Language Corpus](https://www.ijscia.com/full-text-volume-3-issue-3-may-jun-2022-454-459/)\n - [Extended ArmSpeech: Armenian Spoken Language Corpus](https://www.ijscia.com/full-text-volume-3-issue-4-jul-aug-2022-573-576/)\n - [Armenian Speech Recognition System: Acoustic and Language Models](https://www.ijscia.com/full-text-volume-3-issue-5-sep-oct-2022-719-724/)",
    "bugtrack_url": null,
    "license": "MIT",
    "summary": "ArmSpeech is an offline Armenian speech recognition library (speech-to-text) and CLI tool based on Coqui STT (\ud83d\udc38STT) and trained on the ArmSpeech dataset.",
    "version": "0.1.4",
    "project_urls": {
        "Bug Tracker": "https://github.com/Varuzhan97/armspeech/issues",
        "Funding": "https://donate.pypi.org",
        "Homepage": "https://github.com/Varuzhan97/armspeech"
    },
    "split_keywords": [
        "speech recognition",
        "speech-to-text",
        "armenian language"
    ],
    "urls": [
        {
            "comment_text": "",
            "digests": {
                "blake2b_256": "ee1ab5b452600042f633bb26ad39a46bb6560ef160721a0377251017bc168aa2",
                "md5": "5322014bb3973353ecc9048119dfca0c",
                "sha256": "29723aee9b13f59c77970b3087f10552aebaadc1ca2fba42bf81ab397c507d9f"
            },
            "downloads": -1,
            "filename": "armspeech-0.1.4.tar.gz",
            "has_sig": false,
            "md5_digest": "5322014bb3973353ecc9048119dfca0c",
            "packagetype": "sdist",
            "python_version": "source",
            "requires_python": ">=3.6,<3.11",
            "size": 21127098,
            "upload_time": "2023-06-06T16:25:24",
            "upload_time_iso_8601": "2023-06-06T16:25:24.651380Z",
            "url": "https://files.pythonhosted.org/packages/ee/1a/b5b452600042f633bb26ad39a46bb6560ef160721a0377251017bc168aa2/armspeech-0.1.4.tar.gz",
            "yanked": false,
            "yanked_reason": null
        }
    ],
    "upload_time": "2023-06-06 16:25:24",
    "github": true,
    "gitlab": false,
    "bitbucket": false,
    "codeberg": false,
    "github_user": "Varuzhan97",
    "github_project": "armspeech",
    "travis_ci": false,
    "coveralls": false,
    "github_actions": false,
    "lcname": "armspeech"
}

Varuzhan Baghdasaryan