pronunciation-dictionary

Name	pronunciation-dictionary JSON
Version	0.0.6 JSON
	download
home_page
Summary	Library to save and load pronunciation dictionaries (language-independent).
upload_time	2024-01-22 16:13:37
maintainer
docs_url	None
author
requires_python	<3.13,>=3.8
license	MIT
keywords	arpabet ipa x-sampa cmu tts text-to-speech speech synthesis language linguistics
VCS
bugtrack_url
requirements	No requirements were recorded.
Travis-CI	No Travis.
coveralls test coverage	No coveralls.

            # pronunciation-dictionary

[![PyPI](https://img.shields.io/pypi/v/pronunciation-dictionary.svg)](https://pypi.python.org/pypi/pronunciation-dictionary)
[![PyPI](https://img.shields.io/pypi/pyversions/pronunciation-dictionary.svg)](https://pypi.python.org/pypi/pronunciation-dictionary)
[![MIT](https://img.shields.io/github/license/stefantaubert/pronunciation-dictionary.svg)](LICENSE)
[![PyPI](https://img.shields.io/pypi/wheel/pronunciation-dictionary.svg)](https://pypi.python.org/pypi/pronunciation-dictionary/#files)
![PyPI](https://img.shields.io/pypi/implementation/pronunciation-dictionary.svg)
[![PyPI](https://img.shields.io/github/commits-since/stefantaubert/pronunciation-dictionary/latest/master.svg)](https://github.com/stefantaubert/pronunciation-dictionary/compare/v0.0.6...master)
[![DOI](https://zenodo.org/badge/DOI/10.5281/zenodo.10552058.svg)](https://doi.org/10.5281/zenodo.10552058)

Library to save and load pronunciation dictionaries (language-independent).

## Features

- Load dictionary from file or URL
  - Parsing of
    - line comments
    - pronunciation comments
    - numbers indicating alternative pronunciations for words
    - weights
  - Multiprocessing for faster deserialization
- Save dictionary to file
  - including numbers for alternative pronunciations
  - include weights
  - set word/weight/pronunciation separator
- Select pronunciation via
  - first/last
  - longest/shortest
  - highest/lowest weight
  - random
  - weight
- Get phoneme set

## Example dictionaries and deserialization arguments

- [Montreal Forced Aligner dictionaries](https://github.com/MontrealCorpusTools/mfa-models/tree/main/dictionary)
  - `encoding: "UTF-8"`
- [CMU](https://raw.githubusercontent.com/cmusphinx/cmudict/master/cmudict.dict)
  - `encoding: "ISO-8859-1"`
  - `consider_numbers: True`
  - `consider_pronunciation_comments: True`
- [LibriSpeech](https://www.openslr.org/resources/11/librispeech-lexicon.txt)
  - `encoding: "UTF-8"`
- [Prosodylab](https://raw.githubusercontent.com/prosodylab/Prosodylab-Aligner/master/eng.dict)
- Old: [CMU 0.7b](http://svn.code.sf.net/p/cmusphinx/code/trunk/cmudict/cmudict-0.7b)
  - `encoding: "ISO-8859-1"`
  - `consider_comments: True`
  - `consider_numbers: True`

### Excerpt from CMU (as example)

```dict
a.d. EY2 D IY1
a.m. EY2 EH1 M
a.s EY1 Z
aaa T R IH2 P AH0 L EY1
aaberg AA1 B ER0 G
aachen AA1 K AH0 N
aachener AA1 K AH0 N ER0
aaker AA1 K ER0
aalborg AO1 L B AO0 R G # place, danish
aalborg(2) AA1 L B AO0 R G
```

## Installation

```sh
pip install pronunciation-dictionary --user
```

## Usage

```sh
from pronunciation_dictionary import load_dict, save_dict, MultiprocessingOptions, DeserializationOptions, SerializationOptions
```

### Example

```py
from pathlib import Path

from pronunciation_dictionary import (DeserializationOptions, 
  MultiprocessingOptions, SerializationOptions, 
  get_phoneme_set, load_dict_from_url, save_dict)

dictionary = load_dict_from_url(
  "https://raw.githubusercontent.com/cmusphinx/cmudict/master/cmudict.dict",
  "ISO-8859-1",
  DeserializationOptions(False, True, True, False),
  MultiprocessingOptions(4, None, 10000)
)

phoneme_set = get_phoneme_set(dictionary)

print(phoneme_set)
# {'Z', 'EY1', 'AH0', 'F', 'AE0', 'UW0', 'CH', 'G', 'V', 'AY1', 'AO2', 'ZH', 'AA1', 'IY1', 'AW0', 'T', 'TH', 'AY2', 'DH', 'S', 'W', 'ER1', 'AA2', 'AE2', 'AE1', 'AW1', 'UW1', 'AH1', 'Y', 'EY2', 'AO0', 'OW2', 'OY2', 'IY2', 'JH', 'N', 'NG', 'P', 'IH2', 'M', 'OW0', 'L', 'UH1', 'IY0', 'EY0', 'HH', 'IH0', 'SH', 'AH2', 'AW2', 'EH2', 'OW1', 'D', 'R', 'IH1', 'AO1', 'B', 'UH2', 'UH0', 'ER0', 'UW2', 'ER2', 'EH0', 'AY0', 'AA0', 'EH1', 'OY1', 'OY0', 'K'}

pronunciations_distmantle = dictionary.get("dismantle")

for pronunciation, weight in pronunciations_distmantle.items():
  print(pronunciation, weight)
# ('D', 'IH0', 'S', 'M', 'AE1', 'N', 'T', 'AH0', 'L') 1.0
# ('D', 'IH0', 'S', 'M', 'AE1', 'N', 'AH0', 'L') 1.0

save_dict(dictionary, Path("/tmp/cmu.dict"), "UTF-8",
          SerializationOptions("DOUBLE-SPACE", False, False))
```

```sh
head /tmp/cmu.dict
# 'bout  B AW1 T
# 'cause  K AH0 Z
# 'course  K AO1 R S
# 'cuse  K Y UW1 Z
# 'em  AH0 M
# 'frisco  F R IH1 S K OW0
# 'gain  G EH1 N
# 'kay  K EY1
# 'm  AH0 M
# 'n  AH0 N
```

## Roadmap

- replace `SerializationOptions`, `DeserializationOptions` and `MultiprocessingOptions` with parameters
- add default parameter values
- add more tests

## Development setup

```sh
# update
sudo apt update
# install Python 3.8, 3.9, 3.10 & 3.11 for ensuring that tests can be run
sudo apt install python3-pip \
  python3.8 python3.8-dev python3.8-distutils python3.8-venv \
  python3.9 python3.9-dev python3.9-distutils python3.9-venv \
  python3.10 python3.10-dev python3.10-distutils python3.10-venv \
  python3.11 python3.11-dev python3.11-distutils python3.11-venv \
  python3.12 python3.12-dev python3.12-distutils python3.12-venv
# install pipenv for creation of virtual environments
python3.8 -m pip install pipenv --user

# check out repo
git clone https://github.com/stefantaubert/pronunciation-dictionary.git
cd pronunciation-dictionary
# create virtual environment
python3.8 -m pipenv install --dev
```

## Running the tests

```sh
# first install the tool like in "Development setup"
# then, navigate into the directory of the repo (if not already done)
cd pronunciation-dictionary
# activate environment
python3.8 -m pipenv shell
# run tests
tox
```

Final lines of test result output:

```log
  py38: commands succeeded
  py39: commands succeeded
  py310: commands succeeded
  py311: commands succeeded
  py312: commands succeeded
  congratulations :)
```

## License

MIT License

## Acknowledgments

Funded by the Deutsche Forschungsgemeinschaft (DFG, German Research Foundation) – Project-ID 416228727 – CRC 1410

## Citation

If you want to cite this repo, you can use this BibTeX-entry generated by GitHub (see *About => Cite this repository*).

```txt
Taubert, S. (2024). pronunciation-dictionary (Version 0.0.6) [Computer software]. https://doi.org/10.5281/zenodo.7386813
```

Raw data

            {
    "_id": null,
    "home_page": "",
    "name": "pronunciation-dictionary",
    "maintainer": "",
    "docs_url": null,
    "requires_python": "<3.13,>=3.8",
    "maintainer_email": "Stefan Taubert <pypi@stefantaubert.com>",
    "keywords": "ARPAbet,IPA,X-SAMPA,CMU,TTS,Text-to-speech,Speech synthesis,Language,Linguistics",
    "author": "",
    "author_email": "Stefan Taubert <pypi@stefantaubert.com>",
    "download_url": "https://files.pythonhosted.org/packages/70/f5/6c5aa437d6ddcb47d688e7bda4d735d17fbf278c37772ffe4f6e1ec20284/pronunciation-dictionary-0.0.6.tar.gz",
    "platform": null,
    "description": "# pronunciation-dictionary\n\n[![PyPI](https://img.shields.io/pypi/v/pronunciation-dictionary.svg)](https://pypi.python.org/pypi/pronunciation-dictionary)\n[![PyPI](https://img.shields.io/pypi/pyversions/pronunciation-dictionary.svg)](https://pypi.python.org/pypi/pronunciation-dictionary)\n[![MIT](https://img.shields.io/github/license/stefantaubert/pronunciation-dictionary.svg)](LICENSE)\n[![PyPI](https://img.shields.io/pypi/wheel/pronunciation-dictionary.svg)](https://pypi.python.org/pypi/pronunciation-dictionary/#files)\n![PyPI](https://img.shields.io/pypi/implementation/pronunciation-dictionary.svg)\n[![PyPI](https://img.shields.io/github/commits-since/stefantaubert/pronunciation-dictionary/latest/master.svg)](https://github.com/stefantaubert/pronunciation-dictionary/compare/v0.0.6...master)\n[![DOI](https://zenodo.org/badge/DOI/10.5281/zenodo.10552058.svg)](https://doi.org/10.5281/zenodo.10552058)\n\nLibrary to save and load pronunciation dictionaries (language-independent).\n\n## Features\n\n- Load dictionary from file or URL\n  - Parsing of\n    - line comments\n    - pronunciation comments\n    - numbers indicating alternative pronunciations for words\n    - weights\n  - Multiprocessing for faster deserialization\n- Save dictionary to file\n  - including numbers for alternative pronunciations\n  - include weights\n  - set word/weight/pronunciation separator\n- Select pronunciation via\n  - first/last\n  - longest/shortest\n  - highest/lowest weight\n  - random\n  - weight\n- Get phoneme set\n\n## Example dictionaries and deserialization arguments\n\n- [Montreal Forced Aligner dictionaries](https://github.com/MontrealCorpusTools/mfa-models/tree/main/dictionary)\n  - `encoding: \"UTF-8\"`\n- [CMU](https://raw.githubusercontent.com/cmusphinx/cmudict/master/cmudict.dict)\n  - `encoding: \"ISO-8859-1\"`\n  - `consider_numbers: True`\n  - `consider_pronunciation_comments: True`\n- [LibriSpeech](https://www.openslr.org/resources/11/librispeech-lexicon.txt)\n  - `encoding: \"UTF-8\"`\n- [Prosodylab](https://raw.githubusercontent.com/prosodylab/Prosodylab-Aligner/master/eng.dict)\n- Old: [CMU 0.7b](http://svn.code.sf.net/p/cmusphinx/code/trunk/cmudict/cmudict-0.7b)\n  - `encoding: \"ISO-8859-1\"`\n  - `consider_comments: True`\n  - `consider_numbers: True`\n\n### Excerpt from CMU (as example)\n\n```dict\na.d. EY2 D IY1\na.m. EY2 EH1 M\na.s EY1 Z\naaa T R IH2 P AH0 L EY1\naaberg AA1 B ER0 G\naachen AA1 K AH0 N\naachener AA1 K AH0 N ER0\naaker AA1 K ER0\naalborg AO1 L B AO0 R G # place, danish\naalborg(2) AA1 L B AO0 R G\n```\n\n## Installation\n\n```sh\npip install pronunciation-dictionary --user\n```\n\n## Usage\n\n```sh\nfrom pronunciation_dictionary import load_dict, save_dict, MultiprocessingOptions, DeserializationOptions, SerializationOptions\n```\n\n### Example\n\n```py\nfrom pathlib import Path\n\nfrom pronunciation_dictionary import (DeserializationOptions, \n  MultiprocessingOptions, SerializationOptions, \n  get_phoneme_set, load_dict_from_url, save_dict)\n\ndictionary = load_dict_from_url(\n  \"https://raw.githubusercontent.com/cmusphinx/cmudict/master/cmudict.dict\",\n  \"ISO-8859-1\",\n  DeserializationOptions(False, True, True, False),\n  MultiprocessingOptions(4, None, 10000)\n)\n\nphoneme_set = get_phoneme_set(dictionary)\n\nprint(phoneme_set)\n# {'Z', 'EY1', 'AH0', 'F', 'AE0', 'UW0', 'CH', 'G', 'V', 'AY1', 'AO2', 'ZH', 'AA1', 'IY1', 'AW0', 'T', 'TH', 'AY2', 'DH', 'S', 'W', 'ER1', 'AA2', 'AE2', 'AE1', 'AW1', 'UW1', 'AH1', 'Y', 'EY2', 'AO0', 'OW2', 'OY2', 'IY2', 'JH', 'N', 'NG', 'P', 'IH2', 'M', 'OW0', 'L', 'UH1', 'IY0', 'EY0', 'HH', 'IH0', 'SH', 'AH2', 'AW2', 'EH2', 'OW1', 'D', 'R', 'IH1', 'AO1', 'B', 'UH2', 'UH0', 'ER0', 'UW2', 'ER2', 'EH0', 'AY0', 'AA0', 'EH1', 'OY1', 'OY0', 'K'}\n\npronunciations_distmantle = dictionary.get(\"dismantle\")\n\nfor pronunciation, weight in pronunciations_distmantle.items():\n  print(pronunciation, weight)\n# ('D', 'IH0', 'S', 'M', 'AE1', 'N', 'T', 'AH0', 'L') 1.0\n# ('D', 'IH0', 'S', 'M', 'AE1', 'N', 'AH0', 'L') 1.0\n\nsave_dict(dictionary, Path(\"/tmp/cmu.dict\"), \"UTF-8\",\n          SerializationOptions(\"DOUBLE-SPACE\", False, False))\n```\n\n```sh\nhead /tmp/cmu.dict\n# 'bout  B AW1 T\n# 'cause  K AH0 Z\n# 'course  K AO1 R S\n# 'cuse  K Y UW1 Z\n# 'em  AH0 M\n# 'frisco  F R IH1 S K OW0\n# 'gain  G EH1 N\n# 'kay  K EY1\n# 'm  AH0 M\n# 'n  AH0 N\n```\n\n## Roadmap\n\n- replace `SerializationOptions`, `DeserializationOptions` and `MultiprocessingOptions` with parameters\n- add default parameter values\n- add more tests\n\n## Development setup\n\n```sh\n# update\nsudo apt update\n# install Python 3.8, 3.9, 3.10 & 3.11 for ensuring that tests can be run\nsudo apt install python3-pip \\\n  python3.8 python3.8-dev python3.8-distutils python3.8-venv \\\n  python3.9 python3.9-dev python3.9-distutils python3.9-venv \\\n  python3.10 python3.10-dev python3.10-distutils python3.10-venv \\\n  python3.11 python3.11-dev python3.11-distutils python3.11-venv \\\n  python3.12 python3.12-dev python3.12-distutils python3.12-venv\n# install pipenv for creation of virtual environments\npython3.8 -m pip install pipenv --user\n\n# check out repo\ngit clone https://github.com/stefantaubert/pronunciation-dictionary.git\ncd pronunciation-dictionary\n# create virtual environment\npython3.8 -m pipenv install --dev\n```\n\n## Running the tests\n\n```sh\n# first install the tool like in \"Development setup\"\n# then, navigate into the directory of the repo (if not already done)\ncd pronunciation-dictionary\n# activate environment\npython3.8 -m pipenv shell\n# run tests\ntox\n```\n\nFinal lines of test result output:\n\n```log\n  py38: commands succeeded\n  py39: commands succeeded\n  py310: commands succeeded\n  py311: commands succeeded\n  py312: commands succeeded\n  congratulations :)\n```\n\n## License\n\nMIT License\n\n## Acknowledgments\n\nFunded by the Deutsche Forschungsgemeinschaft (DFG, German Research Foundation) \u2013 Project-ID 416228727 \u2013 CRC 1410\n\n## Citation\n\nIf you want to cite this repo, you can use this BibTeX-entry generated by GitHub (see *About => Cite this repository*).\n\n```txt\nTaubert, S. (2024). pronunciation-dictionary (Version 0.0.6) [Computer software]. https://doi.org/10.5281/zenodo.7386813\n```\n",
    "bugtrack_url": null,
    "license": "MIT",
    "summary": "Library to save and load pronunciation dictionaries (language-independent).",
    "version": "0.0.6",
    "project_urls": {
        "Homepage": "https://github.com/stefantaubert/pronunciation-dictionary",
        "Issues": "https://github.com/stefantaubert/pronunciation-dictionary/issues"
    },
    "split_keywords": [
        "arpabet",
        "ipa",
        "x-sampa",
        "cmu",
        "tts",
        "text-to-speech",
        "speech synthesis",
        "language",
        "linguistics"
    ],
    "urls": [
        {
            "comment_text": "",
            "digests": {
                "blake2b_256": "931d58e186e9abd0bbac690deb2be96b21f1a156807830df1a6bec471dfbea87",
                "md5": "778d48ec37a17dc32af566d49abc8c18",
                "sha256": "6f6e70176e8f4c707f67a1cf913e0ca33d5d44ce0fe4d0ce7a35b7329e7577cd"
            },
            "downloads": -1,
            "filename": "pronunciation_dictionary-0.0.6-py3-none-any.whl",
            "has_sig": false,
            "md5_digest": "778d48ec37a17dc32af566d49abc8c18",
            "packagetype": "bdist_wheel",
            "python_version": "py3",
            "requires_python": "<3.13,>=3.8",
            "size": 11946,
            "upload_time": "2024-01-22T16:13:33",
            "upload_time_iso_8601": "2024-01-22T16:13:33.257235Z",
            "url": "https://files.pythonhosted.org/packages/93/1d/58e186e9abd0bbac690deb2be96b21f1a156807830df1a6bec471dfbea87/pronunciation_dictionary-0.0.6-py3-none-any.whl",
            "yanked": false,
            "yanked_reason": null
        },
        {
            "comment_text": "",
            "digests": {
                "blake2b_256": "70f56c5aa437d6ddcb47d688e7bda4d735d17fbf278c37772ffe4f6e1ec20284",
                "md5": "e7ffabed477db0933355e20a3d8860d9",
                "sha256": "f087ccd375dfb80f37cf5905594b3cba7d36071d31a6c08b4f4245a0d799efad"
            },
            "downloads": -1,
            "filename": "pronunciation-dictionary-0.0.6.tar.gz",
            "has_sig": false,
            "md5_digest": "e7ffabed477db0933355e20a3d8860d9",
            "packagetype": "sdist",
            "python_version": "source",
            "requires_python": "<3.13,>=3.8",
            "size": 12094,
            "upload_time": "2024-01-22T16:13:37",
            "upload_time_iso_8601": "2024-01-22T16:13:37.678051Z",
            "url": "https://files.pythonhosted.org/packages/70/f5/6c5aa437d6ddcb47d688e7bda4d735d17fbf278c37772ffe4f6e1ec20284/pronunciation-dictionary-0.0.6.tar.gz",
            "yanked": false,
            "yanked_reason": null
        }
    ],
    "upload_time": "2024-01-22 16:13:37",
    "github": true,
    "gitlab": false,
    "bitbucket": false,
    "codeberg": false,
    "github_user": "stefantaubert",
    "github_project": "pronunciation-dictionary",
    "travis_ci": false,
    "coveralls": false,
    "github_actions": false,
    "lcname": "pronunciation-dictionary"
}