# pronunciation-dictionary
[![PyPI](https://img.shields.io/pypi/v/pronunciation-dictionary.svg)](https://pypi.python.org/pypi/pronunciation-dictionary)
[![PyPI](https://img.shields.io/pypi/pyversions/pronunciation-dictionary.svg)](https://pypi.python.org/pypi/pronunciation-dictionary)
[![MIT](https://img.shields.io/github/license/stefantaubert/pronunciation-dictionary.svg)](LICENSE)
[![PyPI](https://img.shields.io/pypi/wheel/pronunciation-dictionary.svg)](https://pypi.python.org/pypi/pronunciation-dictionary/#files)
![PyPI](https://img.shields.io/pypi/implementation/pronunciation-dictionary.svg)
[![PyPI](https://img.shields.io/github/commits-since/stefantaubert/pronunciation-dictionary/latest/master.svg)](https://github.com/stefantaubert/pronunciation-dictionary/compare/v0.0.6...master)
[![DOI](https://zenodo.org/badge/DOI/10.5281/zenodo.10552058.svg)](https://doi.org/10.5281/zenodo.10552058)
Library to save and load pronunciation dictionaries (language-independent).
## Features
- Load dictionary from file or URL
- Parsing of
- line comments
- pronunciation comments
- numbers indicating alternative pronunciations for words
- weights
- Multiprocessing for faster deserialization
- Save dictionary to file
- including numbers for alternative pronunciations
- include weights
- set word/weight/pronunciation separator
- Select pronunciation via
- first/last
- longest/shortest
- highest/lowest weight
- random
- weight
- Get phoneme set
## Example dictionaries and deserialization arguments
- [Montreal Forced Aligner dictionaries](https://github.com/MontrealCorpusTools/mfa-models/tree/main/dictionary)
- `encoding: "UTF-8"`
- [CMU](https://raw.githubusercontent.com/cmusphinx/cmudict/master/cmudict.dict)
- `encoding: "ISO-8859-1"`
- `consider_numbers: True`
- `consider_pronunciation_comments: True`
- [LibriSpeech](https://www.openslr.org/resources/11/librispeech-lexicon.txt)
- `encoding: "UTF-8"`
- [Prosodylab](https://raw.githubusercontent.com/prosodylab/Prosodylab-Aligner/master/eng.dict)
- Old: [CMU 0.7b](http://svn.code.sf.net/p/cmusphinx/code/trunk/cmudict/cmudict-0.7b)
- `encoding: "ISO-8859-1"`
- `consider_comments: True`
- `consider_numbers: True`
### Excerpt from CMU (as example)
```dict
a.d. EY2 D IY1
a.m. EY2 EH1 M
a.s EY1 Z
aaa T R IH2 P AH0 L EY1
aaberg AA1 B ER0 G
aachen AA1 K AH0 N
aachener AA1 K AH0 N ER0
aaker AA1 K ER0
aalborg AO1 L B AO0 R G # place, danish
aalborg(2) AA1 L B AO0 R G
```
## Installation
```sh
pip install pronunciation-dictionary --user
```
## Usage
```sh
from pronunciation_dictionary import load_dict, save_dict, MultiprocessingOptions, DeserializationOptions, SerializationOptions
```
### Example
```py
from pathlib import Path
from pronunciation_dictionary import (DeserializationOptions,
MultiprocessingOptions, SerializationOptions,
get_phoneme_set, load_dict_from_url, save_dict)
dictionary = load_dict_from_url(
"https://raw.githubusercontent.com/cmusphinx/cmudict/master/cmudict.dict",
"ISO-8859-1",
DeserializationOptions(False, True, True, False),
MultiprocessingOptions(4, None, 10000)
)
phoneme_set = get_phoneme_set(dictionary)
print(phoneme_set)
# {'Z', 'EY1', 'AH0', 'F', 'AE0', 'UW0', 'CH', 'G', 'V', 'AY1', 'AO2', 'ZH', 'AA1', 'IY1', 'AW0', 'T', 'TH', 'AY2', 'DH', 'S', 'W', 'ER1', 'AA2', 'AE2', 'AE1', 'AW1', 'UW1', 'AH1', 'Y', 'EY2', 'AO0', 'OW2', 'OY2', 'IY2', 'JH', 'N', 'NG', 'P', 'IH2', 'M', 'OW0', 'L', 'UH1', 'IY0', 'EY0', 'HH', 'IH0', 'SH', 'AH2', 'AW2', 'EH2', 'OW1', 'D', 'R', 'IH1', 'AO1', 'B', 'UH2', 'UH0', 'ER0', 'UW2', 'ER2', 'EH0', 'AY0', 'AA0', 'EH1', 'OY1', 'OY0', 'K'}
pronunciations_distmantle = dictionary.get("dismantle")
for pronunciation, weight in pronunciations_distmantle.items():
print(pronunciation, weight)
# ('D', 'IH0', 'S', 'M', 'AE1', 'N', 'T', 'AH0', 'L') 1.0
# ('D', 'IH0', 'S', 'M', 'AE1', 'N', 'AH0', 'L') 1.0
save_dict(dictionary, Path("/tmp/cmu.dict"), "UTF-8",
SerializationOptions("DOUBLE-SPACE", False, False))
```
```sh
head /tmp/cmu.dict
# 'bout B AW1 T
# 'cause K AH0 Z
# 'course K AO1 R S
# 'cuse K Y UW1 Z
# 'em AH0 M
# 'frisco F R IH1 S K OW0
# 'gain G EH1 N
# 'kay K EY1
# 'm AH0 M
# 'n AH0 N
```
## Roadmap
- replace `SerializationOptions`, `DeserializationOptions` and `MultiprocessingOptions` with parameters
- add default parameter values
- add more tests
## Development setup
```sh
# update
sudo apt update
# install Python 3.8, 3.9, 3.10 & 3.11 for ensuring that tests can be run
sudo apt install python3-pip \
python3.8 python3.8-dev python3.8-distutils python3.8-venv \
python3.9 python3.9-dev python3.9-distutils python3.9-venv \
python3.10 python3.10-dev python3.10-distutils python3.10-venv \
python3.11 python3.11-dev python3.11-distutils python3.11-venv \
python3.12 python3.12-dev python3.12-distutils python3.12-venv
# install pipenv for creation of virtual environments
python3.8 -m pip install pipenv --user
# check out repo
git clone https://github.com/stefantaubert/pronunciation-dictionary.git
cd pronunciation-dictionary
# create virtual environment
python3.8 -m pipenv install --dev
```
## Running the tests
```sh
# first install the tool like in "Development setup"
# then, navigate into the directory of the repo (if not already done)
cd pronunciation-dictionary
# activate environment
python3.8 -m pipenv shell
# run tests
tox
```
Final lines of test result output:
```log
py38: commands succeeded
py39: commands succeeded
py310: commands succeeded
py311: commands succeeded
py312: commands succeeded
congratulations :)
```
## License
MIT License
## Acknowledgments
Funded by the Deutsche Forschungsgemeinschaft (DFG, German Research Foundation) – Project-ID 416228727 – CRC 1410
## Citation
If you want to cite this repo, you can use this BibTeX-entry generated by GitHub (see *About => Cite this repository*).
```txt
Taubert, S. (2024). pronunciation-dictionary (Version 0.0.6) [Computer software]. https://doi.org/10.5281/zenodo.7386813
```
Raw data
{
"_id": null,
"home_page": "",
"name": "pronunciation-dictionary",
"maintainer": "",
"docs_url": null,
"requires_python": "<3.13,>=3.8",
"maintainer_email": "Stefan Taubert <pypi@stefantaubert.com>",
"keywords": "ARPAbet,IPA,X-SAMPA,CMU,TTS,Text-to-speech,Speech synthesis,Language,Linguistics",
"author": "",
"author_email": "Stefan Taubert <pypi@stefantaubert.com>",
"download_url": "https://files.pythonhosted.org/packages/70/f5/6c5aa437d6ddcb47d688e7bda4d735d17fbf278c37772ffe4f6e1ec20284/pronunciation-dictionary-0.0.6.tar.gz",
"platform": null,
"description": "# pronunciation-dictionary\n\n[![PyPI](https://img.shields.io/pypi/v/pronunciation-dictionary.svg)](https://pypi.python.org/pypi/pronunciation-dictionary)\n[![PyPI](https://img.shields.io/pypi/pyversions/pronunciation-dictionary.svg)](https://pypi.python.org/pypi/pronunciation-dictionary)\n[![MIT](https://img.shields.io/github/license/stefantaubert/pronunciation-dictionary.svg)](LICENSE)\n[![PyPI](https://img.shields.io/pypi/wheel/pronunciation-dictionary.svg)](https://pypi.python.org/pypi/pronunciation-dictionary/#files)\n![PyPI](https://img.shields.io/pypi/implementation/pronunciation-dictionary.svg)\n[![PyPI](https://img.shields.io/github/commits-since/stefantaubert/pronunciation-dictionary/latest/master.svg)](https://github.com/stefantaubert/pronunciation-dictionary/compare/v0.0.6...master)\n[![DOI](https://zenodo.org/badge/DOI/10.5281/zenodo.10552058.svg)](https://doi.org/10.5281/zenodo.10552058)\n\nLibrary to save and load pronunciation dictionaries (language-independent).\n\n## Features\n\n- Load dictionary from file or URL\n - Parsing of\n - line comments\n - pronunciation comments\n - numbers indicating alternative pronunciations for words\n - weights\n - Multiprocessing for faster deserialization\n- Save dictionary to file\n - including numbers for alternative pronunciations\n - include weights\n - set word/weight/pronunciation separator\n- Select pronunciation via\n - first/last\n - longest/shortest\n - highest/lowest weight\n - random\n - weight\n- Get phoneme set\n\n## Example dictionaries and deserialization arguments\n\n- [Montreal Forced Aligner dictionaries](https://github.com/MontrealCorpusTools/mfa-models/tree/main/dictionary)\n - `encoding: \"UTF-8\"`\n- [CMU](https://raw.githubusercontent.com/cmusphinx/cmudict/master/cmudict.dict)\n - `encoding: \"ISO-8859-1\"`\n - `consider_numbers: True`\n - `consider_pronunciation_comments: True`\n- [LibriSpeech](https://www.openslr.org/resources/11/librispeech-lexicon.txt)\n - `encoding: \"UTF-8\"`\n- [Prosodylab](https://raw.githubusercontent.com/prosodylab/Prosodylab-Aligner/master/eng.dict)\n- Old: [CMU 0.7b](http://svn.code.sf.net/p/cmusphinx/code/trunk/cmudict/cmudict-0.7b)\n - `encoding: \"ISO-8859-1\"`\n - `consider_comments: True`\n - `consider_numbers: True`\n\n### Excerpt from CMU (as example)\n\n```dict\na.d. EY2 D IY1\na.m. EY2 EH1 M\na.s EY1 Z\naaa T R IH2 P AH0 L EY1\naaberg AA1 B ER0 G\naachen AA1 K AH0 N\naachener AA1 K AH0 N ER0\naaker AA1 K ER0\naalborg AO1 L B AO0 R G # place, danish\naalborg(2) AA1 L B AO0 R G\n```\n\n## Installation\n\n```sh\npip install pronunciation-dictionary --user\n```\n\n## Usage\n\n```sh\nfrom pronunciation_dictionary import load_dict, save_dict, MultiprocessingOptions, DeserializationOptions, SerializationOptions\n```\n\n### Example\n\n```py\nfrom pathlib import Path\n\nfrom pronunciation_dictionary import (DeserializationOptions, \n MultiprocessingOptions, SerializationOptions, \n get_phoneme_set, load_dict_from_url, save_dict)\n\ndictionary = load_dict_from_url(\n \"https://raw.githubusercontent.com/cmusphinx/cmudict/master/cmudict.dict\",\n \"ISO-8859-1\",\n DeserializationOptions(False, True, True, False),\n MultiprocessingOptions(4, None, 10000)\n)\n\nphoneme_set = get_phoneme_set(dictionary)\n\nprint(phoneme_set)\n# {'Z', 'EY1', 'AH0', 'F', 'AE0', 'UW0', 'CH', 'G', 'V', 'AY1', 'AO2', 'ZH', 'AA1', 'IY1', 'AW0', 'T', 'TH', 'AY2', 'DH', 'S', 'W', 'ER1', 'AA2', 'AE2', 'AE1', 'AW1', 'UW1', 'AH1', 'Y', 'EY2', 'AO0', 'OW2', 'OY2', 'IY2', 'JH', 'N', 'NG', 'P', 'IH2', 'M', 'OW0', 'L', 'UH1', 'IY0', 'EY0', 'HH', 'IH0', 'SH', 'AH2', 'AW2', 'EH2', 'OW1', 'D', 'R', 'IH1', 'AO1', 'B', 'UH2', 'UH0', 'ER0', 'UW2', 'ER2', 'EH0', 'AY0', 'AA0', 'EH1', 'OY1', 'OY0', 'K'}\n\npronunciations_distmantle = dictionary.get(\"dismantle\")\n\nfor pronunciation, weight in pronunciations_distmantle.items():\n print(pronunciation, weight)\n# ('D', 'IH0', 'S', 'M', 'AE1', 'N', 'T', 'AH0', 'L') 1.0\n# ('D', 'IH0', 'S', 'M', 'AE1', 'N', 'AH0', 'L') 1.0\n\nsave_dict(dictionary, Path(\"/tmp/cmu.dict\"), \"UTF-8\",\n SerializationOptions(\"DOUBLE-SPACE\", False, False))\n```\n\n```sh\nhead /tmp/cmu.dict\n# 'bout B AW1 T\n# 'cause K AH0 Z\n# 'course K AO1 R S\n# 'cuse K Y UW1 Z\n# 'em AH0 M\n# 'frisco F R IH1 S K OW0\n# 'gain G EH1 N\n# 'kay K EY1\n# 'm AH0 M\n# 'n AH0 N\n```\n\n## Roadmap\n\n- replace `SerializationOptions`, `DeserializationOptions` and `MultiprocessingOptions` with parameters\n- add default parameter values\n- add more tests\n\n## Development setup\n\n```sh\n# update\nsudo apt update\n# install Python 3.8, 3.9, 3.10 & 3.11 for ensuring that tests can be run\nsudo apt install python3-pip \\\n python3.8 python3.8-dev python3.8-distutils python3.8-venv \\\n python3.9 python3.9-dev python3.9-distutils python3.9-venv \\\n python3.10 python3.10-dev python3.10-distutils python3.10-venv \\\n python3.11 python3.11-dev python3.11-distutils python3.11-venv \\\n python3.12 python3.12-dev python3.12-distutils python3.12-venv\n# install pipenv for creation of virtual environments\npython3.8 -m pip install pipenv --user\n\n# check out repo\ngit clone https://github.com/stefantaubert/pronunciation-dictionary.git\ncd pronunciation-dictionary\n# create virtual environment\npython3.8 -m pipenv install --dev\n```\n\n## Running the tests\n\n```sh\n# first install the tool like in \"Development setup\"\n# then, navigate into the directory of the repo (if not already done)\ncd pronunciation-dictionary\n# activate environment\npython3.8 -m pipenv shell\n# run tests\ntox\n```\n\nFinal lines of test result output:\n\n```log\n py38: commands succeeded\n py39: commands succeeded\n py310: commands succeeded\n py311: commands succeeded\n py312: commands succeeded\n congratulations :)\n```\n\n## License\n\nMIT License\n\n## Acknowledgments\n\nFunded by the Deutsche Forschungsgemeinschaft (DFG, German Research Foundation) \u2013 Project-ID 416228727 \u2013 CRC 1410\n\n## Citation\n\nIf you want to cite this repo, you can use this BibTeX-entry generated by GitHub (see *About => Cite this repository*).\n\n```txt\nTaubert, S. (2024). pronunciation-dictionary (Version 0.0.6) [Computer software]. https://doi.org/10.5281/zenodo.7386813\n```\n",
"bugtrack_url": null,
"license": "MIT",
"summary": "Library to save and load pronunciation dictionaries (language-independent).",
"version": "0.0.6",
"project_urls": {
"Homepage": "https://github.com/stefantaubert/pronunciation-dictionary",
"Issues": "https://github.com/stefantaubert/pronunciation-dictionary/issues"
},
"split_keywords": [
"arpabet",
"ipa",
"x-sampa",
"cmu",
"tts",
"text-to-speech",
"speech synthesis",
"language",
"linguistics"
],
"urls": [
{
"comment_text": "",
"digests": {
"blake2b_256": "931d58e186e9abd0bbac690deb2be96b21f1a156807830df1a6bec471dfbea87",
"md5": "778d48ec37a17dc32af566d49abc8c18",
"sha256": "6f6e70176e8f4c707f67a1cf913e0ca33d5d44ce0fe4d0ce7a35b7329e7577cd"
},
"downloads": -1,
"filename": "pronunciation_dictionary-0.0.6-py3-none-any.whl",
"has_sig": false,
"md5_digest": "778d48ec37a17dc32af566d49abc8c18",
"packagetype": "bdist_wheel",
"python_version": "py3",
"requires_python": "<3.13,>=3.8",
"size": 11946,
"upload_time": "2024-01-22T16:13:33",
"upload_time_iso_8601": "2024-01-22T16:13:33.257235Z",
"url": "https://files.pythonhosted.org/packages/93/1d/58e186e9abd0bbac690deb2be96b21f1a156807830df1a6bec471dfbea87/pronunciation_dictionary-0.0.6-py3-none-any.whl",
"yanked": false,
"yanked_reason": null
},
{
"comment_text": "",
"digests": {
"blake2b_256": "70f56c5aa437d6ddcb47d688e7bda4d735d17fbf278c37772ffe4f6e1ec20284",
"md5": "e7ffabed477db0933355e20a3d8860d9",
"sha256": "f087ccd375dfb80f37cf5905594b3cba7d36071d31a6c08b4f4245a0d799efad"
},
"downloads": -1,
"filename": "pronunciation-dictionary-0.0.6.tar.gz",
"has_sig": false,
"md5_digest": "e7ffabed477db0933355e20a3d8860d9",
"packagetype": "sdist",
"python_version": "source",
"requires_python": "<3.13,>=3.8",
"size": 12094,
"upload_time": "2024-01-22T16:13:37",
"upload_time_iso_8601": "2024-01-22T16:13:37.678051Z",
"url": "https://files.pythonhosted.org/packages/70/f5/6c5aa437d6ddcb47d688e7bda4d735d17fbf278c37772ffe4f6e1ec20284/pronunciation-dictionary-0.0.6.tar.gz",
"yanked": false,
"yanked_reason": null
}
],
"upload_time": "2024-01-22 16:13:37",
"github": true,
"gitlab": false,
"bitbucket": false,
"codeberg": false,
"github_user": "stefantaubert",
"github_project": "pronunciation-dictionary",
"travis_ci": false,
"coveralls": false,
"github_actions": false,
"lcname": "pronunciation-dictionary"
}