# WikiPron
[![PyPI
version](https://badge.fury.io/py/wikipron.svg)](https://pypi.org/project/wikipron)
[![Supported Python
versions](https://img.shields.io/pypi/pyversions/wikipron.svg)](https://pypi.org/project/wikipron)
[![CircleCI](https://circleci.com/gh/CUNY-CL/wikipron/tree/master.svg?style=shield)](https://circleci.com/gh/CUNY-CL/wikipron/tree/master)
[![Paper](http://img.shields.io/badge/paper-ACL:2020.lrec--1.521-B31B1B.svg)](https://www.aclweb.org/anthology/2020.lrec-1.521/)
[![Conference](http://img.shields.io/badge/LREC-2020-4b44ce.svg)](https://lrec2020.lrec-conf.org/en/)
WikiPron is a command-line tool and Python API for mining multilingual
pronunciation data from Wiktionary, as well as a database of pronunciation
dictionaries mined using this tool.
- [Command-line tool](#command-line-tool)
- [Python API](#python-api)
- [Data](#data)
- [Models](#models)
- [Development](#development)
If you use WikiPron in your research, please cite the following:
Jackson L. Lee, Lucas F.E. Ashby, M. Elizabeth Garza, Yeonju Lee-Sikka, Sean
Miller, Alan Wong, Arya D. McCarthy, and Kyle Gorman (2020). [Massively
multilingual pronunciation mining with
WikiPron](https://www.aclweb.org/anthology/2020.lrec-1.521/). In *Proceedings of
the 12th Language Resources and Evaluation Conference*, pages 4223-4228.
\[[bibtex](https://www.aclweb.org/anthology/2020.lrec-1.521.bib)\]
## Command-line tool
### Installation
```bash
pip install wikipron
```
### Usage
#### Quick start
After installation, the terminal command `wikipron` will be available. As a
basic example, the following command scrapes G2P data for French:
```bash
wikipron fra
```
#### Specifying the language
The language is indicated by a three-letter [ISO
639-3](https://en.wikipedia.org/wiki/List_of_ISO_639-3_codes) language code,
e.g., `fra` for French. For which languages can be scraped,
[here](https://en.wiktionary.org/wiki/Category:Terms_with_IPA_pronunciation_by_language)
is the complete list of languages on Wiktionary that have pronunciation entries.
#### Specifying the dialect
One can optionally specify dialects to target using the `--dialect` flag. The
dialect name can be found together with the transcription on Wiktionary. For
example, "(UK, US) IPA: /təˈmɑːtəʊ/". To restrict to the union of dialects use
the pipe character '\|': e.g., `--dialect='General American | US'`.
Transcriptions which lack a dialect specification are selected regardless of the
value of this flag.
#### Specifying the transcription level
By default, WikiPron selects broad pronunciations in angled brackets /like
this/. One can instead select narrow transcriptions written \[like this\] using
the `--narrow` flag. Note that some languages only have broad or narrow
transcriptions (e.g., Russian only has the latter.
#### Segmentation
By default, the [`segments`](https://github.com/cldf/segments) library is used
to segment the transcription into whitespace. The segmentation tends to place
IPA diacritics and modifiers on the "parent" symbol. For instance, \[kʰæt\] is
rendered `kʰ æ t`. This can be disabled using the `--no-segment` flag.
#### Parentheses
Some of transcriptions contain parentheses to indicate alternative
pronunciations. The parentheses (but not the content) are discarded in the
scrape unless the `--no-skip-parens` flag is used.
#### Output
The scraped data is organized with each \<word, pronunciation\> pair on its own
line, where the word and pronunciation are separated by a tab. Note that the
pronunciation is in [International Phonetic Alphabet
(IPA)](https://en.wikipedia.org/wiki/International_Phonetic_Alphabet), segmented
by spaces that correctly handle the combining and modifier diacritics for
modeling purposes, e.g., we have `kʰ æ t` with the aspirated k instead of
`k ʰ æ t`.
For illustration, here is a snippet of French data scraped by WikiPron:
```tsv
accrémentitielle a k ʁ e m ɑ̃ t i t j ɛ l
accrescent a k ʁ ɛ s ɑ̃
accrétion a k ʁ e s j ɔ̃
accrétions a k ʁ e s j ɔ̃
```
By default, the scraped data appears in the terminal. To save the data in a TSV
file, please redirect the standard output to a filename of your choice:
```bash
wikipron fra > fra.tsv
```
#### Advanced options
The `wikipron` terminal command has an array of options to configure your
scraping run. For a full list of the options, please run `wikipron -h`.
## Python API
The underlying module can also be used from Python. A standard workflow looks
like:
```python
import wikipron
config = wikipron.Config(key="fra") # French, with default options.
for word, pron in wikipron.scrape(config):
...
```
## Data
We also make available [a database of over 3 million word/pronunciation
pairs](https://github.com/CUNY-CL/wikipron/tree/master/data) mined using
WikiPron.
## Models
We host grapheme-to-phoneme models and modeling software [in a separate
repository](https://github.com/kylebgorman/wikipron-modeling).
## Development
### Repository
The source code of WikiPron is hosted on GitHub at
[`https://github.com/CUNY-CL/wikipron`](https://github.com/CUNY-CL/wikipron),
where development also happens.
For the latest changes not yet released through `pip` or working on the codebase
yourself, you may obtain the latest source code through GitHub and `git`:
1. Create a fork of the `wikipron` repo on your GitHub account.
2. Locally, make sure you are in some sort of a virtual environment (venv,
virtualenv, conda, etc).
3. Download and install the library in the "editable" mode together with the
core and dev dependencies within the virtual environment:
```bash
git clone https://github.com/<your-github-username>/wikipron.git
cd wikipron
pip install -U pip setuptools
pip install -r requirements.txt
pip install --no-deps -e .
```
We keep track of notable changes in
[`CHANGELOG.md`](https://github.com/CUNY-CL/wikipron/blob/master/CHANGELOG.md).
### Contributing
For questions, bug reports, and feature requests, please [file an
issue](https://github.com/CUNY-CL/wikipron/issues).
If you would like to contribute to the `wikipron` codebase, please see
[CONTRIBUTING.md](https://github.com/CUNY-CL/wikipron/blob/master/CONTRIBUTING.md).
### License
WikiPron is released under an Apache 2.0 license. Please see
[LICENSE.txt](https://github.com/CUNY-CL/wikipron/blob/master/LICENSE.txt) for
details.
Please note that Wiktionary data in the
[`data/`](https://github.com/CUNY-CL/wikipron/tree/master/data) directory has
[its own licensing terms](https://en.wiktionary.org/wiki/Wiktionary:Copyrights).
Raw data
{
"_id": null,
"home_page": null,
"name": "wikipron",
"maintainer": null,
"docs_url": null,
"requires_python": ">=3.8",
"maintainer_email": null,
"keywords": "computational linguistics, natural language processing, phonology, phonetics, speech, language, Wiktionary",
"author": null,
"author_email": "WikiPron authors <kylebgorman@gmail.com>",
"download_url": "https://files.pythonhosted.org/packages/d5/84/19a9a932e4c45be859ed9bb51e68ab91620b0ee2cbc6459ec4bae2942111/wikipron-1.3.3.tar.gz",
"platform": null,
"description": "# WikiPron\n\n[![PyPI\nversion](https://badge.fury.io/py/wikipron.svg)](https://pypi.org/project/wikipron)\n[![Supported Python\nversions](https://img.shields.io/pypi/pyversions/wikipron.svg)](https://pypi.org/project/wikipron)\n[![CircleCI](https://circleci.com/gh/CUNY-CL/wikipron/tree/master.svg?style=shield)](https://circleci.com/gh/CUNY-CL/wikipron/tree/master)\n[![Paper](http://img.shields.io/badge/paper-ACL:2020.lrec--1.521-B31B1B.svg)](https://www.aclweb.org/anthology/2020.lrec-1.521/)\n[![Conference](http://img.shields.io/badge/LREC-2020-4b44ce.svg)](https://lrec2020.lrec-conf.org/en/)\n\nWikiPron is a command-line tool and Python API for mining multilingual\npronunciation data from Wiktionary, as well as a database of pronunciation\ndictionaries mined using this tool.\n\n- [Command-line tool](#command-line-tool)\n- [Python API](#python-api)\n- [Data](#data)\n- [Models](#models)\n- [Development](#development)\n\nIf you use WikiPron in your research, please cite the following:\n\nJackson L. Lee, Lucas F.E. Ashby, M. Elizabeth Garza, Yeonju Lee-Sikka, Sean\nMiller, Alan Wong, Arya D. McCarthy, and Kyle Gorman (2020). [Massively\nmultilingual pronunciation mining with\nWikiPron](https://www.aclweb.org/anthology/2020.lrec-1.521/). In *Proceedings of\nthe 12th Language Resources and Evaluation Conference*, pages 4223-4228.\n\\[[bibtex](https://www.aclweb.org/anthology/2020.lrec-1.521.bib)\\]\n\n## Command-line tool\n\n### Installation\n\n```bash\npip install wikipron\n```\n\n### Usage\n\n#### Quick start\n\nAfter installation, the terminal command `wikipron` will be available. As a\nbasic example, the following command scrapes G2P data for French:\n\n```bash\nwikipron fra\n```\n\n#### Specifying the language\n\nThe language is indicated by a three-letter [ISO\n639-3](https://en.wikipedia.org/wiki/List_of_ISO_639-3_codes) language code,\ne.g., `fra` for French. For which languages can be scraped,\n[here](https://en.wiktionary.org/wiki/Category:Terms_with_IPA_pronunciation_by_language)\nis the complete list of languages on Wiktionary that have pronunciation entries.\n\n#### Specifying the dialect\n\nOne can optionally specify dialects to target using the `--dialect` flag. The\ndialect name can be found together with the transcription on Wiktionary. For\nexample, \"(UK, US) IPA: /t\u0259\u02c8m\u0251\u02d0t\u0259\u028a/\". To restrict to the union of dialects use\nthe pipe character '\\|': e.g., `--dialect='General American | US'`.\nTranscriptions which lack a dialect specification are selected regardless of the\nvalue of this flag.\n\n#### Specifying the transcription level\n\nBy default, WikiPron selects broad pronunciations in angled brackets /like\nthis/. One can instead select narrow transcriptions written \\[like this\\] using\nthe `--narrow` flag. Note that some languages only have broad or narrow\ntranscriptions (e.g., Russian only has the latter.\n\n#### Segmentation\n\nBy default, the [`segments`](https://github.com/cldf/segments) library is used\nto segment the transcription into whitespace. The segmentation tends to place\nIPA diacritics and modifiers on the \"parent\" symbol. For instance, \\[k\u02b0\u00e6t\\] is\nrendered `k\u02b0 \u00e6 t`. This can be disabled using the `--no-segment` flag.\n\n#### Parentheses\n\nSome of transcriptions contain parentheses to indicate alternative\npronunciations. The parentheses (but not the content) are discarded in the\nscrape unless the `--no-skip-parens` flag is used.\n\n#### Output\n\nThe scraped data is organized with each \\<word, pronunciation\\> pair on its own\nline, where the word and pronunciation are separated by a tab. Note that the\npronunciation is in [International Phonetic Alphabet\n(IPA)](https://en.wikipedia.org/wiki/International_Phonetic_Alphabet), segmented\nby spaces that correctly handle the combining and modifier diacritics for\nmodeling purposes, e.g., we have `k\u02b0 \u00e6 t` with the aspirated k instead of\n`k \u02b0 \u00e6 t`.\n\nFor illustration, here is a snippet of French data scraped by WikiPron:\n\n```tsv\naccr\u00e9mentitielle a k \u0281 e m \u0251\u0303 t i t j \u025b l\naccrescent a k \u0281 \u025b s \u0251\u0303\naccr\u00e9tion a k \u0281 e s j \u0254\u0303\naccr\u00e9tions a k \u0281 e s j \u0254\u0303\n```\n\nBy default, the scraped data appears in the terminal. To save the data in a TSV\nfile, please redirect the standard output to a filename of your choice:\n\n```bash\nwikipron fra > fra.tsv\n```\n\n#### Advanced options\n\nThe `wikipron` terminal command has an array of options to configure your\nscraping run. For a full list of the options, please run `wikipron -h`.\n\n## Python API\n\nThe underlying module can also be used from Python. A standard workflow looks\nlike:\n\n```python\nimport wikipron\n\nconfig = wikipron.Config(key=\"fra\") # French, with default options.\nfor word, pron in wikipron.scrape(config):\n ...\n```\n\n## Data\n\nWe also make available [a database of over 3 million word/pronunciation\npairs](https://github.com/CUNY-CL/wikipron/tree/master/data) mined using\nWikiPron.\n\n## Models\n\nWe host grapheme-to-phoneme models and modeling software [in a separate\nrepository](https://github.com/kylebgorman/wikipron-modeling).\n\n## Development\n\n### Repository\n\nThe source code of WikiPron is hosted on GitHub at\n[`https://github.com/CUNY-CL/wikipron`](https://github.com/CUNY-CL/wikipron),\nwhere development also happens.\n\nFor the latest changes not yet released through `pip` or working on the codebase\nyourself, you may obtain the latest source code through GitHub and `git`:\n\n1. Create a fork of the `wikipron` repo on your GitHub account.\n\n2. Locally, make sure you are in some sort of a virtual environment (venv,\n virtualenv, conda, etc).\n\n3. Download and install the library in the \"editable\" mode together with the\n core and dev dependencies within the virtual environment:\n\n ```bash\n git clone https://github.com/<your-github-username>/wikipron.git\n cd wikipron\n pip install -U pip setuptools\n pip install -r requirements.txt\n pip install --no-deps -e .\n ```\n\nWe keep track of notable changes in\n[`CHANGELOG.md`](https://github.com/CUNY-CL/wikipron/blob/master/CHANGELOG.md).\n\n### Contributing\n\nFor questions, bug reports, and feature requests, please [file an\nissue](https://github.com/CUNY-CL/wikipron/issues).\n\nIf you would like to contribute to the `wikipron` codebase, please see\n[CONTRIBUTING.md](https://github.com/CUNY-CL/wikipron/blob/master/CONTRIBUTING.md).\n\n### License\n\nWikiPron is released under an Apache 2.0 license. Please see\n[LICENSE.txt](https://github.com/CUNY-CL/wikipron/blob/master/LICENSE.txt) for\ndetails.\n\nPlease note that Wiktionary data in the\n[`data/`](https://github.com/CUNY-CL/wikipron/tree/master/data) directory has\n[its own licensing terms](https://en.wiktionary.org/wiki/Wiktionary:Copyrights).\n",
"bugtrack_url": null,
"license": "Apache 2.0",
"summary": "Scraping grapheme-to-phoneme data from Wiktionary",
"version": "1.3.3",
"project_urls": {
"Homepage": "https://github.com/CUNY-CL/wikipron"
},
"split_keywords": [
"computational linguistics",
" natural language processing",
" phonology",
" phonetics",
" speech",
" language",
" wiktionary"
],
"urls": [
{
"comment_text": "",
"digests": {
"blake2b_256": "669103960973748cc260b3dc555ed1921b0a5b150b96b9c6b609ac8965f749e7",
"md5": "117c593988e0610124272e81e0159151",
"sha256": "6c29497dc83ddbbccda8f0d3594c45dba7918a05e322ad9d900daed784b435e8"
},
"downloads": -1,
"filename": "wikipron-1.3.3-py3-none-any.whl",
"has_sig": false,
"md5_digest": "117c593988e0610124272e81e0159151",
"packagetype": "bdist_wheel",
"python_version": "py3",
"requires_python": ">=3.8",
"size": 27383,
"upload_time": "2024-07-27T20:25:23",
"upload_time_iso_8601": "2024-07-27T20:25:23.348419Z",
"url": "https://files.pythonhosted.org/packages/66/91/03960973748cc260b3dc555ed1921b0a5b150b96b9c6b609ac8965f749e7/wikipron-1.3.3-py3-none-any.whl",
"yanked": false,
"yanked_reason": null
},
{
"comment_text": "",
"digests": {
"blake2b_256": "d58419a9a932e4c45be859ed9bb51e68ab91620b0ee2cbc6459ec4bae2942111",
"md5": "933e6e1d6566d1feb30ca7f7a447b7c9",
"sha256": "5eb3d1c1bdba89baa7479d7822dde779e19d12aa9ad49d530984927df058fa15"
},
"downloads": -1,
"filename": "wikipron-1.3.3.tar.gz",
"has_sig": false,
"md5_digest": "933e6e1d6566d1feb30ca7f7a447b7c9",
"packagetype": "sdist",
"python_version": "source",
"requires_python": ">=3.8",
"size": 23795,
"upload_time": "2024-07-27T20:25:24",
"upload_time_iso_8601": "2024-07-27T20:25:24.987264Z",
"url": "https://files.pythonhosted.org/packages/d5/84/19a9a932e4c45be859ed9bb51e68ab91620b0ee2cbc6459ec4bae2942111/wikipron-1.3.3.tar.gz",
"yanked": false,
"yanked_reason": null
}
],
"upload_time": "2024-07-27 20:25:24",
"github": true,
"gitlab": false,
"bitbucket": false,
"codeberg": false,
"github_user": "CUNY-CL",
"github_project": "wikipron",
"travis_ci": false,
"coveralls": false,
"github_actions": false,
"circle": true,
"requirements": [
{
"name": "black",
"specs": [
[
"==",
"24.3.0"
]
]
},
{
"name": "build",
"specs": [
[
"==",
"0.9.0"
]
]
},
{
"name": "flake8",
"specs": [
[
"==",
"7.0.0"
]
]
},
{
"name": "lxml_html_clean",
"specs": [
[
"==",
"0.1.1"
]
]
},
{
"name": "mypy",
"specs": [
[
"==",
"1.1.1"
]
]
},
{
"name": "pytest",
"specs": [
[
"==",
"7.2.0"
]
]
},
{
"name": "python-iso639",
"specs": [
[
"==",
"2022.11.27"
]
]
},
{
"name": "requests-html",
"specs": [
[
"==",
"0.10.0"
]
]
},
{
"name": "requests",
"specs": [
[
"==",
"2.32.3"
]
]
},
{
"name": "segments",
"specs": [
[
"==",
"2.2.1"
]
]
},
{
"name": "setuptools",
"specs": [
[
"==",
"70.0.0"
]
]
},
{
"name": "twine",
"specs": [
[
"==",
"5.1.1"
]
]
},
{
"name": "unicodedataplus",
"specs": [
[
"==",
"15.0.0"
]
]
}
],
"lcname": "wikipron"
}