wikipron


Namewikipron JSON
Version 1.3.3 PyPI version JSON
download
home_pageNone
SummaryScraping grapheme-to-phoneme data from Wiktionary
upload_time2024-07-27 20:25:24
maintainerNone
docs_urlNone
authorNone
requires_python>=3.8
licenseApache 2.0
keywords computational linguistics natural language processing phonology phonetics speech language wiktionary
VCS
bugtrack_url
requirements black build flake8 lxml_html_clean mypy pytest python-iso639 requests-html requests segments setuptools twine unicodedataplus
Travis-CI No Travis.
coveralls test coverage No coveralls.
            # WikiPron

[![PyPI
version](https://badge.fury.io/py/wikipron.svg)](https://pypi.org/project/wikipron)
[![Supported Python
versions](https://img.shields.io/pypi/pyversions/wikipron.svg)](https://pypi.org/project/wikipron)
[![CircleCI](https://circleci.com/gh/CUNY-CL/wikipron/tree/master.svg?style=shield)](https://circleci.com/gh/CUNY-CL/wikipron/tree/master)
[![Paper](http://img.shields.io/badge/paper-ACL:2020.lrec--1.521-B31B1B.svg)](https://www.aclweb.org/anthology/2020.lrec-1.521/)
[![Conference](http://img.shields.io/badge/LREC-2020-4b44ce.svg)](https://lrec2020.lrec-conf.org/en/)

WikiPron is a command-line tool and Python API for mining multilingual
pronunciation data from Wiktionary, as well as a database of pronunciation
dictionaries mined using this tool.

-   [Command-line tool](#command-line-tool)
-   [Python API](#python-api)
-   [Data](#data)
-   [Models](#models)
-   [Development](#development)

If you use WikiPron in your research, please cite the following:

Jackson L. Lee, Lucas F.E. Ashby, M. Elizabeth Garza, Yeonju Lee-Sikka, Sean
Miller, Alan Wong, Arya D. McCarthy, and Kyle Gorman (2020). [Massively
multilingual pronunciation mining with
WikiPron](https://www.aclweb.org/anthology/2020.lrec-1.521/). In *Proceedings of
the 12th Language Resources and Evaluation Conference*, pages 4223-4228.
\[[bibtex](https://www.aclweb.org/anthology/2020.lrec-1.521.bib)\]

## Command-line tool

### Installation

```bash
pip install wikipron
```

### Usage

#### Quick start

After installation, the terminal command `wikipron` will be available. As a
basic example, the following command scrapes G2P data for French:

```bash
wikipron fra
```

#### Specifying the language

The language is indicated by a three-letter [ISO
639-3](https://en.wikipedia.org/wiki/List_of_ISO_639-3_codes) language code,
e.g., `fra` for French. For which languages can be scraped,
[here](https://en.wiktionary.org/wiki/Category:Terms_with_IPA_pronunciation_by_language)
is the complete list of languages on Wiktionary that have pronunciation entries.

#### Specifying the dialect

One can optionally specify dialects to target using the `--dialect` flag. The
dialect name can be found together with the transcription on Wiktionary. For
example, "(UK, US) IPA: /təˈmɑːtəʊ/". To restrict to the union of dialects use
the pipe character '\|': e.g., `--dialect='General American | US'`.
Transcriptions which lack a dialect specification are selected regardless of the
value of this flag.

#### Specifying the transcription level

By default, WikiPron selects broad pronunciations in angled brackets /like
this/. One can instead select narrow transcriptions written \[like this\] using
the `--narrow` flag. Note that some languages only have broad or narrow
transcriptions (e.g., Russian only has the latter.

#### Segmentation

By default, the [`segments`](https://github.com/cldf/segments) library is used
to segment the transcription into whitespace. The segmentation tends to place
IPA diacritics and modifiers on the "parent" symbol. For instance, \[kʰæt\] is
rendered `kʰ æ t`. This can be disabled using the `--no-segment` flag.

#### Parentheses

Some of transcriptions contain parentheses to indicate alternative
pronunciations. The parentheses (but not the content) are discarded in the
scrape unless the `--no-skip-parens` flag is used.

#### Output

The scraped data is organized with each \<word, pronunciation\> pair on its own
line, where the word and pronunciation are separated by a tab. Note that the
pronunciation is in [International Phonetic Alphabet
(IPA)](https://en.wikipedia.org/wiki/International_Phonetic_Alphabet), segmented
by spaces that correctly handle the combining and modifier diacritics for
modeling purposes, e.g., we have `kʰ æ t` with the aspirated k instead of
`k ʰ æ t`.

For illustration, here is a snippet of French data scraped by WikiPron:

```tsv
accrémentitielle    a k ʁ e m ɑ̃ t i t j ɛ l
accrescent  a k ʁ ɛ s ɑ̃
accrétion   a k ʁ e s j ɔ̃
accrétions  a k ʁ e s j ɔ̃
```

By default, the scraped data appears in the terminal. To save the data in a TSV
file, please redirect the standard output to a filename of your choice:

```bash
wikipron fra > fra.tsv
```

#### Advanced options

The `wikipron` terminal command has an array of options to configure your
scraping run. For a full list of the options, please run `wikipron -h`.

## Python API

The underlying module can also be used from Python. A standard workflow looks
like:

```python
import wikipron

config = wikipron.Config(key="fra")  # French, with default options.
for word, pron in wikipron.scrape(config):
    ...
```

## Data

We also make available [a database of over 3 million word/pronunciation
pairs](https://github.com/CUNY-CL/wikipron/tree/master/data) mined using
WikiPron.

## Models

We host grapheme-to-phoneme models and modeling software [in a separate
repository](https://github.com/kylebgorman/wikipron-modeling).

## Development

### Repository

The source code of WikiPron is hosted on GitHub at
[`https://github.com/CUNY-CL/wikipron`](https://github.com/CUNY-CL/wikipron),
where development also happens.

For the latest changes not yet released through `pip` or working on the codebase
yourself, you may obtain the latest source code through GitHub and `git`:

1.  Create a fork of the `wikipron` repo on your GitHub account.

2.  Locally, make sure you are in some sort of a virtual environment (venv,
    virtualenv, conda, etc).

3.  Download and install the library in the "editable" mode together with the
    core and dev dependencies within the virtual environment:

    ```bash
    git clone https://github.com/<your-github-username>/wikipron.git
    cd wikipron
    pip install -U pip setuptools
    pip install -r requirements.txt
    pip install --no-deps -e .
    ```

We keep track of notable changes in
[`CHANGELOG.md`](https://github.com/CUNY-CL/wikipron/blob/master/CHANGELOG.md).

### Contributing

For questions, bug reports, and feature requests, please [file an
issue](https://github.com/CUNY-CL/wikipron/issues).

If you would like to contribute to the `wikipron` codebase, please see
[CONTRIBUTING.md](https://github.com/CUNY-CL/wikipron/blob/master/CONTRIBUTING.md).

### License

WikiPron is released under an Apache 2.0 license. Please see
[LICENSE.txt](https://github.com/CUNY-CL/wikipron/blob/master/LICENSE.txt) for
details.

Please note that Wiktionary data in the
[`data/`](https://github.com/CUNY-CL/wikipron/tree/master/data) directory has
[its own licensing terms](https://en.wiktionary.org/wiki/Wiktionary:Copyrights).

            

Raw data

            {
    "_id": null,
    "home_page": null,
    "name": "wikipron",
    "maintainer": null,
    "docs_url": null,
    "requires_python": ">=3.8",
    "maintainer_email": null,
    "keywords": "computational linguistics, natural language processing, phonology, phonetics, speech, language, Wiktionary",
    "author": null,
    "author_email": "WikiPron authors <kylebgorman@gmail.com>",
    "download_url": "https://files.pythonhosted.org/packages/d5/84/19a9a932e4c45be859ed9bb51e68ab91620b0ee2cbc6459ec4bae2942111/wikipron-1.3.3.tar.gz",
    "platform": null,
    "description": "# WikiPron\n\n[![PyPI\nversion](https://badge.fury.io/py/wikipron.svg)](https://pypi.org/project/wikipron)\n[![Supported Python\nversions](https://img.shields.io/pypi/pyversions/wikipron.svg)](https://pypi.org/project/wikipron)\n[![CircleCI](https://circleci.com/gh/CUNY-CL/wikipron/tree/master.svg?style=shield)](https://circleci.com/gh/CUNY-CL/wikipron/tree/master)\n[![Paper](http://img.shields.io/badge/paper-ACL:2020.lrec--1.521-B31B1B.svg)](https://www.aclweb.org/anthology/2020.lrec-1.521/)\n[![Conference](http://img.shields.io/badge/LREC-2020-4b44ce.svg)](https://lrec2020.lrec-conf.org/en/)\n\nWikiPron is a command-line tool and Python API for mining multilingual\npronunciation data from Wiktionary, as well as a database of pronunciation\ndictionaries mined using this tool.\n\n-   [Command-line tool](#command-line-tool)\n-   [Python API](#python-api)\n-   [Data](#data)\n-   [Models](#models)\n-   [Development](#development)\n\nIf you use WikiPron in your research, please cite the following:\n\nJackson L. Lee, Lucas F.E. Ashby, M. Elizabeth Garza, Yeonju Lee-Sikka, Sean\nMiller, Alan Wong, Arya D. McCarthy, and Kyle Gorman (2020). [Massively\nmultilingual pronunciation mining with\nWikiPron](https://www.aclweb.org/anthology/2020.lrec-1.521/). In *Proceedings of\nthe 12th Language Resources and Evaluation Conference*, pages 4223-4228.\n\\[[bibtex](https://www.aclweb.org/anthology/2020.lrec-1.521.bib)\\]\n\n## Command-line tool\n\n### Installation\n\n```bash\npip install wikipron\n```\n\n### Usage\n\n#### Quick start\n\nAfter installation, the terminal command `wikipron` will be available. As a\nbasic example, the following command scrapes G2P data for French:\n\n```bash\nwikipron fra\n```\n\n#### Specifying the language\n\nThe language is indicated by a three-letter [ISO\n639-3](https://en.wikipedia.org/wiki/List_of_ISO_639-3_codes) language code,\ne.g., `fra` for French. For which languages can be scraped,\n[here](https://en.wiktionary.org/wiki/Category:Terms_with_IPA_pronunciation_by_language)\nis the complete list of languages on Wiktionary that have pronunciation entries.\n\n#### Specifying the dialect\n\nOne can optionally specify dialects to target using the `--dialect` flag. The\ndialect name can be found together with the transcription on Wiktionary. For\nexample, \"(UK, US) IPA: /t\u0259\u02c8m\u0251\u02d0t\u0259\u028a/\". To restrict to the union of dialects use\nthe pipe character '\\|': e.g., `--dialect='General American | US'`.\nTranscriptions which lack a dialect specification are selected regardless of the\nvalue of this flag.\n\n#### Specifying the transcription level\n\nBy default, WikiPron selects broad pronunciations in angled brackets /like\nthis/. One can instead select narrow transcriptions written \\[like this\\] using\nthe `--narrow` flag. Note that some languages only have broad or narrow\ntranscriptions (e.g., Russian only has the latter.\n\n#### Segmentation\n\nBy default, the [`segments`](https://github.com/cldf/segments) library is used\nto segment the transcription into whitespace. The segmentation tends to place\nIPA diacritics and modifiers on the \"parent\" symbol. For instance, \\[k\u02b0\u00e6t\\] is\nrendered `k\u02b0 \u00e6 t`. This can be disabled using the `--no-segment` flag.\n\n#### Parentheses\n\nSome of transcriptions contain parentheses to indicate alternative\npronunciations. The parentheses (but not the content) are discarded in the\nscrape unless the `--no-skip-parens` flag is used.\n\n#### Output\n\nThe scraped data is organized with each \\<word, pronunciation\\> pair on its own\nline, where the word and pronunciation are separated by a tab. Note that the\npronunciation is in [International Phonetic Alphabet\n(IPA)](https://en.wikipedia.org/wiki/International_Phonetic_Alphabet), segmented\nby spaces that correctly handle the combining and modifier diacritics for\nmodeling purposes, e.g., we have `k\u02b0 \u00e6 t` with the aspirated k instead of\n`k \u02b0 \u00e6 t`.\n\nFor illustration, here is a snippet of French data scraped by WikiPron:\n\n```tsv\naccr\u00e9mentitielle    a k \u0281 e m \u0251\u0303 t i t j \u025b l\naccrescent  a k \u0281 \u025b s \u0251\u0303\naccr\u00e9tion   a k \u0281 e s j \u0254\u0303\naccr\u00e9tions  a k \u0281 e s j \u0254\u0303\n```\n\nBy default, the scraped data appears in the terminal. To save the data in a TSV\nfile, please redirect the standard output to a filename of your choice:\n\n```bash\nwikipron fra > fra.tsv\n```\n\n#### Advanced options\n\nThe `wikipron` terminal command has an array of options to configure your\nscraping run. For a full list of the options, please run `wikipron -h`.\n\n## Python API\n\nThe underlying module can also be used from Python. A standard workflow looks\nlike:\n\n```python\nimport wikipron\n\nconfig = wikipron.Config(key=\"fra\")  # French, with default options.\nfor word, pron in wikipron.scrape(config):\n    ...\n```\n\n## Data\n\nWe also make available [a database of over 3 million word/pronunciation\npairs](https://github.com/CUNY-CL/wikipron/tree/master/data) mined using\nWikiPron.\n\n## Models\n\nWe host grapheme-to-phoneme models and modeling software [in a separate\nrepository](https://github.com/kylebgorman/wikipron-modeling).\n\n## Development\n\n### Repository\n\nThe source code of WikiPron is hosted on GitHub at\n[`https://github.com/CUNY-CL/wikipron`](https://github.com/CUNY-CL/wikipron),\nwhere development also happens.\n\nFor the latest changes not yet released through `pip` or working on the codebase\nyourself, you may obtain the latest source code through GitHub and `git`:\n\n1.  Create a fork of the `wikipron` repo on your GitHub account.\n\n2.  Locally, make sure you are in some sort of a virtual environment (venv,\n    virtualenv, conda, etc).\n\n3.  Download and install the library in the \"editable\" mode together with the\n    core and dev dependencies within the virtual environment:\n\n    ```bash\n    git clone https://github.com/<your-github-username>/wikipron.git\n    cd wikipron\n    pip install -U pip setuptools\n    pip install -r requirements.txt\n    pip install --no-deps -e .\n    ```\n\nWe keep track of notable changes in\n[`CHANGELOG.md`](https://github.com/CUNY-CL/wikipron/blob/master/CHANGELOG.md).\n\n### Contributing\n\nFor questions, bug reports, and feature requests, please [file an\nissue](https://github.com/CUNY-CL/wikipron/issues).\n\nIf you would like to contribute to the `wikipron` codebase, please see\n[CONTRIBUTING.md](https://github.com/CUNY-CL/wikipron/blob/master/CONTRIBUTING.md).\n\n### License\n\nWikiPron is released under an Apache 2.0 license. Please see\n[LICENSE.txt](https://github.com/CUNY-CL/wikipron/blob/master/LICENSE.txt) for\ndetails.\n\nPlease note that Wiktionary data in the\n[`data/`](https://github.com/CUNY-CL/wikipron/tree/master/data) directory has\n[its own licensing terms](https://en.wiktionary.org/wiki/Wiktionary:Copyrights).\n",
    "bugtrack_url": null,
    "license": "Apache 2.0",
    "summary": "Scraping grapheme-to-phoneme data from Wiktionary",
    "version": "1.3.3",
    "project_urls": {
        "Homepage": "https://github.com/CUNY-CL/wikipron"
    },
    "split_keywords": [
        "computational linguistics",
        " natural language processing",
        " phonology",
        " phonetics",
        " speech",
        " language",
        " wiktionary"
    ],
    "urls": [
        {
            "comment_text": "",
            "digests": {
                "blake2b_256": "669103960973748cc260b3dc555ed1921b0a5b150b96b9c6b609ac8965f749e7",
                "md5": "117c593988e0610124272e81e0159151",
                "sha256": "6c29497dc83ddbbccda8f0d3594c45dba7918a05e322ad9d900daed784b435e8"
            },
            "downloads": -1,
            "filename": "wikipron-1.3.3-py3-none-any.whl",
            "has_sig": false,
            "md5_digest": "117c593988e0610124272e81e0159151",
            "packagetype": "bdist_wheel",
            "python_version": "py3",
            "requires_python": ">=3.8",
            "size": 27383,
            "upload_time": "2024-07-27T20:25:23",
            "upload_time_iso_8601": "2024-07-27T20:25:23.348419Z",
            "url": "https://files.pythonhosted.org/packages/66/91/03960973748cc260b3dc555ed1921b0a5b150b96b9c6b609ac8965f749e7/wikipron-1.3.3-py3-none-any.whl",
            "yanked": false,
            "yanked_reason": null
        },
        {
            "comment_text": "",
            "digests": {
                "blake2b_256": "d58419a9a932e4c45be859ed9bb51e68ab91620b0ee2cbc6459ec4bae2942111",
                "md5": "933e6e1d6566d1feb30ca7f7a447b7c9",
                "sha256": "5eb3d1c1bdba89baa7479d7822dde779e19d12aa9ad49d530984927df058fa15"
            },
            "downloads": -1,
            "filename": "wikipron-1.3.3.tar.gz",
            "has_sig": false,
            "md5_digest": "933e6e1d6566d1feb30ca7f7a447b7c9",
            "packagetype": "sdist",
            "python_version": "source",
            "requires_python": ">=3.8",
            "size": 23795,
            "upload_time": "2024-07-27T20:25:24",
            "upload_time_iso_8601": "2024-07-27T20:25:24.987264Z",
            "url": "https://files.pythonhosted.org/packages/d5/84/19a9a932e4c45be859ed9bb51e68ab91620b0ee2cbc6459ec4bae2942111/wikipron-1.3.3.tar.gz",
            "yanked": false,
            "yanked_reason": null
        }
    ],
    "upload_time": "2024-07-27 20:25:24",
    "github": true,
    "gitlab": false,
    "bitbucket": false,
    "codeberg": false,
    "github_user": "CUNY-CL",
    "github_project": "wikipron",
    "travis_ci": false,
    "coveralls": false,
    "github_actions": false,
    "circle": true,
    "requirements": [
        {
            "name": "black",
            "specs": [
                [
                    "==",
                    "24.3.0"
                ]
            ]
        },
        {
            "name": "build",
            "specs": [
                [
                    "==",
                    "0.9.0"
                ]
            ]
        },
        {
            "name": "flake8",
            "specs": [
                [
                    "==",
                    "7.0.0"
                ]
            ]
        },
        {
            "name": "lxml_html_clean",
            "specs": [
                [
                    "==",
                    "0.1.1"
                ]
            ]
        },
        {
            "name": "mypy",
            "specs": [
                [
                    "==",
                    "1.1.1"
                ]
            ]
        },
        {
            "name": "pytest",
            "specs": [
                [
                    "==",
                    "7.2.0"
                ]
            ]
        },
        {
            "name": "python-iso639",
            "specs": [
                [
                    "==",
                    "2022.11.27"
                ]
            ]
        },
        {
            "name": "requests-html",
            "specs": [
                [
                    "==",
                    "0.10.0"
                ]
            ]
        },
        {
            "name": "requests",
            "specs": [
                [
                    "==",
                    "2.32.3"
                ]
            ]
        },
        {
            "name": "segments",
            "specs": [
                [
                    "==",
                    "2.2.1"
                ]
            ]
        },
        {
            "name": "setuptools",
            "specs": [
                [
                    "==",
                    "70.0.0"
                ]
            ]
        },
        {
            "name": "twine",
            "specs": [
                [
                    "==",
                    "5.1.1"
                ]
            ]
        },
        {
            "name": "unicodedataplus",
            "specs": [
                [
                    "==",
                    "15.0.0"
                ]
            ]
        }
    ],
    "lcname": "wikipron"
}
        
Elapsed time: 9.50761s