g2p-id-py


Nameg2p-id-py JSON
Version 0.4.2 PyPI version JSON
download
home_pagehttps://github.com/bookbot-kids/g2p_id
SummaryIndonesian G2P.
upload_time2024-12-09 03:09:20
maintainerNone
docs_urlNone
authorw11wo
requires_python>=3.8
licenseApache License
keywords
VCS
bugtrack_url
requirements num2words nltk onnxruntime
Travis-CI No Travis.
coveralls test coverage No coveralls.
            # g2p ID: Indonesian Grapheme-to-Phoneme Converter

<p align="center">
    <a href="https://github.com/bookbot-kids/g2p_id/blob/main/LICENSE.md">
        <img alt="GitHub" src="https://img.shields.io/github/license/bookbot-kids/g2p_id.svg?color=blue">
    </a>
    <a href="https://bookbot-kids.github.io/g2p_id/">
        <img alt="Documentation" src="https://img.shields.io/website/http/bookbot-kids.github.io/g2p_id.svg?down_color=red&down_message=offline&up_message=online">
    </a>
    <a href="https://github.com/bookbot-kids/g2p_id/releases">
        <img alt="GitHub release" src="https://img.shields.io/github/release/bookbot-kids/g2p_id.svg">
    </a>
    <a href="https://github.com/bookbot-kids/g2p_id/blob/main/CODE_OF_CONDUCT.md">
        <img alt="Contributor Covenant" src="https://img.shields.io/badge/Contributor%20Covenant-v2.0%20adopted-ff69b4.svg">
    </a>
    <a href="https://github.com/bookbot-kids/g2p_id/actions/workflows/tests.yml">
        <img alt="Tests" src="https://github.com/bookbot-kids/g2p_id/actions/workflows/tests.yml/badge.svg">
    </a>
    <a href="https://codecov.io/gh/bookbot-kids/g2p_id">
        <img alt="Code Coverage" src="https://img.shields.io/codecov/c/github/bookbot-kids/g2p_id">
    </a>
    <a href="https://discord.gg/gqwTPyPxa6">
        <img alt="chat on Discord" src="https://img.shields.io/discord/1001447685645148169?logo=discord">
    </a>
    <a href="https://github.com/bookbot-kids/g2p_id/blob/main/CONTRIBUTING.md">
        <img alt="contributing guidelines" src="https://img.shields.io/badge/contributing-guidelines-brightgreen">
    </a>
</p>

This library is developed to convert Indonesian (Bahasa Indonesia) graphemes (words) to phonemes in IPA. We followed the methods and designs used in the English equivalent library, [g2p](https://github.com/Kyubyong/g2p).

## Installation

```bash
pip install g2p_id_py
```

## How to Use

```py
from g2p_id import G2p

texts = [
    "Apel itu berwarna merah.",
    "Rahel bersekolah di S M A Jakarta 17.",
    "Mereka sedang bermain bola di lapangan.",
]

g2p = G2p()
for text in texts:
    print(g2p(text))

>> [['a', 'p', 'ə', 'l'], ['i', 't', 'u'], ['b', 'ə', 'r', 'w', 'a', 'r', 'n', 'a'], ['m', 'e', 'r', 'a', 'h'], ['.']]
>> [['r', 'a', 'h', 'e', 'l'], ['b', 'ə', 'r', 's', 'ə', 'k', 'o', 'l', 'a', 'h'], ['d', 'i'], ['e', 's'], ['e', 'm'], ['a'], ['dʒ', 'a', 'k', 'a', 'r', 't', 'a'], ['t', 'u', 'dʒ', 'u', 'h'], ['b', 'ə', 'l', 'a', 's'], ['.']]
>> [['m', 'ə', 'r', 'e', 'k', 'a'], ['s', 'ə', 'd', 'a', 'ŋ'], ['b', 'ə', 'r', 'm', 'a', 'i', 'n'], ['b', 'o', 'l', 'a'], ['d', 'i'], ['l', 'a', 'p', 'a', 'ŋ', 'a', 'n'], ['.']]
```

## Algorithm

This is heavily inspired from the English [g2p](https://github.com/Kyubyong/g2p).

1. Spells out arabic numbers and some currency symbols, e.g. `Rp 200,000 -> dua ratus ribu rupiah`. This is borrowed from [Cahya's code](https://github.com/cahya-wirawan/text_processor).
2. Attempts to retrieve the correct pronunciation for homographs based on their [POS (part-of-speech) tags](#pos-tagging).
3. Looks up a lexicon (pronunciation dictionary) for non-homographs. This list is originally from [ipa-dict](https://github.com/open-dict-data/ipa-dict/blob/master/data/ma.txt), and we later made a modified version.
4. For OOVs, we predict their pronunciations using either a [BERT model](https://huggingface.co/bookbot/id-g2p-bert) or an [LSTM model](https://huggingface.co/bookbot/id-g2p-lstm).

## Phoneme and Grapheme Sets

```python
graphemes = ['a', 'b', 'c', 'd', 'e', 'f', 'g', 'h', 'i', 'j', 'k', 'l', 'm', 'n', 'o', 'p', 'q', 'r', 's', 't', 'u', 'v', 'w', 'x', 'y', 'z']
phonemes = ['a', 'b', 'd', 'e', 'f', 'ɡ', 'h', 'i', 'j', 'k', 'l', 'm', 'n', 'o', 'p', 'r', 's', 't', 'u', 'v', 'w', 'z', 'ŋ', 'ə', 'ɲ', 'tʃ', 'ʃ', 'dʒ', 'x', 'ʔ']
```

## Implementation Details

You can find more details on how we handled homographs and out-of-vocabulary prediction on our [documentation](https://bookbot-kids.github.io/g2p_id/algorithm/) page.

## References

```bib
@misc{g2pE2019,
  author = {Park, Kyubyong & Kim, Jongseok},
  title = {g2pE},
  year = {2019},
  publisher = {GitHub},
  journal = {GitHub repository},
  howpublished = {\url{https://github.com/Kyubyong/g2p}}
}
```

```bib
@misc{TextProcessor2021,
  author = {Cahya Wirawan},
  title = {Text Processor},
  year = {2021},
  publisher = {GitHub},
  journal = {GitHub repository},
  howpublished = {\url{https://github.com/cahya-wirawan/text_processor}}
}
```

## Contributors

<a href="https://github.com/bookbot-kids/g2p_id/graphs/contributors">
  <img src="https://contrib.rocks/image?repo=bookbot-kids/g2p_id" />
</a>

            

Raw data

            {
    "_id": null,
    "home_page": "https://github.com/bookbot-kids/g2p_id",
    "name": "g2p-id-py",
    "maintainer": null,
    "docs_url": null,
    "requires_python": ">=3.8",
    "maintainer_email": null,
    "keywords": null,
    "author": "w11wo",
    "author_email": "wilson@bookbotkids.com",
    "download_url": "https://files.pythonhosted.org/packages/1a/82/7a158c9714a77299de714454ad9b1eeda18dfef3f8e4c4f8f898cea00671/g2p_id_py-0.4.2.tar.gz",
    "platform": "linux",
    "description": "# g2p ID: Indonesian Grapheme-to-Phoneme Converter\n\n<p align=\"center\">\n    <a href=\"https://github.com/bookbot-kids/g2p_id/blob/main/LICENSE.md\">\n        <img alt=\"GitHub\" src=\"https://img.shields.io/github/license/bookbot-kids/g2p_id.svg?color=blue\">\n    </a>\n    <a href=\"https://bookbot-kids.github.io/g2p_id/\">\n        <img alt=\"Documentation\" src=\"https://img.shields.io/website/http/bookbot-kids.github.io/g2p_id.svg?down_color=red&down_message=offline&up_message=online\">\n    </a>\n    <a href=\"https://github.com/bookbot-kids/g2p_id/releases\">\n        <img alt=\"GitHub release\" src=\"https://img.shields.io/github/release/bookbot-kids/g2p_id.svg\">\n    </a>\n    <a href=\"https://github.com/bookbot-kids/g2p_id/blob/main/CODE_OF_CONDUCT.md\">\n        <img alt=\"Contributor Covenant\" src=\"https://img.shields.io/badge/Contributor%20Covenant-v2.0%20adopted-ff69b4.svg\">\n    </a>\n    <a href=\"https://github.com/bookbot-kids/g2p_id/actions/workflows/tests.yml\">\n        <img alt=\"Tests\" src=\"https://github.com/bookbot-kids/g2p_id/actions/workflows/tests.yml/badge.svg\">\n    </a>\n    <a href=\"https://codecov.io/gh/bookbot-kids/g2p_id\">\n        <img alt=\"Code Coverage\" src=\"https://img.shields.io/codecov/c/github/bookbot-kids/g2p_id\">\n    </a>\n    <a href=\"https://discord.gg/gqwTPyPxa6\">\n        <img alt=\"chat on Discord\" src=\"https://img.shields.io/discord/1001447685645148169?logo=discord\">\n    </a>\n    <a href=\"https://github.com/bookbot-kids/g2p_id/blob/main/CONTRIBUTING.md\">\n        <img alt=\"contributing guidelines\" src=\"https://img.shields.io/badge/contributing-guidelines-brightgreen\">\n    </a>\n</p>\n\nThis library is developed to convert Indonesian (Bahasa Indonesia) graphemes (words) to phonemes in IPA. We followed the methods and designs used in the English equivalent library, [g2p](https://github.com/Kyubyong/g2p).\n\n## Installation\n\n```bash\npip install g2p_id_py\n```\n\n## How to Use\n\n```py\nfrom g2p_id import G2p\n\ntexts = [\n    \"Apel itu berwarna merah.\",\n    \"Rahel bersekolah di S M A Jakarta 17.\",\n    \"Mereka sedang bermain bola di lapangan.\",\n]\n\ng2p = G2p()\nfor text in texts:\n    print(g2p(text))\n\n>> [['a', 'p', '\u0259', 'l'], ['i', 't', 'u'], ['b', '\u0259', 'r', 'w', 'a', 'r', 'n', 'a'], ['m', 'e', 'r', 'a', 'h'], ['.']]\n>> [['r', 'a', 'h', 'e', 'l'], ['b', '\u0259', 'r', 's', '\u0259', 'k', 'o', 'l', 'a', 'h'], ['d', 'i'], ['e', 's'], ['e', 'm'], ['a'], ['d\u0292', 'a', 'k', 'a', 'r', 't', 'a'], ['t', 'u', 'd\u0292', 'u', 'h'], ['b', '\u0259', 'l', 'a', 's'], ['.']]\n>> [['m', '\u0259', 'r', 'e', 'k', 'a'], ['s', '\u0259', 'd', 'a', '\u014b'], ['b', '\u0259', 'r', 'm', 'a', 'i', 'n'], ['b', 'o', 'l', 'a'], ['d', 'i'], ['l', 'a', 'p', 'a', '\u014b', 'a', 'n'], ['.']]\n```\n\n## Algorithm\n\nThis is heavily inspired from the English [g2p](https://github.com/Kyubyong/g2p).\n\n1. Spells out arabic numbers and some currency symbols, e.g. `Rp 200,000 -> dua ratus ribu rupiah`. This is borrowed from [Cahya's code](https://github.com/cahya-wirawan/text_processor).\n2. Attempts to retrieve the correct pronunciation for homographs based on their [POS (part-of-speech) tags](#pos-tagging).\n3. Looks up a lexicon (pronunciation dictionary) for non-homographs. This list is originally from [ipa-dict](https://github.com/open-dict-data/ipa-dict/blob/master/data/ma.txt), and we later made a modified version.\n4. For OOVs, we predict their pronunciations using either a [BERT model](https://huggingface.co/bookbot/id-g2p-bert) or an [LSTM model](https://huggingface.co/bookbot/id-g2p-lstm).\n\n## Phoneme and Grapheme Sets\n\n```python\ngraphemes = ['a', 'b', 'c', 'd', 'e', 'f', 'g', 'h', 'i', 'j', 'k', 'l', 'm', 'n', 'o', 'p', 'q', 'r', 's', 't', 'u', 'v', 'w', 'x', 'y', 'z']\nphonemes = ['a', 'b', 'd', 'e', 'f', '\u0261', 'h', 'i', 'j', 'k', 'l', 'm', 'n', 'o', 'p', 'r', 's', 't', 'u', 'v', 'w', 'z', '\u014b', '\u0259', '\u0272', 't\u0283', '\u0283', 'd\u0292', 'x', '\u0294']\n```\n\n## Implementation Details\n\nYou can find more details on how we handled homographs and out-of-vocabulary prediction on our [documentation](https://bookbot-kids.github.io/g2p_id/algorithm/) page.\n\n## References\n\n```bib\n@misc{g2pE2019,\n  author = {Park, Kyubyong & Kim, Jongseok},\n  title = {g2pE},\n  year = {2019},\n  publisher = {GitHub},\n  journal = {GitHub repository},\n  howpublished = {\\url{https://github.com/Kyubyong/g2p}}\n}\n```\n\n```bib\n@misc{TextProcessor2021,\n  author = {Cahya Wirawan},\n  title = {Text Processor},\n  year = {2021},\n  publisher = {GitHub},\n  journal = {GitHub repository},\n  howpublished = {\\url{https://github.com/cahya-wirawan/text_processor}}\n}\n```\n\n## Contributors\n\n<a href=\"https://github.com/bookbot-kids/g2p_id/graphs/contributors\">\n  <img src=\"https://contrib.rocks/image?repo=bookbot-kids/g2p_id\" />\n</a>\n",
    "bugtrack_url": null,
    "license": "Apache License",
    "summary": "Indonesian G2P.",
    "version": "0.4.2",
    "project_urls": {
        "Homepage": "https://github.com/bookbot-kids/g2p_id"
    },
    "split_keywords": [],
    "urls": [
        {
            "comment_text": "",
            "digests": {
                "blake2b_256": "faa4cd99ba82de323f2084b59f0b7b55211fcb4a7126d4ecbdcee7608b9ed0c8",
                "md5": "a9f42331de4d99698178e6fda7ca5133",
                "sha256": "0b9ce7e76be2e8d1738c604751d05e0a8aa1b2ce2199fce27a115d350468e671"
            },
            "downloads": -1,
            "filename": "g2p_id_py-0.4.2-py3-none-any.whl",
            "has_sig": false,
            "md5_digest": "a9f42331de4d99698178e6fda7ca5133",
            "packagetype": "bdist_wheel",
            "python_version": "py3",
            "requires_python": ">=3.8",
            "size": 5125282,
            "upload_time": "2024-12-09T03:09:18",
            "upload_time_iso_8601": "2024-12-09T03:09:18.480435Z",
            "url": "https://files.pythonhosted.org/packages/fa/a4/cd99ba82de323f2084b59f0b7b55211fcb4a7126d4ecbdcee7608b9ed0c8/g2p_id_py-0.4.2-py3-none-any.whl",
            "yanked": false,
            "yanked_reason": null
        },
        {
            "comment_text": "",
            "digests": {
                "blake2b_256": "1a827a158c9714a77299de714454ad9b1eeda18dfef3f8e4c4f8f898cea00671",
                "md5": "ba302cc15bd22047e78850f3085a33ea",
                "sha256": "16537ee23035895557cd90b9156390c3a7367a747c69b41bd68bf2b94ff56245"
            },
            "downloads": -1,
            "filename": "g2p_id_py-0.4.2.tar.gz",
            "has_sig": false,
            "md5_digest": "ba302cc15bd22047e78850f3085a33ea",
            "packagetype": "sdist",
            "python_version": "source",
            "requires_python": ">=3.8",
            "size": 5113209,
            "upload_time": "2024-12-09T03:09:20",
            "upload_time_iso_8601": "2024-12-09T03:09:20.924420Z",
            "url": "https://files.pythonhosted.org/packages/1a/82/7a158c9714a77299de714454ad9b1eeda18dfef3f8e4c4f8f898cea00671/g2p_id_py-0.4.2.tar.gz",
            "yanked": false,
            "yanked_reason": null
        }
    ],
    "upload_time": "2024-12-09 03:09:20",
    "github": true,
    "gitlab": false,
    "bitbucket": false,
    "codeberg": false,
    "github_user": "bookbot-kids",
    "github_project": "g2p_id",
    "travis_ci": false,
    "coveralls": false,
    "github_actions": true,
    "requirements": [
        {
            "name": "num2words",
            "specs": []
        },
        {
            "name": "nltk",
            "specs": [
                [
                    "==",
                    "3.9.1"
                ]
            ]
        },
        {
            "name": "onnxruntime",
            "specs": []
        }
    ],
    "tox": true,
    "lcname": "g2p-id-py"
}
        
Elapsed time: 0.42564s