ipa-core


Nameipa-core JSON
Version 0.1.3 PyPI version JSON
download
home_pagehttps://github.com/Riccorl/ipa
SummaryNLP Preprocessing Pipeline Wrappers
upload_time2023-05-12 15:14:56
maintainer
docs_urlNone
authorRiccardo Orlando
requires_python>=3.9
licenseApache
keywords nlp deep learning transformer pytorch stanza spacy trankit preprocessing tokenization pos tagging lemmatization
VCS
bugtrack_url
requirements No requirements were recorded.
Travis-CI No Travis.
coveralls test coverage No coveralls.
            <div align="center">

# 🍺IPA: import, preprocess, accelerate

[//]: # ([![Open in Visual Studio Code]&#40;https://open.vscode.dev/badges/open-in-vscode.svg&#41;]&#40;https://github.dev/Riccorl/ipa&#41;)
[![PyTorch](https://img.shields.io/badge/PyTorch-orange?logo=pytorch)](https://pytorch.org/)
[![Stanza](https://img.shields.io/badge/1.4-Stanza-5f0a09?logo=stanza)](https://stanfordnlp.github.io/stanza/)
[![SpaCy](https://img.shields.io/badge/3.4.3-SpaCy-1a6f93?logo=spacy)](https://spacy.io/)
[![Code style: black](https://img.shields.io/badge/code%20style-black-000000)](https://github.com/psf/black)

[![Upload to PyPi](https://github.com/Riccorl/ipa/actions/workflows/python-publish-pypi.yml/badge.svg)](https://github.com/Riccorl/ipa/actions/workflows/python-publish-pypi.yml)
[![PyPi Version](https://img.shields.io/github/v/release/Riccorl/ipa)](https://github.com/Riccorl/ipa/releases)
[![DeepSource](https://deepsource.io/gh/Riccorl/ipa.svg/?label=active+issues&token=QC6Jty-YdgXjKh9mKZyeqa4I)](https://deepsource.io/gh/Riccorl/ipa/?ref=repository-badge)

</div>

🍺IPA: import, preprocess, accelerate

## How to use

### Install

Install the library from [PyPI](https://pypi.org/project/ipa-core):

```bash
pip install ipa-core
```

### Usage

IPA is a Python library that provides a set of preprocessing wrappers for Stanza and
spaCy, providing a unified API for both libraries, making them interchangeable.

Let's start with a simple example. Here we are using the `SpacyTokenizer` wrapper to preprocess a text: 

```python
from ipa import SpacyTokenizer

spacy_tokenizer = SpacyTokenizer(language="en", return_pos_tags=True, return_lemmas=True)
tokenized = spacy_tokenizer("Mary sold the car to John.")
for word in tokenized:
    print("{:<5} {:<10} {:<10} {:<10}".format(word.index, word.text, word.pos, word.lemma))

"""
0    Mary       PROPN      Mary
1    sold       VERB       sell
2    the        DET        the
3    car        NOUN       car
4    to         ADP        to
5    John       PROPN      John
6    .          PUNCT      .
"""
```

You can load any model from spaCy, with its canonical name, `en_core_web_sm`, or with a simple alias, as 
we did here, like `en`. By default, the simpler alias loads the smaller version of each model. For a complete 
list of available models, see [spaCy documentation](https://spacy.io/usage/models).

In the very same way, you can load any model from Stanza using the `StanzaTokenizer` wrapper:

```python
from ipa import StanzaTokenizer

stanza_tokenizer = StanzaTokenizer(language="en", return_pos_tags=True, return_lemmas=True)
tokenized = stanza_tokenizer("Mary sold the car to John.")
for word in tokenized:
    print("{:<5} {:<10} {:<10} {:<10}".format(word.index, word.text, word.pos, word.lemma))

"""
0    Mary       PROPN      Mary
1    sold       VERB       sell
2    the        DET        the
3    car        NOUN       car
4    to         ADP        to
5    John       PROPN      John
6    .          PUNCT      .
"""
```

For more simple scenarios, you can use the `WhiteSpaceTokenizer` wrapper, which will just split the text 
by whitespace:

```python
from ipa import WhitespaceTokenizer

whitespace_tokenizer = WhitespaceTokenizer()
tokenized = whitespace_tokenizer("Mary sold the car to John .")
for word in tokenized:
    print("{:<5} {:<10}".format(word.index, word.text))

"""
0    Mary
1    sold
2    the
3    car
4    to
5    John
6    .
"""
```

### Features

#### Complete preprocessing pipeline

`SpacyTokenizer` and `StanzaTokenizer` provide a unified API for both libraries, exposing most of their
features, like tokenization, Part-of-Speech tagging, lemmatization and dependency parsing. You can activate 
and deactivate any of these using `return_pos_tags`, `return_lemmas` and `return_deps`. So, for example,

```python
StanzaTokenizer(language="en", return_pos_tags=True, return_lemmas=True)
```

will return a list of `Token` objects, with the `pos` and `lemma` fields filled.

while

```python
StanzaTokenizer(language="en")
```

will return a list of `Token` objects, with only the `text` field filled.

### GPU support

With `use_gpu=True`, the library will use the GPU if it is available. To set up the environment for the GPU, 
refer to the [Stanza documentation](https://stanfordnlp.github.io/stanza/) and the 
[spaCy documentation](https://spacy.io/usage/gpu).

## API

### Tokenizers

`SpacyTokenizer`

```python
class SpacyTokenizer(BaseTokenizer):
    def __init__(
        self,
        language: str = "en",
        return_pos_tags: bool = False,
        return_lemmas: bool = False,
        return_deps: bool = False,
        split_on_spaces: bool = False,
        use_gpu: bool = False,
    ):
```

`StanzaTokenizer`

```python
class StanzaTokenizer(BaseTokenizer):
    def __init__(
        self,
        language: str = "en",
        return_pos_tags: bool = False,
        return_lemmas: bool = False,
        return_deps: bool = False,
        split_on_spaces: bool = False,
        use_gpu: bool = False,
    ):
```

`WhitespaceTokenizer`

```python
class WhitespaceTokenizer(BaseTokenizer):
    def __init__(self):
```

### Sentence Splitter

`SpacySentenceSplitter`

```python
class SpacySentenceSplitter(BaseSentenceSplitter):
    def __init__(self, language: str = "en", model_type: str = "statistical"):
```

            

Raw data

            {
    "_id": null,
    "home_page": "https://github.com/Riccorl/ipa",
    "name": "ipa-core",
    "maintainer": "",
    "docs_url": null,
    "requires_python": ">=3.9",
    "maintainer_email": "",
    "keywords": "NLP deep learning transformer pytorch stanza spacy trankit preprocessing tokenization pos tagging lemmatization",
    "author": "Riccardo Orlando",
    "author_email": "orlandoricc@gmail.com",
    "download_url": "https://files.pythonhosted.org/packages/62/4d/9926c1f3dabec4aff2ccedb869b5db867908eda64a37d625104252c295b4/ipa-core-0.1.3.tar.gz",
    "platform": null,
    "description": "<div align=\"center\">\n\n# \ud83c\udf7aIPA: import, preprocess, accelerate\n\n[//]: # ([![Open in Visual Studio Code]&#40;https://open.vscode.dev/badges/open-in-vscode.svg&#41;]&#40;https://github.dev/Riccorl/ipa&#41;)\n[![PyTorch](https://img.shields.io/badge/PyTorch-orange?logo=pytorch)](https://pytorch.org/)\n[![Stanza](https://img.shields.io/badge/1.4-Stanza-5f0a09?logo=stanza)](https://stanfordnlp.github.io/stanza/)\n[![SpaCy](https://img.shields.io/badge/3.4.3-SpaCy-1a6f93?logo=spacy)](https://spacy.io/)\n[![Code style: black](https://img.shields.io/badge/code%20style-black-000000)](https://github.com/psf/black)\n\n[![Upload to PyPi](https://github.com/Riccorl/ipa/actions/workflows/python-publish-pypi.yml/badge.svg)](https://github.com/Riccorl/ipa/actions/workflows/python-publish-pypi.yml)\n[![PyPi Version](https://img.shields.io/github/v/release/Riccorl/ipa)](https://github.com/Riccorl/ipa/releases)\n[![DeepSource](https://deepsource.io/gh/Riccorl/ipa.svg/?label=active+issues&token=QC6Jty-YdgXjKh9mKZyeqa4I)](https://deepsource.io/gh/Riccorl/ipa/?ref=repository-badge)\n\n</div>\n\n\ud83c\udf7aIPA: import, preprocess, accelerate\n\n## How to use\n\n### Install\n\nInstall the library from [PyPI](https://pypi.org/project/ipa-core):\n\n```bash\npip install ipa-core\n```\n\n### Usage\n\nIPA is a Python library that provides a set of preprocessing wrappers for Stanza and\nspaCy, providing a unified API for both libraries, making them interchangeable.\n\nLet's start with a simple example. Here we are using the `SpacyTokenizer` wrapper to preprocess a text: \n\n```python\nfrom ipa import SpacyTokenizer\n\nspacy_tokenizer = SpacyTokenizer(language=\"en\", return_pos_tags=True, return_lemmas=True)\ntokenized = spacy_tokenizer(\"Mary sold the car to John.\")\nfor word in tokenized:\n    print(\"{:<5} {:<10} {:<10} {:<10}\".format(word.index, word.text, word.pos, word.lemma))\n\n\"\"\"\n0    Mary       PROPN      Mary\n1    sold       VERB       sell\n2    the        DET        the\n3    car        NOUN       car\n4    to         ADP        to\n5    John       PROPN      John\n6    .          PUNCT      .\n\"\"\"\n```\n\nYou can load any model from spaCy, with its canonical name, `en_core_web_sm`, or with a simple alias, as \nwe did here, like `en`. By default, the simpler alias loads the smaller version of each model. For a complete \nlist of available models, see [spaCy documentation](https://spacy.io/usage/models).\n\nIn the very same way, you can load any model from Stanza using the `StanzaTokenizer` wrapper:\n\n```python\nfrom ipa import StanzaTokenizer\n\nstanza_tokenizer = StanzaTokenizer(language=\"en\", return_pos_tags=True, return_lemmas=True)\ntokenized = stanza_tokenizer(\"Mary sold the car to John.\")\nfor word in tokenized:\n    print(\"{:<5} {:<10} {:<10} {:<10}\".format(word.index, word.text, word.pos, word.lemma))\n\n\"\"\"\n0    Mary       PROPN      Mary\n1    sold       VERB       sell\n2    the        DET        the\n3    car        NOUN       car\n4    to         ADP        to\n5    John       PROPN      John\n6    .          PUNCT      .\n\"\"\"\n```\n\nFor more simple scenarios, you can use the `WhiteSpaceTokenizer` wrapper, which will just split the text \nby whitespace:\n\n```python\nfrom ipa import WhitespaceTokenizer\n\nwhitespace_tokenizer = WhitespaceTokenizer()\ntokenized = whitespace_tokenizer(\"Mary sold the car to John .\")\nfor word in tokenized:\n    print(\"{:<5} {:<10}\".format(word.index, word.text))\n\n\"\"\"\n0    Mary\n1    sold\n2    the\n3    car\n4    to\n5    John\n6    .\n\"\"\"\n```\n\n### Features\n\n#### Complete preprocessing pipeline\n\n`SpacyTokenizer` and `StanzaTokenizer` provide a unified API for both libraries, exposing most of their\nfeatures, like tokenization, Part-of-Speech tagging, lemmatization and dependency parsing. You can activate \nand deactivate any of these using `return_pos_tags`, `return_lemmas` and `return_deps`. So, for example,\n\n```python\nStanzaTokenizer(language=\"en\", return_pos_tags=True, return_lemmas=True)\n```\n\nwill return a list of `Token` objects, with the `pos` and `lemma` fields filled.\n\nwhile\n\n```python\nStanzaTokenizer(language=\"en\")\n```\n\nwill return a list of `Token` objects, with only the `text` field filled.\n\n### GPU support\n\nWith `use_gpu=True`, the library will use the GPU if it is available. To set up the environment for the GPU, \nrefer to the [Stanza documentation](https://stanfordnlp.github.io/stanza/) and the \n[spaCy documentation](https://spacy.io/usage/gpu).\n\n## API\n\n### Tokenizers\n\n`SpacyTokenizer`\n\n```python\nclass SpacyTokenizer(BaseTokenizer):\n    def __init__(\n        self,\n        language: str = \"en\",\n        return_pos_tags: bool = False,\n        return_lemmas: bool = False,\n        return_deps: bool = False,\n        split_on_spaces: bool = False,\n        use_gpu: bool = False,\n    ):\n```\n\n`StanzaTokenizer`\n\n```python\nclass StanzaTokenizer(BaseTokenizer):\n    def __init__(\n        self,\n        language: str = \"en\",\n        return_pos_tags: bool = False,\n        return_lemmas: bool = False,\n        return_deps: bool = False,\n        split_on_spaces: bool = False,\n        use_gpu: bool = False,\n    ):\n```\n\n`WhitespaceTokenizer`\n\n```python\nclass WhitespaceTokenizer(BaseTokenizer):\n    def __init__(self):\n```\n\n### Sentence Splitter\n\n`SpacySentenceSplitter`\n\n```python\nclass SpacySentenceSplitter(BaseSentenceSplitter):\n    def __init__(self, language: str = \"en\", model_type: str = \"statistical\"):\n```\n",
    "bugtrack_url": null,
    "license": "Apache",
    "summary": "NLP Preprocessing Pipeline Wrappers",
    "version": "0.1.3",
    "project_urls": {
        "Homepage": "https://github.com/Riccorl/ipa"
    },
    "split_keywords": [
        "nlp",
        "deep",
        "learning",
        "transformer",
        "pytorch",
        "stanza",
        "spacy",
        "trankit",
        "preprocessing",
        "tokenization",
        "pos",
        "tagging",
        "lemmatization"
    ],
    "urls": [
        {
            "comment_text": "",
            "digests": {
                "blake2b_256": "3ddbc45f50252666cd5ffe5a4b494806468179982d8267512ec0d65187d405ed",
                "md5": "27358c68bce1365a3a75c0a5cdcb0b08",
                "sha256": "690c3a8ca174ef79ce6df9d090a4097a1fef5bed812a42ae77f3f5b8b0523565"
            },
            "downloads": -1,
            "filename": "ipa_core-0.1.3-py3-none-any.whl",
            "has_sig": false,
            "md5_digest": "27358c68bce1365a3a75c0a5cdcb0b08",
            "packagetype": "bdist_wheel",
            "python_version": "py3",
            "requires_python": ">=3.9",
            "size": 16288,
            "upload_time": "2023-05-12T15:14:55",
            "upload_time_iso_8601": "2023-05-12T15:14:55.067376Z",
            "url": "https://files.pythonhosted.org/packages/3d/db/c45f50252666cd5ffe5a4b494806468179982d8267512ec0d65187d405ed/ipa_core-0.1.3-py3-none-any.whl",
            "yanked": false,
            "yanked_reason": null
        },
        {
            "comment_text": "",
            "digests": {
                "blake2b_256": "624d9926c1f3dabec4aff2ccedb869b5db867908eda64a37d625104252c295b4",
                "md5": "8226aa5196a8c1db0e46bafbd889ce82",
                "sha256": "a267dceb7ef5c91802735d1a40f09300d03256f5a74fe0d08fae6beba81ab4ae"
            },
            "downloads": -1,
            "filename": "ipa-core-0.1.3.tar.gz",
            "has_sig": false,
            "md5_digest": "8226aa5196a8c1db0e46bafbd889ce82",
            "packagetype": "sdist",
            "python_version": "source",
            "requires_python": ">=3.9",
            "size": 14704,
            "upload_time": "2023-05-12T15:14:56",
            "upload_time_iso_8601": "2023-05-12T15:14:56.882796Z",
            "url": "https://files.pythonhosted.org/packages/62/4d/9926c1f3dabec4aff2ccedb869b5db867908eda64a37d625104252c295b4/ipa-core-0.1.3.tar.gz",
            "yanked": false,
            "yanked_reason": null
        }
    ],
    "upload_time": "2023-05-12 15:14:56",
    "github": true,
    "gitlab": false,
    "bitbucket": false,
    "codeberg": false,
    "github_user": "Riccorl",
    "github_project": "ipa",
    "travis_ci": false,
    "coveralls": false,
    "github_actions": true,
    "requirements": [],
    "lcname": "ipa-core"
}
        
Elapsed time: 0.24827s