<div align="center">
# 🍺IPA: import, preprocess, accelerate
[//]: # ([![Open in Visual Studio Code](https://open.vscode.dev/badges/open-in-vscode.svg)](https://github.dev/Riccorl/ipa))
[![PyTorch](https://img.shields.io/badge/PyTorch-orange?logo=pytorch)](https://pytorch.org/)
[![Stanza](https://img.shields.io/badge/1.4-Stanza-5f0a09?logo=stanza)](https://stanfordnlp.github.io/stanza/)
[![SpaCy](https://img.shields.io/badge/3.4.3-SpaCy-1a6f93?logo=spacy)](https://spacy.io/)
[![Code style: black](https://img.shields.io/badge/code%20style-black-000000)](https://github.com/psf/black)
[![Upload to PyPi](https://github.com/Riccorl/ipa/actions/workflows/python-publish-pypi.yml/badge.svg)](https://github.com/Riccorl/ipa/actions/workflows/python-publish-pypi.yml)
[![PyPi Version](https://img.shields.io/github/v/release/Riccorl/ipa)](https://github.com/Riccorl/ipa/releases)
[![DeepSource](https://deepsource.io/gh/Riccorl/ipa.svg/?label=active+issues&token=QC6Jty-YdgXjKh9mKZyeqa4I)](https://deepsource.io/gh/Riccorl/ipa/?ref=repository-badge)
</div>
🍺IPA: import, preprocess, accelerate
## How to use
### Install
Install the library from [PyPI](https://pypi.org/project/ipa-core):
```bash
pip install ipa-core
```
### Usage
IPA is a Python library that provides a set of preprocessing wrappers for Stanza and
spaCy, providing a unified API for both libraries, making them interchangeable.
Let's start with a simple example. Here we are using the `SpacyTokenizer` wrapper to preprocess a text:
```python
from ipa import SpacyTokenizer
spacy_tokenizer = SpacyTokenizer(language="en", return_pos_tags=True, return_lemmas=True)
tokenized = spacy_tokenizer("Mary sold the car to John.")
for word in tokenized:
print("{:<5} {:<10} {:<10} {:<10}".format(word.index, word.text, word.pos, word.lemma))
"""
0 Mary PROPN Mary
1 sold VERB sell
2 the DET the
3 car NOUN car
4 to ADP to
5 John PROPN John
6 . PUNCT .
"""
```
You can load any model from spaCy, with its canonical name, `en_core_web_sm`, or with a simple alias, as
we did here, like `en`. By default, the simpler alias loads the smaller version of each model. For a complete
list of available models, see [spaCy documentation](https://spacy.io/usage/models).
In the very same way, you can load any model from Stanza using the `StanzaTokenizer` wrapper:
```python
from ipa import StanzaTokenizer
stanza_tokenizer = StanzaTokenizer(language="en", return_pos_tags=True, return_lemmas=True)
tokenized = stanza_tokenizer("Mary sold the car to John.")
for word in tokenized:
print("{:<5} {:<10} {:<10} {:<10}".format(word.index, word.text, word.pos, word.lemma))
"""
0 Mary PROPN Mary
1 sold VERB sell
2 the DET the
3 car NOUN car
4 to ADP to
5 John PROPN John
6 . PUNCT .
"""
```
For more simple scenarios, you can use the `WhiteSpaceTokenizer` wrapper, which will just split the text
by whitespace:
```python
from ipa import WhitespaceTokenizer
whitespace_tokenizer = WhitespaceTokenizer()
tokenized = whitespace_tokenizer("Mary sold the car to John .")
for word in tokenized:
print("{:<5} {:<10}".format(word.index, word.text))
"""
0 Mary
1 sold
2 the
3 car
4 to
5 John
6 .
"""
```
### Features
#### Complete preprocessing pipeline
`SpacyTokenizer` and `StanzaTokenizer` provide a unified API for both libraries, exposing most of their
features, like tokenization, Part-of-Speech tagging, lemmatization and dependency parsing. You can activate
and deactivate any of these using `return_pos_tags`, `return_lemmas` and `return_deps`. So, for example,
```python
StanzaTokenizer(language="en", return_pos_tags=True, return_lemmas=True)
```
will return a list of `Token` objects, with the `pos` and `lemma` fields filled.
while
```python
StanzaTokenizer(language="en")
```
will return a list of `Token` objects, with only the `text` field filled.
### GPU support
With `use_gpu=True`, the library will use the GPU if it is available. To set up the environment for the GPU,
refer to the [Stanza documentation](https://stanfordnlp.github.io/stanza/) and the
[spaCy documentation](https://spacy.io/usage/gpu).
## API
### Tokenizers
`SpacyTokenizer`
```python
class SpacyTokenizer(BaseTokenizer):
def __init__(
self,
language: str = "en",
return_pos_tags: bool = False,
return_lemmas: bool = False,
return_deps: bool = False,
split_on_spaces: bool = False,
use_gpu: bool = False,
):
```
`StanzaTokenizer`
```python
class StanzaTokenizer(BaseTokenizer):
def __init__(
self,
language: str = "en",
return_pos_tags: bool = False,
return_lemmas: bool = False,
return_deps: bool = False,
split_on_spaces: bool = False,
use_gpu: bool = False,
):
```
`WhitespaceTokenizer`
```python
class WhitespaceTokenizer(BaseTokenizer):
def __init__(self):
```
### Sentence Splitter
`SpacySentenceSplitter`
```python
class SpacySentenceSplitter(BaseSentenceSplitter):
def __init__(self, language: str = "en", model_type: str = "statistical"):
```
Raw data
{
"_id": null,
"home_page": "https://github.com/Riccorl/ipa",
"name": "ipa-core",
"maintainer": "",
"docs_url": null,
"requires_python": ">=3.9",
"maintainer_email": "",
"keywords": "NLP deep learning transformer pytorch stanza spacy trankit preprocessing tokenization pos tagging lemmatization",
"author": "Riccardo Orlando",
"author_email": "orlandoricc@gmail.com",
"download_url": "https://files.pythonhosted.org/packages/62/4d/9926c1f3dabec4aff2ccedb869b5db867908eda64a37d625104252c295b4/ipa-core-0.1.3.tar.gz",
"platform": null,
"description": "<div align=\"center\">\n\n# \ud83c\udf7aIPA: import, preprocess, accelerate\n\n[//]: # ([![Open in Visual Studio Code](https://open.vscode.dev/badges/open-in-vscode.svg)](https://github.dev/Riccorl/ipa))\n[![PyTorch](https://img.shields.io/badge/PyTorch-orange?logo=pytorch)](https://pytorch.org/)\n[![Stanza](https://img.shields.io/badge/1.4-Stanza-5f0a09?logo=stanza)](https://stanfordnlp.github.io/stanza/)\n[![SpaCy](https://img.shields.io/badge/3.4.3-SpaCy-1a6f93?logo=spacy)](https://spacy.io/)\n[![Code style: black](https://img.shields.io/badge/code%20style-black-000000)](https://github.com/psf/black)\n\n[![Upload to PyPi](https://github.com/Riccorl/ipa/actions/workflows/python-publish-pypi.yml/badge.svg)](https://github.com/Riccorl/ipa/actions/workflows/python-publish-pypi.yml)\n[![PyPi Version](https://img.shields.io/github/v/release/Riccorl/ipa)](https://github.com/Riccorl/ipa/releases)\n[![DeepSource](https://deepsource.io/gh/Riccorl/ipa.svg/?label=active+issues&token=QC6Jty-YdgXjKh9mKZyeqa4I)](https://deepsource.io/gh/Riccorl/ipa/?ref=repository-badge)\n\n</div>\n\n\ud83c\udf7aIPA: import, preprocess, accelerate\n\n## How to use\n\n### Install\n\nInstall the library from [PyPI](https://pypi.org/project/ipa-core):\n\n```bash\npip install ipa-core\n```\n\n### Usage\n\nIPA is a Python library that provides a set of preprocessing wrappers for Stanza and\nspaCy, providing a unified API for both libraries, making them interchangeable.\n\nLet's start with a simple example. Here we are using the `SpacyTokenizer` wrapper to preprocess a text: \n\n```python\nfrom ipa import SpacyTokenizer\n\nspacy_tokenizer = SpacyTokenizer(language=\"en\", return_pos_tags=True, return_lemmas=True)\ntokenized = spacy_tokenizer(\"Mary sold the car to John.\")\nfor word in tokenized:\n print(\"{:<5} {:<10} {:<10} {:<10}\".format(word.index, word.text, word.pos, word.lemma))\n\n\"\"\"\n0 Mary PROPN Mary\n1 sold VERB sell\n2 the DET the\n3 car NOUN car\n4 to ADP to\n5 John PROPN John\n6 . PUNCT .\n\"\"\"\n```\n\nYou can load any model from spaCy, with its canonical name, `en_core_web_sm`, or with a simple alias, as \nwe did here, like `en`. By default, the simpler alias loads the smaller version of each model. For a complete \nlist of available models, see [spaCy documentation](https://spacy.io/usage/models).\n\nIn the very same way, you can load any model from Stanza using the `StanzaTokenizer` wrapper:\n\n```python\nfrom ipa import StanzaTokenizer\n\nstanza_tokenizer = StanzaTokenizer(language=\"en\", return_pos_tags=True, return_lemmas=True)\ntokenized = stanza_tokenizer(\"Mary sold the car to John.\")\nfor word in tokenized:\n print(\"{:<5} {:<10} {:<10} {:<10}\".format(word.index, word.text, word.pos, word.lemma))\n\n\"\"\"\n0 Mary PROPN Mary\n1 sold VERB sell\n2 the DET the\n3 car NOUN car\n4 to ADP to\n5 John PROPN John\n6 . PUNCT .\n\"\"\"\n```\n\nFor more simple scenarios, you can use the `WhiteSpaceTokenizer` wrapper, which will just split the text \nby whitespace:\n\n```python\nfrom ipa import WhitespaceTokenizer\n\nwhitespace_tokenizer = WhitespaceTokenizer()\ntokenized = whitespace_tokenizer(\"Mary sold the car to John .\")\nfor word in tokenized:\n print(\"{:<5} {:<10}\".format(word.index, word.text))\n\n\"\"\"\n0 Mary\n1 sold\n2 the\n3 car\n4 to\n5 John\n6 .\n\"\"\"\n```\n\n### Features\n\n#### Complete preprocessing pipeline\n\n`SpacyTokenizer` and `StanzaTokenizer` provide a unified API for both libraries, exposing most of their\nfeatures, like tokenization, Part-of-Speech tagging, lemmatization and dependency parsing. You can activate \nand deactivate any of these using `return_pos_tags`, `return_lemmas` and `return_deps`. So, for example,\n\n```python\nStanzaTokenizer(language=\"en\", return_pos_tags=True, return_lemmas=True)\n```\n\nwill return a list of `Token` objects, with the `pos` and `lemma` fields filled.\n\nwhile\n\n```python\nStanzaTokenizer(language=\"en\")\n```\n\nwill return a list of `Token` objects, with only the `text` field filled.\n\n### GPU support\n\nWith `use_gpu=True`, the library will use the GPU if it is available. To set up the environment for the GPU, \nrefer to the [Stanza documentation](https://stanfordnlp.github.io/stanza/) and the \n[spaCy documentation](https://spacy.io/usage/gpu).\n\n## API\n\n### Tokenizers\n\n`SpacyTokenizer`\n\n```python\nclass SpacyTokenizer(BaseTokenizer):\n def __init__(\n self,\n language: str = \"en\",\n return_pos_tags: bool = False,\n return_lemmas: bool = False,\n return_deps: bool = False,\n split_on_spaces: bool = False,\n use_gpu: bool = False,\n ):\n```\n\n`StanzaTokenizer`\n\n```python\nclass StanzaTokenizer(BaseTokenizer):\n def __init__(\n self,\n language: str = \"en\",\n return_pos_tags: bool = False,\n return_lemmas: bool = False,\n return_deps: bool = False,\n split_on_spaces: bool = False,\n use_gpu: bool = False,\n ):\n```\n\n`WhitespaceTokenizer`\n\n```python\nclass WhitespaceTokenizer(BaseTokenizer):\n def __init__(self):\n```\n\n### Sentence Splitter\n\n`SpacySentenceSplitter`\n\n```python\nclass SpacySentenceSplitter(BaseSentenceSplitter):\n def __init__(self, language: str = \"en\", model_type: str = \"statistical\"):\n```\n",
"bugtrack_url": null,
"license": "Apache",
"summary": "NLP Preprocessing Pipeline Wrappers",
"version": "0.1.3",
"project_urls": {
"Homepage": "https://github.com/Riccorl/ipa"
},
"split_keywords": [
"nlp",
"deep",
"learning",
"transformer",
"pytorch",
"stanza",
"spacy",
"trankit",
"preprocessing",
"tokenization",
"pos",
"tagging",
"lemmatization"
],
"urls": [
{
"comment_text": "",
"digests": {
"blake2b_256": "3ddbc45f50252666cd5ffe5a4b494806468179982d8267512ec0d65187d405ed",
"md5": "27358c68bce1365a3a75c0a5cdcb0b08",
"sha256": "690c3a8ca174ef79ce6df9d090a4097a1fef5bed812a42ae77f3f5b8b0523565"
},
"downloads": -1,
"filename": "ipa_core-0.1.3-py3-none-any.whl",
"has_sig": false,
"md5_digest": "27358c68bce1365a3a75c0a5cdcb0b08",
"packagetype": "bdist_wheel",
"python_version": "py3",
"requires_python": ">=3.9",
"size": 16288,
"upload_time": "2023-05-12T15:14:55",
"upload_time_iso_8601": "2023-05-12T15:14:55.067376Z",
"url": "https://files.pythonhosted.org/packages/3d/db/c45f50252666cd5ffe5a4b494806468179982d8267512ec0d65187d405ed/ipa_core-0.1.3-py3-none-any.whl",
"yanked": false,
"yanked_reason": null
},
{
"comment_text": "",
"digests": {
"blake2b_256": "624d9926c1f3dabec4aff2ccedb869b5db867908eda64a37d625104252c295b4",
"md5": "8226aa5196a8c1db0e46bafbd889ce82",
"sha256": "a267dceb7ef5c91802735d1a40f09300d03256f5a74fe0d08fae6beba81ab4ae"
},
"downloads": -1,
"filename": "ipa-core-0.1.3.tar.gz",
"has_sig": false,
"md5_digest": "8226aa5196a8c1db0e46bafbd889ce82",
"packagetype": "sdist",
"python_version": "source",
"requires_python": ">=3.9",
"size": 14704,
"upload_time": "2023-05-12T15:14:56",
"upload_time_iso_8601": "2023-05-12T15:14:56.882796Z",
"url": "https://files.pythonhosted.org/packages/62/4d/9926c1f3dabec4aff2ccedb869b5db867908eda64a37d625104252c295b4/ipa-core-0.1.3.tar.gz",
"yanked": false,
"yanked_reason": null
}
],
"upload_time": "2023-05-12 15:14:56",
"github": true,
"gitlab": false,
"bitbucket": false,
"codeberg": false,
"github_user": "Riccorl",
"github_project": "ipa",
"travis_ci": false,
"coveralls": false,
"github_actions": true,
"requirements": [],
"lcname": "ipa-core"
}