<img src="https://raw.githubusercontent.com/ljvmiranda921/calamanCy/master/logo.png" width="125" height="125" align="right" />
# calamanCy: NLP pipelines for Tagalog
[![example workflow](https://github.com/ljvmiranda921/calamancy/actions/workflows/test.yml/badge.svg)](https://github.com/ljvmiranda921/calamanCy/actions/workflows/test.yml)
[![PyPI](https://img.shields.io/pypi/v/calamancy?labelColor=%23272c32&color=%2333cc56&logo=pypi&logoColor=white)](https://pypi.org/project/calamanCy/)
**calamanCy** is a Tagalog natural language preprocessing framework made with
[spaCy](https://spacy.io). Its goal is to provide pipelines and datasets for
downstream NLP tasks. This repository contains material for using calamanCy,
reproduction of results, and guides on usage.
> calamanCy takes inspiration from other language-specific [spaCy Universe frameworks](https://spacy.io/universe) such as
> [DaCy](https://github.com/centre-for-humanities-computing/DaCy), [huSpaCy](https://github.com/huspacy/huspacy),
> and [graCy](https://github.com/jmyerston/graCy). The name is based from [*calamansi*](https://en.wikipedia.org/wiki/Calamansi),
> a citrus fruit native to the Philippines and used in traditional Filipino cuisine.
## 🔧 Installation
To get started with calamanCy, simply install it using `pip` by running the
following line in your terminal:
```sh
pip install calamanCy
```
### Development
If you are developing calamanCy, first clone the repository:
```sh
git clone git@github.com:ljvmiranda921/calamanCy.git
```
Then, create a virtual environment and install the dependencies:
```sh
python -m venv venv
venv/bin/pip install -e . # requires pip>=23.0
venv/bin/pip install .[dev]
# Activate the virtual environment
source venv/bin/activate
```
or alternatively, use `make dev`.
### Running the tests
We use [pytest](https://docs.pytest.org/en/7.4.x/) as our test runner:
```sh
python -m pytest --pyargs calamancy
```
## 👩💻 Usage
To use calamanCy you first have to download either the medium, large, or
transformer model. To see a list of all available models, run:
```python
import calamancy
from model in calamancy.models():
print(model)
# ..
# tl_calamancy_md-0.1.0
# tl_calamancy_lg-0.1.0
# tl_calamancy_trf-0.1.0
```
To download and load a model, run:
```python
nlp = calamancy.load("tl_calamancy_md-0.1.0")
doc = nlp("Ako si Juan de la Cruz")
```
The `nlp` object is an instance of spaCy's [`Language`
class](https://spacy.io/api/language) and you can use it as any other spaCy
pipeline. You can also [access these models on Hugging Face](https://huggingface.co/ljvmiranda921) 🤗.
## 📦 Models and Datasets
calamanCy provides Tagalog models and datasets that you can use in your spaCy
pipelines. You can download them directly or use the `calamancy` Python library
to access them. The training procedure for each pipeline can be found in the
`models/` directory. They are further subdivided into versions. Each folder is
an instance of a [spaCy project](https://spacy.io/usage/projects).
Here are the models for the latest release:
| Model | Pipelines | Description |
|-----------------------------|---------------------------------------------|--------------------------------------------------------------------------------------------------------------|
| tl_calamancy_md (73.7 MB) | tok2vec, tagger, morphologizer, parser, ner | CPU-optimized Tagalog NLP model. Pretrained using the TLUnified dataset. Using floret vectors (50k keys) |
| tl_calamancy_lg (431.9 MB) | tok2vec, tagger, morphologizer, parser, ner | CPU-optimized large Tagalog NLP model. Pretrained using the TLUnified dataset. Using fastText vectors (714k) |
| tl_calamancy_trf (775.6 MB) | transformer, tagger, parser, ner | GPU-optimized transformer Tagalog NLP model. Uses roberta-tagalog-base as context vectors. |
## 📓 API
The calamanCy library contains utility functions that help you load its models
and infer on your text. You can think of these functions as "syntactic sugar"
to the spaCy API. We highly recommend checking out the [spaCy Doc
object](https://spacy.io/api/doc), as it provides the most flexibility.
### Loaders
The loader functions provide an easier interface to download calamanCy models.
These models are hosted on [HuggingFace](https://huggingface.co/ljvmiranda921)
so you can try them out first before downloading.
#### <kbd>function</kbd> `get_latest_version`
Return the latest version of a calamanCy model.
| Argument | Type | Description |
| ----------- | ----- | ---------------------- |
| `model` | `str` | The string indicating the model. |
| **RETURNS** | `str` | The latest version of the model. |
#### <kbd>function</kbd> `models`
Get a list of valid calamanCy models.
| Argument | Type | Description |
| ----------- | ----- | ---------------------- |
| **RETURNS** | `List[str]` | List of valid calamanCy models |
#### <kbd>function</kbd> `load`
Load a calamanCy model as a [spaCy language pipeline](https://spacy.io/usage/processing-pipelines).
| Argument | Type | Description |
| ----------- | ----- | ---------------------- |
| `model` | `str` | The model to download. See the available models at [`calamancy.models()`](#function-models). |
| `force` | `bool` | Force download the model. Defaults to `False`. |
| `**kwargs` | `dict` | Additional arguments to `spacy.load()`. |
| **RETURNS** | [`Language`](https://spacy.io/api/language) | A spaCy language pipeline. |
### Inference
Below are lightweight utility classes for users who are not familiar with spaCy's
primitives. They are only useful for inference and not for training. If you wish
to train on top of these calamanCy models (e.g., text categorization,
task-specific NER, etc.), we advise you to follow the standard [spaCy training
workflow](https://spacy.io/usage/training).
General usage: first, you need to instantiate a class with the name of a model.
Then, you can use the `__call__` method to perform the prediction. The output
is of the type `Iterable[Tuple[str, Any]]` where the first part of the tuple
is the token and the second part is its label.
#### <kbd>method</kbd> `EntityRecognizer.__call__`
Perform named entity recognition (NER). By default, it uses the v0.1.0 of
[TLUnified-NER](https://huggingface.co/datasets/ljvmiranda921/tlunified-ner)
with the following entity labels: *PER (Person), ORG (Organization), LOC
(Location).*
| Argument | Type | Description |
| ----------- | ----- | ---------------------- |
| `text` | `str` | The text to get the entities from. |
| **YIELDS** | `Iterable[Tuple[str, str]]` | the token and its entity in IOB format. |
#### <kbd>method</kbd> `Tagger.__call__`
Perform parts-of-speech tagging. It uses the annotations from the
[TRG](https://universaldependencies.org/treebanks/tl_trg/index.html) and
[Ugnayan](https://universaldependencies.org/treebanks/tl_ugnayan/index.html)
treebanks with the following tags: *ADJ, ADP, ADV, AUX, DET, INTJ, NOUN, PART,
PRON, PROPN, PUNCT, SCONJ, VERB.*
| Argument | Type | Description |
| ----------- | ----- | ---------------------- |
| `text` | `str` | The text to get the POS tags from. |
| **YIELDS** | `Iterable[Tuple[str, Tuple[str, str]]]` | the token and its coarse- and fine-grained POS tag. |
#### <kbd>method</kbd> `Parser.__call__`
Perform syntactic dependency parsing. It uses the annotations from the
[TRG](https://universaldependencies.org/treebanks/tl_trg/index.html) and
[Ugnayan](https://universaldependencies.org/treebanks/tl_ugnayan/index.html) treebanks.
| Argument | Type | Description |
| ----------- | ----- | ---------------------- |
| `text` | `str` | The text to get the dependency relations from. |
| **YIELDS** | `Iterable[Tuple[str, str]]` | the token and its dependency relation. |
## 📝 Reporting Issues
If you have questions regarding the usage of `calamanCy`, bug reports, or just
want to give us feedback after giving it a spin, please use the [Issue
tracker](https://github.com/ljvmiranda921/calamancy/issues). Thank you!
## 📜 Citation
If you are citing the open-source software, please use:
```
@misc{miranda2023calamancy,
title={{calamanCy: A Tagalog Natural Language Processing Toolkit}},
author={Lester James V. Miranda},
year={2023},
eprint={2311.07171},
archivePrefix={arXiv},
primaryClass={cs.CL}
}
```
If you are citing the [NER dataset](https://huggingface.co/ljvmiranda921), please use:
```
@misc{miranda2023developing,
title={{Developing a Named Entity Recognition Dataset for Tagalog}},
author={Lester James V. Miranda},
year={2023},
eprint={2311.07161},
archivePrefix={arXiv},
primaryClass={cs.CL}
}
```
Raw data
{
"_id": null,
"home_page": "",
"name": "calamanCy",
"maintainer": "",
"docs_url": null,
"requires_python": ">=3.7",
"maintainer_email": "",
"keywords": "nlp,natural language processing,language technology,tagalog",
"author": "",
"author_email": "\"Lj V. Miranda\" <ljvmiranda@gmail.com>",
"download_url": "https://files.pythonhosted.org/packages/19/89/b0d65b923be34afa5f60f1027a4d635d71a6f3b5fa8f516e012baeff2560/calamanCy-0.1.2.tar.gz",
"platform": null,
"description": "<img src=\"https://raw.githubusercontent.com/ljvmiranda921/calamanCy/master/logo.png\" width=\"125\" height=\"125\" align=\"right\" />\n\n# calamanCy: NLP pipelines for Tagalog\n\n[![example workflow](https://github.com/ljvmiranda921/calamancy/actions/workflows/test.yml/badge.svg)](https://github.com/ljvmiranda921/calamanCy/actions/workflows/test.yml)\n[![PyPI](https://img.shields.io/pypi/v/calamancy?labelColor=%23272c32&color=%2333cc56&logo=pypi&logoColor=white)](https://pypi.org/project/calamanCy/)\n\n\n\n**calamanCy** is a Tagalog natural language preprocessing framework made with\n[spaCy](https://spacy.io). Its goal is to provide pipelines and datasets for\ndownstream NLP tasks. This repository contains material for using calamanCy,\nreproduction of results, and guides on usage. \n\n> calamanCy takes inspiration from other language-specific [spaCy Universe frameworks](https://spacy.io/universe) such as \n> [DaCy](https://github.com/centre-for-humanities-computing/DaCy), [huSpaCy](https://github.com/huspacy/huspacy),\n> and [graCy](https://github.com/jmyerston/graCy). The name is based from [*calamansi*](https://en.wikipedia.org/wiki/Calamansi),\n> a citrus fruit native to the Philippines and used in traditional Filipino cuisine.\n\n## \ud83d\udd27 Installation\n\nTo get started with calamanCy, simply install it using `pip` by running the\nfollowing line in your terminal:\n\n```sh\npip install calamanCy\n``` \n\n### Development\n\nIf you are developing calamanCy, first clone the repository:\n\n```sh\ngit clone git@github.com:ljvmiranda921/calamanCy.git\n```\n\nThen, create a virtual environment and install the dependencies:\n\n```sh\npython -m venv venv\nvenv/bin/pip install -e . # requires pip>=23.0\nvenv/bin/pip install .[dev]\n\n# Activate the virtual environment\nsource venv/bin/activate\n```\n\nor alternatively, use `make dev`.\n\n### Running the tests\n\nWe use [pytest](https://docs.pytest.org/en/7.4.x/) as our test runner:\n\n```sh\npython -m pytest --pyargs calamancy\n```\n\n\n## \ud83d\udc69\u200d\ud83d\udcbb Usage\n\nTo use calamanCy you first have to download either the medium, large, or\ntransformer model. To see a list of all available models, run:\n\n```python\nimport calamancy\nfrom model in calamancy.models():\n print(model)\n\n# ..\n# tl_calamancy_md-0.1.0\n# tl_calamancy_lg-0.1.0\n# tl_calamancy_trf-0.1.0\n```\n\nTo download and load a model, run:\n\n```python\nnlp = calamancy.load(\"tl_calamancy_md-0.1.0\")\ndoc = nlp(\"Ako si Juan de la Cruz\")\n```\n\nThe `nlp` object is an instance of spaCy's [`Language`\nclass](https://spacy.io/api/language) and you can use it as any other spaCy\npipeline. You can also [access these models on Hugging Face](https://huggingface.co/ljvmiranda921) \ud83e\udd17.\n\n## \ud83d\udce6 Models and Datasets\n\ncalamanCy provides Tagalog models and datasets that you can use in your spaCy\npipelines. You can download them directly or use the `calamancy` Python library\nto access them. The training procedure for each pipeline can be found in the\n`models/` directory. They are further subdivided into versions. Each folder is\nan instance of a [spaCy project](https://spacy.io/usage/projects).\n\nHere are the models for the latest release:\n\n| Model | Pipelines | Description |\n|-----------------------------|---------------------------------------------|--------------------------------------------------------------------------------------------------------------|\n| tl_calamancy_md (73.7 MB) | tok2vec, tagger, morphologizer, parser, ner | CPU-optimized Tagalog NLP model. Pretrained using the TLUnified dataset. Using floret vectors (50k keys) |\n| tl_calamancy_lg (431.9 MB) | tok2vec, tagger, morphologizer, parser, ner | CPU-optimized large Tagalog NLP model. Pretrained using the TLUnified dataset. Using fastText vectors (714k) |\n| tl_calamancy_trf (775.6 MB) | transformer, tagger, parser, ner | GPU-optimized transformer Tagalog NLP model. Uses roberta-tagalog-base as context vectors. |\n\n## \ud83d\udcd3 API\n\nThe calamanCy library contains utility functions that help you load its models\nand infer on your text. You can think of these functions as \"syntactic sugar\"\nto the spaCy API. We highly recommend checking out the [spaCy Doc\nobject](https://spacy.io/api/doc), as it provides the most flexibility.\n\n### Loaders\n\nThe loader functions provide an easier interface to download calamanCy models.\nThese models are hosted on [HuggingFace](https://huggingface.co/ljvmiranda921)\nso you can try them out first before downloading.\n\n#### <kbd>function</kbd> `get_latest_version`\n\nReturn the latest version of a calamanCy model.\n\n| Argument | Type | Description |\n| ----------- | ----- | ---------------------- |\n| `model` | `str` | The string indicating the model. |\n| **RETURNS** | `str` | The latest version of the model. |\n\n\n#### <kbd>function</kbd> `models`\n\nGet a list of valid calamanCy models.\n\n| Argument | Type | Description |\n| ----------- | ----- | ---------------------- |\n| **RETURNS** | `List[str]` | List of valid calamanCy models |\n\n\n#### <kbd>function</kbd> `load`\n\nLoad a calamanCy model as a [spaCy language pipeline](https://spacy.io/usage/processing-pipelines).\n\n| Argument | Type | Description |\n| ----------- | ----- | ---------------------- |\n| `model` | `str` | The model to download. See the available models at [`calamancy.models()`](#function-models). |\n| `force` | `bool` | Force download the model. Defaults to `False`. |\n| `**kwargs` | `dict` | Additional arguments to `spacy.load()`. |\n| **RETURNS** | [`Language`](https://spacy.io/api/language) | A spaCy language pipeline. |\n\n\n### Inference\n\nBelow are lightweight utility classes for users who are not familiar with spaCy's\nprimitives. They are only useful for inference and not for training. If you wish\nto train on top of these calamanCy models (e.g., text categorization,\ntask-specific NER, etc.), we advise you to follow the standard [spaCy training\nworkflow](https://spacy.io/usage/training).\n\nGeneral usage: first, you need to instantiate a class with the name of a model.\nThen, you can use the `__call__` method to perform the prediction. The output\nis of the type `Iterable[Tuple[str, Any]]` where the first part of the tuple\nis the token and the second part is its label.\n\n#### <kbd>method</kbd> `EntityRecognizer.__call__`\n\nPerform named entity recognition (NER). By default, it uses the v0.1.0 of\n[TLUnified-NER](https://huggingface.co/datasets/ljvmiranda921/tlunified-ner)\nwith the following entity labels: *PER (Person), ORG (Organization), LOC\n(Location).*\n\n\n| Argument | Type | Description |\n| ----------- | ----- | ---------------------- |\n| `text` | `str` | The text to get the entities from. |\n| **YIELDS** | `Iterable[Tuple[str, str]]` | the token and its entity in IOB format. |\n\n#### <kbd>method</kbd> `Tagger.__call__`\n\nPerform parts-of-speech tagging. It uses the annotations from the\n[TRG](https://universaldependencies.org/treebanks/tl_trg/index.html) and\n[Ugnayan](https://universaldependencies.org/treebanks/tl_ugnayan/index.html)\ntreebanks with the following tags: *ADJ, ADP, ADV, AUX, DET, INTJ, NOUN, PART,\nPRON, PROPN, PUNCT, SCONJ, VERB.*\n\n\n| Argument | Type | Description |\n| ----------- | ----- | ---------------------- |\n| `text` | `str` | The text to get the POS tags from. |\n| **YIELDS** | `Iterable[Tuple[str, Tuple[str, str]]]` | the token and its coarse- and fine-grained POS tag. |\n\n#### <kbd>method</kbd> `Parser.__call__`\n\nPerform syntactic dependency parsing. It uses the annotations from the\n[TRG](https://universaldependencies.org/treebanks/tl_trg/index.html) and\n[Ugnayan](https://universaldependencies.org/treebanks/tl_ugnayan/index.html) treebanks.\n\n\n| Argument | Type | Description |\n| ----------- | ----- | ---------------------- |\n| `text` | `str` | The text to get the dependency relations from. |\n| **YIELDS** | `Iterable[Tuple[str, str]]` | the token and its dependency relation. |\n\n\n## \ud83d\udcdd Reporting Issues\n\nIf you have questions regarding the usage of `calamanCy`, bug reports, or just\nwant to give us feedback after giving it a spin, please use the [Issue\ntracker](https://github.com/ljvmiranda921/calamancy/issues). Thank you!\n\n\n## \ud83d\udcdc Citation\n\nIf you are citing the open-source software, please use:\n\n```\n@misc{miranda2023calamancy,\n title={{calamanCy: A Tagalog Natural Language Processing Toolkit}}, \n author={Lester James V. Miranda},\n year={2023},\n eprint={2311.07171},\n archivePrefix={arXiv},\n primaryClass={cs.CL}\n}\n```\n\nIf you are citing the [NER dataset](https://huggingface.co/ljvmiranda921), please use:\n\n```\n@misc{miranda2023developing,\n title={{Developing a Named Entity Recognition Dataset for Tagalog}}, \n author={Lester James V. Miranda},\n year={2023},\n eprint={2311.07161},\n archivePrefix={arXiv},\n primaryClass={cs.CL}\n}\n```\n",
"bugtrack_url": null,
"license": "MIT License",
"summary": "NLP Pipelines for Tagalog",
"version": "0.1.2",
"project_urls": {
"Bug Tracker": "https://github.com/ljvmiranda921/calamanCy/issues",
"Release Notes": "https://github.com/ljvmiranda921/calamanCy/releases",
"Repository": "https://github.com/ljvmiranda921/calamanCy"
},
"split_keywords": [
"nlp",
"natural language processing",
"language technology",
"tagalog"
],
"urls": [
{
"comment_text": "",
"digests": {
"blake2b_256": "b7b49b39ecdbcf9c8d4691f06f0be36fe430d197eedd707887d0ffb0e73fecb7",
"md5": "6ef376611a218a3711124be7cf3a7157",
"sha256": "ecfc4adcff05e0c9cda07a0a4285a7b7bf1cac703686f4be3708e1cd0786519b"
},
"downloads": -1,
"filename": "calamanCy-0.1.2-py3-none-any.whl",
"has_sig": false,
"md5_digest": "6ef376611a218a3711124be7cf3a7157",
"packagetype": "bdist_wheel",
"python_version": "py3",
"requires_python": ">=3.7",
"size": 8105,
"upload_time": "2023-12-04T08:28:26",
"upload_time_iso_8601": "2023-12-04T08:28:26.324338Z",
"url": "https://files.pythonhosted.org/packages/b7/b4/9b39ecdbcf9c8d4691f06f0be36fe430d197eedd707887d0ffb0e73fecb7/calamanCy-0.1.2-py3-none-any.whl",
"yanked": false,
"yanked_reason": null
},
{
"comment_text": "",
"digests": {
"blake2b_256": "1989b0d65b923be34afa5f60f1027a4d635d71a6f3b5fa8f516e012baeff2560",
"md5": "fb4512c50aec42ff817b7aa1843b9abe",
"sha256": "5c5c25121a61a6fad8df56e3a335223bd0a202dd065bafe79877825c068a574f"
},
"downloads": -1,
"filename": "calamanCy-0.1.2.tar.gz",
"has_sig": false,
"md5_digest": "fb4512c50aec42ff817b7aa1843b9abe",
"packagetype": "sdist",
"python_version": "source",
"requires_python": ">=3.7",
"size": 7870,
"upload_time": "2023-12-04T08:28:28",
"upload_time_iso_8601": "2023-12-04T08:28:28.089234Z",
"url": "https://files.pythonhosted.org/packages/19/89/b0d65b923be34afa5f60f1027a4d635d71a6f3b5fa8f516e012baeff2560/calamanCy-0.1.2.tar.gz",
"yanked": false,
"yanked_reason": null
}
],
"upload_time": "2023-12-04 08:28:28",
"github": true,
"gitlab": false,
"bitbucket": false,
"codeberg": false,
"github_user": "ljvmiranda921",
"github_project": "calamanCy",
"travis_ci": false,
"coveralls": false,
"github_actions": true,
"lcname": "calamancy"
}