# AraNizer
## Description
`AraNizer` is a sophisticated toolkit of custom tokenizers tailored for Arabic language processing. Integrating advanced methodologies such as SentencePiece and Byte Pair Encoding (BPE), these tokenizers are specifically designed for seamless integration with the `transformers` and `sentence_transformers` libraries. The `AraNizer` suite offers a range of tokenizers, each optimized for distinct NLP tasks and accommodating varying vocabulary sizes to cater to a multitude of linguistic applications.
## Key Features
- **Versatile Tokenization:** Supports multiple tokenization strategies (BPE, SentencePiece) for varied NLP tasks.
- **Broad Vocabulary Range:** Customizable tokenizers with vocabulary sizes ranging from 32k to 86k.
- **Seamless Integration:** Compatible with popular libraries like transformers and sentence_transformers.
- **Optimized for Arabic:** Specifically engineered for the intricacies of the Arabic language.
## Installation
Install AraNizer effortlessly with pip:
```bash
pip install aranizer
```
## Usage
### Importing Tokenizers
Import your desired tokenizer from AraNizer. Available tokenizers include:
BEP variants: get_bpe with keys bpe32, bpe50, bpe64, bpe86, bpe32T, bpe50T, bpe64T, bpe86T
SentencePiece variants: get_sp with keys sp32, sp50, sp64, sp86, sp32T, sp50T, sp64T, sp86T
```python
from aranizer import get_bpe, get_sp # Import functions to retrieve tokenizers
# Example for importing a BPE tokenizer
bpe_tokenizer = get_bpe("bpe32") # Replace with your chosen tokenizer key
# Example for importing a SentencePiece tokenizer
sp_tokenizer = get_sp("sp32") # Replace with your chosen tokenizer key
```
### Tokenizing Text
Tokenize Arabic text using the selected tokenizer:
```python
text = "مثال على النص العربي" # Example Arabic text
# Using BPE tokenizer
bpe_tokens = bpe_tokenizer.tokenize(text)
print(bpe_tokens)
# Using SentencePiece tokenizer
sp_tokens = sp_tokenizer.tokenize(text)
print(sp_tokens)
```
### Encoding and Decoding
Encode text into token ids and decode back to text.
**Encoding:** To encode text, use the encode method.
```python
text = "مثال على النص العربي" # Example Arabic text
# Using BPE tokenizer
encoded_bpe_output = bpe_tokenizer.encode(text, add_special_tokens=True)
print(encoded_bpe_output)
# Using SentencePiece tokenizer
encoded_sp_output = sp_tokenizer.encode(text, add_special_tokens=True)
print(encoded_sp_output)
```
**Decoding:** To convert token ids back to text, use the decode method:
```python
# Using BPE tokenizer
decoded_bpe_text = bpe_tokenizer.decode(encoded_bpe_output)
print(decoded_bpe_text)
# Using SentencePiece tokenizer
decoded_sp_text = sp_tokenizer.decode(encoded_sp_output)
print(decoded_sp_text)
```
## Available Tokenizers
```bash
Available Tokenizers
get_bpe("bpe32"): Based on BPE Tokenizer with Vocab Size of 32k
get_bpe("bpe50"): Based on BPE Tokenizer with Vocab Size of 50k
get_bpe("bpe64"): Based on BPE Tokenizer with Vocab Size of 64k
get_bpe("bpe86"): Based on BPE Tokenizer with Vocab Size of 86k
get_bpe("bpe32T"): Based on BPE Tokenizer with Vocab Size of 32k (with Tashkeel (diacritics))
get_bpe("bpe50T"): Based on BPE Tokenizer with Vocab Size of 50k (with Tashkeel (diacritics))
get_bpe("bpe64T"): Based on BPE Tokenizer with Vocab Size of 64k (with Tashkeel (diacritics))
get_bpe("bpe86T"): Based on BPE Tokenizer with Vocab Size of 86k (with Tashkeel (diacritics))
get_sp("sp32"): Based on SentencePiece Tokenizer with Vocab Size of 32k
get_sp("sp50"): Based on SentencePiece Tokenizer with Vocab Size of 50k
get_sp("sp64"): Based on SentencePiece Tokenizer with Vocab Size of 64k
get_sp("sp86"): Based on SentencePiece Tokenizer with Vocab Size of 86k
get_sp("sp32T"): Based on SentencePiece Tokenizer with Vocab Size of 32k (with Tashkeel (diacritics))
get_sp("sp50T"): Based on SentencePiece Tokenizer with Vocab Size of 50k (with Tashkeel (diacritics))
get_sp("sp64T"): Based on SentencePiece Tokenizer with Vocab Size of 64k (with Tashkeel (diacritics))
get_sp("sp86T"): Based on SentencePiece Tokenizer with Vocab Size of 86k (with Tashkeel (diacritics))
```
## System Requirements
- Python 3.x
- transformers library
## Contact:
For queries or assistance, please contact riotu@psu.edu.sa.
## Acknowledgments:
This work is maintained by the Robotics and Internet-of-Things Lab at Prince Sultan University.
## Team:
- Prof. Anis Koubaa (Lab Leader)
- Dr. Lahouari Ghouti (NLP Team Leader)
- Eng. Omar Najjar (AI Research Assistant)
- Eng. Serry Sebai (NLP Research Engineer)
## Version:
0.2.3
## Citations:
Coming soon
Raw data
{
"_id": null,
"home_page": "https://github.com/riotu-lab/aranizer",
"name": "aranizer",
"maintainer": null,
"docs_url": null,
"requires_python": ">=3.6",
"maintainer_email": null,
"keywords": null,
"author": "RIOTU Lab",
"author_email": "riotu@psu.edu.sa",
"download_url": "https://files.pythonhosted.org/packages/b9/37/c6960d2232e1ff81dbb7965388c7f2ceaef15fec7138e2be6e9317ab4d59/aranizer-0.2.4.tar.gz",
"platform": null,
"description": "# AraNizer\n\n## Description\n`AraNizer` is a sophisticated toolkit of custom tokenizers tailored for Arabic language processing. Integrating advanced methodologies such as SentencePiece and Byte Pair Encoding (BPE), these tokenizers are specifically designed for seamless integration with the `transformers` and `sentence_transformers` libraries. The `AraNizer` suite offers a range of tokenizers, each optimized for distinct NLP tasks and accommodating varying vocabulary sizes to cater to a multitude of linguistic applications.\n\n## Key Features\n- **Versatile Tokenization:** Supports multiple tokenization strategies (BPE, SentencePiece) for varied NLP tasks.\n- **Broad Vocabulary Range:** Customizable tokenizers with vocabulary sizes ranging from 32k to 86k.\n- **Seamless Integration:** Compatible with popular libraries like transformers and sentence_transformers.\n- **Optimized for Arabic:** Specifically engineered for the intricacies of the Arabic language.\n\n## Installation\nInstall AraNizer effortlessly with pip:\n```bash\npip install aranizer\n```\n\n## Usage\n### Importing Tokenizers\nImport your desired tokenizer from AraNizer. Available tokenizers include:\nBEP variants: get_bpe with keys bpe32, bpe50, bpe64, bpe86, bpe32T, bpe50T, bpe64T, bpe86T\nSentencePiece variants: get_sp with keys sp32, sp50, sp64, sp86, sp32T, sp50T, sp64T, sp86T\n\n```python\nfrom aranizer import get_bpe, get_sp # Import functions to retrieve tokenizers\n\n# Example for importing a BPE tokenizer\nbpe_tokenizer = get_bpe(\"bpe32\") # Replace with your chosen tokenizer key\n\n# Example for importing a SentencePiece tokenizer\nsp_tokenizer = get_sp(\"sp32\") # Replace with your chosen tokenizer key\n```\n\n### Tokenizing Text\nTokenize Arabic text using the selected tokenizer:\n\n```python\ntext = \"\u0645\u062b\u0627\u0644 \u0639\u0644\u0649 \u0627\u0644\u0646\u0635 \u0627\u0644\u0639\u0631\u0628\u064a\" # Example Arabic text\n\n# Using BPE tokenizer\nbpe_tokens = bpe_tokenizer.tokenize(text)\nprint(bpe_tokens)\n\n# Using SentencePiece tokenizer\nsp_tokens = sp_tokenizer.tokenize(text)\nprint(sp_tokens)\n```\n\n### Encoding and Decoding\n\nEncode text into token ids and decode back to text. \n\n**Encoding:** To encode text, use the encode method. \n```python\ntext = \"\u0645\u062b\u0627\u0644 \u0639\u0644\u0649 \u0627\u0644\u0646\u0635 \u0627\u0644\u0639\u0631\u0628\u064a\" # Example Arabic text\n\n# Using BPE tokenizer\nencoded_bpe_output = bpe_tokenizer.encode(text, add_special_tokens=True)\nprint(encoded_bpe_output)\n\n# Using SentencePiece tokenizer\nencoded_sp_output = sp_tokenizer.encode(text, add_special_tokens=True)\nprint(encoded_sp_output)\n```\n\n**Decoding:** To convert token ids back to text, use the decode method:\n```python\n# Using BPE tokenizer\ndecoded_bpe_text = bpe_tokenizer.decode(encoded_bpe_output)\nprint(decoded_bpe_text)\n\n# Using SentencePiece tokenizer\ndecoded_sp_text = sp_tokenizer.decode(encoded_sp_output)\nprint(decoded_sp_text)\n```\n\n## Available Tokenizers\n\n```bash\nAvailable Tokenizers\nget_bpe(\"bpe32\"): Based on BPE Tokenizer with Vocab Size of 32k\nget_bpe(\"bpe50\"): Based on BPE Tokenizer with Vocab Size of 50k\nget_bpe(\"bpe64\"): Based on BPE Tokenizer with Vocab Size of 64k\nget_bpe(\"bpe86\"): Based on BPE Tokenizer with Vocab Size of 86k\nget_bpe(\"bpe32T\"): Based on BPE Tokenizer with Vocab Size of 32k (with Tashkeel (diacritics))\nget_bpe(\"bpe50T\"): Based on BPE Tokenizer with Vocab Size of 50k (with Tashkeel (diacritics))\nget_bpe(\"bpe64T\"): Based on BPE Tokenizer with Vocab Size of 64k (with Tashkeel (diacritics))\nget_bpe(\"bpe86T\"): Based on BPE Tokenizer with Vocab Size of 86k (with Tashkeel (diacritics))\nget_sp(\"sp32\"): Based on SentencePiece Tokenizer with Vocab Size of 32k\nget_sp(\"sp50\"): Based on SentencePiece Tokenizer with Vocab Size of 50k\nget_sp(\"sp64\"): Based on SentencePiece Tokenizer with Vocab Size of 64k\nget_sp(\"sp86\"): Based on SentencePiece Tokenizer with Vocab Size of 86k\nget_sp(\"sp32T\"): Based on SentencePiece Tokenizer with Vocab Size of 32k (with Tashkeel (diacritics))\nget_sp(\"sp50T\"): Based on SentencePiece Tokenizer with Vocab Size of 50k (with Tashkeel (diacritics))\nget_sp(\"sp64T\"): Based on SentencePiece Tokenizer with Vocab Size of 64k (with Tashkeel (diacritics))\nget_sp(\"sp86T\"): Based on SentencePiece Tokenizer with Vocab Size of 86k (with Tashkeel (diacritics))\n```\n\n## System Requirements\n- Python 3.x\n- transformers library\n \n## Contact:\nFor queries or assistance, please contact riotu@psu.edu.sa.\n\n## Acknowledgments:\nThis work is maintained by the Robotics and Internet-of-Things Lab at Prince Sultan University. \n\n## Team:\n- Prof. Anis Koubaa (Lab Leader)\n- Dr. Lahouari Ghouti (NLP Team Leader)\n- Eng. Omar Najjar (AI Research Assistant)\n- Eng. Serry Sebai (NLP Research Engineer)\n\n## Version:\n0.2.3\n\n## Citations:\nComing soon\n\n\n",
"bugtrack_url": null,
"license": "Apache 2.0",
"summary": "Aranizer: A Custom Tokenizer for Enhanced Arabic Language Processing",
"version": "0.2.4",
"project_urls": {
"Homepage": "https://github.com/riotu-lab/aranizer"
},
"split_keywords": [],
"urls": [
{
"comment_text": "",
"digests": {
"blake2b_256": "f6d40d7f3a3edff5e6e4b2d90e8ee2702b4061a7cb9bd694b36c79c3f8d4573a",
"md5": "0b93ee5d15dced2e7a521544c657c97a",
"sha256": "1d39b11fcd510d783f30d59013e6531b018977dd5e402059b1303557e023d827"
},
"downloads": -1,
"filename": "aranizer-0.2.4-py3-none-any.whl",
"has_sig": false,
"md5_digest": "0b93ee5d15dced2e7a521544c657c97a",
"packagetype": "bdist_wheel",
"python_version": "py3",
"requires_python": ">=3.6",
"size": 6171923,
"upload_time": "2024-08-04T18:56:23",
"upload_time_iso_8601": "2024-08-04T18:56:23.371463Z",
"url": "https://files.pythonhosted.org/packages/f6/d4/0d7f3a3edff5e6e4b2d90e8ee2702b4061a7cb9bd694b36c79c3f8d4573a/aranizer-0.2.4-py3-none-any.whl",
"yanked": false,
"yanked_reason": null
},
{
"comment_text": "",
"digests": {
"blake2b_256": "b937c6960d2232e1ff81dbb7965388c7f2ceaef15fec7138e2be6e9317ab4d59",
"md5": "39c2b2234c6636b3f0d1e2cb1a36b739",
"sha256": "a2c6aa4bc20787bea43ddf4accb171a1f9b43cc4fa3ceb4499a92074f1555d35"
},
"downloads": -1,
"filename": "aranizer-0.2.4.tar.gz",
"has_sig": false,
"md5_digest": "39c2b2234c6636b3f0d1e2cb1a36b739",
"packagetype": "sdist",
"python_version": "source",
"requires_python": ">=3.6",
"size": 5818974,
"upload_time": "2024-08-04T18:56:34",
"upload_time_iso_8601": "2024-08-04T18:56:34.743605Z",
"url": "https://files.pythonhosted.org/packages/b9/37/c6960d2232e1ff81dbb7965388c7f2ceaef15fec7138e2be6e9317ab4d59/aranizer-0.2.4.tar.gz",
"yanked": false,
"yanked_reason": null
}
],
"upload_time": "2024-08-04 18:56:34",
"github": true,
"gitlab": false,
"bitbucket": false,
"codeberg": false,
"github_user": "riotu-lab",
"github_project": "aranizer",
"travis_ci": false,
"coveralls": false,
"github_actions": false,
"lcname": "aranizer"
}