# Nepali Tokenizers
[![LICENSE](https://img.shields.io/badge/license-Apache--2.0-blue)](./LICENSE) [![Build and Release](https://github.com/basnetsoyuj/nepali-tokenizers/actions/workflows/build.yml/badge.svg)](https://github.com/basnetsoyuj/nepali-tokenizers/actions/workflows/build.yml)
This package provides access to pre-trained __WordPiece__ and __SentencePiece__ (Unigram) tokenizers for Nepali language, trained using HuggingFace's `tokenizers` library. It is a simple and short Python package tailored specifically for Nepali language with a default set of configurations for the normalizer, pre-tokenizer, post-processor, and decoder.
It delegates further customization by providing an interface to HuggingFace's `Tokenizer` pipeline, allowing users to adapt the tokenizers according to their requirements.
## Installation
You can install `nepalitokenizers` using pip:
```sh
pip install nepalitokenizers
```
## Usage
After installing the package, you can use the tokenizers in your Python code:
### WordPiece Tokenizer
```python
from nepalitokenizers import WordPiece
text = "हाम्रा सबै क्रियाकलापहरु भोलिवादी छन् । मेरो पानीजहाज वाम माछाले भरिपूर्ण छ । इन्जिनियरहरुले गएको हप्ता राजधानीमा त्यस्तै बहस गरे ।"
tokenizer_wp = WordPiece()
tokens = tokenizer_wp.encode(text)
print(tokens.ids)
print(tokens.tokens)
print(tokenizer_wp.decode(tokens.ids))
```
**Output**
```
[1, 11366, 8625, 14157, 8423, 13344, 9143, 8425, 1496, 9505, 22406, 11693, 12679, 8340, 27445, 1430, 1496, 13890, 9008, 9605, 13591, 14547, 9957, 12507, 8700, 1496, 2]
['[CLS]', 'हाम्रा', 'सबै', 'क्रियाकलाप', '##हरु', 'भोलि', '##वादी', 'छन्', '।', 'मेरो', 'पानीजहाज', 'वाम', 'माछा', '##ले', 'भरिपूर्ण', 'छ', '।', 'इन्जिनियर', '##हरुले', 'गएको', 'हप्ता', 'राजधानीमा', 'त्यस्तै', 'बहस', 'गरे', '।', '[SEP]']
हाम्रा सबै क्रियाकलापहरु भोलिवादी छन् । मेरो पानीजहाज वाम माछाले भरिपूर्ण छ । इन्जिनियरहरुले गएको हप्ता राजधानीमा त्यस्तै बहस गरे ।
```
### SentencePiece (Unigram) Tokenizer
```python
from nepalitokenizers import SentencePiece
text = "कोभिड महामारीको पिडाबाट मुक्त नहुँदै मानव समाजलाई यतिबेला युद्धको विध्वंसकारी क्षतिको चिन्ताले चिन्तित बनाएको छ ।"
tokenizer_sp = SentencePiece()
tokens = tokenizer_sp.encode(text)
print(tokens.ids)
print(tokens.tokens)
print(tokenizer_wp.decode(tokens.ids))
```
**Output**
```
[7, 9, 3241, 483, 12081, 9, 11079, 23, 2567, 11254, 1002, 789, 20, 3334, 2161, 9, 23517, 2711, 1115, 9, 1718, 12, 5941, 781, 19, 8, 1, 0]
['▁', 'को', 'भि', 'ड', '▁महामारी', 'को', '▁पिडा', 'बाट', '▁मुक्त', '▁नहुँदै', '▁मानव', '▁समाज', 'लाई', '▁यतिबेला', '▁युद्ध', 'को', '▁विध्वंस', 'कारी', '▁क्षति', 'को', '▁चिन्ता', 'ले', '▁चिन्तित', '▁बनाएको', '▁छ', '▁।', '<sep>', '<cls>']
कोभिड महामारीको पिडाबाट मुक्त नहुँदै मानव समाजलाई यतिबेला युद्धको विध्वंसकारी क्षतिको चिन्ताले चिन्तित बनाएको छ ।
```
## Configuration & Customization
Each tokenizer class has a default and standard set of configurations for the normalizer, pre-tokenizer, post-processor, and decoder. For more information, look at the training files available in the [`train/`](train/) directory.
The package delegates further customization by providing an interface to directly access to HuggingFace's tokenizer pipeline. Therefore, you can treat `nepalitokenizers`'s tokenizer instances as HuggingFace's `Tokenizer` objects. For example:
```python
from nepalitokenizers import WordPiece
# importing from the HuggingFace tokenizers package
from tokenizers.processors import TemplateProcessing
text = "हाम्रो मातृभूमि नेपाल हो"
tokenizer_sp = WordPiece()
# using default post processor
tokens = tokenizer_sp.encode(text)
print(tokens.tokens)
# change the post processor to not add any special tokens
# treat tokenizer_sp as HuggingFace's Tokenizer object
tokenizer_sp.post_processor = TemplateProcessing()
tokens = tokenizer_sp.encode(text)
print(tokens.tokens)
```
**Output**
```
['[CLS]', 'हाम्रो', 'मातृ', '##भूमि', 'नेपाल', 'हो', '[SEP]']
['हाम्रो', 'मातृ', '##भूमि', 'नेपाल', 'हो']
```
To learn more about further customizations that can be performed, visit [HuggingFace's Tokenizer Documentation](https://huggingface.co/docs/tokenizers/api/tokenizer).
> **Note**: The delegation to HuggingFace's Tokenizer pipeline was done with the following generic wrapper class because `tokenizers.Tokenizer` is not an acceptable base type for inheritance.
> It is a useful trick I use for solving similar issues:
> ```python
> class Delegate:
> """
> A generic wrapper class that delegates attributes and method calls
> to the specified self.delegate instance.
> """
>
> @property
> def _items(self):
> return dir(self.delegate)
>
> def __getattr__(self, name):
> if name in self._items:
> return getattr(self.delegate, name)
> raise AttributeError(
> f"'{self.__class__.__name__}' object has no attribute '{name}'")
>
> def __setattr__(self, name, value):
> if name == "delegate" or name not in self._items:
> super().__setattr__(name, value)
> else:
> setattr(self.delegate, name, value)
>
> def __dir__(self):
> return dir(type(self)) + list(self.__dict__.keys()) + self._items
> ```
## Training
The python files used to train the tokenizers are available in the [`train/`](train/) directory. You can also use these files to train your own tokenizers on a custom text corpus.
These tokenizers were trained on two datasets:
#### 1. The Nepali Subset of the [OSCAR](https://oscar-project.github.io/documentation/versions/oscar-2301/) dataset
You can download it using the following code:
```python
import datasets
from tqdm.auto import tqdm
import os
dataset = datasets.load_dataset(
'oscar', 'unshuffled_deduplicated_ne',
split='train'
)
os.mkdir('data')
batch = []
counter = 0
for sample in tqdm(dataset):
sample = sample['text'].replace('\n', ' ')
batch.append(sample)
if len(batch) == 10_000:
with open(f'data/ne_{counter}.txt', 'w', encoding='utf-8') as f:
f.write('\n'.join(batch))
batch = []
counter += 1
```
#### 2. A Large Scale Nepali Text Corpus by Rabindra Lamsal (2020)
To download the dataset, follow the instructions provided in this link: [A Large Scale Nepali Text Corpus](https://dx.doi.org/10.21227/jxrd-d245).
## License
This package is licensed under the Apache 2.0 License, which is consistent with the license used by HuggingFace's `tokenizers` library. Please see the [`LICENSE`](LICENSE) file for more details.
Raw data
{
"_id": null,
"home_page": "",
"name": "nepalitokenizers",
"maintainer": "",
"docs_url": null,
"requires_python": ">=3.5",
"maintainer_email": "",
"keywords": "nepali,tokenizer,NLP,wordpiece,sentencepiece,huggingface,transformers",
"author": "",
"author_email": "Soyuj Jung Basnet <bsoyuj@gmail.com>",
"download_url": "https://files.pythonhosted.org/packages/41/bb/e05f85fa8bea83f714b054c5b050c59a812bba69c6ec07c922abaf889a40/nepalitokenizers-0.0.1.tar.gz",
"platform": null,
"description": "# Nepali Tokenizers\n\n[![LICENSE](https://img.shields.io/badge/license-Apache--2.0-blue)](./LICENSE) [![Build and Release](https://github.com/basnetsoyuj/nepali-tokenizers/actions/workflows/build.yml/badge.svg)](https://github.com/basnetsoyuj/nepali-tokenizers/actions/workflows/build.yml)\n\nThis package provides access to pre-trained __WordPiece__ and __SentencePiece__ (Unigram) tokenizers for Nepali language, trained using HuggingFace's `tokenizers` library. It is a simple and short Python package tailored specifically for Nepali language with a default set of configurations for the normalizer, pre-tokenizer, post-processor, and decoder. \n\nIt delegates further customization by providing an interface to HuggingFace's `Tokenizer` pipeline, allowing users to adapt the tokenizers according to their requirements.\n\n\n## Installation\n\nYou can install `nepalitokenizers` using pip:\n\n```sh\npip install nepalitokenizers\n```\n\n\n## Usage\n\nAfter installing the package, you can use the tokenizers in your Python code:\n\n### WordPiece Tokenizer\n\n```python\nfrom nepalitokenizers import WordPiece\n\ntext = \"\u0939\u093e\u092e\u094d\u0930\u093e \u0938\u092c\u0948 \u0915\u094d\u0930\u093f\u092f\u093e\u0915\u0932\u093e\u092a\u0939\u0930\u0941 \u092d\u094b\u0932\u093f\u0935\u093e\u0926\u0940 \u091b\u0928\u094d \u0964 \u092e\u0947\u0930\u094b \u092a\u093e\u0928\u0940\u091c\u0939\u093e\u091c \u0935\u093e\u092e \u092e\u093e\u091b\u093e\u0932\u0947 \u092d\u0930\u093f\u092a\u0942\u0930\u094d\u0923 \u091b \u0964 \u0907\u0928\u094d\u091c\u093f\u0928\u093f\u092f\u0930\u0939\u0930\u0941\u0932\u0947 \u0917\u090f\u0915\u094b \u0939\u092a\u094d\u0924\u093e \u0930\u093e\u091c\u0927\u093e\u0928\u0940\u092e\u093e \u0924\u094d\u092f\u0938\u094d\u0924\u0948 \u092c\u0939\u0938 \u0917\u0930\u0947 \u0964\"\n\ntokenizer_wp = WordPiece()\n\ntokens = tokenizer_wp.encode(text)\nprint(tokens.ids)\nprint(tokens.tokens)\n\nprint(tokenizer_wp.decode(tokens.ids))\n```\n\n**Output**\n\n```\n[1, 11366, 8625, 14157, 8423, 13344, 9143, 8425, 1496, 9505, 22406, 11693, 12679, 8340, 27445, 1430, 1496, 13890, 9008, 9605, 13591, 14547, 9957, 12507, 8700, 1496, 2]\n['[CLS]', '\u0939\u093e\u092e\u094d\u0930\u093e', '\u0938\u092c\u0948', '\u0915\u094d\u0930\u093f\u092f\u093e\u0915\u0932\u093e\u092a', '##\u0939\u0930\u0941', '\u092d\u094b\u0932\u093f', '##\u0935\u093e\u0926\u0940', '\u091b\u0928\u094d', '\u0964', '\u092e\u0947\u0930\u094b', '\u092a\u093e\u0928\u0940\u091c\u0939\u093e\u091c', '\u0935\u093e\u092e', '\u092e\u093e\u091b\u093e', '##\u0932\u0947', '\u092d\u0930\u093f\u092a\u0942\u0930\u094d\u0923', '\u091b', '\u0964', '\u0907\u0928\u094d\u091c\u093f\u0928\u093f\u092f\u0930', '##\u0939\u0930\u0941\u0932\u0947', '\u0917\u090f\u0915\u094b', '\u0939\u092a\u094d\u0924\u093e', '\u0930\u093e\u091c\u0927\u093e\u0928\u0940\u092e\u093e', '\u0924\u094d\u092f\u0938\u094d\u0924\u0948', '\u092c\u0939\u0938', '\u0917\u0930\u0947', '\u0964', '[SEP]']\n\u0939\u093e\u092e\u094d\u0930\u093e \u0938\u092c\u0948 \u0915\u094d\u0930\u093f\u092f\u093e\u0915\u0932\u093e\u092a\u0939\u0930\u0941 \u092d\u094b\u0932\u093f\u0935\u093e\u0926\u0940 \u091b\u0928\u094d \u0964 \u092e\u0947\u0930\u094b \u092a\u093e\u0928\u0940\u091c\u0939\u093e\u091c \u0935\u093e\u092e \u092e\u093e\u091b\u093e\u0932\u0947 \u092d\u0930\u093f\u092a\u0942\u0930\u094d\u0923 \u091b \u0964 \u0907\u0928\u094d\u091c\u093f\u0928\u093f\u092f\u0930\u0939\u0930\u0941\u0932\u0947 \u0917\u090f\u0915\u094b \u0939\u092a\u094d\u0924\u093e \u0930\u093e\u091c\u0927\u093e\u0928\u0940\u092e\u093e \u0924\u094d\u092f\u0938\u094d\u0924\u0948 \u092c\u0939\u0938 \u0917\u0930\u0947 \u0964\n```\n\n### SentencePiece (Unigram) Tokenizer\n\n```python\nfrom nepalitokenizers import SentencePiece\n\ntext = \"\u0915\u094b\u092d\u093f\u0921 \u092e\u0939\u093e\u092e\u093e\u0930\u0940\u0915\u094b \u092a\u093f\u0921\u093e\u092c\u093e\u091f \u092e\u0941\u0915\u094d\u0924 \u0928\u0939\u0941\u0901\u0926\u0948 \u092e\u093e\u0928\u0935 \u0938\u092e\u093e\u091c\u0932\u093e\u0908 \u092f\u0924\u093f\u092c\u0947\u0932\u093e \u092f\u0941\u0926\u094d\u0927\u0915\u094b \u0935\u093f\u0927\u094d\u0935\u0902\u0938\u0915\u093e\u0930\u0940 \u0915\u094d\u0937\u0924\u093f\u0915\u094b \u091a\u093f\u0928\u094d\u0924\u093e\u0932\u0947 \u091a\u093f\u0928\u094d\u0924\u093f\u0924 \u092c\u0928\u093e\u090f\u0915\u094b \u091b \u0964\"\n\ntokenizer_sp = SentencePiece()\n\ntokens = tokenizer_sp.encode(text)\nprint(tokens.ids)\nprint(tokens.tokens)\n\nprint(tokenizer_wp.decode(tokens.ids))\n```\n\n**Output**\n\n```\n[7, 9, 3241, 483, 12081, 9, 11079, 23, 2567, 11254, 1002, 789, 20, 3334, 2161, 9, 23517, 2711, 1115, 9, 1718, 12, 5941, 781, 19, 8, 1, 0]\n['\u2581', '\u0915\u094b', '\u092d\u093f', '\u0921', '\u2581\u092e\u0939\u093e\u092e\u093e\u0930\u0940', '\u0915\u094b', '\u2581\u092a\u093f\u0921\u093e', '\u092c\u093e\u091f', '\u2581\u092e\u0941\u0915\u094d\u0924', '\u2581\u0928\u0939\u0941\u0901\u0926\u0948', '\u2581\u092e\u093e\u0928\u0935', '\u2581\u0938\u092e\u093e\u091c', '\u0932\u093e\u0908', '\u2581\u092f\u0924\u093f\u092c\u0947\u0932\u093e', '\u2581\u092f\u0941\u0926\u094d\u0927', '\u0915\u094b', '\u2581\u0935\u093f\u0927\u094d\u0935\u0902\u0938', '\u0915\u093e\u0930\u0940', '\u2581\u0915\u094d\u0937\u0924\u093f', '\u0915\u094b', '\u2581\u091a\u093f\u0928\u094d\u0924\u093e', '\u0932\u0947', '\u2581\u091a\u093f\u0928\u094d\u0924\u093f\u0924', '\u2581\u092c\u0928\u093e\u090f\u0915\u094b', '\u2581\u091b', '\u2581\u0964', '<sep>', '<cls>']\n\u0915\u094b\u092d\u093f\u0921 \u092e\u0939\u093e\u092e\u093e\u0930\u0940\u0915\u094b \u092a\u093f\u0921\u093e\u092c\u093e\u091f \u092e\u0941\u0915\u094d\u0924 \u0928\u0939\u0941\u0901\u0926\u0948 \u092e\u093e\u0928\u0935 \u0938\u092e\u093e\u091c\u0932\u093e\u0908 \u092f\u0924\u093f\u092c\u0947\u0932\u093e \u092f\u0941\u0926\u094d\u0927\u0915\u094b \u0935\u093f\u0927\u094d\u0935\u0902\u0938\u0915\u093e\u0930\u0940 \u0915\u094d\u0937\u0924\u093f\u0915\u094b \u091a\u093f\u0928\u094d\u0924\u093e\u0932\u0947 \u091a\u093f\u0928\u094d\u0924\u093f\u0924 \u092c\u0928\u093e\u090f\u0915\u094b \u091b \u0964\n```\n\n\n## Configuration & Customization\n\nEach tokenizer class has a default and standard set of configurations for the normalizer, pre-tokenizer, post-processor, and decoder. For more information, look at the training files available in the [`train/`](train/) directory.\n\nThe package delegates further customization by providing an interface to directly access to HuggingFace's tokenizer pipeline. Therefore, you can treat `nepalitokenizers`'s tokenizer instances as HuggingFace's `Tokenizer` objects. For example:\n\n```python\nfrom nepalitokenizers import WordPiece\n\n# importing from the HuggingFace tokenizers package\nfrom tokenizers.processors import TemplateProcessing\n\ntext = \"\u0939\u093e\u092e\u094d\u0930\u094b \u092e\u093e\u0924\u0943\u092d\u0942\u092e\u093f \u0928\u0947\u092a\u093e\u0932 \u0939\u094b\"\n\ntokenizer_sp = WordPiece()\n\n# using default post processor\ntokens = tokenizer_sp.encode(text)\nprint(tokens.tokens)\n\n# change the post processor to not add any special tokens\n# treat tokenizer_sp as HuggingFace's Tokenizer object\ntokenizer_sp.post_processor = TemplateProcessing()\n\ntokens = tokenizer_sp.encode(text)\nprint(tokens.tokens)\n```\n\n**Output**\n```\n['[CLS]', '\u0939\u093e\u092e\u094d\u0930\u094b', '\u092e\u093e\u0924\u0943', '##\u092d\u0942\u092e\u093f', '\u0928\u0947\u092a\u093e\u0932', '\u0939\u094b', '[SEP]']\n['\u0939\u093e\u092e\u094d\u0930\u094b', '\u092e\u093e\u0924\u0943', '##\u092d\u0942\u092e\u093f', '\u0928\u0947\u092a\u093e\u0932', '\u0939\u094b']\n```\n\nTo learn more about further customizations that can be performed, visit [HuggingFace's Tokenizer Documentation](https://huggingface.co/docs/tokenizers/api/tokenizer).\n\n\n> **Note**: The delegation to HuggingFace's Tokenizer pipeline was done with the following generic wrapper class because `tokenizers.Tokenizer` is not an acceptable base type for inheritance.\n> It is a useful trick I use for solving similar issues:\n> ```python\n> class Delegate:\n> \"\"\"\n> A generic wrapper class that delegates attributes and method calls\n> to the specified self.delegate instance.\n> \"\"\"\n>\n> @property\n> def _items(self):\n> return dir(self.delegate)\n>\n> def __getattr__(self, name):\n> if name in self._items:\n> return getattr(self.delegate, name)\n> raise AttributeError(\n> f\"'{self.__class__.__name__}' object has no attribute '{name}'\")\n>\n> def __setattr__(self, name, value):\n> if name == \"delegate\" or name not in self._items:\n> super().__setattr__(name, value)\n> else:\n> setattr(self.delegate, name, value)\n>\n> def __dir__(self):\n> return dir(type(self)) + list(self.__dict__.keys()) + self._items\n> ```\n\n\n## Training\n\nThe python files used to train the tokenizers are available in the [`train/`](train/) directory. You can also use these files to train your own tokenizers on a custom text corpus. \n\nThese tokenizers were trained on two datasets:\n\n#### 1. The Nepali Subset of the [OSCAR](https://oscar-project.github.io/documentation/versions/oscar-2301/) dataset\n\nYou can download it using the following code:\n\n```python\nimport datasets\nfrom tqdm.auto import tqdm\nimport os\n\ndataset = datasets.load_dataset(\n 'oscar', 'unshuffled_deduplicated_ne',\n split='train'\n)\n\nos.mkdir('data')\n\nbatch = []\ncounter = 0\n\nfor sample in tqdm(dataset):\n sample = sample['text'].replace('\\n', ' ')\n batch.append(sample)\n\n if len(batch) == 10_000:\n with open(f'data/ne_{counter}.txt', 'w', encoding='utf-8') as f:\n f.write('\\n'.join(batch))\n batch = []\n counter += 1\n```\n\n#### 2. A Large Scale Nepali Text Corpus by Rabindra Lamsal (2020)\nTo download the dataset, follow the instructions provided in this link: [A Large Scale Nepali Text Corpus](https://dx.doi.org/10.21227/jxrd-d245).\n\n\n## License\n\nThis package is licensed under the Apache 2.0 License, which is consistent with the license used by HuggingFace's `tokenizers` library. Please see the [`LICENSE`](LICENSE) file for more details.\n",
"bugtrack_url": null,
"license": "Apache-2.0",
"summary": "Pre-trained Tokenizers for the Nepali language with an interface to HuggingFace's tokenizers library for customizability.",
"version": "0.0.1",
"project_urls": {
"Bug Tracker": "https://github.com/basnetsoyuj/nepali-tokenizers/issues",
"Homepage": "https://github.com/basnetsoyuj/nepali-tokenizers"
},
"split_keywords": [
"nepali",
"tokenizer",
"nlp",
"wordpiece",
"sentencepiece",
"huggingface",
"transformers"
],
"urls": [
{
"comment_text": "",
"digests": {
"blake2b_256": "fa2247bbd1d408a94dc9470c809235f13d22a47750a25996583f058772fa15f9",
"md5": "eac68d91ff6a1b740e8a5954b74c7096",
"sha256": "a6c1fadb0f2d86b603d3861a7fecee810d064b4c66d6acbcf9de301ece083085"
},
"downloads": -1,
"filename": "nepalitokenizers-0.0.1-py3-none-any.whl",
"has_sig": false,
"md5_digest": "eac68d91ff6a1b740e8a5954b74c7096",
"packagetype": "bdist_wheel",
"python_version": "py3",
"requires_python": ">=3.5",
"size": 9438,
"upload_time": "2023-06-23T22:51:41",
"upload_time_iso_8601": "2023-06-23T22:51:41.461261Z",
"url": "https://files.pythonhosted.org/packages/fa/22/47bbd1d408a94dc9470c809235f13d22a47750a25996583f058772fa15f9/nepalitokenizers-0.0.1-py3-none-any.whl",
"yanked": false,
"yanked_reason": null
},
{
"comment_text": "",
"digests": {
"blake2b_256": "41bbe05f85fa8bea83f714b054c5b050c59a812bba69c6ec07c922abaf889a40",
"md5": "42abd0829b452b7a5f7348f16f47033c",
"sha256": "c5713b6d88d2f855c89445ca60de93eba0b0fb9404045a4ff290499d69253016"
},
"downloads": -1,
"filename": "nepalitokenizers-0.0.1.tar.gz",
"has_sig": false,
"md5_digest": "42abd0829b452b7a5f7348f16f47033c",
"packagetype": "sdist",
"python_version": "source",
"requires_python": ">=3.5",
"size": 9398,
"upload_time": "2023-06-23T22:51:43",
"upload_time_iso_8601": "2023-06-23T22:51:43.355542Z",
"url": "https://files.pythonhosted.org/packages/41/bb/e05f85fa8bea83f714b054c5b050c59a812bba69c6ec07c922abaf889a40/nepalitokenizers-0.0.1.tar.gz",
"yanked": false,
"yanked_reason": null
}
],
"upload_time": "2023-06-23 22:51:43",
"github": true,
"gitlab": false,
"bitbucket": false,
"codeberg": false,
"github_user": "basnetsoyuj",
"github_project": "nepali-tokenizers",
"github_not_found": true,
"lcname": "nepalitokenizers"
}