![Pashto Word Cloud](wc.png)
# NLPashto – NLP Toolkit for Pashto
![GitHub](https://img.shields.io/github/license/ijazul-haq/nlpashto) ![GitHub contributors](https://img.shields.io/github/contributors/ijazul-haq/nlpashto) ![code size](https://img.shields.io/github/languages/code-size/ijazul-haq/nlpashto)
NLPashto is a Python suite for Pashto Natural Language Processing, which includes tools for fundamental text processing tasks, such as text cleaning, tokenization, and chunking (word segmentation). It also includes state-of-the-art models for POS tagging and sentiment analysis (offensive language detection, to be specific).
**Citation**
```
@article{haq2023nlpashto,
title={NLPashto: NLP Toolkit for Low-resource Pashto Language},
author={Ijazul Haq and Weidong Qiu and Jie Guo and Peng Tang},
journal={International Journal of Advanced Computer Science and Applications},
volume={14},
number={6},
pages={1345-1352},
year={2023},
issn={2156-5570},
doi={https://dx.doi.org/10.14569/IJACSA.2023.01406142}
publisher={Science and Information (SAI) Organization Limited}
}
```
## Prerequisites
To use NLPashto you will need:
* Python 3.8+
## Installing NLPashto
NLPashto can be installed from GitHub source or from PyPi hub simply by running this command:
```bash
pip install nlpashto
```
## Basic Usage
### Text Cleaning
This module contains basic text cleaning utilities, which can be used as follows:
```python
from nlpashto import Cleaner
cleaner=Cleaner()
noisy_txt= “په ژوند کی علم 📚📖 , 🖊 او پيسي 💵. 💸💲 دواړه حاصل کړه پوهان به دی علم ته درناوی ولري اوناپوهان به دي پیسو ته… https://t.co/xIiEXFg”
cleaned_text=cleaner.clean(noisy_txt)
print(cleaned_text)
Output:: په ژوند کی علم , او پيسي دواړه حاصل کړه پوهان به دی علم ته درناوی ولري او ناپوهان به دي پیسو ته
```
The clean method has several parameters you can play with:
* `text`=None | str or list | the input noisy text that you want to clean.
* `split_into_sentences`=True | bool | whether the text should be split into sentences or not.
* `remove_emojis`=True | bool | whether emojis should be removed or not.
* `normalize_nums`=True | bool | If set to True, Arabic numerals (1,2,3,…) will be normalized to Pashto numerals (۱،۲،۳،...).
* `remove_puncs`=False | bool | If set to True, punctuations (.,”,’,؟ etc.) will be removed.
* `remove_special_chars`=True | bool | If set to True, special characters (@,#,$,% etc.) will be removed.
* `special_chars`=[ ] | list | A list of special characters that you want to keep in the text.
### Tokenization (Space Correction)
This module can be used to correct space omission and space insertion errors. It will remove extra spaces from the text and insert space where necessary. It is basically a supervised whitespace tokenizer, but unlike the split() function of Python, the NLPashto Tokenizer is aware of the correct position of whitespace in the text. It is essentially a CRF model, trained on a corpus of ≈200K sentences. For further detail about this model, please refer to our paper, [Correction of whitespace and word segmentation in noisy Pashto text using CRF]( https://linkinghub.elsevier.com/retrieve/pii/S0167639323001048).
```python
from nlpashto import Tokenizer
tokenizer=Tokenizer()
noisy_txt='جلال اباد ښار کې هره ورځ لس ګونه کسانپهډلهییزهتوګهدنشهيي توکو کارولو ته ا د ا م ه و رک وي'
tokenized_text=tokenizer.tokenize(noisy_txt)
print(tokenized_text)
Output:: [['جلال', 'اباد', 'ښار', 'کې', 'هره', 'ورځ', 'لسګونه', 'کسان', 'په', 'ډله', 'ییزه', 'توګه', 'د', 'نشه', 'يي', 'توکو', 'کارولو', 'ته', 'ادامه', 'ورکوي']]
```
### (Chunking) Word Segmentation
If we look at the above example, we can see that the Tokenizer has split the compound words, `جلال اباد`, `ډلي ييزه`, and `نشه يي` into meaningless sub-parts. In such cases, where retrieval of the full word is necessary (instead of space-delimited tokens), we can use NLPashto Segmenter class. The word segmentation model is based on transformers, available on HuggingFace [ijazulhaq/pashto-word-segmentation](https://huggingface.co/ijazulhaq/pashto-word-segmentation).
```python
from nlpashto import Segmenter
segmenter=Segmenter()
#we are passing the above tokenized text to word segmenter
segmented_text=segmenter.segment(tokenized_text)
print(segmented_text)
Output:: [['جلال اباد', 'ښار', 'کې', 'هره', 'ورځ', 'لسګونه', 'کسان', 'په', 'ډله ییزه', 'توګه', 'د', 'نشه يي', 'توکو', 'کارولو', 'ته', 'ادامه', 'ورکوي', '']]
```
To segment multiple sentences, it’s better to specify the batch size by passing it to the class constructor, as below:
```python
segmenter=Segmenter(batch_size=32) # by default it’s 16
```
### Part-of-speech (POS) Tagging
For a detailed explanation about the POS tagger, tagset, and the dataset used for training the model, please have a look at our paper [POS Tagging of Low-resource Pashto Language: Annotated Corpus and Bert-based Model](https://www.researchsquare.com/article/rs-2712906/v1). This is also a transformer-based model, available on HuggingFace [ijazulhaq/pashto-pos](https://huggingface.co/ijazulhaq/pashto-pos).
```python
from nlpashto import POSTagger
pos_tagger=POSTagger()
pos_tagged=pos_tagger.tag(segmented_text)
print(pos_tagged)
Output:: [[('جلال اباد', 'NNP'), ('ښار', 'NNM'), ('کې', 'PT'), ('هره', 'JJ'), ('ورځ', 'NNF'), ('لسګونه', 'JJ'), ('کسان', 'NNS'), ('په', 'IN'), ('ډله ییزه', 'JJ'), ('توګه', 'NNF'), ('د', 'IN'), ('نشه يي', 'JJ'), ('توکو', 'NNS'), ('کارولو', 'VBG'), ('ته', 'PT'), ('ادامه', 'NNF'), ('ورکوي', 'VBP')]]
```
The tag method takes a list of segmented sentences as input and returns a List of Lists of tuples where the first element in the tuple is the word and the second element is the corresponding POS tag.
### Sentiment Analysis (Offensive Language Detection)
NLPashto toolkit includes a state-of-the-art model for offensive language detection. It is a fine-tuned PsBERT model, which directly predicts text toxicity without translating it. It takes a sequence of text as input and returns 0 (normal) or 1 (offensive/toxic). For further detail, please read our paper, [Pashto offensive language detection: a benchmark dataset and monolingual Pashto BERT](http://dx.doi.org/10.7717/peerj-cs.1617).
```python
from nlpashto import POLD
sentiment_analysis=POLD()
offensive_text='مړه یو کس وی صرف ځان شرموی او یو ستا غوندے جاهل وی چې قوم او ملت شرموی'
sentiment=sentiment_analysis.predict(offensive_text)
print(sentiment)
Output:: 1
normal_text='تاسو رښتیا وایئ خور 🙏'
sentiment=sentiment_analysis.predict(offensive_text)
print(sentiment)
Output:: 1
```
## Other Resources
#### BERT (WordPiece Level)
Pretrained Pashto BERT model (PsBERT), available on HuggingFace, [ijazulhaq/bert-base-pashto](https://huggingface.co/ijazulhaq/bert-base-pashto)
#### BERT (Character Level)
Pretrained Pashto BERT model (character-level), available on HuggingFace, [ijazulhaq/bert-base-pashto-c](https://huggingface.co/ijazulhaq/bert-base-pashto-c)
#### Static Word Embeddings
For Pashto, we have pretrained 3 types of static word embeddings, available at this repository [pashto-word-embeddings](https://github.com/ijazul-haq/ ).
* Word2Vec
* fastText
* GloVe
#### Examples and Notebooks
For related examples and Jupyter Notebooks please visit our [Kaggle profile](https://www.kaggle.com/drijaz/)
#### Datasets and Text Corpora
Sample datasets are available on our [Kaggle profile](https://www.kaggle.com/drijaz/), and the full version of the datasets and annotated corpora can be provided on request.
## Contact
- LinkedIn: [https://www.linkedin.com/in/drijaz/](https://www.linkedin.com/in/drijaz/)
- Email: [ijazse@hotmail.com](mailto:ijazse@hotmail.com)
## Citations
Please cite our work if you are using this code or toolkit for learning or any other purpose.
**For NLPashto ToolKit**
_H. Ijazul, Q. Weidong, G. Jie, and T. Peng, "NLPashto: NLP Toolkit for Low-resource Pashto Language," International Journal of Advanced Computer Science and Applications, vol. 14, pp. 1345-1352, 2023._
```
@article{haq2023nlpashto,
title={NLPashto: NLP Toolkit for Low-resource Pashto Language},
author={Ijazul Haq and Weidong Qiu and Jie Guo and Peng Tang},
journal={International Journal of Advanced Computer Science and Applications},
volume={14},
number={6},
pages={1345-1352},
year={2023},
issn={2156-5570},
doi={https://dx.doi.org/10.14569/IJACSA.2023.01406142}
publisher={Science and Information (SAI) Organization Limited}
}
```
**For Tokenization, Space Correction and Word segmentation**
_H. Ijazul, Q. Weidong, G. Jie, and T. Peng, "Correction of whitespace and word segmentation in noisy Pashto text using CRF," Speech Communication, vol. 153, p. 102970, 2023._
```
@article{HAQ2023102970,
title={Correction of whitespace and word segmentation in noisy Pashto text using CRF},
journal={Speech Communication},
volume={153},
pages={102970},
year={2023},
issn={0167-6393},
doi={https://doi.org/10.1016/j.specom.2023.102970},
url={https://www.sciencedirect.com/science/article/pii/S0167639323001048},
author={Ijazul Haq and Weidong Qiu and Jie Guo and Peng Tang},
}
```
**For POS Tagger and Tagset**
_H. Ijazul, Q. Weidong, G. Jie, and T. Peng, "POS Tagging of Low-resource Pashto Language: Annotated Corpus and Bert-based Model," preprint https://doi.org/10.21203/rs.3.rs-2712906/v1, 2023._
```
@article{haq2023pashto,
title={POS Tagging of Low-resource Pashto Language: Annotated Corpus and Bert-based Model},
author={Ijazul Haq and Weidong Qiu and Jie Guo and Peng Tang},
journal={preprint https://doi.org/10.21203/rs.3.rs-2712906/v1},
year={2023}
}
```
**For Sentiment Classification, Offensive Language Detection, and pretrained Pashto BERT model (PsBERT)**
_H. Ijazul, Q. Weidong, G. Jie, and T. Peng, "Pashto offensive language detection: a benchmark dataset and monolingual Pashto BERT," PeerJ Comput. Sci., 2023._
```
@article{haq2023pold,
title={Pashto offensive language detection: a benchmark dataset and monolingual Pashto BERT},
author={Ijazul Haq and Weidong Qiu and Jie Guo and Peng Tang},
journal={PeerJ Comput. Sci.},
year={2023},
issn={2376-5992},
doi={http://doi.org/10.7717/peerj-cs.1617}
}
```
Raw data
{
"_id": null,
"home_page": "",
"name": "nlpashto",
"maintainer": "",
"docs_url": null,
"requires_python": ">=3.8",
"maintainer_email": "",
"keywords": "NLP,Pashto,NLPashto,Pashto NLP,NLP Toolkit,Word Segmentation,Tokenization,POS Tagging,Sentinemt Nalysis,Text Classification,Token Classification,Sequence Tagging,Text Cleaning",
"author": "",
"author_email": "Ijazul Haq <ijazse@hotmail.com>",
"download_url": "https://files.pythonhosted.org/packages/94/10/ec9e46164f82674e95c6694849b1ed799c4f17ae36339647d82b6115b4b4/nlpashto-0.0.23.tar.gz",
"platform": null,
"description": "![Pashto Word Cloud](wc.png)\r\n# NLPashto \u2013 NLP Toolkit for Pashto\r\n![GitHub](https://img.shields.io/github/license/ijazul-haq/nlpashto) ![GitHub contributors](https://img.shields.io/github/contributors/ijazul-haq/nlpashto) ![code size](https://img.shields.io/github/languages/code-size/ijazul-haq/nlpashto)\r\n\r\nNLPashto is a Python suite for Pashto Natural Language Processing, which includes tools for fundamental text processing tasks, such as text cleaning, tokenization, and chunking (word segmentation). It also includes state-of-the-art models for POS tagging and sentiment analysis (offensive language detection, to be specific).\r\n\r\n**Citation**\r\n```\r\n@article{haq2023nlpashto,\r\n title={NLPashto: NLP Toolkit for Low-resource Pashto Language},\r\n author={Ijazul Haq and Weidong Qiu and Jie Guo and Peng Tang},\r\n journal={International Journal of Advanced Computer Science and Applications},\r\n volume={14},\r\n number={6},\r\n pages={1345-1352},\r\n year={2023},\r\n issn={2156-5570},\r\n doi={https://dx.doi.org/10.14569/IJACSA.2023.01406142}\r\n publisher={Science and Information (SAI) Organization Limited}\r\n}\r\n```\r\n\r\n## Prerequisites\r\nTo use NLPashto you will need:\r\n* Python 3.8+\r\n\r\n## Installing NLPashto\r\nNLPashto can be installed from GitHub source or from PyPi hub simply by running this command:\r\n\r\n```bash\r\npip install nlpashto\r\n```\r\n\r\n## Basic Usage\r\n\r\n### Text Cleaning\r\nThis module contains basic text cleaning utilities, which can be used as follows:\r\n\r\n```python\r\nfrom nlpashto import Cleaner\r\ncleaner=Cleaner()\r\nnoisy_txt= \u201c\u067e\u0647 \u0698\u0648\u0646\u062f \u06a9\u06cc \u0639\u0644\u0645 \ud83d\udcda\ud83d\udcd6 , \ud83d\udd8a \u0627\u0648 \u067e\u064a\u0633\u064a \ud83d\udcb5. \ud83d\udcb8\ud83d\udcb2 \u062f\u0648\u0627\u0693\u0647 \u062d\u0627\u0635\u0644 \u06a9\u0693\u0647 \u067e\u0648\u0647\u0627\u0646 \u0628\u0647 \u062f\u06cc \u0639\u0644\u0645 \u062a\u0647 \u062f\u0631\u0646\u0627\u0648\u06cc \u0648\u0644\u0631\u064a \u0627\u0648\u0646\u0627\u067e\u0648\u0647\u0627\u0646 \u0628\u0647 \u062f\u064a \u067e\u06cc\u0633\u0648 \u062a\u0647\u2026 https://t.co/xIiEXFg\u201d\r\n\r\ncleaned_text=cleaner.clean(noisy_txt)\r\nprint(cleaned_text)\r\nOutput:: \u067e\u0647 \u0698\u0648\u0646\u062f \u06a9\u06cc \u0639\u0644\u0645 , \u0627\u0648 \u067e\u064a\u0633\u064a \u062f\u0648\u0627\u0693\u0647 \u062d\u0627\u0635\u0644 \u06a9\u0693\u0647 \u067e\u0648\u0647\u0627\u0646 \u0628\u0647 \u062f\u06cc \u0639\u0644\u0645 \u062a\u0647 \u062f\u0631\u0646\u0627\u0648\u06cc \u0648\u0644\u0631\u064a \u0627\u0648 \u0646\u0627\u067e\u0648\u0647\u0627\u0646 \u0628\u0647 \u062f\u064a \u067e\u06cc\u0633\u0648 \u062a\u0647\r\n\r\n```\r\nThe clean method has several parameters you can play with:\r\n\r\n* `text`=None | str or list | the input noisy text that you want to clean.\r\n* `split_into_sentences`=True | bool | whether the text should be split into sentences or not.\r\n* `remove_emojis`=True | bool | whether emojis should be removed or not.\r\n* `normalize_nums`=True | bool | If set to True, Arabic numerals (1,2,3,\u2026) will be normalized to Pashto numerals (\u06f1\u060c\u06f2\u060c\u06f3\u060c...).\r\n* `remove_puncs`=False | bool | If set to True, punctuations (.,\u201d,\u2019,\u061f etc.) will be removed.\r\n* `remove_special_chars`=True | bool | If set to True, special characters (@,#,$,% etc.) will be removed.\r\n* `special_chars`=[ ] | list | A list of special characters that you want to keep in the text.\r\n\r\n### Tokenization (Space Correction)\r\nThis module can be used to correct space omission and space insertion errors. It will remove extra spaces from the text and insert space where necessary. It is basically a supervised whitespace tokenizer, but unlike the split() function of Python, the NLPashto Tokenizer is aware of the correct position of whitespace in the text. It is essentially a CRF model, trained on a corpus of \u2248200K sentences. For further detail about this model, please refer to our paper, [Correction of whitespace and word segmentation in noisy Pashto text using CRF]( https://linkinghub.elsevier.com/retrieve/pii/S0167639323001048).\r\n\r\n```python\r\nfrom nlpashto import Tokenizer\r\ntokenizer=Tokenizer()\r\nnoisy_txt='\u062c\u0644\u0627\u0644 \u0627\u0628\u0627\u062f \u069a\u0627\u0631 \u06a9\u06d0 \u0647\u0631\u0647 \u0648\u0631\u0681 \u0644\u0633 \u06ab\u0648\u0646\u0647 \u06a9\u0633\u0627\u0646\u067e\u0647\u0689\u0644\u0647\u06cc\u06cc\u0632\u0647\u062a\u0648\u06ab\u0647\u062f\u0646\u0634\u0647\u064a\u064a \u062a\u0648\u06a9\u0648 \u06a9\u0627\u0631\u0648\u0644\u0648 \u062a\u0647 \u0627 \u062f \u0627 \u0645 \u0647 \u0648 \u0631\u06a9 \u0648\u064a'\r\ntokenized_text=tokenizer.tokenize(noisy_txt)\r\nprint(tokenized_text)\r\nOutput:: [['\u062c\u0644\u0627\u0644', '\u0627\u0628\u0627\u062f', '\u069a\u0627\u0631', '\u06a9\u06d0', '\u0647\u0631\u0647', '\u0648\u0631\u0681', '\u0644\u0633\u06ab\u0648\u0646\u0647', '\u06a9\u0633\u0627\u0646', '\u067e\u0647', '\u0689\u0644\u0647', '\u06cc\u06cc\u0632\u0647', '\u062a\u0648\u06ab\u0647', '\u062f', '\u0646\u0634\u0647', '\u064a\u064a', '\u062a\u0648\u06a9\u0648', '\u06a9\u0627\u0631\u0648\u0644\u0648', '\u062a\u0647', '\u0627\u062f\u0627\u0645\u0647', '\u0648\u0631\u06a9\u0648\u064a']]\r\n```\r\n\r\n### (Chunking) Word Segmentation\r\nIf we look at the above example, we can see that the Tokenizer has split the compound words, `\u062c\u0644\u0627\u0644 \u0627\u0628\u0627\u062f`, `\u0689\u0644\u064a \u064a\u064a\u0632\u0647`, and `\u0646\u0634\u0647 \u064a\u064a` into meaningless sub-parts. In such cases, where retrieval of the full word is necessary (instead of space-delimited tokens), we can use NLPashto Segmenter class. The word segmentation model is based on transformers, available on HuggingFace [ijazulhaq/pashto-word-segmentation](https://huggingface.co/ijazulhaq/pashto-word-segmentation). \r\n\r\n```python\r\nfrom nlpashto import Segmenter\r\n\r\nsegmenter=Segmenter()\r\n#we are passing the above tokenized text to word segmenter\r\nsegmented_text=segmenter.segment(tokenized_text)\r\nprint(segmented_text) \r\nOutput:: [['\u062c\u0644\u0627\u0644 \u0627\u0628\u0627\u062f', '\u069a\u0627\u0631', '\u06a9\u06d0', '\u0647\u0631\u0647', '\u0648\u0631\u0681', '\u0644\u0633\u06ab\u0648\u0646\u0647', '\u06a9\u0633\u0627\u0646', '\u067e\u0647', '\u0689\u0644\u0647 \u06cc\u06cc\u0632\u0647', '\u062a\u0648\u06ab\u0647', '\u062f', '\u0646\u0634\u0647 \u064a\u064a', '\u062a\u0648\u06a9\u0648', '\u06a9\u0627\u0631\u0648\u0644\u0648', '\u062a\u0647', '\u0627\u062f\u0627\u0645\u0647', '\u0648\u0631\u06a9\u0648\u064a', '']]\r\n```\r\n\r\nTo segment multiple sentences, it\u2019s better to specify the batch size by passing it to the class constructor, as below:\r\n```python\r\nsegmenter=Segmenter(batch_size=32) # by default it\u2019s 16\r\n```\r\n### Part-of-speech (POS) Tagging\r\nFor a detailed explanation about the POS tagger, tagset, and the dataset used for training the model, please have a look at our paper [POS Tagging of Low-resource Pashto Language: Annotated Corpus and Bert-based Model](https://www.researchsquare.com/article/rs-2712906/v1). This is also a transformer-based model, available on HuggingFace [ijazulhaq/pashto-pos](https://huggingface.co/ijazulhaq/pashto-pos).\r\n```python\r\nfrom nlpashto import POSTagger\r\npos_tagger=POSTagger()\r\npos_tagged=pos_tagger.tag(segmented_text)\r\nprint(pos_tagged)\r\nOutput:: [[('\u062c\u0644\u0627\u0644 \u0627\u0628\u0627\u062f', 'NNP'), ('\u069a\u0627\u0631', 'NNM'), ('\u06a9\u06d0', 'PT'), ('\u0647\u0631\u0647', 'JJ'), ('\u0648\u0631\u0681', 'NNF'), ('\u0644\u0633\u06ab\u0648\u0646\u0647', 'JJ'), ('\u06a9\u0633\u0627\u0646', 'NNS'), ('\u067e\u0647', 'IN'), ('\u0689\u0644\u0647 \u06cc\u06cc\u0632\u0647', 'JJ'), ('\u062a\u0648\u06ab\u0647', 'NNF'), ('\u062f', 'IN'), ('\u0646\u0634\u0647 \u064a\u064a', 'JJ'), ('\u062a\u0648\u06a9\u0648', 'NNS'), ('\u06a9\u0627\u0631\u0648\u0644\u0648', 'VBG'), ('\u062a\u0647', 'PT'), ('\u0627\u062f\u0627\u0645\u0647', 'NNF'), ('\u0648\u0631\u06a9\u0648\u064a', 'VBP')]]\r\n```\r\nThe tag method takes a list of segmented sentences as input and returns a List of Lists of tuples where the first element in the tuple is the word and the second element is the corresponding POS tag.\r\n\r\n### Sentiment Analysis (Offensive Language Detection)\r\nNLPashto toolkit includes a state-of-the-art model for offensive language detection. It is a fine-tuned PsBERT model, which directly predicts text toxicity without translating it. It takes a sequence of text as input and returns 0 (normal) or 1 (offensive/toxic). For further detail, please read our paper, [Pashto offensive language detection: a benchmark dataset and monolingual Pashto BERT](http://dx.doi.org/10.7717/peerj-cs.1617). \r\n\r\n```python\r\nfrom nlpashto import POLD\r\nsentiment_analysis=POLD()\r\noffensive_text='\u0645\u0693\u0647 \u06cc\u0648 \u06a9\u0633 \u0648\u06cc \u0635\u0631\u0641 \u0681\u0627\u0646 \u0634\u0631\u0645\u0648\u06cc \u0627\u0648 \u06cc\u0648 \u0633\u062a\u0627 \u063a\u0648\u0646\u062f\u06d2 \u062c\u0627\u0647\u0644 \u0648\u06cc \u0686\u06d0 \u0642\u0648\u0645 \u0627\u0648 \u0645\u0644\u062a \u0634\u0631\u0645\u0648\u06cc'\r\nsentiment=sentiment_analysis.predict(offensive_text)\r\nprint(sentiment)\r\nOutput:: 1\r\n\r\nnormal_text='\u062a\u0627\u0633\u0648 \u0631\u069a\u062a\u06cc\u0627 \u0648\u0627\u06cc\u0626 \u062e\u0648\u0631 \ud83d\ude4f'\r\nsentiment=sentiment_analysis.predict(offensive_text)\r\nprint(sentiment)\r\nOutput:: 1\r\n```\r\n## Other Resources\r\n#### BERT (WordPiece Level)\r\nPretrained Pashto BERT model (PsBERT), available on HuggingFace, [ijazulhaq/bert-base-pashto](https://huggingface.co/ijazulhaq/bert-base-pashto)\r\n#### BERT (Character Level)\r\nPretrained Pashto BERT model (character-level), available on HuggingFace, [ijazulhaq/bert-base-pashto-c](https://huggingface.co/ijazulhaq/bert-base-pashto-c)\r\n#### Static Word Embeddings\r\nFor Pashto, we have pretrained 3 types of static word embeddings, available at this repository [pashto-word-embeddings](https://github.com/ijazul-haq/ ). \r\n* Word2Vec\r\n* fastText\r\n* GloVe\r\n#### Examples and Notebooks\r\nFor related examples and Jupyter Notebooks please visit our [Kaggle profile](https://www.kaggle.com/drijaz/) \r\n#### Datasets and Text Corpora\r\nSample datasets are available on our [Kaggle profile](https://www.kaggle.com/drijaz/), and the full version of the datasets and annotated corpora can be provided on request.\r\n\r\n## Contact\r\n- LinkedIn: [https://www.linkedin.com/in/drijaz/](https://www.linkedin.com/in/drijaz/) \r\n- Email: [ijazse@hotmail.com](mailto:ijazse@hotmail.com)\r\n\r\n## Citations\r\nPlease cite our work if you are using this code or toolkit for learning or any other purpose.\r\n\r\n**For NLPashto ToolKit**\r\n\r\n_H. Ijazul, Q. Weidong, G. Jie, and T. Peng, \"NLPashto: NLP Toolkit for Low-resource Pashto Language,\" International Journal of Advanced Computer Science and Applications, vol. 14, pp. 1345-1352, 2023._\r\n\r\n```\r\n@article{haq2023nlpashto,\r\n title={NLPashto: NLP Toolkit for Low-resource Pashto Language},\r\n author={Ijazul Haq and Weidong Qiu and Jie Guo and Peng Tang},\r\n journal={International Journal of Advanced Computer Science and Applications},\r\n volume={14},\r\n number={6},\r\n pages={1345-1352},\r\n year={2023},\r\n issn={2156-5570},\r\n doi={https://dx.doi.org/10.14569/IJACSA.2023.01406142}\r\n publisher={Science and Information (SAI) Organization Limited}\r\n}\r\n```\r\n\r\n**For Tokenization, Space Correction and Word segmentation**\r\n\r\n_H. Ijazul, Q. Weidong, G. Jie, and T. Peng, \"Correction of whitespace and word segmentation in noisy Pashto text using CRF,\" Speech Communication, vol. 153, p. 102970, 2023._\r\n\r\n```\r\n@article{HAQ2023102970,\r\n title={Correction of whitespace and word segmentation in noisy Pashto text using CRF},\r\n journal={Speech Communication},\r\n volume={153},\r\n pages={102970},\r\n year={2023},\r\n issn={0167-6393},\r\n doi={https://doi.org/10.1016/j.specom.2023.102970},\r\n url={https://www.sciencedirect.com/science/article/pii/S0167639323001048},\r\n author={Ijazul Haq and Weidong Qiu and Jie Guo and Peng Tang},\r\n}\r\n```\r\n\r\n**For POS Tagger and Tagset**\r\n\r\n_H. Ijazul, Q. Weidong, G. Jie, and T. Peng, \"POS Tagging of Low-resource Pashto Language: Annotated Corpus and Bert-based Model,\" preprint https://doi.org/10.21203/rs.3.rs-2712906/v1, 2023._\r\n\r\n```\r\n@article{haq2023pashto,\r\n title={POS Tagging of Low-resource Pashto Language: Annotated Corpus and Bert-based Model},\r\n author={Ijazul Haq and Weidong Qiu and Jie Guo and Peng Tang},\r\n journal={preprint https://doi.org/10.21203/rs.3.rs-2712906/v1},\r\n year={2023}\r\n}\r\n```\r\n\r\n**For Sentiment Classification, Offensive Language Detection, and pretrained Pashto BERT model (PsBERT)**\r\n\r\n_H. Ijazul, Q. Weidong, G. Jie, and T. Peng, \"Pashto offensive language detection: a benchmark dataset and monolingual Pashto BERT,\" PeerJ Comput. Sci., 2023._\r\n\r\n```\r\n@article{haq2023pold,\r\n title={Pashto offensive language detection: a benchmark dataset and monolingual Pashto BERT},\r\n author={Ijazul Haq and Weidong Qiu and Jie Guo and Peng Tang},\r\n journal={PeerJ Comput. Sci.},\r\n year={2023},\r\n issn={2376-5992},\r\n doi={http://doi.org/10.7717/peerj-cs.1617}\r\n}\r\n```\r\n",
"bugtrack_url": null,
"license": "",
"summary": "Pashto Natural Language Processing Toolkit",
"version": "0.0.23",
"project_urls": {
"Homepage": "https://github.com/ijazul-haq/nlpashto"
},
"split_keywords": [
"nlp",
"pashto",
"nlpashto",
"pashto nlp",
"nlp toolkit",
"word segmentation",
"tokenization",
"pos tagging",
"sentinemt nalysis",
"text classification",
"token classification",
"sequence tagging",
"text cleaning"
],
"urls": [
{
"comment_text": "",
"digests": {
"blake2b_256": "538d62ba5a7ebce3adf0921702aa9cbefe050dd714f84ca0816584e7fe6aa8d7",
"md5": "63c021bb97c1722af0f8cce6ed8f318a",
"sha256": "9378007cb0d142a3a4a2e21bbb7d72e3b3bf403e6ce1de95ab864d5d8c3947f5"
},
"downloads": -1,
"filename": "nlpashto-0.0.23-py3-none-any.whl",
"has_sig": false,
"md5_digest": "63c021bb97c1722af0f8cce6ed8f318a",
"packagetype": "bdist_wheel",
"python_version": "py3",
"requires_python": ">=3.8",
"size": 12524,
"upload_time": "2023-10-09T14:40:39",
"upload_time_iso_8601": "2023-10-09T14:40:39.471771Z",
"url": "https://files.pythonhosted.org/packages/53/8d/62ba5a7ebce3adf0921702aa9cbefe050dd714f84ca0816584e7fe6aa8d7/nlpashto-0.0.23-py3-none-any.whl",
"yanked": false,
"yanked_reason": null
},
{
"comment_text": "",
"digests": {
"blake2b_256": "9410ec9e46164f82674e95c6694849b1ed799c4f17ae36339647d82b6115b4b4",
"md5": "4b5b7e9596e8cca18d313ab8f7f525c9",
"sha256": "34c7edbe48a287a96cb8ff1f3e86ebeb0e0477a2c8353047f7b6ada240a68edb"
},
"downloads": -1,
"filename": "nlpashto-0.0.23.tar.gz",
"has_sig": false,
"md5_digest": "4b5b7e9596e8cca18d313ab8f7f525c9",
"packagetype": "sdist",
"python_version": "source",
"requires_python": ">=3.8",
"size": 14550,
"upload_time": "2023-10-09T14:40:40",
"upload_time_iso_8601": "2023-10-09T14:40:40.755546Z",
"url": "https://files.pythonhosted.org/packages/94/10/ec9e46164f82674e95c6694849b1ed799c4f17ae36339647d82b6115b4b4/nlpashto-0.0.23.tar.gz",
"yanked": false,
"yanked_reason": null
}
],
"upload_time": "2023-10-09 14:40:40",
"github": true,
"gitlab": false,
"bitbucket": false,
"codeberg": false,
"github_user": "ijazul-haq",
"github_project": "nlpashto",
"travis_ci": false,
"coveralls": false,
"github_actions": false,
"lcname": "nlpashto"
}