
<p align="center">
<a href="https://pypi.python.org/pypi/shekar" target="_blank"><img alt="PyPI - Version" src="https://img.shields.io/pypi/v/shekar?color=00A693"></a>
<a href="https://pypi.python.org/pypi/shekar" target="_blank"><img alt="GitHub Actions Workflow Status" src="https://img.shields.io/github/actions/workflow/status/amirivojdan/shekar/test.yml?color=00A693"></a>
<a href="https://pypi.python.org/pypi/shekar" target="_blank"><img alt="Codecov" src="https://img.shields.io/codecov/c/github/amirivojdan/shekar?color=00A693"></a>
<a href="https://pypi.python.org/pypi/shekar" target="_blank"><img alt="PyPI - License" src="https://img.shields.io/pypi/l/shekar?color=00A693"></a>
<a href="https://pypi.python.org/pypi/shekar" target="_blank"><img alt="PyPI - Python Version" src="https://img.shields.io/pypi/pyversions/shekar?color=00A693"></a>
<a href="https://doi.org/10.21105/joss.09128" target="_blank">
<img alt="Static Badge" src="https://img.shields.io/badge/JOSS-10.21105%2Fjoss.09128-00A693"></a>
</p>
<p align="center">
    <em>Simplifying Persian NLP for Modern Applications</em>
</p>
**Shekar** (meaning 'sugar' in Persian) is an open-source Python library for Persian natural language processing, named after the influential satirical story *"فارسی شکر است"* (Persian is Sugar) published in 1921 by Mohammad Ali Jamalzadeh. The story became a cornerstone of Iran's literary renaissance, advocating for accessible yet eloquent expression. Shekar embodies this philosophy in its design and development.
It provides tools for text preprocessing, tokenization, part-of-speech(POS) tagging, named entity recognition(NER), embeddings, spell checking, and more. With its modular pipeline design, Shekar makes it easy to build reproducible workflows for both research and production applications.
📖 Documentation: https://lib.shekar.io/
### Table of Contents
- [Installation](#installation)
  - [CPU Installation (All Platforms)](#cpu-installation-all-platforms)
  - [GPU Acceleration (NVIDIA CUDA)](#gpu-acceleration-nvidia-cuda)
- [Preprocessing](#preprocessing)
  - [Normalizer](#normalizer)
  - [Customization](#customization)
- [Tokenization](#tokenization)
  - [WordTokenizer](#wordtokenizer)
  - [SentenceTokenizer](#sentencetokenizer)
- [Embeddings](#embeddings)
  - [Word Embeddings](#word-embeddings)
  - [Contextual Embeddings](#contextual-embeddings)
- [Stemming](#stemming)
- [Lemmatization](#lemmatization)
- [Part-of-Speech Tagging](#part-of-speech-tagging)
- [Named Entity Recognition (NER)](#named-entity-recognition-ner)
- [Sentiment Analysis](#sentiment-analysis)
- [Toxicity Detection](#toxicity-detection)
- [Keyword Extraction](#keyword-extraction)
- [Spell Checking](#spell-checking)
- [WordCloud](#wordcloud)
- [Command-Line Interface (CLI)](#command-line-interface-cli)
- [Download Models](#download-models)
- [Citation](#citation)
## Installation
You can install Shekar with pip. By default, the `CPU` runtime of ONNX is included, which works on all platforms.
### CPU Installation (All Platforms)
<!-- termynal -->
```bash
$ pip install shekar
```
This works on **Windows**, **Linux**, and **macOS** (including Apple Silicon M1/M2/M3).
### GPU Acceleration (NVIDIA CUDA)
If you have an NVIDIA GPU and want hardware acceleration, you need to replace the CPU runtime with the GPU version.
**Prerequisites**
- NVIDIA GPU with CUDA support
- Appropriate CUDA Toolkit installed
- Compatible NVIDIA drivers
<!-- termynal -->
```bash
$ pip install shekar && pip uninstall -y onnxruntime && pip install onnxruntime-gpu
```
## Preprocessing
[](examples/preprocessing.ipynb)  [](https://colab.research.google.com/github/amirivojdan/shekar/blob/main/examples/preprocessing.ipynb)
### Normalizer
The built-in `Normalizer` class provides a ready-to-use pipeline that combines the most common filters and normalization steps, offering a default configuration that covers the majority of use cases.
```python
from shekar import Normalizer
normalizer = Normalizer()
text = "«فارسی شِکَر است» نام داستان ڪوتاه طنز    آمێزی از محمد علی جمالــــــــزاده ی گرامی می   باشد که در سال 1921 منتشر  شده است و آغاز   ڱر تحول بزرگی در ادَبێات معاصر ایران 🇮🇷 بۃ شمار میرود."
print(normalizer(text))
```
```shell
«فارسی شکر است» نام داستان کوتاه طنزآمیزی از محمدعلی جمالزادهی گرامی میباشد که در سال ۱۹۲۱ منتشر شدهاست و آغازگر تحول بزرگی در ادبیات معاصر ایران به شمار میرود.
```
### Customization
For advanced customization, Shekar offers a modular and composable framework for text preprocessing. It includes components such as `filters`, `normalizers`, and `maskers`, which can be applied individually or flexibly combined using the `Pipeline` class with the `|` operator.
You can combine any of the preprocessing components using the `|` operator:
```python
from shekar.preprocessing import EmojiRemover, PunctuationRemover
text = "ز ایران دلش یاد کرد و بسوخت! 🌍🇮🇷"
pipeline = EmojiRemover() | PunctuationRemover()
output = pipeline(text)
print(output)
```
```shell
ز ایران دلش یاد کرد و بسوخت
```
## Tokenization
### WordTokenizer
The WordTokenizer class in Shekar is a simple, rule-based tokenizer for Persian that splits text based on punctuation and whitespace using Unicode-aware regular expressions.
```python
from shekar import WordTokenizer
tokenizer = WordTokenizer()
text = "چه سیبهای قشنگی! حیات نشئهٔ تنهایی است."
tokens = list(tokenizer(text))
print(tokens)
```
```shell
["چه", "سیبهای", "قشنگی", "!", "حیات", "نشئهٔ", "تنهایی", "است", "."]
```
### SentenceTokenizer
The `SentenceTokenizer` class is designed to split a given text into individual sentences. This class is particularly useful in natural language processing tasks where understanding the structure and meaning of sentences is important. The `SentenceTokenizer` class can handle various punctuation marks and language-specific rules to accurately identify sentence boundaries.
Below is an example of how to use the `SentenceTokenizer`:
```python
from shekar.tokenization import SentenceTokenizer
text = "هدف ما کمک به یکدیگر است! ما میتوانیم با هم کار کنیم."
tokenizer = SentenceTokenizer()
sentences = tokenizer(text)
for sentence in sentences:
    print(sentence)
```
```output
هدف ما کمک به یکدیگر است!
ما میتوانیم با هم کار کنیم.
```
## Embeddings
[](examples/embeddings.ipynb)  [](https://colab.research.google.com/github/amirivojdan/shekar/blob/main/examples/embeddings.ipynb)
**Shekar** offers two main embedding classes:
- **`WordEmbedder`**: Provides static word embeddings using pre-trained FastText models.
- **`ContextualEmbedder`**: Provides contextual embeddings using a fine-tuned ALBERT model.
Both classes share a consistent interface:
- `embed(text)` returns a NumPy vector.
- `transform(text)` is an alias for `embed(text)` to integrate with pipelines.
### Word Embeddings
`WordEmbedder` supports two static FastText models:
- **`fasttext-d100`**: A 100-dimensional CBOW model trained on [Persian Wikipedia](https://huggingface.co/datasets/codersan/Persian-Wikipedia-Corpus).
- **`fasttext-d300`**: A 300-dimensional CBOW model trained on the large-scale [Naab dataset](https://huggingface.co/datasets/SLPL/naab).
```python
from shekar.embeddings import WordEmbedder
embedder = WordEmbedder(model="fasttext-d100")
embedding = embedder("کتاب")
print(embedding.shape)
similar_words = embedder.most_similar("کتاب", top_n=5)
print(similar_words)
```
### Contextual Embeddings
`ContextualEmbedder` uses an ALBERT model trained with Masked Language Modeling (MLM) on the Naab dataset to generate high-quality contextual embeddings.
The resulting embeddings are 768-dimensional vectors representing the semantic meaning of entire phrases or sentences.
```python
from shekar.embeddings import ContextualEmbedder
embedder = ContextualEmbedder(model="albert")
sentence = "کتابها دریچهای به جهان دانش هستند."
embedding = embedder(sentence)
print(embedding.shape)  # (768,)
```
## Stemming
The `Stemmer` is a lightweight, rule-based reducer for Persian word forms. It trims common suffixes while respecting Persian orthography and Zero Width Non-Joiner usage. The goal is to produce stable stems for search, indexing, and simple text analysis without requiring a full morphological analyzer.
```python
from shekar import Stemmer
stemmer = Stemmer()
print(stemmer("نوهام"))
print(stemmer("کتابها"))
print(stemmer("خانههایی"))
```
```output
نوه
کتاب
خانه
```
## Lemmatization
The `Lemmatizer` maps Persian words to their base dictionary form. Unlike stemming, which only trims affixes, lemmatization uses explicit verb conjugation rules, vocabulary lookups, and a stemmer fallback to ensure valid lemmas. This makes it more accurate for tasks like part-of-speech tagging, text normalization, and linguistic analysis where the canonical form of a word is required.
```python
from shekar import Lemmatizer
lemmatizer = Lemmatizer()
print(lemmatizer("رفتند"))
print(lemmatizer("کتابها"))
print(lemmatizer("خانههایی"))
print(lemmatizer("گفته بودهایم"))
```
```output
رفت/رو
کتاب
خانه
گفت/گو
```
## Part-of-Speech Tagging
[](examples/pos_tagging.ipynb)  [](https://colab.research.google.com/github/amirivojdan/shekar/blob/main/examples/pos_tagging.ipynb)
The POSTagger class provides part-of-speech tagging for Persian text using a transformer-based model (default: ALBERT). It returns one tag per word based on Universal POS tags (following the Universal Dependencies standard).
Example usage:
```python
from shekar import POSTagger
pos_tagger = POSTagger()
text = "نوروز، جشن سال نو ایرانی، بیش از سه هزار سال قدمت دارد و در کشورهای مختلف جشن گرفته میشود."
result = pos_tagger(text)
for word, tag in result:
    print(f"{word}: {tag}")
```
```output
نوروز: PROPN
،: PUNCT
جشن: NOUN
سال: NOUN
نو: ADJ
ایرانی: ADJ
،: PUNCT
بیش: ADJ
از: ADP
سه: NUM
هزار: NUM
سال: NOUN
قدمت: NOUN
دارد: VERB
و: CCONJ
در: ADP
کشورهای: NOUN
مختلف: ADJ
جشن: NOUN
گرفته: VERB
میشود: VERB
.: PUNCT
```
## Named Entity Recognition (NER)
[](examples/ner.ipynb)  [](https://colab.research.google.com/github/amirivojdan/shekar/blob/main/examples/ner.ipynb)
The `NER` module offers a fast, quantized Named Entity Recognition pipeline using a fine-tuned ALBERT model. It detects common Persian entities such as persons, locations, organizations, and dates. This model is designed for efficient inference and can be easily combined with other preprocessing steps.
Example usage:
```python
from shekar import NER
from shekar import Normalizer
input_text = (
    "شاهرخ مسکوب به سالِ ۱۳۰۴ در بابل زاده شد و دوره ابتدایی را در تهران و در مدرسه علمیه پشت "
    "مسجد سپهسالار گذراند. از کلاس پنجم ابتدایی مطالعه رمان و آثار ادبی را شروع کرد. از همان زمان "
    "در دبیرستان ادب اصفهان ادامه تحصیل داد. پس از پایان تحصیلات دبیرستان در سال ۱۳۲۴ از اصفهان به تهران رفت و "
    "در رشته حقوق دانشگاه تهران مشغول به تحصیل شد."
)
normalizer = Normalizer()
normalized_text = normalizer(input_text)
albert_ner = NER()
entities = albert_ner(normalized_text)
for text, label in entities:
    print(f"{text} → {label}")
```
```output
شاهرخ مسکوب → PER
سال ۱۳۰۴ → DAT
بابل → LOC
دوره ابتدایی → DAT
تهران → LOC
مدرسه علمیه → LOC
مسجد سپهسالار → LOC
دبیرستان ادب اصفهان → LOC
در سال ۱۳۲۴ → DAT
اصفهان → LOC
تهران → LOC
دانشگاه تهران → ORG
فرانسه → LOC
```
## Sentiment Analysis
The `SentimentClassifier` module enables automatic sentiment analysis of Persian text using transformer-based models. It currently supports the `AlbertBinarySentimentClassifier`, a lightweight ALBERT model fine-tuned on Snapfood dataset to classify text as **positive** or **negative**, returning both the predicted label and its confidence score.
**Example usage:**
```python
from shekar import SentimentClassifier
sentiment_classifier = SentimentClassifier()
print(sentiment_classifier("سریال قصههای مجید عالی بود!"))
print(sentiment_classifier("فیلم ۳۰۰ افتضاح بود!"))
```
```output
('positive', 0.9923112988471985)
('negative', 0.9330866932868958)
```
## Toxicity Detection
The `toxicity` module currently includes a Logistic Regression classifier trained on TF-IDF features extracted from the [Naseza (ناسزا) dataset](https://github.com/amirivojdan/naseza), a large-scale collection of Persian text labeled for offensive and neutral language. The `OffensiveLanguageClassifier` processes input text to determine whether it is neutral or offensive, returning both the predicted label and its confidence score.
```python
from shekar.toxicity import OffensiveLanguageClassifier
offensive_classifier = OffensiveLanguageClassifier()
print(offensive_classifier("زبان فارسی میهن من است!"))
print(offensive_classifier("تو خیلی احمق و بیشرفی!"))
```
```output
('neutral', 0.7651197910308838)
('offensive', 0.7607775330543518)
```
## Keyword Extraction
[](examples/keyword_extraction.ipynb)  [](https://colab.research.google.com/github/amirivojdan/shekar/blob/main/examples/keyword_extraction.ipynb)
The **shekar.keyword_extraction** module provides tools for automatically identifying and extracting key terms and phrases from Persian text. These algorithms help identify the most important concepts and topics within documents.
```python
from shekar import KeywordExtractor
extractor = KeywordExtractor(max_length=2, top_n=10)
input_text = (
    "زبان فارسی یکی از زبانهای مهم منطقه و جهان است که تاریخچهای کهن دارد. "
    "زبان فارسی با داشتن ادبیاتی غنی و شاعرانی برجسته، نقشی بیبدیل در گسترش فرهنگ ایرانی ایفا کرده است. "
    "از دوران فردوسی و شاهنامه تا دوران معاصر، زبان فارسی همواره ابزار بیان اندیشه، احساس و هنر بوده است. "
)
keywords = extractor(input_text)
for kw in keywords:
    print(kw)
```
```output
فرهنگ ایرانی
گسترش فرهنگ
ایرانی ایفا
زبان فارسی
تاریخچهای کهن
```
## Spell Checking
The `SpellChecker` class provides simple and effective spelling correction for Persian text. It can automatically detect and fix common errors such as extra characters, spacing mistakes, or misspelled words. You can use it directly as a callable on a sentence to clean up the text, or call `suggest()` to get a ranked list of correction candidates for a single word.
```python
from shekar import SpellChecker
spell_checker = SpellChecker()
print(spell_checker("سسلام بر ششما ددوست من"))
print(spell_checker.suggest("درود"))
```
```output
سلام بر شما دوست من
['درود', 'درصد', 'ورود', 'درد', 'درون']
```
## WordCloud
[](examples/word_cloud.ipynb)  [](https://colab.research.google.com/github/amirivojdan/shekar/blob/main/examples/word_cloud.ipynb)
The WordCloud class offers an easy way to create visually rich Persian word clouds. It supports reshaping and right-to-left rendering, Persian fonts, color maps, and custom shape masks for accurate and elegant visualization of word frequencies.
```python
import requests
from collections import Counter
from shekar import WordCloud
from shekar import WordTokenizer
from shekar.preprocessing import (
  HTMLTagRemover,
  PunctuationRemover,
  StopWordRemover,
  NonPersianRemover,
)
preprocessing_pipeline = HTMLTagRemover() | PunctuationRemover() | StopWordRemover() | NonPersianRemover()
url = f"https://shahnameh.me/p.php?id=F82F6CED"
response = requests.get(url)
html_content = response.text
clean_text = preprocessing_pipeline(html_content)
word_tokenizer = WordTokenizer()
tokens = word_tokenizer(clean_text)
word_freqs = Counter(tokens)
wordCloud = WordCloud(
        mask="Iran",
        width=640,
        height=480,
        max_font_size=220,
        min_font_size=6,
        bg_color="white",
        contour_color="black",
        contour_width=5,
        color_map="greens",
    )
# if shows disconnect words, try again with bidi_reshape=True
image = wordCloud.generate(word_freqs, bidi_reshape=False)
image.show()
```

## Command-Line Interface (CLI)
Shekar includes a command-line interface (CLI) for quick text processing and visualization.  
You can normalize Persian text or generate wordclouds directly from files or inline strings.
**Usage**
```console
shekar [COMMAND] [OPTIONS]
```
**Examples**
```console
# Normalize a text file and save output
shekar normalize -i ./corpus.txt -o ./normalized_corpus.txt
# Normalize inline text
shekar normalize -t "درود پرودگار بر ایران و ایرانی"
```
## Download Models
If Shekar Hub is unavailable, you can manually download the models and place them in the cache directory at `home/[username]/.shekar/` 
| Model Name                | Download Link |
|----------------------------|---------------|
| FastText Embedding d100    | [Download](https://drive.google.com/file/d/1qgd0slGA3Ar7A2ShViA3v8UTM4qXIEN6/view?usp=drive_link) (50MB)|
| FastText Embedding d300    | [Download](https://drive.google.com/file/d/1yeAg5otGpgoeD-3-E_W9ZwLyTvNKTlCa/view?usp=drive_link) (500MB)|
| SentenceEmbedding    | [Download](https://drive.google.com/file/d/1PftSG2QD2M9qzhAltWk_S38eQLljPUiG/view?usp=drive_link) (60MB)|
| POS Tagger  | [Download](https://drive.google.com/file/d/1d80TJn7moO31nMXT4WEatAaTEUirx2Ju/view?usp=drive_link) (38MB)|
| NER       | [Download](https://drive.google.com/file/d/1DLoMJt8TWlNnGGbHDWjwNGsD7qzlLHfu/view?usp=drive_link) (38MB)|
| Sentiment Classifier       | [Download](https://drive.google.com/file/d/17gTip7RwipEkA7Rf3-Cv1W8XNHTdaS4c/view?usp=drive_link) (38MB)|
| Offensive Language Classifier       | [Download](https://drive.google.com/file/d/1ZLiFI6nzpQ2rYjJTKxOYKTfD9IqHZ5tc/view?usp=drive_link) (8MB)|
| AlbertTokenizer   | [Download](https://drive.google.com/file/d/1w-oe53F0nPePMcoor5FgXRwRMwkYqDqM/view?usp=drive_link) (2MB)|
-----
## Citation
If you find **Shekar** useful in your research, please consider citing the following paper:
```
@article{Amirivojdan_Shekar,
author = {Amirivojdan, Ahmad},
doi = {10.21105/joss.09128},
journal = {Journal of Open Source Software},
month = oct,
number = {114},
pages = {9128},
title = {{Shekar: A Python Toolkit for Persian Natural Language Processing}},
url = {https://joss.theoj.org/papers/10.21105/joss.09128},
volume = {10},
year = {2025}
}
```
<p align="center"><em>With ❤️ for <strong>IRAN</strong></em></p>
            
         
        Raw data
        
            {
    "_id": null,
    "home_page": null,
    "name": "shekar",
    "maintainer": null,
    "docs_url": null,
    "requires_python": ">=3.10",
    "maintainer_email": null,
    "keywords": "Deep Learning, Machine Learning, NLP, Natural Language Processing, Persian, Shekar, Text Processing",
    "author": null,
    "author_email": "Ahmad Amirivojdan <amirivojdan@gmail.com>",
    "download_url": "https://files.pythonhosted.org/packages/2e/9a/359d7f82b72d8c852b5f555f1df774af809d1ee00e31b696ccfb2feb8afa/shekar-1.0.1.tar.gz",
    "platform": null,
    "description": "\n\n\n<p align=\"center\">\n<a href=\"https://pypi.python.org/pypi/shekar\" target=\"_blank\"><img alt=\"PyPI - Version\" src=\"https://img.shields.io/pypi/v/shekar?color=00A693\"></a>\n<a href=\"https://pypi.python.org/pypi/shekar\" target=\"_blank\"><img alt=\"GitHub Actions Workflow Status\" src=\"https://img.shields.io/github/actions/workflow/status/amirivojdan/shekar/test.yml?color=00A693\"></a>\n<a href=\"https://pypi.python.org/pypi/shekar\" target=\"_blank\"><img alt=\"Codecov\" src=\"https://img.shields.io/codecov/c/github/amirivojdan/shekar?color=00A693\"></a>\n<a href=\"https://pypi.python.org/pypi/shekar\" target=\"_blank\"><img alt=\"PyPI - License\" src=\"https://img.shields.io/pypi/l/shekar?color=00A693\"></a>\n<a href=\"https://pypi.python.org/pypi/shekar\" target=\"_blank\"><img alt=\"PyPI - Python Version\" src=\"https://img.shields.io/pypi/pyversions/shekar?color=00A693\"></a>\n<a href=\"https://doi.org/10.21105/joss.09128\" target=\"_blank\">\n<img alt=\"Static Badge\" src=\"https://img.shields.io/badge/JOSS-10.21105%2Fjoss.09128-00A693\"></a>\n</p>\n\n<p align=\"center\">\n    <em>Simplifying Persian NLP for Modern Applications</em>\n</p>\n\n**Shekar** (meaning 'sugar' in Persian) is an open-source Python library for Persian natural language processing, named after the influential satirical story *\"\u0641\u0627\u0631\u0633\u06cc \u0634\u06a9\u0631 \u0627\u0633\u062a\"* (Persian is Sugar) published in 1921 by Mohammad Ali Jamalzadeh. The story became a cornerstone of Iran's literary renaissance, advocating for accessible yet eloquent expression. Shekar embodies this philosophy in its design and development.\n\nIt provides tools for text preprocessing, tokenization, part-of-speech(POS) tagging, named entity recognition(NER), embeddings, spell checking, and more. With its modular pipeline design, Shekar makes it easy to build reproducible workflows for both research and production applications.\n\n\ud83d\udcd6 Documentation: https://lib.shekar.io/\n\n### Table of Contents\n\n- [Installation](#installation)\n  - [CPU Installation (All Platforms)](#cpu-installation-all-platforms)\n  - [GPU Acceleration (NVIDIA CUDA)](#gpu-acceleration-nvidia-cuda)\n- [Preprocessing](#preprocessing)\n  - [Normalizer](#normalizer)\n  - [Customization](#customization)\n- [Tokenization](#tokenization)\n  - [WordTokenizer](#wordtokenizer)\n  - [SentenceTokenizer](#sentencetokenizer)\n- [Embeddings](#embeddings)\n  - [Word Embeddings](#word-embeddings)\n  - [Contextual Embeddings](#contextual-embeddings)\n- [Stemming](#stemming)\n- [Lemmatization](#lemmatization)\n- [Part-of-Speech Tagging](#part-of-speech-tagging)\n- [Named Entity Recognition (NER)](#named-entity-recognition-ner)\n- [Sentiment Analysis](#sentiment-analysis)\n- [Toxicity Detection](#toxicity-detection)\n- [Keyword Extraction](#keyword-extraction)\n- [Spell Checking](#spell-checking)\n- [WordCloud](#wordcloud)\n- [Command-Line Interface (CLI)](#command-line-interface-cli)\n- [Download Models](#download-models)\n- [Citation](#citation)\n\n\n## Installation\n\nYou can install Shekar with pip. By default, the `CPU` runtime of ONNX is included, which works on all platforms.\n\n### CPU Installation (All Platforms)\n\n<!-- termynal -->\n```bash\n$ pip install shekar\n```\nThis works on **Windows**, **Linux**, and **macOS** (including Apple Silicon M1/M2/M3).\n\n### GPU Acceleration (NVIDIA CUDA)\nIf you have an NVIDIA GPU and want hardware acceleration, you need to replace the CPU runtime with the GPU version.\n\n**Prerequisites**\n\n- NVIDIA GPU with CUDA support\n- Appropriate CUDA Toolkit installed\n- Compatible NVIDIA drivers\n\n<!-- termynal -->\n```bash\n$ pip install shekar && pip uninstall -y onnxruntime && pip install onnxruntime-gpu\n```\n\n## Preprocessing\n\n[](examples/preprocessing.ipynb)  [](https://colab.research.google.com/github/amirivojdan/shekar/blob/main/examples/preprocessing.ipynb)\n\n\n### Normalizer\n\nThe built-in `Normalizer` class provides a ready-to-use pipeline that combines the most common filters and normalization steps, offering a default configuration that covers the majority of use cases.\n\n```python\nfrom shekar import Normalizer\n\nnormalizer = Normalizer()\ntext = \"\u00ab\u0641\u0627\u0631\u0633\u06cc \u0634\u0650\u06a9\u064e\u0631 \u0627\u0633\u062a\u00bb \u0646\u0627\u0645 \u062f\u0627\u0633\u062a\u0627\u0646 \u06aa\u0648\u062a\u0627\u0647 \u0637\u0646\u0632    \u0622\u0645\u06ce\u0632\u06cc \u0627\u0632 \u0645\u062d\u0645\u062f \u0639\u0644\u06cc \u062c\u0645\u0627\u0644\u0640\u0640\u0640\u0640\u0640\u0640\u0640\u0640\u0632\u0627\u062f\u0647 \u06cc \u06af\u0631\u0627\u0645\u06cc \u0645\u06cc   \u0628\u0627\u0634\u062f \u06a9\u0647 \u062f\u0631 \u0633\u0627\u0644 1921 \u0645\u0646\u062a\u0634\u0631  \u0634\u062f\u0647 \u0627\u0633\u062a \u0648 \u0622\u063a\u0627\u0632   \u06b1\u0631 \u062a\u062d\u0648\u0644 \u0628\u0632\u0631\u06af\u06cc \u062f\u0631 \u0627\u062f\u064e\u0628\u06ce\u0627\u062a \u0645\u0639\u0627\u0635\u0631 \u0627\u06cc\u0631\u0627\u0646 \ud83c\uddee\ud83c\uddf7 \u0628\u06c3 \u0634\u0645\u0627\u0631 \u0645\u06cc\u0631\u0648\u062f.\"\n\nprint(normalizer(text))\n```\n\n```shell\n\u00ab\u0641\u0627\u0631\u0633\u06cc \u0634\u06a9\u0631 \u0627\u0633\u062a\u00bb \u0646\u0627\u0645 \u062f\u0627\u0633\u062a\u0627\u0646 \u06a9\u0648\u062a\u0627\u0647 \u0637\u0646\u0632\u0622\u0645\u06cc\u0632\u06cc \u0627\u0632 \u0645\u062d\u0645\u062f\u200c\u0639\u0644\u06cc \u062c\u0645\u0627\u0644\u0632\u0627\u062f\u0647\u200c\u06cc \u06af\u0631\u0627\u0645\u06cc \u0645\u06cc\u200c\u0628\u0627\u0634\u062f \u06a9\u0647 \u062f\u0631 \u0633\u0627\u0644 \u06f1\u06f9\u06f2\u06f1 \u0645\u0646\u062a\u0634\u0631 \u0634\u062f\u0647\u200c\u0627\u0633\u062a \u0648 \u0622\u063a\u0627\u0632\u06af\u0631 \u062a\u062d\u0648\u0644 \u0628\u0632\u0631\u06af\u06cc \u062f\u0631 \u0627\u062f\u0628\u06cc\u0627\u062a \u0645\u0639\u0627\u0635\u0631 \u0627\u06cc\u0631\u0627\u0646 \u0628\u0647 \u0634\u0645\u0627\u0631 \u0645\u06cc\u200c\u0631\u0648\u062f.\n```\n\n\n### Customization\n\nFor advanced customization, Shekar offers a modular and composable framework for text preprocessing. It includes components such as `filters`, `normalizers`, and `maskers`, which can be applied individually or flexibly combined using the `Pipeline` class with the `|` operator.\n\nYou can combine any of the preprocessing components using the `|` operator:\n\n```python\nfrom shekar.preprocessing import EmojiRemover, PunctuationRemover\n\ntext = \"\u0632 \u0627\u06cc\u0631\u0627\u0646 \u062f\u0644\u0634 \u06cc\u0627\u062f \u06a9\u0631\u062f \u0648 \u0628\u0633\u0648\u062e\u062a! \ud83c\udf0d\ud83c\uddee\ud83c\uddf7\"\npipeline = EmojiRemover() | PunctuationRemover()\noutput = pipeline(text)\nprint(output)\n```\n\n```shell\n\u0632 \u0627\u06cc\u0631\u0627\u0646 \u062f\u0644\u0634 \u06cc\u0627\u062f \u06a9\u0631\u062f \u0648 \u0628\u0633\u0648\u062e\u062a\n```\n\n## Tokenization\n\n### WordTokenizer\nThe WordTokenizer class in Shekar is a simple, rule-based tokenizer for Persian that splits text based on punctuation and whitespace using Unicode-aware regular expressions.\n\n```python\nfrom shekar import WordTokenizer\n\ntokenizer = WordTokenizer()\n\ntext = \"\u0686\u0647 \u0633\u06cc\u0628\u200c\u0647\u0627\u06cc \u0642\u0634\u0646\u06af\u06cc! \u062d\u06cc\u0627\u062a \u0646\u0634\u0626\u0647\u0654 \u062a\u0646\u0647\u0627\u06cc\u06cc \u0627\u0633\u062a.\"\ntokens = list(tokenizer(text))\nprint(tokens)\n```\n\n```shell\n[\"\u0686\u0647\", \"\u0633\u06cc\u0628\u200c\u0647\u0627\u06cc\", \"\u0642\u0634\u0646\u06af\u06cc\", \"!\", \"\u062d\u06cc\u0627\u062a\", \"\u0646\u0634\u0626\u0647\u0654\", \"\u062a\u0646\u0647\u0627\u06cc\u06cc\", \"\u0627\u0633\u062a\", \".\"]\n```\n\n### SentenceTokenizer\n\nThe `SentenceTokenizer` class is designed to split a given text into individual sentences. This class is particularly useful in natural language processing tasks where understanding the structure and meaning of sentences is important. The `SentenceTokenizer` class can handle various punctuation marks and language-specific rules to accurately identify sentence boundaries.\n\nBelow is an example of how to use the `SentenceTokenizer`:\n\n```python\nfrom shekar.tokenization import SentenceTokenizer\n\ntext = \"\u0647\u062f\u0641 \u0645\u0627 \u06a9\u0645\u06a9 \u0628\u0647 \u06cc\u06a9\u062f\u06cc\u06af\u0631 \u0627\u0633\u062a! \u0645\u0627 \u0645\u06cc\u200c\u062a\u0648\u0627\u0646\u06cc\u0645 \u0628\u0627 \u0647\u0645 \u06a9\u0627\u0631 \u06a9\u0646\u06cc\u0645.\"\ntokenizer = SentenceTokenizer()\nsentences = tokenizer(text)\n\nfor sentence in sentences:\n    print(sentence)\n```\n\n```output\n\u0647\u062f\u0641 \u0645\u0627 \u06a9\u0645\u06a9 \u0628\u0647 \u06cc\u06a9\u062f\u06cc\u06af\u0631 \u0627\u0633\u062a!\n\u0645\u0627 \u0645\u06cc\u200c\u062a\u0648\u0627\u0646\u06cc\u0645 \u0628\u0627 \u0647\u0645 \u06a9\u0627\u0631 \u06a9\u0646\u06cc\u0645.\n```\n\n## Embeddings\n\n[](examples/embeddings.ipynb)  [](https://colab.research.google.com/github/amirivojdan/shekar/blob/main/examples/embeddings.ipynb)\n\n**Shekar** offers two main embedding classes:\n\n- **`WordEmbedder`**: Provides static word embeddings using pre-trained FastText models.\n- **`ContextualEmbedder`**: Provides contextual embeddings using a fine-tuned ALBERT model.\n\nBoth classes share a consistent interface:\n\n- `embed(text)` returns a NumPy vector.\n- `transform(text)` is an alias for `embed(text)` to integrate with pipelines.\n\n### Word Embeddings\n\n`WordEmbedder` supports two static FastText models:\n\n- **`fasttext-d100`**: A 100-dimensional CBOW model trained on [Persian Wikipedia](https://huggingface.co/datasets/codersan/Persian-Wikipedia-Corpus).\n- **`fasttext-d300`**: A 300-dimensional CBOW model trained on the large-scale [Naab dataset](https://huggingface.co/datasets/SLPL/naab).\n\n\n```python\nfrom shekar.embeddings import WordEmbedder\n\nembedder = WordEmbedder(model=\"fasttext-d100\")\n\nembedding = embedder(\"\u06a9\u062a\u0627\u0628\")\nprint(embedding.shape)\n\nsimilar_words = embedder.most_similar(\"\u06a9\u062a\u0627\u0628\", top_n=5)\nprint(similar_words)\n```\n\n### Contextual Embeddings\n\n`ContextualEmbedder` uses an ALBERT model trained with Masked Language Modeling (MLM) on the Naab dataset to generate high-quality contextual embeddings.\nThe resulting embeddings are 768-dimensional vectors representing the semantic meaning of entire phrases or sentences.\n\n```python\nfrom shekar.embeddings import ContextualEmbedder\n\nembedder = ContextualEmbedder(model=\"albert\")\n\nsentence = \"\u06a9\u062a\u0627\u0628\u200c\u0647\u0627 \u062f\u0631\u06cc\u0686\u0647\u200c\u0627\u06cc \u0628\u0647 \u062c\u0647\u0627\u0646 \u062f\u0627\u0646\u0634 \u0647\u0633\u062a\u0646\u062f.\"\nembedding = embedder(sentence)\nprint(embedding.shape)  # (768,)\n```\n\n## Stemming\n\nThe `Stemmer` is a lightweight, rule-based reducer for Persian word forms. It trims common suffixes while respecting Persian orthography and Zero Width Non-Joiner usage. The goal is to produce stable stems for search, indexing, and simple text analysis without requiring a full morphological analyzer.\n\n```python\nfrom shekar import Stemmer\n\nstemmer = Stemmer()\n\nprint(stemmer(\"\u0646\u0648\u0647\u200c\u0627\u0645\"))\nprint(stemmer(\"\u06a9\u062a\u0627\u0628\u200c\u0647\u0627\"))\nprint(stemmer(\"\u062e\u0627\u0646\u0647\u200c\u0647\u0627\u06cc\u06cc\"))\n```\n\n```output\n\u0646\u0648\u0647\n\u06a9\u062a\u0627\u0628\n\u062e\u0627\u0646\u0647\n```\n\n## Lemmatization\n\nThe `Lemmatizer` maps Persian words to their base dictionary form. Unlike stemming, which only trims affixes, lemmatization uses explicit verb conjugation rules, vocabulary lookups, and a stemmer fallback to ensure valid lemmas. This makes it more accurate for tasks like part-of-speech tagging, text normalization, and linguistic analysis where the canonical form of a word is required.\n\n```python\nfrom shekar import Lemmatizer\n\nlemmatizer = Lemmatizer()\n\nprint(lemmatizer(\"\u0631\u0641\u062a\u0646\u062f\"))\nprint(lemmatizer(\"\u06a9\u062a\u0627\u0628\u200c\u0647\u0627\"))\nprint(lemmatizer(\"\u062e\u0627\u0646\u0647\u200c\u0647\u0627\u06cc\u06cc\"))\nprint(lemmatizer(\"\u06af\u0641\u062a\u0647 \u0628\u0648\u062f\u0647\u200c\u0627\u06cc\u0645\"))\n```\n\n```output\n\u0631\u0641\u062a/\u0631\u0648\n\u06a9\u062a\u0627\u0628\n\u062e\u0627\u0646\u0647\n\u06af\u0641\u062a/\u06af\u0648\n```\n\n## Part-of-Speech Tagging\n\n[](examples/pos_tagging.ipynb)  [](https://colab.research.google.com/github/amirivojdan/shekar/blob/main/examples/pos_tagging.ipynb)\n\nThe POSTagger class provides part-of-speech tagging for Persian text using a transformer-based model (default: ALBERT). It returns one tag per word based on Universal POS tags (following the Universal Dependencies standard).\n\nExample usage:\n\n```python\nfrom shekar import POSTagger\n\npos_tagger = POSTagger()\ntext = \"\u0646\u0648\u0631\u0648\u0632\u060c \u062c\u0634\u0646 \u0633\u0627\u0644 \u0646\u0648 \u0627\u06cc\u0631\u0627\u0646\u06cc\u060c \u0628\u06cc\u0634 \u0627\u0632 \u0633\u0647 \u0647\u0632\u0627\u0631 \u0633\u0627\u0644 \u0642\u062f\u0645\u062a \u062f\u0627\u0631\u062f \u0648 \u062f\u0631 \u06a9\u0634\u0648\u0631\u0647\u0627\u06cc \u0645\u062e\u062a\u0644\u0641 \u062c\u0634\u0646 \u06af\u0631\u0641\u062a\u0647 \u0645\u06cc\u200c\u0634\u0648\u062f.\"\n\nresult = pos_tagger(text)\nfor word, tag in result:\n    print(f\"{word}: {tag}\")\n```\n\n```output\n\u0646\u0648\u0631\u0648\u0632: PROPN\n\u060c: PUNCT\n\u062c\u0634\u0646: NOUN\n\u0633\u0627\u0644: NOUN\n\u0646\u0648: ADJ\n\u0627\u06cc\u0631\u0627\u0646\u06cc: ADJ\n\u060c: PUNCT\n\u0628\u06cc\u0634: ADJ\n\u0627\u0632: ADP\n\u0633\u0647: NUM\n\u0647\u0632\u0627\u0631: NUM\n\u0633\u0627\u0644: NOUN\n\u0642\u062f\u0645\u062a: NOUN\n\u062f\u0627\u0631\u062f: VERB\n\u0648: CCONJ\n\u062f\u0631: ADP\n\u06a9\u0634\u0648\u0631\u0647\u0627\u06cc: NOUN\n\u0645\u062e\u062a\u0644\u0641: ADJ\n\u062c\u0634\u0646: NOUN\n\u06af\u0631\u0641\u062a\u0647: VERB\n\u0645\u06cc\u200c\u0634\u0648\u062f: VERB\n.: PUNCT\n```\n\n## Named Entity Recognition (NER)\n\n[](examples/ner.ipynb)  [](https://colab.research.google.com/github/amirivojdan/shekar/blob/main/examples/ner.ipynb)\n\nThe `NER` module offers a fast, quantized Named Entity Recognition pipeline using a fine-tuned ALBERT model. It detects common Persian entities such as persons, locations, organizations, and dates. This model is designed for efficient inference and can be easily combined with other preprocessing steps.\n\nExample usage:\n\n```python\nfrom shekar import NER\nfrom shekar import Normalizer\n\ninput_text = (\n    \"\u0634\u0627\u0647\u0631\u062e \u0645\u0633\u06a9\u0648\u0628 \u0628\u0647 \u0633\u0627\u0644\u0650 \u06f1\u06f3\u06f0\u06f4 \u062f\u0631 \u0628\u0627\u0628\u0644 \u0632\u0627\u062f\u0647 \u0634\u062f \u0648 \u062f\u0648\u0631\u0647 \u0627\u0628\u062a\u062f\u0627\u06cc\u06cc \u0631\u0627 \u062f\u0631 \u062a\u0647\u0631\u0627\u0646 \u0648 \u062f\u0631 \u0645\u062f\u0631\u0633\u0647 \u0639\u0644\u0645\u06cc\u0647 \u067e\u0634\u062a \"\n    \"\u0645\u0633\u062c\u062f \u0633\u067e\u0647\u0633\u0627\u0644\u0627\u0631 \u06af\u0630\u0631\u0627\u0646\u062f. \u0627\u0632 \u06a9\u0644\u0627\u0633 \u067e\u0646\u062c\u0645 \u0627\u0628\u062a\u062f\u0627\u06cc\u06cc \u0645\u0637\u0627\u0644\u0639\u0647 \u0631\u0645\u0627\u0646 \u0648 \u0622\u062b\u0627\u0631 \u0627\u062f\u0628\u06cc \u0631\u0627 \u0634\u0631\u0648\u0639 \u06a9\u0631\u062f. \u0627\u0632 \u0647\u0645\u0627\u0646 \u0632\u0645\u0627\u0646 \"\n    \"\u062f\u0631 \u062f\u0628\u06cc\u0631\u0633\u062a\u0627\u0646 \u0627\u062f\u0628 \u0627\u0635\u0641\u0647\u0627\u0646 \u0627\u062f\u0627\u0645\u0647 \u062a\u062d\u0635\u06cc\u0644 \u062f\u0627\u062f. \u067e\u0633 \u0627\u0632 \u067e\u0627\u06cc\u0627\u0646 \u062a\u062d\u0635\u06cc\u0644\u0627\u062a \u062f\u0628\u06cc\u0631\u0633\u062a\u0627\u0646 \u062f\u0631 \u0633\u0627\u0644 \u06f1\u06f3\u06f2\u06f4 \u0627\u0632 \u0627\u0635\u0641\u0647\u0627\u0646 \u0628\u0647 \u062a\u0647\u0631\u0627\u0646 \u0631\u0641\u062a \u0648 \"\n    \"\u062f\u0631 \u0631\u0634\u062a\u0647 \u062d\u0642\u0648\u0642 \u062f\u0627\u0646\u0634\u06af\u0627\u0647 \u062a\u0647\u0631\u0627\u0646 \u0645\u0634\u063a\u0648\u0644 \u0628\u0647 \u062a\u062d\u0635\u06cc\u0644 \u0634\u062f.\"\n)\n\nnormalizer = Normalizer()\nnormalized_text = normalizer(input_text)\n\nalbert_ner = NER()\nentities = albert_ner(normalized_text)\n\nfor text, label in entities:\n    print(f\"{text} \u2192 {label}\")\n```\n\n```output\n\u0634\u0627\u0647\u0631\u062e \u0645\u0633\u06a9\u0648\u0628 \u2192 PER\n\u0633\u0627\u0644 \u06f1\u06f3\u06f0\u06f4 \u2192 DAT\n\u0628\u0627\u0628\u0644 \u2192 LOC\n\u062f\u0648\u0631\u0647 \u0627\u0628\u062a\u062f\u0627\u06cc\u06cc \u2192 DAT\n\u062a\u0647\u0631\u0627\u0646 \u2192 LOC\n\u0645\u062f\u0631\u0633\u0647 \u0639\u0644\u0645\u06cc\u0647 \u2192 LOC\n\u0645\u0633\u062c\u062f \u0633\u067e\u0647\u0633\u0627\u0644\u0627\u0631 \u2192 LOC\n\u062f\u0628\u06cc\u0631\u0633\u062a\u0627\u0646 \u0627\u062f\u0628 \u0627\u0635\u0641\u0647\u0627\u0646 \u2192 LOC\n\u062f\u0631 \u0633\u0627\u0644 \u06f1\u06f3\u06f2\u06f4 \u2192 DAT\n\u0627\u0635\u0641\u0647\u0627\u0646 \u2192 LOC\n\u062a\u0647\u0631\u0627\u0646 \u2192 LOC\n\u062f\u0627\u0646\u0634\u06af\u0627\u0647 \u062a\u0647\u0631\u0627\u0646 \u2192 ORG\n\u0641\u0631\u0627\u0646\u0633\u0647 \u2192 LOC\n```\n\n## Sentiment Analysis\n\nThe `SentimentClassifier` module enables automatic sentiment analysis of Persian text using transformer-based models. It currently supports the `AlbertBinarySentimentClassifier`, a lightweight ALBERT model fine-tuned on Snapfood dataset to classify text as **positive** or **negative**, returning both the predicted label and its confidence score.\n\n**Example usage:**\n\n```python\nfrom shekar import SentimentClassifier\n\nsentiment_classifier = SentimentClassifier()\n\nprint(sentiment_classifier(\"\u0633\u0631\u06cc\u0627\u0644 \u0642\u0635\u0647\u200c\u0647\u0627\u06cc \u0645\u062c\u06cc\u062f \u0639\u0627\u0644\u06cc \u0628\u0648\u062f!\"))\nprint(sentiment_classifier(\"\u0641\u06cc\u0644\u0645 \u06f3\u06f0\u06f0 \u0627\u0641\u062a\u0636\u0627\u062d \u0628\u0648\u062f!\"))\n```\n\n```output\n('positive', 0.9923112988471985)\n('negative', 0.9330866932868958)\n```\n\n## Toxicity Detection\n\nThe `toxicity` module currently includes a Logistic Regression classifier trained on TF-IDF features extracted from the [Naseza (\u0646\u0627\u0633\u0632\u0627) dataset](https://github.com/amirivojdan/naseza), a large-scale collection of Persian text labeled for offensive and neutral language. The `OffensiveLanguageClassifier` processes input text to determine whether it is neutral or offensive, returning both the predicted label and its confidence score.\n\n```python\nfrom shekar.toxicity import OffensiveLanguageClassifier\n\noffensive_classifier = OffensiveLanguageClassifier()\n\nprint(offensive_classifier(\"\u0632\u0628\u0627\u0646 \u0641\u0627\u0631\u0633\u06cc \u0645\u06cc\u0647\u0646 \u0645\u0646 \u0627\u0633\u062a!\"))\nprint(offensive_classifier(\"\u062a\u0648 \u062e\u06cc\u0644\u06cc \u0627\u062d\u0645\u0642 \u0648 \u0628\u06cc\u200c\u0634\u0631\u0641\u06cc!\"))\n```\n\n```output\n('neutral', 0.7651197910308838)\n('offensive', 0.7607775330543518)\n```\n\n## Keyword Extraction\n\n[](examples/keyword_extraction.ipynb)  [](https://colab.research.google.com/github/amirivojdan/shekar/blob/main/examples/keyword_extraction.ipynb)\n\nThe **shekar.keyword_extraction** module provides tools for automatically identifying and extracting key terms and phrases from Persian text. These algorithms help identify the most important concepts and topics within documents.\n\n```python\nfrom shekar import KeywordExtractor\n\nextractor = KeywordExtractor(max_length=2, top_n=10)\n\ninput_text = (\n    \"\u0632\u0628\u0627\u0646 \u0641\u0627\u0631\u0633\u06cc \u06cc\u06a9\u06cc \u0627\u0632 \u0632\u0628\u0627\u0646\u200c\u0647\u0627\u06cc \u0645\u0647\u0645 \u0645\u0646\u0637\u0642\u0647 \u0648 \u062c\u0647\u0627\u0646 \u0627\u0633\u062a \u06a9\u0647 \u062a\u0627\u0631\u06cc\u062e\u0686\u0647\u200c\u0627\u06cc \u06a9\u0647\u0646 \u062f\u0627\u0631\u062f. \"\n    \"\u0632\u0628\u0627\u0646 \u0641\u0627\u0631\u0633\u06cc \u0628\u0627 \u062f\u0627\u0634\u062a\u0646 \u0627\u062f\u0628\u06cc\u0627\u062a\u06cc \u063a\u0646\u06cc \u0648 \u0634\u0627\u0639\u0631\u0627\u0646\u06cc \u0628\u0631\u062c\u0633\u062a\u0647\u060c \u0646\u0642\u0634\u06cc \u0628\u06cc\u200c\u0628\u062f\u06cc\u0644 \u062f\u0631 \u06af\u0633\u062a\u0631\u0634 \u0641\u0631\u0647\u0646\u06af \u0627\u06cc\u0631\u0627\u0646\u06cc \u0627\u06cc\u0641\u0627 \u06a9\u0631\u062f\u0647 \u0627\u0633\u062a. \"\n    \"\u0627\u0632 \u062f\u0648\u0631\u0627\u0646 \u0641\u0631\u062f\u0648\u0633\u06cc \u0648 \u0634\u0627\u0647\u0646\u0627\u0645\u0647 \u062a\u0627 \u062f\u0648\u0631\u0627\u0646 \u0645\u0639\u0627\u0635\u0631\u060c \u0632\u0628\u0627\u0646 \u0641\u0627\u0631\u0633\u06cc \u0647\u0645\u0648\u0627\u0631\u0647 \u0627\u0628\u0632\u0627\u0631 \u0628\u06cc\u0627\u0646 \u0627\u0646\u062f\u06cc\u0634\u0647\u060c \u0627\u062d\u0633\u0627\u0633 \u0648 \u0647\u0646\u0631 \u0628\u0648\u062f\u0647 \u0627\u0633\u062a. \"\n)\n\nkeywords = extractor(input_text)\n\nfor kw in keywords:\n    print(kw)\n```\n```output\n\u0641\u0631\u0647\u0646\u06af \u0627\u06cc\u0631\u0627\u0646\u06cc\n\u06af\u0633\u062a\u0631\u0634 \u0641\u0631\u0647\u0646\u06af\n\u0627\u06cc\u0631\u0627\u0646\u06cc \u0627\u06cc\u0641\u0627\n\u0632\u0628\u0627\u0646 \u0641\u0627\u0631\u0633\u06cc\n\u062a\u0627\u0631\u06cc\u062e\u0686\u0647\u200c\u0627\u06cc \u06a9\u0647\u0646\n```\n\n## Spell Checking\n\nThe `SpellChecker` class provides simple and effective spelling correction for Persian text. It can automatically detect and fix common errors such as extra characters, spacing mistakes, or misspelled words. You can use it directly as a callable on a sentence to clean up the text, or call `suggest()` to get a ranked list of correction candidates for a single word.\n\n```python\nfrom shekar import SpellChecker\n\nspell_checker = SpellChecker()\nprint(spell_checker(\"\u0633\u0633\u0644\u0627\u0645 \u0628\u0631 \u0634\u0634\u0645\u0627 \u062f\u062f\u0648\u0633\u062a \u0645\u0646\"))\nprint(spell_checker.suggest(\"\u062f\u0631\u0648\u062f\"))\n```\n\n```output\n\u0633\u0644\u0627\u0645 \u0628\u0631 \u0634\u0645\u0627 \u062f\u0648\u0633\u062a \u0645\u0646\n['\u062f\u0631\u0648\u062f', '\u062f\u0631\u0635\u062f', '\u0648\u0631\u0648\u062f', '\u062f\u0631\u062f', '\u062f\u0631\u0648\u0646']\n```\n\n## WordCloud\n\n[](examples/word_cloud.ipynb)  [](https://colab.research.google.com/github/amirivojdan/shekar/blob/main/examples/word_cloud.ipynb)\n\nThe WordCloud class offers an easy way to create visually rich Persian word clouds. It supports reshaping and right-to-left rendering, Persian fonts, color maps, and custom shape masks for accurate and elegant visualization of word frequencies.\n\n```python\nimport requests\nfrom collections import Counter\n\nfrom shekar import WordCloud\nfrom shekar import WordTokenizer\nfrom shekar.preprocessing import (\n  HTMLTagRemover,\n  PunctuationRemover,\n  StopWordRemover,\n  NonPersianRemover,\n)\npreprocessing_pipeline = HTMLTagRemover() | PunctuationRemover() | StopWordRemover() | NonPersianRemover()\n\n\nurl = f\"https://shahnameh.me/p.php?id=F82F6CED\"\nresponse = requests.get(url)\nhtml_content = response.text\nclean_text = preprocessing_pipeline(html_content)\n\nword_tokenizer = WordTokenizer()\ntokens = word_tokenizer(clean_text)\n\nword_freqs = Counter(tokens)\n\nwordCloud = WordCloud(\n        mask=\"Iran\",\n        width=640,\n        height=480,\n        max_font_size=220,\n        min_font_size=6,\n        bg_color=\"white\",\n        contour_color=\"black\",\n        contour_width=5,\n        color_map=\"greens\",\n    )\n\n# if shows disconnect words, try again with bidi_reshape=True\nimage = wordCloud.generate(word_freqs, bidi_reshape=False)\nimage.show()\n```\n\n\n\n## Command-Line Interface (CLI)\n\nShekar includes a command-line interface (CLI) for quick text processing and visualization.  \nYou can normalize Persian text or generate wordclouds directly from files or inline strings.\n\n**Usage**\n\n```console\nshekar [COMMAND] [OPTIONS]\n```\n\n**Examples**\n\n```console\n# Normalize a text file and save output\nshekar normalize -i ./corpus.txt -o ./normalized_corpus.txt\n\n# Normalize inline text\nshekar normalize -t \"\u062f\u0631\u0648\u062f \u067e\u0631\u0648\u062f\u06af\u0627\u0631 \u0628\u0631 \u0627\u06cc\u0631\u0627\u0646 \u0648 \u0627\u06cc\u0631\u0627\u0646\u06cc\"\n```\n\n## Download Models\n\nIf Shekar Hub is unavailable, you can manually download the models and place them in the cache directory at `home/[username]/.shekar/` \n\n| Model Name                | Download Link |\n|----------------------------|---------------|\n| FastText Embedding d100    | [Download](https://drive.google.com/file/d/1qgd0slGA3Ar7A2ShViA3v8UTM4qXIEN6/view?usp=drive_link) (50MB)|\n| FastText Embedding d300    | [Download](https://drive.google.com/file/d/1yeAg5otGpgoeD-3-E_W9ZwLyTvNKTlCa/view?usp=drive_link) (500MB)|\n| SentenceEmbedding    | [Download](https://drive.google.com/file/d/1PftSG2QD2M9qzhAltWk_S38eQLljPUiG/view?usp=drive_link) (60MB)|\n| POS Tagger  | [Download](https://drive.google.com/file/d/1d80TJn7moO31nMXT4WEatAaTEUirx2Ju/view?usp=drive_link) (38MB)|\n| NER       | [Download](https://drive.google.com/file/d/1DLoMJt8TWlNnGGbHDWjwNGsD7qzlLHfu/view?usp=drive_link) (38MB)|\n| Sentiment Classifier       | [Download](https://drive.google.com/file/d/17gTip7RwipEkA7Rf3-Cv1W8XNHTdaS4c/view?usp=drive_link) (38MB)|\n| Offensive Language Classifier       | [Download](https://drive.google.com/file/d/1ZLiFI6nzpQ2rYjJTKxOYKTfD9IqHZ5tc/view?usp=drive_link) (8MB)|\n| AlbertTokenizer   | [Download](https://drive.google.com/file/d/1w-oe53F0nPePMcoor5FgXRwRMwkYqDqM/view?usp=drive_link) (2MB)|\n\n-----\n\n## Citation\n\nIf you find **Shekar** useful in your research, please consider citing the following paper:\n\n```\n@article{Amirivojdan_Shekar,\nauthor = {Amirivojdan, Ahmad},\ndoi = {10.21105/joss.09128},\njournal = {Journal of Open Source Software},\nmonth = oct,\nnumber = {114},\npages = {9128},\ntitle = {{Shekar: A Python Toolkit for Persian Natural Language Processing}},\nurl = {https://joss.theoj.org/papers/10.21105/joss.09128},\nvolume = {10},\nyear = {2025}\n}\n```\n\n<p align=\"center\"><em>With \u2764\ufe0f for <strong>IRAN</strong></em></p>\n\n\n",
    "bugtrack_url": null,
    "license": null,
    "summary": "Simplifying Persian NLP for Modern Applications",
    "version": "1.0.1",
    "project_urls": {
        "Changelog": "https://github.com/amirivojdan/shekar/releases",
        "Documentation": "https://lib.shekar.io",
        "Homepage": "https://github.com/amirivojdan/shekar",
        "Issues": "https://github.com/amirivojdan/shekar/issues",
        "Repository": "https://github.com/amirivojdan/shekar"
    },
    "split_keywords": [
        "deep learning",
        " machine learning",
        " nlp",
        " natural language processing",
        " persian",
        " shekar",
        " text processing"
    ],
    "urls": [
        {
            "comment_text": null,
            "digests": {
                "blake2b_256": "ece515dc89c9ad7ce31100565f44a478ff6223df9c781973264d9f506e8f37c0",
                "md5": "bc75f75648f0c98ace08a4923e4569ba",
                "sha256": "b60620c9dc130b9304be5e21be75ddc09bf47d888e62af2526065ef3eb7f4659"
            },
            "downloads": -1,
            "filename": "shekar-1.0.1-py3-none-any.whl",
            "has_sig": false,
            "md5_digest": "bc75f75648f0c98ace08a4923e4569ba",
            "packagetype": "bdist_wheel",
            "python_version": "py3",
            "requires_python": ">=3.10",
            "size": 963929,
            "upload_time": "2025-11-04T03:29:02",
            "upload_time_iso_8601": "2025-11-04T03:29:02.069106Z",
            "url": "https://files.pythonhosted.org/packages/ec/e5/15dc89c9ad7ce31100565f44a478ff6223df9c781973264d9f506e8f37c0/shekar-1.0.1-py3-none-any.whl",
            "yanked": false,
            "yanked_reason": null
        },
        {
            "comment_text": null,
            "digests": {
                "blake2b_256": "2e9a359d7f82b72d8c852b5f555f1df774af809d1ee00e31b696ccfb2feb8afa",
                "md5": "6d177b2b973c7d85b3ceb20b8aa46025",
                "sha256": "a99e86457a8bc5208e1ab66b3e0036aaa6acaf89ae0fdf71fd6ce9d62cd74f87"
            },
            "downloads": -1,
            "filename": "shekar-1.0.1.tar.gz",
            "has_sig": false,
            "md5_digest": "6d177b2b973c7d85b3ceb20b8aa46025",
            "packagetype": "sdist",
            "python_version": "source",
            "requires_python": ">=3.10",
            "size": 4080544,
            "upload_time": "2025-11-04T03:29:03",
            "upload_time_iso_8601": "2025-11-04T03:29:03.657860Z",
            "url": "https://files.pythonhosted.org/packages/2e/9a/359d7f82b72d8c852b5f555f1df774af809d1ee00e31b696ccfb2feb8afa/shekar-1.0.1.tar.gz",
            "yanked": false,
            "yanked_reason": null
        }
    ],
    "upload_time": "2025-11-04 03:29:03",
    "github": true,
    "gitlab": false,
    "bitbucket": false,
    "codeberg": false,
    "github_user": "amirivojdan",
    "github_project": "shekar",
    "travis_ci": false,
    "coveralls": true,
    "github_actions": true,
    "lcname": "shekar"
}