MorphoPreText


NameMorphoPreText JSON
Version 0.1.1 PyPI version JSON
download
home_pagehttps://github.com/ghaskari/MorphoPreText
SummaryA bilingual text preprocessing toolkit for English and Persian.
upload_time2025-01-19 07:10:43
maintainerNone
docs_urlNone
authorGhazal Askari
requires_python>=3.8
licenseNone
keywords text preprocessing nlp english persian bilingual
VCS
bugtrack_url
requirements emoji nltk pandas scikit-learn pyspellchecker parsivar spacy openpyxl jdatetime
Travis-CI No Travis.
coveralls test coverage No coveralls.
            # MorphoPreText

MorphoPreText is a Python package designed for preprocessing English and Persian text. This library provides tools for text normalization, tokenization, cleaning, and other preprocessing tasks that are essential for Natural Language Processing (NLP) applications. The package supports both English and Persian languages with specific modules tailored to the linguistic nuances of each language.

---

## Features

- **Multilingual Support**: Handles preprocessing for both English and Persian text.
- **Text Normalization**: Includes dictionaries for standardizing characters, punctuation, and text structure.
- **Stopword Removal**: Integrated with customizable stopword lists for both languages.
- **Spelling Correction**: Automatically corrects misspelled words (English only).
- **Emoji Handling**: Options to remove, replace, or analyze emoji sentiments.
- **Date Handling**: Converts Persian dates to the Gregorian calendar format.
- **Customizable Tasks**: Configurations for different NLP use cases such as sentiment analysis, named entity recognition (NER), and more.
- **Predefined Task Configurations**: Provides task-specific preprocessing setups for translation, summarization, topic modeling, and more.
- **Task-Specific Preprocessing**:
  - Supports tasks like `translation`, `sentiment`, `ner`, `spam_detection`, `topic_modeling`, and `summarization`.
- **Language-Specific Preprocessing**:
  - **Persian**: Diacritic removal, numeral normalization, punctuation handling, Persian stopword removal, half-space handling.
  - **English**: Spelling correction, contractions expansion, lemmatization, stemming, and punctuation cleaning.
- **Text Cleaning**:
  - Removes URLs, HTML tags, emails, hashtags, mentions, and extra spaces.
- **Custom Dictionary Support**:
  - Includes dictionaries for standardizing text, handling special characters, and expanding contractions.
- **Flexible Emoji Processing**:
  - Provides options to analyze emoji sentiment, replace emojis with placeholders, or remove them entirely.
- **Efficient Column-Wide Processing**:
  - Capable of processing entire pandas DataFrame columns for large-scale text datasets.
- **Persian-Specific Date Handling**:
  - Converts Persian calendar dates into the Gregorian calendar format seamlessly.

---

## Installation

MorphoPreText is available on PyPI and can be installed using pip:

```bash
pip install morphopretext
```

Alternatively, you can install the package from the source:

```bash
# Clone the repository
$ git clone https://github.com/ghaskari/MorphoPreText.git

# Navigate to the project directory
$ cd MorphoPreText

# Install the package
$ pip install .

# Install dependencies
$ pip install -r requirements.txt
```

---

## Usage

### What Can You Do With MorphoPreText?

MorphoPreText provides robust preprocessing tools for handling diverse text preprocessing needs:

- **Clean and Normalize Text**: Standardize characters, remove extra spaces, and handle punctuation.
- **Handle Emojis**: Remove, replace, or analyze sentiment based on emojis.
- **Convert Dates**: Process Persian calendar dates into standard Gregorian format.
- **Remove Unwanted Elements**: Strip out URLs, HTML tags, mentions, hashtags, and email addresses.
- **Custom Task Configurations**: Use predefined configurations for tasks like sentiment analysis, translation, and topic modeling.
- **Tokenization and Stopword Removal**: Tokenize text and remove language-specific stopwords.
- **Language-Specific Enhancements**: Handle unique linguistic features such as Persian half-spaces or English contractions.

### English Text Preprocessing

```python
from morphopretext import EnglishTextPreprocessor

# Initialize the preprocessor
english_preprocessor = EnglishTextPreprocessor(task="default")

# Preprocess text example 1
text = "This is a sample text with emojis 😊 and a URL: https://example.com"
cleaned_text = english_preprocessor.clean_punctuation(text)
print(cleaned_text)  # Output: This is a sample text with emojis 😊 and a URL https example com

# Preprocess text example 2
text_with_html = "This is a <b>bold</b> statement."
cleaned_html_text = english_preprocessor.remove_url_and_html(text_with_html)
print(cleaned_html_text)  # Output: This is a bold statement.

# Preprocess text example 3
text_with_emojis = "I love programming! 😊"
emoji_handled_text = english_preprocessor.handle_emojis(text_with_emojis, strategy="replace")
print(emoji_handled_text)  # Output: I love programming! EMOJI

# Preprocess text example 4
spelling_text = "Ths is a smple txt with erors."
corrected_text = english_preprocessor.correct_spelling(spelling_text)
print(corrected_text)  # Output: This is a sample text with errors.
```

### Persian Text Preprocessing

```python
from morphopretext import PersianTextPreprocessor

# Initialize the preprocessor with a custom stopword file
persian_preprocessor = PersianTextPreprocessor(stopword_file="stopwords.txt", task="default")

# Preprocess text example 1
persian_text = "این یک متن نمونه است که شامل تاریخ ۱۴۰۲/۰۳/۱۵ و علائم نگارشی است."
cleaned_text = persian_preprocessor.remove_stopwords(persian_text)
print(cleaned_text)  # Output: این متن نمونه شامل تاریخ ۱۴۰۲/۰۳/۱۵ علائم نگارشی است.

# Preprocess text example 2
persian_text_with_emojis = "این یک متن 😊 تست است"
emoji_removed_text = persian_preprocessor.handle_emojis(persian_text_with_emojis, "remove")
print(emoji_removed_text)  # Output: این یک متن تست است

# Preprocess text example 3
persian_text_with_half_space = "این‌ متن‌ تست‌ است"
cleaned_half_space_text = persian_preprocessor.remove_half_space(persian_text_with_half_space)
print(cleaned_half_space_text)  # Output: این متن تست است

# Preprocess text example 4
persian_date_text = "تاریخ امروز ۱۴۰۲/۰۵/۲۰ است"
converted_date_text = persian_preprocessor.date_converter().handle_persian_dates(persian_date_text, convert_to_standard=True)
print(converted_date_text)  # Output: تاریخ امروز 2023-08-11 است
```

---

## Project Structure

```
MorphoPreText/
├── morphotext/                    # Package directory
│   ├── __init__.py                # Initialize the package
│   ├── english_text_preprocessor.py
│   ├── persian_text_preprocessor.py
│   ├── Dictionaries_En.py
│   ├── Dictionaries_Fa.py
│   ├── stopwords.txt
├── README.md                      # Project description
├── setup.py                       # Packaging configuration
├── requirements.txt               # Dependencies
├── LICENSE                        # License information
```

---

## Contributions

Contributions are welcome! Please fork the repository and submit a pull request with your improvements or bug fixes.

---

## License

This project is licensed under the terms of the MIT License. See the `LICENSE` file for details.

---

## Repository

For more details, visit: https://github.com/ghaskari/MorphoPreText


            

Raw data

            {
    "_id": null,
    "home_page": "https://github.com/ghaskari/MorphoPreText",
    "name": "MorphoPreText",
    "maintainer": null,
    "docs_url": null,
    "requires_python": ">=3.8",
    "maintainer_email": null,
    "keywords": "text preprocessing NLP English Persian bilingual",
    "author": "Ghazal Askari",
    "author_email": "g.askari1037@gmail.com",
    "download_url": "https://files.pythonhosted.org/packages/a9/63/7801ba07fb63dd04c5b8998e8de9a352ee9752241803caf37286cc318348/MorphoPreText-0.1.1.tar.gz",
    "platform": null,
    "description": "# MorphoPreText\n\nMorphoPreText is a Python package designed for preprocessing English and Persian text. This library provides tools for text normalization, tokenization, cleaning, and other preprocessing tasks that are essential for Natural Language Processing (NLP) applications. The package supports both English and Persian languages with specific modules tailored to the linguistic nuances of each language.\n\n---\n\n## Features\n\n- **Multilingual Support**: Handles preprocessing for both English and Persian text.\n- **Text Normalization**: Includes dictionaries for standardizing characters, punctuation, and text structure.\n- **Stopword Removal**: Integrated with customizable stopword lists for both languages.\n- **Spelling Correction**: Automatically corrects misspelled words (English only).\n- **Emoji Handling**: Options to remove, replace, or analyze emoji sentiments.\n- **Date Handling**: Converts Persian dates to the Gregorian calendar format.\n- **Customizable Tasks**: Configurations for different NLP use cases such as sentiment analysis, named entity recognition (NER), and more.\n- **Predefined Task Configurations**: Provides task-specific preprocessing setups for translation, summarization, topic modeling, and more.\n- **Task-Specific Preprocessing**:\n  - Supports tasks like `translation`, `sentiment`, `ner`, `spam_detection`, `topic_modeling`, and `summarization`.\n- **Language-Specific Preprocessing**:\n  - **Persian**: Diacritic removal, numeral normalization, punctuation handling, Persian stopword removal, half-space handling.\n  - **English**: Spelling correction, contractions expansion, lemmatization, stemming, and punctuation cleaning.\n- **Text Cleaning**:\n  - Removes URLs, HTML tags, emails, hashtags, mentions, and extra spaces.\n- **Custom Dictionary Support**:\n  - Includes dictionaries for standardizing text, handling special characters, and expanding contractions.\n- **Flexible Emoji Processing**:\n  - Provides options to analyze emoji sentiment, replace emojis with placeholders, or remove them entirely.\n- **Efficient Column-Wide Processing**:\n  - Capable of processing entire pandas DataFrame columns for large-scale text datasets.\n- **Persian-Specific Date Handling**:\n  - Converts Persian calendar dates into the Gregorian calendar format seamlessly.\n\n---\n\n## Installation\n\nMorphoPreText is available on PyPI and can be installed using pip:\n\n```bash\npip install morphopretext\n```\n\nAlternatively, you can install the package from the source:\n\n```bash\n# Clone the repository\n$ git clone https://github.com/ghaskari/MorphoPreText.git\n\n# Navigate to the project directory\n$ cd MorphoPreText\n\n# Install the package\n$ pip install .\n\n# Install dependencies\n$ pip install -r requirements.txt\n```\n\n---\n\n## Usage\n\n### What Can You Do With MorphoPreText?\n\nMorphoPreText provides robust preprocessing tools for handling diverse text preprocessing needs:\n\n- **Clean and Normalize Text**: Standardize characters, remove extra spaces, and handle punctuation.\n- **Handle Emojis**: Remove, replace, or analyze sentiment based on emojis.\n- **Convert Dates**: Process Persian calendar dates into standard Gregorian format.\n- **Remove Unwanted Elements**: Strip out URLs, HTML tags, mentions, hashtags, and email addresses.\n- **Custom Task Configurations**: Use predefined configurations for tasks like sentiment analysis, translation, and topic modeling.\n- **Tokenization and Stopword Removal**: Tokenize text and remove language-specific stopwords.\n- **Language-Specific Enhancements**: Handle unique linguistic features such as Persian half-spaces or English contractions.\n\n### English Text Preprocessing\n\n```python\nfrom morphopretext import EnglishTextPreprocessor\n\n# Initialize the preprocessor\nenglish_preprocessor = EnglishTextPreprocessor(task=\"default\")\n\n# Preprocess text example 1\ntext = \"This is a sample text with emojis \ud83d\ude0a and a URL: https://example.com\"\ncleaned_text = english_preprocessor.clean_punctuation(text)\nprint(cleaned_text)  # Output: This is a sample text with emojis \ud83d\ude0a and a URL https example com\n\n# Preprocess text example 2\ntext_with_html = \"This is a <b>bold</b> statement.\"\ncleaned_html_text = english_preprocessor.remove_url_and_html(text_with_html)\nprint(cleaned_html_text)  # Output: This is a bold statement.\n\n# Preprocess text example 3\ntext_with_emojis = \"I love programming! \ud83d\ude0a\"\nemoji_handled_text = english_preprocessor.handle_emojis(text_with_emojis, strategy=\"replace\")\nprint(emoji_handled_text)  # Output: I love programming! EMOJI\n\n# Preprocess text example 4\nspelling_text = \"Ths is a smple txt with erors.\"\ncorrected_text = english_preprocessor.correct_spelling(spelling_text)\nprint(corrected_text)  # Output: This is a sample text with errors.\n```\n\n### Persian Text Preprocessing\n\n```python\nfrom morphopretext import PersianTextPreprocessor\n\n# Initialize the preprocessor with a custom stopword file\npersian_preprocessor = PersianTextPreprocessor(stopword_file=\"stopwords.txt\", task=\"default\")\n\n# Preprocess text example 1\npersian_text = \"\u0627\u06cc\u0646 \u06cc\u06a9 \u0645\u062a\u0646 \u0646\u0645\u0648\u0646\u0647 \u0627\u0633\u062a \u06a9\u0647 \u0634\u0627\u0645\u0644 \u062a\u0627\u0631\u06cc\u062e \u06f1\u06f4\u06f0\u06f2/\u06f0\u06f3/\u06f1\u06f5 \u0648 \u0639\u0644\u0627\u0626\u0645 \u0646\u06af\u0627\u0631\u0634\u06cc \u0627\u0633\u062a.\"\ncleaned_text = persian_preprocessor.remove_stopwords(persian_text)\nprint(cleaned_text)  # Output: \u0627\u06cc\u0646 \u0645\u062a\u0646 \u0646\u0645\u0648\u0646\u0647 \u0634\u0627\u0645\u0644 \u062a\u0627\u0631\u06cc\u062e \u06f1\u06f4\u06f0\u06f2/\u06f0\u06f3/\u06f1\u06f5 \u0639\u0644\u0627\u0626\u0645 \u0646\u06af\u0627\u0631\u0634\u06cc \u0627\u0633\u062a.\n\n# Preprocess text example 2\npersian_text_with_emojis = \"\u0627\u06cc\u0646 \u06cc\u06a9 \u0645\u062a\u0646 \ud83d\ude0a \u062a\u0633\u062a \u0627\u0633\u062a\"\nemoji_removed_text = persian_preprocessor.handle_emojis(persian_text_with_emojis, \"remove\")\nprint(emoji_removed_text)  # Output: \u0627\u06cc\u0646 \u06cc\u06a9 \u0645\u062a\u0646 \u062a\u0633\u062a \u0627\u0633\u062a\n\n# Preprocess text example 3\npersian_text_with_half_space = \"\u0627\u06cc\u0646\u200c \u0645\u062a\u0646\u200c \u062a\u0633\u062a\u200c \u0627\u0633\u062a\"\ncleaned_half_space_text = persian_preprocessor.remove_half_space(persian_text_with_half_space)\nprint(cleaned_half_space_text)  # Output: \u0627\u06cc\u0646 \u0645\u062a\u0646 \u062a\u0633\u062a \u0627\u0633\u062a\n\n# Preprocess text example 4\npersian_date_text = \"\u062a\u0627\u0631\u06cc\u062e \u0627\u0645\u0631\u0648\u0632 \u06f1\u06f4\u06f0\u06f2/\u06f0\u06f5/\u06f2\u06f0 \u0627\u0633\u062a\"\nconverted_date_text = persian_preprocessor.date_converter().handle_persian_dates(persian_date_text, convert_to_standard=True)\nprint(converted_date_text)  # Output: \u062a\u0627\u0631\u06cc\u062e \u0627\u0645\u0631\u0648\u0632 2023-08-11 \u0627\u0633\u062a\n```\n\n---\n\n## Project Structure\n\n```\nMorphoPreText/\n\u251c\u2500\u2500 morphotext/                    # Package directory\n\u2502   \u251c\u2500\u2500 __init__.py                # Initialize the package\n\u2502   \u251c\u2500\u2500 english_text_preprocessor.py\n\u2502   \u251c\u2500\u2500 persian_text_preprocessor.py\n\u2502   \u251c\u2500\u2500 Dictionaries_En.py\n\u2502   \u251c\u2500\u2500 Dictionaries_Fa.py\n\u2502   \u251c\u2500\u2500 stopwords.txt\n\u251c\u2500\u2500 README.md                      # Project description\n\u251c\u2500\u2500 setup.py                       # Packaging configuration\n\u251c\u2500\u2500 requirements.txt               # Dependencies\n\u251c\u2500\u2500 LICENSE                        # License information\n```\n\n---\n\n## Contributions\n\nContributions are welcome! Please fork the repository and submit a pull request with your improvements or bug fixes.\n\n---\n\n## License\n\nThis project is licensed under the terms of the MIT License. See the `LICENSE` file for details.\n\n---\n\n## Repository\n\nFor more details, visit: https://github.com/ghaskari/MorphoPreText\n\n",
    "bugtrack_url": null,
    "license": null,
    "summary": "A bilingual text preprocessing toolkit for English and Persian.",
    "version": "0.1.1",
    "project_urls": {
        "Homepage": "https://github.com/ghaskari/MorphoPreText"
    },
    "split_keywords": [
        "text",
        "preprocessing",
        "nlp",
        "english",
        "persian",
        "bilingual"
    ],
    "urls": [
        {
            "comment_text": "",
            "digests": {
                "blake2b_256": "a30dbe74282c0376cf50dbfbcf1f4410c72933020f0f2e2ffa76283062e4ac70",
                "md5": "eea0a1aab1f0259570836cf627271b4d",
                "sha256": "e2fb3f4f174a008ddc25f21e13da59b205288bf04eed245f8e8287e74901faff"
            },
            "downloads": -1,
            "filename": "MorphoPreText-0.1.1-py3-none-any.whl",
            "has_sig": false,
            "md5_digest": "eea0a1aab1f0259570836cf627271b4d",
            "packagetype": "bdist_wheel",
            "python_version": "py3",
            "requires_python": ">=3.8",
            "size": 21226,
            "upload_time": "2025-01-19T07:10:41",
            "upload_time_iso_8601": "2025-01-19T07:10:41.084562Z",
            "url": "https://files.pythonhosted.org/packages/a3/0d/be74282c0376cf50dbfbcf1f4410c72933020f0f2e2ffa76283062e4ac70/MorphoPreText-0.1.1-py3-none-any.whl",
            "yanked": false,
            "yanked_reason": null
        },
        {
            "comment_text": "",
            "digests": {
                "blake2b_256": "a9637801ba07fb63dd04c5b8998e8de9a352ee9752241803caf37286cc318348",
                "md5": "06d81be7f8e0f27990e14c2a771d4a1d",
                "sha256": "8742389e94618bd5f7905318e223bea50255992ddea9c6935f48a507d5c35087"
            },
            "downloads": -1,
            "filename": "MorphoPreText-0.1.1.tar.gz",
            "has_sig": false,
            "md5_digest": "06d81be7f8e0f27990e14c2a771d4a1d",
            "packagetype": "sdist",
            "python_version": "source",
            "requires_python": ">=3.8",
            "size": 17255,
            "upload_time": "2025-01-19T07:10:43",
            "upload_time_iso_8601": "2025-01-19T07:10:43.102597Z",
            "url": "https://files.pythonhosted.org/packages/a9/63/7801ba07fb63dd04c5b8998e8de9a352ee9752241803caf37286cc318348/MorphoPreText-0.1.1.tar.gz",
            "yanked": false,
            "yanked_reason": null
        }
    ],
    "upload_time": "2025-01-19 07:10:43",
    "github": true,
    "gitlab": false,
    "bitbucket": false,
    "codeberg": false,
    "github_user": "ghaskari",
    "github_project": "MorphoPreText",
    "travis_ci": false,
    "coveralls": false,
    "github_actions": false,
    "requirements": [
        {
            "name": "emoji",
            "specs": [
                [
                    "==",
                    "2.14.0"
                ]
            ]
        },
        {
            "name": "nltk",
            "specs": [
                [
                    "==",
                    "3.2.2"
                ]
            ]
        },
        {
            "name": "pandas",
            "specs": [
                [
                    "==",
                    "2.2.3"
                ]
            ]
        },
        {
            "name": "scikit-learn",
            "specs": [
                [
                    "==",
                    "1.6.0"
                ]
            ]
        },
        {
            "name": "pyspellchecker",
            "specs": [
                [
                    "==",
                    "0.8.2"
                ]
            ]
        },
        {
            "name": "parsivar",
            "specs": [
                [
                    "==",
                    "0.2.2"
                ]
            ]
        },
        {
            "name": "spacy",
            "specs": [
                [
                    "==",
                    "3.8.3"
                ]
            ]
        },
        {
            "name": "openpyxl",
            "specs": [
                [
                    "==",
                    "3.1.5"
                ]
            ]
        },
        {
            "name": "jdatetime",
            "specs": [
                [
                    "==",
                    "5.0.0"
                ]
            ]
        }
    ],
    "lcname": "morphopretext"
}
        
Elapsed time: 0.73625s