TextPreprocessorTool


NameTextPreprocessorTool JSON
Version 0.0.3 PyPI version JSON
download
home_pagehttps://github.com/Gaurav-Jaiswal-1/Text-Preprocessing-Toolkit.git
SummaryA package that automates text preprocessing
upload_time2024-12-03 18:33:21
maintainerNone
docs_urlNone
authorGaurav Jaiswal
requires_python>=3.8
licenseMIT
keywords text preprocessing toolkit automated text preprocessing
VCS
bugtrack_url
requirements pyspellchecker matplotlib pandas spacy nltk
Travis-CI No Travis.
coveralls test coverage No coveralls.
            

# Text Preprocessing Toolkit (TPT)

**Version:** 0.0.1  
**Author:** Gaurav Jaiswal  
A comprehensive Python toolkit for preprocessing text, designed to simplify NLP workflows. This package provides various utilities like stopword removal, punctuation handling, spell-checking, lemmatization, and more to clean and preprocess text effectively.

---

## Features

- **Remove Punctuation:** Strips punctuation marks from text.
- **Remove Stopwords:** Removes common stopwords to reduce noise in textual data.
- **Remove Special Characters:** Cleans text by removing unnecessary symbols.
- **Lowercase Conversion:** Standardizes text to lowercase.
- **Spell Correction:** Identifies and corrects misspelled words.
- **Lemmatization:** Converts words to their base forms.
- **Stemming:** Reduces words to their root forms using a stemming algorithm.
- **HTML Tag Removal:** Cleans HTML tags from the text.
- **URL Removal:** Detects and removes URLs.
- **Customizable Pipeline:** Allows users to apply preprocessing steps in a specified order.
- **Quick Dataset Preview:** Provides a summary of text datasets, including word and character counts.

---

## Installation

Clone the repository or install the package using `pip`:

```bash
pip install Text_Preprocessing_Toolkit
```

---

## Usage

### Import the Package

```python
from TPT import TPT
```

### Initialize the Toolkit

You can add custom stopwords during initialization:

```python
tpt = TPT(custom_stopwords=["example", "custom"])
```

### Preprocess Text with Default Pipeline

```python
text = "This is an <b>example</b> sentence with a URL: https://example.com."
processed_text = tpt.preprocess(text)
print(processed_text)
```

### Customize Preprocessing Steps

```python
custom_steps = ["lowercase", "remove_punctuation", "remove_stopwords"]
processed_text = tpt.preprocess(text, steps=custom_steps)
print(processed_text)
```

### Quick Dataset Summary

```python
texts = [
    "This is a sample text.",
    "Another <b>example</b> with HTML tags and a URL: https://example.com.",
    "Spellngg errors corrected!",
]
tpt.head(texts, n=3)
```

---

## Available Methods

| Method                   | Description                                                     |
|--------------------------|-----------------------------------------------------------------|
| `remove_punctuation`     | Removes punctuation from text.                                 |
| `remove_stopwords`       | Removes stopwords from text.                                   |
| `remove_special_characters` | Cleans text by removing special characters.                  |
| `remove_url`             | Removes URLs from the text.                                    |
| `remove_html_tags`       | Strips HTML tags from text.                                    |
| `correct_spellings`      | Corrects spelling mistakes in the text.                       |
| `lowercase`              | Converts text to lowercase.                                   |
| `lemmatize_text`         | Lemmatizes text using WordNet.                                |
| `stem_text`              | Applies stemming to reduce words to their root forms.         |
| `preprocess`             | Applies a series of preprocessing steps to the input text.    |
| `head`                   | Displays a quick summary of a text dataset.                   |

---

## Example Output

### Input

```text
This is a <b>sample</b> text with a URL: https://example.com. Check spellngg errors!
```

### Output (Default Pipeline)

```text
sample text check spelling errors
```

---

## Requirements

- Python >= 3.8
- Libraries: `nltk`, `pandas`, `spellchecker`, `IPython`

To install the dependencies:

```bash
pip install -r requirements.txt
```

---

## Contributing

Contributions are welcome! To contribute:

1. Fork this repository.
2. Clone your forked repository.
3. Create a new branch for your feature.
4. Make your changes, write tests, and ensure the code passes.
5. Submit a pull request for review.

---

## Testing

To test the package locally:

1. Install development dependencies:
   ```bash
   pip install pytest
   ```
2. Run tests:
   ```bash
   pytest
   ```

---

## License

This project is licensed under the MIT License. See the `LICENSE` file for details.

---

## Author

- **Gaurav Jaiswal**  
  [GitHub](https://github.com/Gaurav-Jaiswal-1)  


            

Raw data

            {
    "_id": null,
    "home_page": "https://github.com/Gaurav-Jaiswal-1/Text-Preprocessing-Toolkit.git",
    "name": "TextPreprocessorTool",
    "maintainer": null,
    "docs_url": null,
    "requires_python": ">=3.8",
    "maintainer_email": null,
    "keywords": "text preprocessing, toolkit, automated text preprocessing",
    "author": "Gaurav Jaiswal",
    "author_email": "Gaurav Jaiswal <jaiswalgaurav863@gmail.com>",
    "download_url": "https://files.pythonhosted.org/packages/5c/f8/fdda09093fc73793051504c0394f222ff59e9290346e0beddca6d25a1380/textpreprocessortool-0.0.3.tar.gz",
    "platform": null,
    "description": "\r\n\r\n# Text Preprocessing Toolkit (TPT)\r\n\r\n**Version:** 0.0.1  \r\n**Author:** Gaurav Jaiswal  \r\nA comprehensive Python toolkit for preprocessing text, designed to simplify NLP workflows. This package provides various utilities like stopword removal, punctuation handling, spell-checking, lemmatization, and more to clean and preprocess text effectively.\r\n\r\n---\r\n\r\n## Features\r\n\r\n- **Remove Punctuation:** Strips punctuation marks from text.\r\n- **Remove Stopwords:** Removes common stopwords to reduce noise in textual data.\r\n- **Remove Special Characters:** Cleans text by removing unnecessary symbols.\r\n- **Lowercase Conversion:** Standardizes text to lowercase.\r\n- **Spell Correction:** Identifies and corrects misspelled words.\r\n- **Lemmatization:** Converts words to their base forms.\r\n- **Stemming:** Reduces words to their root forms using a stemming algorithm.\r\n- **HTML Tag Removal:** Cleans HTML tags from the text.\r\n- **URL Removal:** Detects and removes URLs.\r\n- **Customizable Pipeline:** Allows users to apply preprocessing steps in a specified order.\r\n- **Quick Dataset Preview:** Provides a summary of text datasets, including word and character counts.\r\n\r\n---\r\n\r\n## Installation\r\n\r\nClone the repository or install the package using `pip`:\r\n\r\n```bash\r\npip install Text_Preprocessing_Toolkit\r\n```\r\n\r\n---\r\n\r\n## Usage\r\n\r\n### Import the Package\r\n\r\n```python\r\nfrom TPT import TPT\r\n```\r\n\r\n### Initialize the Toolkit\r\n\r\nYou can add custom stopwords during initialization:\r\n\r\n```python\r\ntpt = TPT(custom_stopwords=[\"example\", \"custom\"])\r\n```\r\n\r\n### Preprocess Text with Default Pipeline\r\n\r\n```python\r\ntext = \"This is an <b>example</b> sentence with a URL: https://example.com.\"\r\nprocessed_text = tpt.preprocess(text)\r\nprint(processed_text)\r\n```\r\n\r\n### Customize Preprocessing Steps\r\n\r\n```python\r\ncustom_steps = [\"lowercase\", \"remove_punctuation\", \"remove_stopwords\"]\r\nprocessed_text = tpt.preprocess(text, steps=custom_steps)\r\nprint(processed_text)\r\n```\r\n\r\n### Quick Dataset Summary\r\n\r\n```python\r\ntexts = [\r\n    \"This is a sample text.\",\r\n    \"Another <b>example</b> with HTML tags and a URL: https://example.com.\",\r\n    \"Spellngg errors corrected!\",\r\n]\r\ntpt.head(texts, n=3)\r\n```\r\n\r\n---\r\n\r\n## Available Methods\r\n\r\n| Method                   | Description                                                     |\r\n|--------------------------|-----------------------------------------------------------------|\r\n| `remove_punctuation`     | Removes punctuation from text.                                 |\r\n| `remove_stopwords`       | Removes stopwords from text.                                   |\r\n| `remove_special_characters` | Cleans text by removing special characters.                  |\r\n| `remove_url`             | Removes URLs from the text.                                    |\r\n| `remove_html_tags`       | Strips HTML tags from text.                                    |\r\n| `correct_spellings`      | Corrects spelling mistakes in the text.                       |\r\n| `lowercase`              | Converts text to lowercase.                                   |\r\n| `lemmatize_text`         | Lemmatizes text using WordNet.                                |\r\n| `stem_text`              | Applies stemming to reduce words to their root forms.         |\r\n| `preprocess`             | Applies a series of preprocessing steps to the input text.    |\r\n| `head`                   | Displays a quick summary of a text dataset.                   |\r\n\r\n---\r\n\r\n## Example Output\r\n\r\n### Input\r\n\r\n```text\r\nThis is a <b>sample</b> text with a URL: https://example.com. Check spellngg errors!\r\n```\r\n\r\n### Output (Default Pipeline)\r\n\r\n```text\r\nsample text check spelling errors\r\n```\r\n\r\n---\r\n\r\n## Requirements\r\n\r\n- Python >= 3.8\r\n- Libraries: `nltk`, `pandas`, `spellchecker`, `IPython`\r\n\r\nTo install the dependencies:\r\n\r\n```bash\r\npip install -r requirements.txt\r\n```\r\n\r\n---\r\n\r\n## Contributing\r\n\r\nContributions are welcome! To contribute:\r\n\r\n1. Fork this repository.\r\n2. Clone your forked repository.\r\n3. Create a new branch for your feature.\r\n4. Make your changes, write tests, and ensure the code passes.\r\n5. Submit a pull request for review.\r\n\r\n---\r\n\r\n## Testing\r\n\r\nTo test the package locally:\r\n\r\n1. Install development dependencies:\r\n   ```bash\r\n   pip install pytest\r\n   ```\r\n2. Run tests:\r\n   ```bash\r\n   pytest\r\n   ```\r\n\r\n---\r\n\r\n## License\r\n\r\nThis project is licensed under the MIT License. See the `LICENSE` file for details.\r\n\r\n---\r\n\r\n## Author\r\n\r\n- **Gaurav Jaiswal**  \r\n  [GitHub](https://github.com/Gaurav-Jaiswal-1)  \r\n\r\n",
    "bugtrack_url": null,
    "license": "MIT",
    "summary": "A package that automates text preprocessing",
    "version": "0.0.3",
    "project_urls": {
        "Homepage": "https://github.com/Gaurav-Jaiswal-1/Text-Preprocessing-Toolkit",
        "Issues": "https://github.com/Gaurav-Jaiswal-1/Text-Preprocessing-Toolkit/issues"
    },
    "split_keywords": [
        "text preprocessing",
        " toolkit",
        " automated text preprocessing"
    ],
    "urls": [
        {
            "comment_text": "",
            "digests": {
                "blake2b_256": "9c73864bc67972d0ac4c7219bbb96e812fce11552b6ca6ef3c9ef0cf67897d56",
                "md5": "644a4d2e6de969191787be459fc4b385",
                "sha256": "764457753ba4120639eab2724468e2c4fc3f3ac7b8c450752b0786a08253624f"
            },
            "downloads": -1,
            "filename": "TextPreprocessorTool-0.0.3-py3-none-any.whl",
            "has_sig": false,
            "md5_digest": "644a4d2e6de969191787be459fc4b385",
            "packagetype": "bdist_wheel",
            "python_version": "py3",
            "requires_python": ">=3.8",
            "size": 5334,
            "upload_time": "2024-12-03T18:33:19",
            "upload_time_iso_8601": "2024-12-03T18:33:19.052088Z",
            "url": "https://files.pythonhosted.org/packages/9c/73/864bc67972d0ac4c7219bbb96e812fce11552b6ca6ef3c9ef0cf67897d56/TextPreprocessorTool-0.0.3-py3-none-any.whl",
            "yanked": false,
            "yanked_reason": null
        },
        {
            "comment_text": "",
            "digests": {
                "blake2b_256": "5cf8fdda09093fc73793051504c0394f222ff59e9290346e0beddca6d25a1380",
                "md5": "d361535bff5a7e2cb6d54e9e8f5a5317",
                "sha256": "ab04420a4b42d62bc56ae8c9c5bdce44a1bccb36dae558114758dd1ce8f8c709"
            },
            "downloads": -1,
            "filename": "textpreprocessortool-0.0.3.tar.gz",
            "has_sig": false,
            "md5_digest": "d361535bff5a7e2cb6d54e9e8f5a5317",
            "packagetype": "sdist",
            "python_version": "source",
            "requires_python": ">=3.8",
            "size": 6454,
            "upload_time": "2024-12-03T18:33:21",
            "upload_time_iso_8601": "2024-12-03T18:33:21.066491Z",
            "url": "https://files.pythonhosted.org/packages/5c/f8/fdda09093fc73793051504c0394f222ff59e9290346e0beddca6d25a1380/textpreprocessortool-0.0.3.tar.gz",
            "yanked": false,
            "yanked_reason": null
        }
    ],
    "upload_time": "2024-12-03 18:33:21",
    "github": true,
    "gitlab": false,
    "bitbucket": false,
    "codeberg": false,
    "github_user": "Gaurav-Jaiswal-1",
    "github_project": "Text-Preprocessing-Toolkit",
    "travis_ci": false,
    "coveralls": false,
    "github_actions": true,
    "requirements": [
        {
            "name": "pyspellchecker",
            "specs": [
                [
                    "==",
                    "0.7.1"
                ]
            ]
        },
        {
            "name": "matplotlib",
            "specs": [
                [
                    ">=",
                    "3.1"
                ]
            ]
        },
        {
            "name": "pandas",
            "specs": [
                [
                    ">=",
                    "1.1"
                ]
            ]
        },
        {
            "name": "spacy",
            "specs": [
                [
                    "<",
                    "3.0"
                ]
            ]
        },
        {
            "name": "nltk",
            "specs": [
                [
                    ">=",
                    "3.5"
                ]
            ]
        }
    ],
    "tox": true,
    "lcname": "textpreprocessortool"
}
        
Elapsed time: 0.36723s