# Text Preprocessing Toolkit (TPT)
**Version:** 0.0.1
**Author:** Gaurav Jaiswal
A comprehensive Python toolkit for preprocessing text, designed to simplify NLP workflows. This package provides various utilities like stopword removal, punctuation handling, spell-checking, lemmatization, and more to clean and preprocess text effectively.
---
## Features
- **Remove Punctuation:** Strips punctuation marks from text.
- **Remove Stopwords:** Removes common stopwords to reduce noise in textual data.
- **Remove Special Characters:** Cleans text by removing unnecessary symbols.
- **Lowercase Conversion:** Standardizes text to lowercase.
- **Spell Correction:** Identifies and corrects misspelled words.
- **Lemmatization:** Converts words to their base forms.
- **Stemming:** Reduces words to their root forms using a stemming algorithm.
- **HTML Tag Removal:** Cleans HTML tags from the text.
- **URL Removal:** Detects and removes URLs.
- **Customizable Pipeline:** Allows users to apply preprocessing steps in a specified order.
- **Quick Dataset Preview:** Provides a summary of text datasets, including word and character counts.
---
## Installation
Clone the repository or install the package using `pip`:
```bash
pip install Text_Preprocessing_Toolkit
```
---
## Usage
### Import the Package
```python
from TPT import TPT
```
### Initialize the Toolkit
You can add custom stopwords during initialization:
```python
tpt = TPT(custom_stopwords=["example", "custom"])
```
### Preprocess Text with Default Pipeline
```python
text = "This is an <b>example</b> sentence with a URL: https://example.com."
processed_text = tpt.preprocess(text)
print(processed_text)
```
### Customize Preprocessing Steps
```python
custom_steps = ["lowercase", "remove_punctuation", "remove_stopwords"]
processed_text = tpt.preprocess(text, steps=custom_steps)
print(processed_text)
```
### Quick Dataset Summary
```python
texts = [
"This is a sample text.",
"Another <b>example</b> with HTML tags and a URL: https://example.com.",
"Spellngg errors corrected!",
]
tpt.head(texts, n=3)
```
---
## Available Methods
| Method | Description |
|--------------------------|-----------------------------------------------------------------|
| `remove_punctuation` | Removes punctuation from text. |
| `remove_stopwords` | Removes stopwords from text. |
| `remove_special_characters` | Cleans text by removing special characters. |
| `remove_url` | Removes URLs from the text. |
| `remove_html_tags` | Strips HTML tags from text. |
| `correct_spellings` | Corrects spelling mistakes in the text. |
| `lowercase` | Converts text to lowercase. |
| `lemmatize_text` | Lemmatizes text using WordNet. |
| `stem_text` | Applies stemming to reduce words to their root forms. |
| `preprocess` | Applies a series of preprocessing steps to the input text. |
| `head` | Displays a quick summary of a text dataset. |
---
## Example Output
### Input
```text
This is a <b>sample</b> text with a URL: https://example.com. Check spellngg errors!
```
### Output (Default Pipeline)
```text
sample text check spelling errors
```
---
## Requirements
- Python >= 3.8
- Libraries: `nltk`, `pandas`, `spellchecker`, `IPython`
To install the dependencies:
```bash
pip install -r requirements.txt
```
---
## Contributing
Contributions are welcome! To contribute:
1. Fork this repository.
2. Clone your forked repository.
3. Create a new branch for your feature.
4. Make your changes, write tests, and ensure the code passes.
5. Submit a pull request for review.
---
## Testing
To test the package locally:
1. Install development dependencies:
```bash
pip install pytest
```
2. Run tests:
```bash
pytest
```
---
## License
This project is licensed under the MIT License. See the `LICENSE` file for details.
---
## Author
- **Gaurav Jaiswal**
[GitHub](https://github.com/Gaurav-Jaiswal-1)
Raw data
{
"_id": null,
"home_page": "https://github.com/Gaurav-Jaiswal-1/Text-Preprocessing-Toolkit.git",
"name": "TextPreprocessorTool",
"maintainer": null,
"docs_url": null,
"requires_python": ">=3.8",
"maintainer_email": null,
"keywords": "text preprocessing, toolkit, automated text preprocessing",
"author": "Gaurav Jaiswal",
"author_email": "Gaurav Jaiswal <jaiswalgaurav863@gmail.com>",
"download_url": "https://files.pythonhosted.org/packages/5c/f8/fdda09093fc73793051504c0394f222ff59e9290346e0beddca6d25a1380/textpreprocessortool-0.0.3.tar.gz",
"platform": null,
"description": "\r\n\r\n# Text Preprocessing Toolkit (TPT)\r\n\r\n**Version:** 0.0.1 \r\n**Author:** Gaurav Jaiswal \r\nA comprehensive Python toolkit for preprocessing text, designed to simplify NLP workflows. This package provides various utilities like stopword removal, punctuation handling, spell-checking, lemmatization, and more to clean and preprocess text effectively.\r\n\r\n---\r\n\r\n## Features\r\n\r\n- **Remove Punctuation:** Strips punctuation marks from text.\r\n- **Remove Stopwords:** Removes common stopwords to reduce noise in textual data.\r\n- **Remove Special Characters:** Cleans text by removing unnecessary symbols.\r\n- **Lowercase Conversion:** Standardizes text to lowercase.\r\n- **Spell Correction:** Identifies and corrects misspelled words.\r\n- **Lemmatization:** Converts words to their base forms.\r\n- **Stemming:** Reduces words to their root forms using a stemming algorithm.\r\n- **HTML Tag Removal:** Cleans HTML tags from the text.\r\n- **URL Removal:** Detects and removes URLs.\r\n- **Customizable Pipeline:** Allows users to apply preprocessing steps in a specified order.\r\n- **Quick Dataset Preview:** Provides a summary of text datasets, including word and character counts.\r\n\r\n---\r\n\r\n## Installation\r\n\r\nClone the repository or install the package using `pip`:\r\n\r\n```bash\r\npip install Text_Preprocessing_Toolkit\r\n```\r\n\r\n---\r\n\r\n## Usage\r\n\r\n### Import the Package\r\n\r\n```python\r\nfrom TPT import TPT\r\n```\r\n\r\n### Initialize the Toolkit\r\n\r\nYou can add custom stopwords during initialization:\r\n\r\n```python\r\ntpt = TPT(custom_stopwords=[\"example\", \"custom\"])\r\n```\r\n\r\n### Preprocess Text with Default Pipeline\r\n\r\n```python\r\ntext = \"This is an <b>example</b> sentence with a URL: https://example.com.\"\r\nprocessed_text = tpt.preprocess(text)\r\nprint(processed_text)\r\n```\r\n\r\n### Customize Preprocessing Steps\r\n\r\n```python\r\ncustom_steps = [\"lowercase\", \"remove_punctuation\", \"remove_stopwords\"]\r\nprocessed_text = tpt.preprocess(text, steps=custom_steps)\r\nprint(processed_text)\r\n```\r\n\r\n### Quick Dataset Summary\r\n\r\n```python\r\ntexts = [\r\n \"This is a sample text.\",\r\n \"Another <b>example</b> with HTML tags and a URL: https://example.com.\",\r\n \"Spellngg errors corrected!\",\r\n]\r\ntpt.head(texts, n=3)\r\n```\r\n\r\n---\r\n\r\n## Available Methods\r\n\r\n| Method | Description |\r\n|--------------------------|-----------------------------------------------------------------|\r\n| `remove_punctuation` | Removes punctuation from text. |\r\n| `remove_stopwords` | Removes stopwords from text. |\r\n| `remove_special_characters` | Cleans text by removing special characters. |\r\n| `remove_url` | Removes URLs from the text. |\r\n| `remove_html_tags` | Strips HTML tags from text. |\r\n| `correct_spellings` | Corrects spelling mistakes in the text. |\r\n| `lowercase` | Converts text to lowercase. |\r\n| `lemmatize_text` | Lemmatizes text using WordNet. |\r\n| `stem_text` | Applies stemming to reduce words to their root forms. |\r\n| `preprocess` | Applies a series of preprocessing steps to the input text. |\r\n| `head` | Displays a quick summary of a text dataset. |\r\n\r\n---\r\n\r\n## Example Output\r\n\r\n### Input\r\n\r\n```text\r\nThis is a <b>sample</b> text with a URL: https://example.com. Check spellngg errors!\r\n```\r\n\r\n### Output (Default Pipeline)\r\n\r\n```text\r\nsample text check spelling errors\r\n```\r\n\r\n---\r\n\r\n## Requirements\r\n\r\n- Python >= 3.8\r\n- Libraries: `nltk`, `pandas`, `spellchecker`, `IPython`\r\n\r\nTo install the dependencies:\r\n\r\n```bash\r\npip install -r requirements.txt\r\n```\r\n\r\n---\r\n\r\n## Contributing\r\n\r\nContributions are welcome! To contribute:\r\n\r\n1. Fork this repository.\r\n2. Clone your forked repository.\r\n3. Create a new branch for your feature.\r\n4. Make your changes, write tests, and ensure the code passes.\r\n5. Submit a pull request for review.\r\n\r\n---\r\n\r\n## Testing\r\n\r\nTo test the package locally:\r\n\r\n1. Install development dependencies:\r\n ```bash\r\n pip install pytest\r\n ```\r\n2. Run tests:\r\n ```bash\r\n pytest\r\n ```\r\n\r\n---\r\n\r\n## License\r\n\r\nThis project is licensed under the MIT License. See the `LICENSE` file for details.\r\n\r\n---\r\n\r\n## Author\r\n\r\n- **Gaurav Jaiswal** \r\n [GitHub](https://github.com/Gaurav-Jaiswal-1) \r\n\r\n",
"bugtrack_url": null,
"license": "MIT",
"summary": "A package that automates text preprocessing",
"version": "0.0.3",
"project_urls": {
"Homepage": "https://github.com/Gaurav-Jaiswal-1/Text-Preprocessing-Toolkit",
"Issues": "https://github.com/Gaurav-Jaiswal-1/Text-Preprocessing-Toolkit/issues"
},
"split_keywords": [
"text preprocessing",
" toolkit",
" automated text preprocessing"
],
"urls": [
{
"comment_text": "",
"digests": {
"blake2b_256": "9c73864bc67972d0ac4c7219bbb96e812fce11552b6ca6ef3c9ef0cf67897d56",
"md5": "644a4d2e6de969191787be459fc4b385",
"sha256": "764457753ba4120639eab2724468e2c4fc3f3ac7b8c450752b0786a08253624f"
},
"downloads": -1,
"filename": "TextPreprocessorTool-0.0.3-py3-none-any.whl",
"has_sig": false,
"md5_digest": "644a4d2e6de969191787be459fc4b385",
"packagetype": "bdist_wheel",
"python_version": "py3",
"requires_python": ">=3.8",
"size": 5334,
"upload_time": "2024-12-03T18:33:19",
"upload_time_iso_8601": "2024-12-03T18:33:19.052088Z",
"url": "https://files.pythonhosted.org/packages/9c/73/864bc67972d0ac4c7219bbb96e812fce11552b6ca6ef3c9ef0cf67897d56/TextPreprocessorTool-0.0.3-py3-none-any.whl",
"yanked": false,
"yanked_reason": null
},
{
"comment_text": "",
"digests": {
"blake2b_256": "5cf8fdda09093fc73793051504c0394f222ff59e9290346e0beddca6d25a1380",
"md5": "d361535bff5a7e2cb6d54e9e8f5a5317",
"sha256": "ab04420a4b42d62bc56ae8c9c5bdce44a1bccb36dae558114758dd1ce8f8c709"
},
"downloads": -1,
"filename": "textpreprocessortool-0.0.3.tar.gz",
"has_sig": false,
"md5_digest": "d361535bff5a7e2cb6d54e9e8f5a5317",
"packagetype": "sdist",
"python_version": "source",
"requires_python": ">=3.8",
"size": 6454,
"upload_time": "2024-12-03T18:33:21",
"upload_time_iso_8601": "2024-12-03T18:33:21.066491Z",
"url": "https://files.pythonhosted.org/packages/5c/f8/fdda09093fc73793051504c0394f222ff59e9290346e0beddca6d25a1380/textpreprocessortool-0.0.3.tar.gz",
"yanked": false,
"yanked_reason": null
}
],
"upload_time": "2024-12-03 18:33:21",
"github": true,
"gitlab": false,
"bitbucket": false,
"codeberg": false,
"github_user": "Gaurav-Jaiswal-1",
"github_project": "Text-Preprocessing-Toolkit",
"travis_ci": false,
"coveralls": false,
"github_actions": true,
"requirements": [
{
"name": "pyspellchecker",
"specs": [
[
"==",
"0.7.1"
]
]
},
{
"name": "matplotlib",
"specs": [
[
">=",
"3.1"
]
]
},
{
"name": "pandas",
"specs": [
[
">=",
"1.1"
]
]
},
{
"name": "spacy",
"specs": [
[
"<",
"3.0"
]
]
},
{
"name": "nltk",
"specs": [
[
">=",
"3.5"
]
]
}
],
"tox": true,
"lcname": "textpreprocessortool"
}