TPTK


NameTPTK JSON
Version 1.0.1 PyPI version JSON
download
home_pagehttps://github.com/Gaurav-Jaiswal-1/Text-Preprocessing-Toolkit
SummaryA Python package for automating text preprocessing tasks.
upload_time2024-12-08 13:48:15
maintainerNone
docs_urlNone
authorGaurav Jaiswal
requires_python>=3.8
licenseMIT
keywords text preprocessing nlp text cleaning
VCS
bugtrack_url
requirements pyspellchecker matplotlib pandas spacy nltk
Travis-CI No Travis.
coveralls test coverage No coveralls.
            
# TextPreprocessor

A comprehensive and modular text preprocessing library for Natural Language Processing (NLP) tasks. It provides utilities for common preprocessing steps like tokenization, lemmatization, spell correction, and stopword removal, making it easier to prepare text data for machine learning and data analysis.

---

## Table of Contents

- [Features](#features)
- [Installation](#installation)
- [Usage](#usage)
- [Preprocessing Steps](#preprocessing-steps)
- [Class Methods](#class-methods)
- [Examples](#examples)
- [Contributing](#contributing)
- [License](#license)

---

## Features

- **Customizable Pipeline**: Flexible pipeline to execute selected preprocessing steps.
- **Spell Correction**: Automatically fixes misspelled words.
- **Stopword Removal**: Supports customizable stopword lists.
- **Lemmatization**: Converts words to their base forms.
- **Punctuation and Special Character Removal**: Cleans the text by removing unnecessary characters.
- **URL and HTML Tag Removal**: Handles noisy inputs containing links and tags.
- **Dataset Preprocessing**: Processes datasets and provides statistics like word and character counts.
- **Logging**: Logs every preprocessing step for better debugging and transparency.

---

## Installation

To use the `TextPreprocessor` class, ensure you have Python 3.7+ installed. Clone this repository and install the required dependencies.

```bash
git clone https://github.com/your-username/TextPreprocessor.git
cd TextPreprocessor
pip install -r requirements.txt
```

---

## Usage

The `TextPreprocessor` class is designed to preprocess both individual text strings and datasets. Below is a quick example:

### Initializing the Preprocessor
```python
from text_preprocessor import TextPreprocessor

# Initialize with optional custom stopwords
preprocessor = TextPreprocessor(custom_stopwords=["example", "test"])
```

### Preprocessing Text
```python
sample_text = "This is an <b>example</b> text with a URL: https://example.com."
clean_text = preprocessor.preprocess(sample_text)
print(clean_text)
```

### Preprocessing a Dataset
```python
import pandas as pd

dataset = pd.Series([
    "This is an example text.",
    "HTML tags like <p> are removed.",
    "Misspelled wrods are corrected.",
    "12345 is just a number.",
    None
])

processed_data = preprocessor.preprocess_dataset(dataset)
print(processed_data)
```

---

## Preprocessing Steps

The `preprocess` function supports the following steps, which can be customized:
- `lowercase`: Converts text to lowercase.
- `remove_url`: Removes URLs from the text.
- `remove_html_tags`: Strips out HTML tags.
- `remove_punctuation`: Removes punctuation marks.
- `remove_special_characters`: Removes non-alphanumeric characters.
- `correct_spellings`: Corrects misspelled words.
- `lemmatize_text`: Lemmatizes words to their base forms.

Default steps can be modified by providing a `steps` parameter as a list when calling `preprocess`.

---

## Class Methods

- `tokenize(text: str) -> List[str]`: Tokenizes text into words.
- `remove_punctuation(text: str) -> str`: Removes punctuation.
- `remove_stopwords(tokens: List[str]) -> List[str]`: Removes stopwords from tokenized text.
- `remove_special_characters(text: str) -> str`: Removes special characters.
- `lemmatize_text(text: str) -> str`: Lemmatizes words using WordNet.
- `correct_spellings(text: str) -> str`: Corrects misspellings using `pyspellchecker`.
- `remove_url(text: str) -> str`: Removes URLs from text.
- `remove_html_tags(text: str) -> str`: Strips HTML tags from text.
- `preprocess(text: str, steps: Optional[List[str]] = None) -> str`: Preprocesses text based on selected steps.
- `preprocess_dataset(texts: Union[List[str], pd.Series], n: int = 5) -> pd.DataFrame`: Preprocesses a dataset of text entries.
- `head(text: str, n: int = 10) -> str`: Displays the first `n` characters of the text.

---

## Examples

### Custom Preprocessing
```python
text = "An EXAMPLE text with HTML <b>tags</b> and a URL: https://example.com."
custom_steps = ["lowercase", "remove_html_tags", "lemmatize_text"]
processed_text = preprocessor.preprocess(text, steps=custom_steps)
print(processed_text)
```

### Displaying Dataset Statistics
```python
dataset = pd.Series(["Sample text 1.", "Another <p>example</p>."])
preprocessor.preprocess_dataset(dataset)
```

### Using the `head` Function
```python
sample_text = "This is a sample text for the head function."
print(preprocessor.head(sample_text, 15))
```

---

## Contributing

We welcome contributions! Please follow these steps:
1. Fork the repository.
2. Create a new branch for your feature or bug fix.
3. Submit a pull request with a clear description of your changes.

---

## License

This project is licensed under the [MIT License](LICENSE).

---

## Acknowledgments

- [NLTK](https://www.nltk.org) for NLP tools.
- [PySpellChecker](https://github.com/barrust/pyspellchecker) for spell checking.
- The Python community for inspiring open-source projects.
```


            

Raw data

            {
    "_id": null,
    "home_page": "https://github.com/Gaurav-Jaiswal-1/Text-Preprocessing-Toolkit",
    "name": "TPTK",
    "maintainer": null,
    "docs_url": null,
    "requires_python": ">=3.8",
    "maintainer_email": null,
    "keywords": "text preprocessing, NLP, text cleaning",
    "author": "Gaurav Jaiswal",
    "author_email": "Gaurav Jaiswal <jaiswalgaurav863@gmail.com>",
    "download_url": "https://files.pythonhosted.org/packages/4c/0f/43d1d1d5747757b38344c61b64253598849f78dbad7a2ef277d05421c36c/tptk-1.0.1.tar.gz",
    "platform": null,
    "description": "\r\n# TextPreprocessor\r\n\r\nA comprehensive and modular text preprocessing library for Natural Language Processing (NLP) tasks. It provides utilities for common preprocessing steps like tokenization, lemmatization, spell correction, and stopword removal, making it easier to prepare text data for machine learning and data analysis.\r\n\r\n---\r\n\r\n## Table of Contents\r\n\r\n- [Features](#features)\r\n- [Installation](#installation)\r\n- [Usage](#usage)\r\n- [Preprocessing Steps](#preprocessing-steps)\r\n- [Class Methods](#class-methods)\r\n- [Examples](#examples)\r\n- [Contributing](#contributing)\r\n- [License](#license)\r\n\r\n---\r\n\r\n## Features\r\n\r\n- **Customizable Pipeline**: Flexible pipeline to execute selected preprocessing steps.\r\n- **Spell Correction**: Automatically fixes misspelled words.\r\n- **Stopword Removal**: Supports customizable stopword lists.\r\n- **Lemmatization**: Converts words to their base forms.\r\n- **Punctuation and Special Character Removal**: Cleans the text by removing unnecessary characters.\r\n- **URL and HTML Tag Removal**: Handles noisy inputs containing links and tags.\r\n- **Dataset Preprocessing**: Processes datasets and provides statistics like word and character counts.\r\n- **Logging**: Logs every preprocessing step for better debugging and transparency.\r\n\r\n---\r\n\r\n## Installation\r\n\r\nTo use the `TextPreprocessor` class, ensure you have Python 3.7+ installed. Clone this repository and install the required dependencies.\r\n\r\n```bash\r\ngit clone https://github.com/your-username/TextPreprocessor.git\r\ncd TextPreprocessor\r\npip install -r requirements.txt\r\n```\r\n\r\n---\r\n\r\n## Usage\r\n\r\nThe `TextPreprocessor` class is designed to preprocess both individual text strings and datasets. Below is a quick example:\r\n\r\n### Initializing the Preprocessor\r\n```python\r\nfrom text_preprocessor import TextPreprocessor\r\n\r\n# Initialize with optional custom stopwords\r\npreprocessor = TextPreprocessor(custom_stopwords=[\"example\", \"test\"])\r\n```\r\n\r\n### Preprocessing Text\r\n```python\r\nsample_text = \"This is an <b>example</b> text with a URL: https://example.com.\"\r\nclean_text = preprocessor.preprocess(sample_text)\r\nprint(clean_text)\r\n```\r\n\r\n### Preprocessing a Dataset\r\n```python\r\nimport pandas as pd\r\n\r\ndataset = pd.Series([\r\n    \"This is an example text.\",\r\n    \"HTML tags like <p> are removed.\",\r\n    \"Misspelled wrods are corrected.\",\r\n    \"12345 is just a number.\",\r\n    None\r\n])\r\n\r\nprocessed_data = preprocessor.preprocess_dataset(dataset)\r\nprint(processed_data)\r\n```\r\n\r\n---\r\n\r\n## Preprocessing Steps\r\n\r\nThe `preprocess` function supports the following steps, which can be customized:\r\n- `lowercase`: Converts text to lowercase.\r\n- `remove_url`: Removes URLs from the text.\r\n- `remove_html_tags`: Strips out HTML tags.\r\n- `remove_punctuation`: Removes punctuation marks.\r\n- `remove_special_characters`: Removes non-alphanumeric characters.\r\n- `correct_spellings`: Corrects misspelled words.\r\n- `lemmatize_text`: Lemmatizes words to their base forms.\r\n\r\nDefault steps can be modified by providing a `steps` parameter as a list when calling `preprocess`.\r\n\r\n---\r\n\r\n## Class Methods\r\n\r\n- `tokenize(text: str) -> List[str]`: Tokenizes text into words.\r\n- `remove_punctuation(text: str) -> str`: Removes punctuation.\r\n- `remove_stopwords(tokens: List[str]) -> List[str]`: Removes stopwords from tokenized text.\r\n- `remove_special_characters(text: str) -> str`: Removes special characters.\r\n- `lemmatize_text(text: str) -> str`: Lemmatizes words using WordNet.\r\n- `correct_spellings(text: str) -> str`: Corrects misspellings using `pyspellchecker`.\r\n- `remove_url(text: str) -> str`: Removes URLs from text.\r\n- `remove_html_tags(text: str) -> str`: Strips HTML tags from text.\r\n- `preprocess(text: str, steps: Optional[List[str]] = None) -> str`: Preprocesses text based on selected steps.\r\n- `preprocess_dataset(texts: Union[List[str], pd.Series], n: int = 5) -> pd.DataFrame`: Preprocesses a dataset of text entries.\r\n- `head(text: str, n: int = 10) -> str`: Displays the first `n` characters of the text.\r\n\r\n---\r\n\r\n## Examples\r\n\r\n### Custom Preprocessing\r\n```python\r\ntext = \"An EXAMPLE text with HTML <b>tags</b> and a URL: https://example.com.\"\r\ncustom_steps = [\"lowercase\", \"remove_html_tags\", \"lemmatize_text\"]\r\nprocessed_text = preprocessor.preprocess(text, steps=custom_steps)\r\nprint(processed_text)\r\n```\r\n\r\n### Displaying Dataset Statistics\r\n```python\r\ndataset = pd.Series([\"Sample text 1.\", \"Another <p>example</p>.\"])\r\npreprocessor.preprocess_dataset(dataset)\r\n```\r\n\r\n### Using the `head` Function\r\n```python\r\nsample_text = \"This is a sample text for the head function.\"\r\nprint(preprocessor.head(sample_text, 15))\r\n```\r\n\r\n---\r\n\r\n## Contributing\r\n\r\nWe welcome contributions! Please follow these steps:\r\n1. Fork the repository.\r\n2. Create a new branch for your feature or bug fix.\r\n3. Submit a pull request with a clear description of your changes.\r\n\r\n---\r\n\r\n## License\r\n\r\nThis project is licensed under the [MIT License](LICENSE).\r\n\r\n---\r\n\r\n## Acknowledgments\r\n\r\n- [NLTK](https://www.nltk.org) for NLP tools.\r\n- [PySpellChecker](https://github.com/barrust/pyspellchecker) for spell checking.\r\n- The Python community for inspiring open-source projects.\r\n```\r\n\r\n",
    "bugtrack_url": null,
    "license": "MIT",
    "summary": "A Python package for automating text preprocessing tasks.",
    "version": "1.0.1",
    "project_urls": {
        "Homepage": "https://github.com/Gaurav-Jaiswal-1/Text-Preprocessing-Toolkit"
    },
    "split_keywords": [
        "text preprocessing",
        " nlp",
        " text cleaning"
    ],
    "urls": [
        {
            "comment_text": "",
            "digests": {
                "blake2b_256": "2fa7a9f3f03059154597eb02fb8a8f215699bb0686d296a05e166f7487e66b9b",
                "md5": "7373023889d05d1a695576081f5ca478",
                "sha256": "46d5134f7f8327c9f776157a52ef04196822080e7ea77778f873cf21501f71b2"
            },
            "downloads": -1,
            "filename": "TPTK-1.0.1-py3-none-any.whl",
            "has_sig": false,
            "md5_digest": "7373023889d05d1a695576081f5ca478",
            "packagetype": "bdist_wheel",
            "python_version": "py3",
            "requires_python": ">=3.8",
            "size": 5402,
            "upload_time": "2024-12-08T13:48:13",
            "upload_time_iso_8601": "2024-12-08T13:48:13.686683Z",
            "url": "https://files.pythonhosted.org/packages/2f/a7/a9f3f03059154597eb02fb8a8f215699bb0686d296a05e166f7487e66b9b/TPTK-1.0.1-py3-none-any.whl",
            "yanked": false,
            "yanked_reason": null
        },
        {
            "comment_text": "",
            "digests": {
                "blake2b_256": "4c0f43d1d1d5747757b38344c61b64253598849f78dbad7a2ef277d05421c36c",
                "md5": "76a1b5e91199c328dcf8753ca8f44212",
                "sha256": "ab9f9c8074d5fcbc5eff107ce9b1ea5752fe0ca795e9d95cc2b396c2154ae2dd"
            },
            "downloads": -1,
            "filename": "tptk-1.0.1.tar.gz",
            "has_sig": false,
            "md5_digest": "76a1b5e91199c328dcf8753ca8f44212",
            "packagetype": "sdist",
            "python_version": "source",
            "requires_python": ">=3.8",
            "size": 6009,
            "upload_time": "2024-12-08T13:48:15",
            "upload_time_iso_8601": "2024-12-08T13:48:15.424457Z",
            "url": "https://files.pythonhosted.org/packages/4c/0f/43d1d1d5747757b38344c61b64253598849f78dbad7a2ef277d05421c36c/tptk-1.0.1.tar.gz",
            "yanked": false,
            "yanked_reason": null
        }
    ],
    "upload_time": "2024-12-08 13:48:15",
    "github": true,
    "gitlab": false,
    "bitbucket": false,
    "codeberg": false,
    "github_user": "Gaurav-Jaiswal-1",
    "github_project": "Text-Preprocessing-Toolkit",
    "travis_ci": false,
    "coveralls": false,
    "github_actions": true,
    "requirements": [
        {
            "name": "pyspellchecker",
            "specs": [
                [
                    ">=",
                    "0.7.1"
                ]
            ]
        },
        {
            "name": "matplotlib",
            "specs": [
                [
                    ">=",
                    "3.1"
                ]
            ]
        },
        {
            "name": "pandas",
            "specs": [
                [
                    ">=",
                    "1.1"
                ]
            ]
        },
        {
            "name": "spacy",
            "specs": [
                [
                    ">=",
                    "3.0"
                ]
            ]
        },
        {
            "name": "nltk",
            "specs": [
                [
                    ">=",
                    "3.5"
                ]
            ]
        }
    ],
    "tox": true,
    "lcname": "tptk"
}
        
Elapsed time: 3.09679s