# Text Preprocessing Toolkit
This repository contains a Python package for text preprocessing tasks. The toolkit includes functions for various preprocessing steps such as tokenization, lemmatization, stopword removal, text normalization, and more. It aims to provide a convenient and customizable solution for preparing text data for downstream tasks like natural language processing (NLP) and machine learning.
## Features
- **Lowercasing**: Convert all text to lowercase.
- **Punctuation Removal**: Remove punctuation marks from text.
- **Stopword Removal**: Remove common words (e.g., "and", "the") that do not contribute much meaning.
- **Lemmatization**: Reduce words to their base or root form (e.g., "running" -> "run").
- **Spell Correction**: Correct misspelled words in the text.
- **URL and HTML Tag Removal**: Clean URLs and HTML tags from text.
- **Special Character Removal**: Remove non-alphanumeric characters.
## Requirements
- Python 3.8 or higher
- `flake8` for linting
- `pytest` for testing
- Any dependencies defined in `requirements.txt`
## Installation
To install the package, clone the repository and install the necessary dependencies.
### Clone the repository:
```bash
git clone https://github.com/your-username/text-preprocessing-toolkit.git
cd text-preprocessing-toolkit
```
### Install dependencies:
```bash
pip install -r requirements.txt
```
Alternatively, if you want to install the package globally:
```bash
pip install .
```
## Usage
You can use this toolkit in your Python project by importing the preprocessing functions:
```python
from text_preprocessing_toolkit import processor
text = "Your sample text goes here!"
# Preprocess text
cleaned_text = processor.preprocess(text, steps=[
"lowercase",
"remove_punctuation",
"remove_stopwords",
"lemmatize_text",
"remove_special_characters",
"remove_url",
"remove_html_tags",
"correct_spellings"
])
print(cleaned_text)
```
### Available Preprocessing Steps:
- **lowercase**: Convert text to lowercase.
- **remove_punctuation**: Remove punctuation characters.
- **remove_stopwords**: Remove stopwords (common words like 'the', 'and', etc.).
- **lemmatize_text**: Lemmatize words (reduce to base form).
- **remove_special_characters**: Remove special characters from text.
- **remove_url**: Remove URLs from text.
- **remove_html_tags**: Remove HTML tags.
- **correct_spellings**: Correct common spelling mistakes.
## Running Tests
This repository includes unit and integration tests using `pytest`. To run the tests:
1. Install `pytest` if you haven't already:
```bash
pip install pytest
```
2. Run the tests:
```bash
pytest
```
Tests are located in the `tests/` directory.
## Code Linting
This project uses `flake8` for linting. To check the code for style issues:
```bash
flake8 text_preprocessing_toolkit
```
## CI/CD
This repository is integrated with GitHub Actions for continuous integration and continuous deployment (CI/CD). Every time a new commit is pushed or a pull request is created to the `main` branch, the following steps will be automatically performed:
- **Linting**: Code will be checked for style issues using `flake8`.
- **Testing**: Unit tests will be run using `pytest`.
- **Build**: The package will be built using `python -m build`.
- **Publish**: The package will be uploaded to PyPI (if a release is created).
## Contributing
We welcome contributions! If you'd like to contribute to the project, please follow these steps:
1. Fork the repository.
2. Create a new branch (`git checkout -b feature-name`).
3. Make your changes and commit them (`git commit -m 'Add feature'`).
4. Push to your forked repository (`git push origin feature-name`).
5. Create a pull request.
## License
This project is licensed under the MIT License - see the [LICENSE](LICENSE) file for details.
---
### Notes:
- Replace the repository URL in the `git clone` command with your actual GitHub repository URL.
- Update any project-specific features or configurations that might be necessary.
# ________________________________________________________________________________________________
Raw data
{
"_id": null,
"home_page": "https://github.com/Gaurav-Jaiswal-1/Text-Preprocessing-Toolkit.git",
"name": "Text-Preprocessing-Toolkit",
"maintainer": null,
"docs_url": null,
"requires_python": ">=3.8",
"maintainer_email": null,
"keywords": "text preprocessing, toolkit, automation",
"author": "Gaurav Jaiswal",
"author_email": "Gaurav Jaiswal <jaiswalgaurav863@gmail.com>",
"download_url": "https://files.pythonhosted.org/packages/52/f8/0959e1e6b46742f0b598851c48d50e0977ffc7ae0fd09fc20c38a3811ff4/text_preprocessing_toolkit-0.0.1.tar.gz",
"platform": null,
"description": "\r\n\r\n# Text Preprocessing Toolkit\r\n\r\nThis repository contains a Python package for text preprocessing tasks. The toolkit includes functions for various preprocessing steps such as tokenization, lemmatization, stopword removal, text normalization, and more. It aims to provide a convenient and customizable solution for preparing text data for downstream tasks like natural language processing (NLP) and machine learning.\r\n\r\n## Features\r\n\r\n- **Lowercasing**: Convert all text to lowercase.\r\n- **Punctuation Removal**: Remove punctuation marks from text.\r\n- **Stopword Removal**: Remove common words (e.g., \"and\", \"the\") that do not contribute much meaning.\r\n- **Lemmatization**: Reduce words to their base or root form (e.g., \"running\" -> \"run\").\r\n- **Spell Correction**: Correct misspelled words in the text.\r\n- **URL and HTML Tag Removal**: Clean URLs and HTML tags from text.\r\n- **Special Character Removal**: Remove non-alphanumeric characters.\r\n\r\n## Requirements\r\n\r\n- Python 3.8 or higher\r\n- `flake8` for linting\r\n- `pytest` for testing\r\n- Any dependencies defined in `requirements.txt`\r\n\r\n## Installation\r\n\r\nTo install the package, clone the repository and install the necessary dependencies.\r\n\r\n### Clone the repository:\r\n\r\n```bash\r\ngit clone https://github.com/your-username/text-preprocessing-toolkit.git\r\ncd text-preprocessing-toolkit\r\n```\r\n\r\n### Install dependencies:\r\n\r\n```bash\r\npip install -r requirements.txt\r\n```\r\n\r\nAlternatively, if you want to install the package globally:\r\n\r\n```bash\r\npip install .\r\n```\r\n\r\n## Usage\r\n\r\nYou can use this toolkit in your Python project by importing the preprocessing functions:\r\n\r\n```python\r\nfrom text_preprocessing_toolkit import processor\r\n\r\ntext = \"Your sample text goes here!\"\r\n\r\n# Preprocess text\r\ncleaned_text = processor.preprocess(text, steps=[\r\n \"lowercase\",\r\n \"remove_punctuation\",\r\n \"remove_stopwords\",\r\n \"lemmatize_text\",\r\n \"remove_special_characters\",\r\n \"remove_url\",\r\n \"remove_html_tags\",\r\n \"correct_spellings\"\r\n])\r\n\r\nprint(cleaned_text)\r\n```\r\n\r\n### Available Preprocessing Steps:\r\n\r\n- **lowercase**: Convert text to lowercase.\r\n- **remove_punctuation**: Remove punctuation characters.\r\n- **remove_stopwords**: Remove stopwords (common words like 'the', 'and', etc.).\r\n- **lemmatize_text**: Lemmatize words (reduce to base form).\r\n- **remove_special_characters**: Remove special characters from text.\r\n- **remove_url**: Remove URLs from text.\r\n- **remove_html_tags**: Remove HTML tags.\r\n- **correct_spellings**: Correct common spelling mistakes.\r\n\r\n## Running Tests\r\n\r\nThis repository includes unit and integration tests using `pytest`. To run the tests:\r\n\r\n1. Install `pytest` if you haven't already:\r\n\r\n```bash\r\npip install pytest\r\n```\r\n\r\n2. Run the tests:\r\n\r\n```bash\r\npytest\r\n```\r\n\r\nTests are located in the `tests/` directory.\r\n\r\n## Code Linting\r\n\r\nThis project uses `flake8` for linting. To check the code for style issues:\r\n\r\n```bash\r\nflake8 text_preprocessing_toolkit\r\n```\r\n\r\n## CI/CD\r\n\r\nThis repository is integrated with GitHub Actions for continuous integration and continuous deployment (CI/CD). Every time a new commit is pushed or a pull request is created to the `main` branch, the following steps will be automatically performed:\r\n\r\n- **Linting**: Code will be checked for style issues using `flake8`.\r\n- **Testing**: Unit tests will be run using `pytest`.\r\n- **Build**: The package will be built using `python -m build`.\r\n- **Publish**: The package will be uploaded to PyPI (if a release is created).\r\n\r\n## Contributing\r\n\r\nWe welcome contributions! If you'd like to contribute to the project, please follow these steps:\r\n\r\n1. Fork the repository.\r\n2. Create a new branch (`git checkout -b feature-name`).\r\n3. Make your changes and commit them (`git commit -m 'Add feature'`).\r\n4. Push to your forked repository (`git push origin feature-name`).\r\n5. Create a pull request.\r\n\r\n## License\r\n\r\nThis project is licensed under the MIT License - see the [LICENSE](LICENSE) file for details.\r\n\r\n---\r\n\r\n### Notes:\r\n- Replace the repository URL in the `git clone` command with your actual GitHub repository URL.\r\n- Update any project-specific features or configurations that might be necessary.\r\n\r\n\r\n# ________________________________________________________________________________________________\r\n",
"bugtrack_url": null,
"license": "MIT",
"summary": "A package that automates text preprocessing",
"version": "0.0.1",
"project_urls": {
"Homepage": "https://github.com/Gaurav-Jaiswal-1/Text-Preprocessing-Toolkit",
"Issues": "https://github.com/Gaurav-Jaiswal-1/Text-Preprocessing-Toolkit/issues"
},
"split_keywords": [
"text preprocessing",
" toolkit",
" automation"
],
"urls": [
{
"comment_text": "",
"digests": {
"blake2b_256": "7ba637c88f7eff8b61e528dd6be4e948c076f172f94612857ebc2e24576dabc9",
"md5": "cdd035c66f2fef0061e8d96b0a81d453",
"sha256": "316fd8a70bc092b33a839c7099715ef0aaf53d54aadbfb3aa9cc8fea351a7cef"
},
"downloads": -1,
"filename": "Text_Preprocessing_Toolkit-0.0.1-py3-none-any.whl",
"has_sig": false,
"md5_digest": "cdd035c66f2fef0061e8d96b0a81d453",
"packagetype": "bdist_wheel",
"python_version": "py3",
"requires_python": ">=3.8",
"size": 4909,
"upload_time": "2024-12-01T19:43:20",
"upload_time_iso_8601": "2024-12-01T19:43:20.521519Z",
"url": "https://files.pythonhosted.org/packages/7b/a6/37c88f7eff8b61e528dd6be4e948c076f172f94612857ebc2e24576dabc9/Text_Preprocessing_Toolkit-0.0.1-py3-none-any.whl",
"yanked": false,
"yanked_reason": null
},
{
"comment_text": "",
"digests": {
"blake2b_256": "52f80959e1e6b46742f0b598851c48d50e0977ffc7ae0fd09fc20c38a3811ff4",
"md5": "03f1e453fd8290330ebd44cb63444c61",
"sha256": "73692ce178b69c2324b29bef627df63b06c6db398fac0b02ed9fd8ea1a0af606"
},
"downloads": -1,
"filename": "text_preprocessing_toolkit-0.0.1.tar.gz",
"has_sig": false,
"md5_digest": "03f1e453fd8290330ebd44cb63444c61",
"packagetype": "sdist",
"python_version": "source",
"requires_python": ">=3.8",
"size": 6150,
"upload_time": "2024-12-01T19:43:22",
"upload_time_iso_8601": "2024-12-01T19:43:22.612582Z",
"url": "https://files.pythonhosted.org/packages/52/f8/0959e1e6b46742f0b598851c48d50e0977ffc7ae0fd09fc20c38a3811ff4/text_preprocessing_toolkit-0.0.1.tar.gz",
"yanked": false,
"yanked_reason": null
}
],
"upload_time": "2024-12-01 19:43:22",
"github": true,
"gitlab": false,
"bitbucket": false,
"codeberg": false,
"github_user": "Gaurav-Jaiswal-1",
"github_project": "Text-Preprocessing-Toolkit",
"travis_ci": false,
"coveralls": false,
"github_actions": true,
"requirements": [
{
"name": "pyspellchecker",
"specs": [
[
"==",
"0.7.1"
]
]
},
{
"name": "matplotlib",
"specs": [
[
">=",
"3.1"
]
]
},
{
"name": "pandas",
"specs": [
[
">=",
"1.1"
]
]
},
{
"name": "spacy",
"specs": [
[
"<",
"3.0"
]
]
},
{
"name": "nltk",
"specs": [
[
">=",
"3.5"
]
]
}
],
"tox": true,
"lcname": "text-preprocessing-toolkit"
}