Text-Preprocessing-Toolkit

Name	Text-Preprocessing-Toolkit JSON
Version	0.0.1 JSON
	download
home_page	https://github.com/Gaurav-Jaiswal-1/Text-Preprocessing-Toolkit.git
Summary	A package that automates text preprocessing
upload_time	2024-12-01 19:43:22
maintainer	None
docs_url	None
author	Gaurav Jaiswal
requires_python	>=3.8
license	MIT
keywords	text preprocessing toolkit automation
VCS
bugtrack_url
requirements	pyspellchecker matplotlib pandas spacy nltk
Travis-CI	No Travis.
coveralls test coverage	No coveralls.

            

# Text Preprocessing Toolkit

This repository contains a Python package for text preprocessing tasks. The toolkit includes functions for various preprocessing steps such as tokenization, lemmatization, stopword removal, text normalization, and more. It aims to provide a convenient and customizable solution for preparing text data for downstream tasks like natural language processing (NLP) and machine learning.

## Features

- **Lowercasing**: Convert all text to lowercase.
- **Punctuation Removal**: Remove punctuation marks from text.
- **Stopword Removal**: Remove common words (e.g., "and", "the") that do not contribute much meaning.
- **Lemmatization**: Reduce words to their base or root form (e.g., "running" -> "run").
- **Spell Correction**: Correct misspelled words in the text.
- **URL and HTML Tag Removal**: Clean URLs and HTML tags from text.
- **Special Character Removal**: Remove non-alphanumeric characters.

## Requirements

- Python 3.8 or higher
- `flake8` for linting
- `pytest` for testing
- Any dependencies defined in `requirements.txt`

## Installation

To install the package, clone the repository and install the necessary dependencies.

### Clone the repository:

```bash
git clone https://github.com/your-username/text-preprocessing-toolkit.git
cd text-preprocessing-toolkit
```

### Install dependencies:

```bash
pip install -r requirements.txt
```

Alternatively, if you want to install the package globally:

```bash
pip install .
```

## Usage

You can use this toolkit in your Python project by importing the preprocessing functions:

```python
from text_preprocessing_toolkit import processor

text = "Your sample text goes here!"

# Preprocess text
cleaned_text = processor.preprocess(text, steps=[
    "lowercase",
    "remove_punctuation",
    "remove_stopwords",
    "lemmatize_text",
    "remove_special_characters",
    "remove_url",
    "remove_html_tags",
    "correct_spellings"
])

print(cleaned_text)
```

### Available Preprocessing Steps:

- **lowercase**: Convert text to lowercase.
- **remove_punctuation**: Remove punctuation characters.
- **remove_stopwords**: Remove stopwords (common words like 'the', 'and', etc.).
- **lemmatize_text**: Lemmatize words (reduce to base form).
- **remove_special_characters**: Remove special characters from text.
- **remove_url**: Remove URLs from text.
- **remove_html_tags**: Remove HTML tags.
- **correct_spellings**: Correct common spelling mistakes.

## Running Tests

This repository includes unit and integration tests using `pytest`. To run the tests:

1. Install `pytest` if you haven't already:

```bash
pip install pytest
```

2. Run the tests:

```bash
pytest
```

Tests are located in the `tests/` directory.

## Code Linting

This project uses `flake8` for linting. To check the code for style issues:

```bash
flake8 text_preprocessing_toolkit
```

## CI/CD

This repository is integrated with GitHub Actions for continuous integration and continuous deployment (CI/CD). Every time a new commit is pushed or a pull request is created to the `main` branch, the following steps will be automatically performed:

- **Linting**: Code will be checked for style issues using `flake8`.
- **Testing**: Unit tests will be run using `pytest`.
- **Build**: The package will be built using `python -m build`.
- **Publish**: The package will be uploaded to PyPI (if a release is created).

## Contributing

We welcome contributions! If you'd like to contribute to the project, please follow these steps:

1. Fork the repository.
2. Create a new branch (`git checkout -b feature-name`).
3. Make your changes and commit them (`git commit -m 'Add feature'`).
4. Push to your forked repository (`git push origin feature-name`).
5. Create a pull request.

## License

This project is licensed under the MIT License - see the [LICENSE](LICENSE) file for details.

---

### Notes:
- Replace the repository URL in the `git clone` command with your actual GitHub repository URL.
- Update any project-specific features or configurations that might be necessary.


# ________________________________________________________________________________________________

Raw data

            {
    "_id": null,
    "home_page": "https://github.com/Gaurav-Jaiswal-1/Text-Preprocessing-Toolkit.git",
    "name": "Text-Preprocessing-Toolkit",
    "maintainer": null,
    "docs_url": null,
    "requires_python": ">=3.8",
    "maintainer_email": null,
    "keywords": "text preprocessing, toolkit, automation",
    "author": "Gaurav Jaiswal",
    "author_email": "Gaurav Jaiswal <jaiswalgaurav863@gmail.com>",
    "download_url": "https://files.pythonhosted.org/packages/52/f8/0959e1e6b46742f0b598851c48d50e0977ffc7ae0fd09fc20c38a3811ff4/text_preprocessing_toolkit-0.0.1.tar.gz",
    "platform": null,
    "description": "\r\n\r\n# Text Preprocessing Toolkit\r\n\r\nThis repository contains a Python package for text preprocessing tasks. The toolkit includes functions for various preprocessing steps such as tokenization, lemmatization, stopword removal, text normalization, and more. It aims to provide a convenient and customizable solution for preparing text data for downstream tasks like natural language processing (NLP) and machine learning.\r\n\r\n## Features\r\n\r\n- **Lowercasing**: Convert all text to lowercase.\r\n- **Punctuation Removal**: Remove punctuation marks from text.\r\n- **Stopword Removal**: Remove common words (e.g., \"and\", \"the\") that do not contribute much meaning.\r\n- **Lemmatization**: Reduce words to their base or root form (e.g., \"running\" -> \"run\").\r\n- **Spell Correction**: Correct misspelled words in the text.\r\n- **URL and HTML Tag Removal**: Clean URLs and HTML tags from text.\r\n- **Special Character Removal**: Remove non-alphanumeric characters.\r\n\r\n## Requirements\r\n\r\n- Python 3.8 or higher\r\n- `flake8` for linting\r\n- `pytest` for testing\r\n- Any dependencies defined in `requirements.txt`\r\n\r\n## Installation\r\n\r\nTo install the package, clone the repository and install the necessary dependencies.\r\n\r\n### Clone the repository:\r\n\r\n```bash\r\ngit clone https://github.com/your-username/text-preprocessing-toolkit.git\r\ncd text-preprocessing-toolkit\r\n```\r\n\r\n### Install dependencies:\r\n\r\n```bash\r\npip install -r requirements.txt\r\n```\r\n\r\nAlternatively, if you want to install the package globally:\r\n\r\n```bash\r\npip install .\r\n```\r\n\r\n## Usage\r\n\r\nYou can use this toolkit in your Python project by importing the preprocessing functions:\r\n\r\n```python\r\nfrom text_preprocessing_toolkit import processor\r\n\r\ntext = \"Your sample text goes here!\"\r\n\r\n# Preprocess text\r\ncleaned_text = processor.preprocess(text, steps=[\r\n    \"lowercase\",\r\n    \"remove_punctuation\",\r\n    \"remove_stopwords\",\r\n    \"lemmatize_text\",\r\n    \"remove_special_characters\",\r\n    \"remove_url\",\r\n    \"remove_html_tags\",\r\n    \"correct_spellings\"\r\n])\r\n\r\nprint(cleaned_text)\r\n```\r\n\r\n### Available Preprocessing Steps:\r\n\r\n- **lowercase**: Convert text to lowercase.\r\n- **remove_punctuation**: Remove punctuation characters.\r\n- **remove_stopwords**: Remove stopwords (common words like 'the', 'and', etc.).\r\n- **lemmatize_text**: Lemmatize words (reduce to base form).\r\n- **remove_special_characters**: Remove special characters from text.\r\n- **remove_url**: Remove URLs from text.\r\n- **remove_html_tags**: Remove HTML tags.\r\n- **correct_spellings**: Correct common spelling mistakes.\r\n\r\n## Running Tests\r\n\r\nThis repository includes unit and integration tests using `pytest`. To run the tests:\r\n\r\n1. Install `pytest` if you haven't already:\r\n\r\n```bash\r\npip install pytest\r\n```\r\n\r\n2. Run the tests:\r\n\r\n```bash\r\npytest\r\n```\r\n\r\nTests are located in the `tests/` directory.\r\n\r\n## Code Linting\r\n\r\nThis project uses `flake8` for linting. To check the code for style issues:\r\n\r\n```bash\r\nflake8 text_preprocessing_toolkit\r\n```\r\n\r\n## CI/CD\r\n\r\nThis repository is integrated with GitHub Actions for continuous integration and continuous deployment (CI/CD). Every time a new commit is pushed or a pull request is created to the `main` branch, the following steps will be automatically performed:\r\n\r\n- **Linting**: Code will be checked for style issues using `flake8`.\r\n- **Testing**: Unit tests will be run using `pytest`.\r\n- **Build**: The package will be built using `python -m build`.\r\n- **Publish**: The package will be uploaded to PyPI (if a release is created).\r\n\r\n## Contributing\r\n\r\nWe welcome contributions! If you'd like to contribute to the project, please follow these steps:\r\n\r\n1. Fork the repository.\r\n2. Create a new branch (`git checkout -b feature-name`).\r\n3. Make your changes and commit them (`git commit -m 'Add feature'`).\r\n4. Push to your forked repository (`git push origin feature-name`).\r\n5. Create a pull request.\r\n\r\n## License\r\n\r\nThis project is licensed under the MIT License - see the [LICENSE](LICENSE) file for details.\r\n\r\n---\r\n\r\n### Notes:\r\n- Replace the repository URL in the `git clone` command with your actual GitHub repository URL.\r\n- Update any project-specific features or configurations that might be necessary.\r\n\r\n\r\n# ________________________________________________________________________________________________\r\n",
    "bugtrack_url": null,
    "license": "MIT",
    "summary": "A package that automates text preprocessing",
    "version": "0.0.1",
    "project_urls": {
        "Homepage": "https://github.com/Gaurav-Jaiswal-1/Text-Preprocessing-Toolkit",
        "Issues": "https://github.com/Gaurav-Jaiswal-1/Text-Preprocessing-Toolkit/issues"
    },
    "split_keywords": [
        "text preprocessing",
        " toolkit",
        " automation"
    ],
    "urls": [
        {
            "comment_text": "",
            "digests": {
                "blake2b_256": "7ba637c88f7eff8b61e528dd6be4e948c076f172f94612857ebc2e24576dabc9",
                "md5": "cdd035c66f2fef0061e8d96b0a81d453",
                "sha256": "316fd8a70bc092b33a839c7099715ef0aaf53d54aadbfb3aa9cc8fea351a7cef"
            },
            "downloads": -1,
            "filename": "Text_Preprocessing_Toolkit-0.0.1-py3-none-any.whl",
            "has_sig": false,
            "md5_digest": "cdd035c66f2fef0061e8d96b0a81d453",
            "packagetype": "bdist_wheel",
            "python_version": "py3",
            "requires_python": ">=3.8",
            "size": 4909,
            "upload_time": "2024-12-01T19:43:20",
            "upload_time_iso_8601": "2024-12-01T19:43:20.521519Z",
            "url": "https://files.pythonhosted.org/packages/7b/a6/37c88f7eff8b61e528dd6be4e948c076f172f94612857ebc2e24576dabc9/Text_Preprocessing_Toolkit-0.0.1-py3-none-any.whl",
            "yanked": false,
            "yanked_reason": null
        },
        {
            "comment_text": "",
            "digests": {
                "blake2b_256": "52f80959e1e6b46742f0b598851c48d50e0977ffc7ae0fd09fc20c38a3811ff4",
                "md5": "03f1e453fd8290330ebd44cb63444c61",
                "sha256": "73692ce178b69c2324b29bef627df63b06c6db398fac0b02ed9fd8ea1a0af606"
            },
            "downloads": -1,
            "filename": "text_preprocessing_toolkit-0.0.1.tar.gz",
            "has_sig": false,
            "md5_digest": "03f1e453fd8290330ebd44cb63444c61",
            "packagetype": "sdist",
            "python_version": "source",
            "requires_python": ">=3.8",
            "size": 6150,
            "upload_time": "2024-12-01T19:43:22",
            "upload_time_iso_8601": "2024-12-01T19:43:22.612582Z",
            "url": "https://files.pythonhosted.org/packages/52/f8/0959e1e6b46742f0b598851c48d50e0977ffc7ae0fd09fc20c38a3811ff4/text_preprocessing_toolkit-0.0.1.tar.gz",
            "yanked": false,
            "yanked_reason": null
        }
    ],
    "upload_time": "2024-12-01 19:43:22",
    "github": true,
    "gitlab": false,
    "bitbucket": false,
    "codeberg": false,
    "github_user": "Gaurav-Jaiswal-1",
    "github_project": "Text-Preprocessing-Toolkit",
    "travis_ci": false,
    "coveralls": false,
    "github_actions": true,
    "requirements": [
        {
            "name": "pyspellchecker",
            "specs": [
                [
                    "==",
                    "0.7.1"
                ]
            ]
        },
        {
            "name": "matplotlib",
            "specs": [
                [
                    ">=",
                    "3.1"
                ]
            ]
        },
        {
            "name": "pandas",
            "specs": [
                [
                    ">=",
                    "1.1"
                ]
            ]
        },
        {
            "name": "spacy",
            "specs": [
                [
                    "<",
                    "3.0"
                ]
            ]
        },
        {
            "name": "nltk",
            "specs": [
                [
                    ">=",
                    "3.5"
                ]
            ]
        }
    ],
    "tox": true,
    "lcname": "text-preprocessing-toolkit"
}

Gaurav Jaiswal