nlp-text-clean


Namenlp-text-clean JSON
Version 0.1.4 PyPI version JSON
download
home_pageNone
SummaryA simple and configurable text preprocessing library for NLP tasks.
upload_time2025-09-13 16:10:11
maintainerNone
docs_urlNone
authorNone
requires_python>=3.8
licenseNone
keywords nlp text-cleaning preprocessing stopwords lemmatization
VCS
bugtrack_url
requirements nltk spacy pytest
Travis-CI No Travis.
coveralls test coverage No coveralls.
            # NLP Text Cleaner

[![PyPI Version](https://img.shields.io/pypi/v/nlp-text-clean)](https://pypi.org/project/nlp-text-clean/)
[![License](https://img.shields.io/pypi/l/nlp-text-clean)](https://opensource.org/licenses/MIT)
[![Python Versions](https://img.shields.io/pypi/pyversions/nlp-text-clean)](https://pypi.org/project/nlp-text-clean/)

A lightweight, configurable Python library for **text preprocessing** in NLP tasks.  
It helps you clean raw text by lowercasing, removing noise, lemmatizing, and more — ready for **machine learning** or **deep learning** models.

---

## ✨ Features
- Convert text to lowercase  
- Remove special characters  
- Remove punctuation (optional)  
- Remove numbers or replace them with `<NUM>` token  
- Remove extra spaces  
- Lemmatization (default) or stemming (optional)  
- Stopword removal (customizable)  
- Multi-language support (via **NLTK** / **spaCy**)  
- Add your own **custom stopwords**  
- Remove Html Tags
- Remove URLs


## Installation

From **PyPI**:
```bash
pip install nlp-text-clean

Usage
python

from nlp_text_clean.cleaner import TextCleaner

# Initialize with custom options
cleaner = TextCleaner(
    remove_numbers=True,
    remove_punctuation=True,
    remove_special_chars=True,
    use_stemming=False,
    use_lemmatization=True,
    language="english",
    custom_stopwords=None,
    preserve_num_token=False,
    remove_html_tags=False,
    remove_urls=False
)

text = "He scored 100 marks!!! Running better than others."
cleaned = cleaner.clean_text(text)

print(cleaned)
# Output: "score <NUM> mark run well"
```

## Parameters
| Parameter Name        | Type         | Default   | Description                                      |
|-----------------------|--------------|-----------|--------------------------------------------------|
| remove_numbers        | bool         | False     | Remove numbers from text                         |
| preserve_num_token    | bool         | False     | Replace numbers with `<NUM>` token               |
| remove_punctuation    | bool         | True      | Remove punctuation marks                         |
| remove_special_chars  | bool         | True      | Remove non-alphanumeric characters               |
| use_lemmatization     | bool         | True      | Apply lemmatization                              |
| use_stemming          | bool         | False     | Apply stemming instead of lemmatization          |
| stopwords_language    | str          | "english" | Language for stopwords                           |
| custom_stopwords      | list[str]    | None      | Add your own stopwords                           |
| remove_html_tags      | bool         | False     | Remove HTML tags                                 |
| remove_urls           | bool         | False     | Remove Urls                                      |


* Running Tests
```bash
pytest
```

### Why use this package?
There are other preprocessing tools (like NLTK or clean-text), but nlp-cleaner is designed to be:

1) Lightweight & Easy to Use → Just one class, minimal setup

2) Highly Configurable → Toggle lemmatization, stemming, number handling, punctuation, etc.

3) Production Ready → Clean and consistent outputs for ML pipelines

4) Customizable → Extend with your own stopwords or language models

5) Multi-language → Works beyond English when supported by spaCy/NLTK

If you want a plug-and-play solution to clean messy text quickly, this is the right tool.

            

Raw data

            {
    "_id": null,
    "home_page": null,
    "name": "nlp-text-clean",
    "maintainer": null,
    "docs_url": null,
    "requires_python": ">=3.8",
    "maintainer_email": null,
    "keywords": "nlp, text-cleaning, preprocessing, stopwords, lemmatization",
    "author": null,
    "author_email": "Santhosh Bora <borasanthosh921@gmail.com>",
    "download_url": "https://files.pythonhosted.org/packages/ac/51/ca1da961d77f4e0e471216139774a8d7ca58047060021224dcb8918d059c/nlp_text_clean-0.1.4.tar.gz",
    "platform": null,
    "description": "# NLP Text Cleaner\r\n\r\n[![PyPI Version](https://img.shields.io/pypi/v/nlp-text-clean)](https://pypi.org/project/nlp-text-clean/)\r\n[![License](https://img.shields.io/pypi/l/nlp-text-clean)](https://opensource.org/licenses/MIT)\r\n[![Python Versions](https://img.shields.io/pypi/pyversions/nlp-text-clean)](https://pypi.org/project/nlp-text-clean/)\r\n\r\nA lightweight, configurable Python library for **text preprocessing** in NLP tasks.  \r\nIt helps you clean raw text by lowercasing, removing noise, lemmatizing, and more \u2014 ready for **machine learning** or **deep learning** models.\r\n\r\n---\r\n\r\n## \u2728 Features\r\n- Convert text to lowercase  \r\n- Remove special characters  \r\n- Remove punctuation (optional)  \r\n- Remove numbers or replace them with `<NUM>` token  \r\n- Remove extra spaces  \r\n- Lemmatization (default) or stemming (optional)  \r\n- Stopword removal (customizable)  \r\n- Multi-language support (via **NLTK** / **spaCy**)  \r\n- Add your own **custom stopwords**  \r\n- Remove Html Tags\r\n- Remove URLs\r\n\r\n\r\n## Installation\r\n\r\nFrom **PyPI**:\r\n```bash\r\npip install nlp-text-clean\r\n\r\nUsage\r\npython\r\n\r\nfrom nlp_text_clean.cleaner import TextCleaner\r\n\r\n# Initialize with custom options\r\ncleaner = TextCleaner(\r\n    remove_numbers=True,\r\n    remove_punctuation=True,\r\n    remove_special_chars=True,\r\n    use_stemming=False,\r\n    use_lemmatization=True,\r\n    language=\"english\",\r\n    custom_stopwords=None,\r\n    preserve_num_token=False,\r\n    remove_html_tags=False,\r\n    remove_urls=False\r\n)\r\n\r\ntext = \"He scored 100 marks!!! Running better than others.\"\r\ncleaned = cleaner.clean_text(text)\r\n\r\nprint(cleaned)\r\n# Output: \"score <NUM> mark run well\"\r\n```\r\n\r\n## Parameters\r\n| Parameter Name        | Type         | Default   | Description                                      |\r\n|-----------------------|--------------|-----------|--------------------------------------------------|\r\n| remove_numbers        | bool         | False     | Remove numbers from text                         |\r\n| preserve_num_token    | bool         | False     | Replace numbers with `<NUM>` token               |\r\n| remove_punctuation    | bool         | True      | Remove punctuation marks                         |\r\n| remove_special_chars  | bool         | True      | Remove non-alphanumeric characters               |\r\n| use_lemmatization     | bool         | True      | Apply lemmatization                              |\r\n| use_stemming          | bool         | False     | Apply stemming instead of lemmatization          |\r\n| stopwords_language    | str          | \"english\" | Language for stopwords                           |\r\n| custom_stopwords      | list[str]    | None      | Add your own stopwords                           |\r\n| remove_html_tags      | bool         | False     | Remove HTML tags                                 |\r\n| remove_urls           | bool         | False     | Remove Urls                                      |\r\n\r\n\r\n* Running Tests\r\n```bash\r\npytest\r\n```\r\n\r\n### Why use this package?\r\nThere are other preprocessing tools (like NLTK or clean-text), but nlp-cleaner is designed to be:\r\n\r\n1) Lightweight & Easy to Use \u2192 Just one class, minimal setup\r\n\r\n2) Highly Configurable \u2192 Toggle lemmatization, stemming, number handling, punctuation, etc.\r\n\r\n3) Production Ready \u2192 Clean and consistent outputs for ML pipelines\r\n\r\n4) Customizable \u2192 Extend with your own stopwords or language models\r\n\r\n5) Multi-language \u2192 Works beyond English when supported by spaCy/NLTK\r\n\r\nIf you want a plug-and-play solution to clean messy text quickly, this is the right tool.\r\n",
    "bugtrack_url": null,
    "license": null,
    "summary": "A simple and configurable text preprocessing library for NLP tasks.",
    "version": "0.1.4",
    "project_urls": {
        "Bug Tracker": "https://github.com/santhoshreddybora/nlp-text-clean/issues",
        "Homepage": "https://github.com/santhoshreddybora/nlp-text-clean"
    },
    "split_keywords": [
        "nlp",
        " text-cleaning",
        " preprocessing",
        " stopwords",
        " lemmatization"
    ],
    "urls": [
        {
            "comment_text": null,
            "digests": {
                "blake2b_256": "2e394d3ab3bae4efc3790cc38eca6c0336e75698fc0bdeb41ca3f518cbf3e464",
                "md5": "f59f76f28b8c10df76a015a45a7eb663",
                "sha256": "2ad1c0c194aca19631d444515935e35335cd0aeaa652a896b48df58cde722f26"
            },
            "downloads": -1,
            "filename": "nlp_text_clean-0.1.4-py3-none-any.whl",
            "has_sig": false,
            "md5_digest": "f59f76f28b8c10df76a015a45a7eb663",
            "packagetype": "bdist_wheel",
            "python_version": "py3",
            "requires_python": ">=3.8",
            "size": 5217,
            "upload_time": "2025-09-13T16:10:10",
            "upload_time_iso_8601": "2025-09-13T16:10:10.174867Z",
            "url": "https://files.pythonhosted.org/packages/2e/39/4d3ab3bae4efc3790cc38eca6c0336e75698fc0bdeb41ca3f518cbf3e464/nlp_text_clean-0.1.4-py3-none-any.whl",
            "yanked": false,
            "yanked_reason": null
        },
        {
            "comment_text": null,
            "digests": {
                "blake2b_256": "ac51ca1da961d77f4e0e471216139774a8d7ca58047060021224dcb8918d059c",
                "md5": "c0af1a3ead455d9e00fa59554e2b21f3",
                "sha256": "ef13b775dded921e1fd5ba4aa6a14bf8d44fdf4fbdfb145fe577b502f3b22583"
            },
            "downloads": -1,
            "filename": "nlp_text_clean-0.1.4.tar.gz",
            "has_sig": false,
            "md5_digest": "c0af1a3ead455d9e00fa59554e2b21f3",
            "packagetype": "sdist",
            "python_version": "source",
            "requires_python": ">=3.8",
            "size": 5707,
            "upload_time": "2025-09-13T16:10:11",
            "upload_time_iso_8601": "2025-09-13T16:10:11.507267Z",
            "url": "https://files.pythonhosted.org/packages/ac/51/ca1da961d77f4e0e471216139774a8d7ca58047060021224dcb8918d059c/nlp_text_clean-0.1.4.tar.gz",
            "yanked": false,
            "yanked_reason": null
        }
    ],
    "upload_time": "2025-09-13 16:10:11",
    "github": true,
    "gitlab": false,
    "bitbucket": false,
    "codeberg": false,
    "github_user": "santhoshreddybora",
    "github_project": "nlp-text-clean",
    "travis_ci": false,
    "coveralls": false,
    "github_actions": false,
    "requirements": [
        {
            "name": "nltk",
            "specs": [
                [
                    ">=",
                    "3.8"
                ]
            ]
        },
        {
            "name": "spacy",
            "specs": [
                [
                    ">=",
                    "3.0"
                ]
            ]
        },
        {
            "name": "pytest",
            "specs": []
        }
    ],
    "lcname": "nlp-text-clean"
}
        
Elapsed time: 1.07049s