textcleaner-partha

Name	textcleaner-partha JSON
Version	1.1.2 JSON
	download
home_page	https://github.com/partha6369/textcleaner
Summary	A lightweight and reusable text preprocessing package for NLP tasks
upload_time	2025-08-02 16:21:04
maintainer	None
docs_url	None
author	Dr. Partha Majumdar
requires_python	>=3.7
license	None
keywords	nlp text preprocessing lemmatization contractions emoji removal spacy autocorrect tokens
VCS
bugtrack_url
requirements	spacy autocorrect contractions beautifulsoup4
Travis-CI	No Travis.
coveralls test coverage	No coveralls.

            # 🧹 textcleaner-partha

[![PyPI version](https://img.shields.io/pypi/v/textcleaner-partha?color=blue)](https://pypi.org/project/textcleaner-partha/)
[![License](https://img.shields.io/badge/license-MIT-green.svg)](LICENSE)

A lightweight and reusable text preprocessing package for NLP tasks.
It cleans text by removing HTML tags and emojis, expanding contractions, correcting spelling, and performing lemmatization using spaCy.

## ✨ Features
	•	✅ HTML tag and emoji removal
	•	✅ Stopword removal
	•	✅ Contraction expansion (e.g., “can’t” → “cannot”)
	•	✅ Abbreviation expansion (e.g., “asap” → “as soon as possible”)
	•	✅ Spelling correction with autocorrect
	•	✅ Lemmatisation using spaCy (en_core_web_sm)
	•	✅ Filters out stopwords, punctuation, numbers
	•	✅ Retains only nouns, verbs, adjectives, and adverbs
	•	✅ Returns tokens in a text


## 🚀 Installation

### From PyPI:

```bash
pip install --upgrade textcleaner-partha
```

Install directly from GitHub:

```bash
pip install git+https://github.com/partha6369/textcleaner.git
```

## 🧠 Usage

```python
from textcleaner_partha import preprocess

text = "I can't believe it's already raining! 😞 <p>Click here</p>"

# Default usage (all features enabled)
cleaned = preprocess(text)
print(cleaned)

# Custom usage with optional features disabled
cleaned_partial = preprocess(
    text,
    lemmatise=False,            # Skip spaCy processing (lemmatisation, POS filtering)
    correct_spelling=False,     # Skip spelling correction
    expand_contraction=False    # Skip contraction expansion
)
print(cleaned_partial)
```

```python
from textcleaner_partha import get_tokens

text = "I can't believe it's already raining! 😞 <p>Click here</p>"

# Default usage (all features enabled)
tokens = get_tokens(text)
print(tokens)

# Custom usage with optional features disabled
tokens_partial = get_tokens(
    text,
    lemmatise=False,            # Skip spaCy processing (lemmatisation, POS filtering)
    correct_spelling=False,     # Skip spelling correction
    expand_contraction=False    # Skip contraction expansion
)
print(tokens_partial)
```

## 🔧 Parameters

The preprocess() and get_tokens() functions offer flexible control over each text cleaning step. You can selectively enable or disable operations using the parameters below:

```python
def preprocess(
    text,
    lowercase=True,
    remove_html=True,
    remove_emoji=True,
    remove_whitespace=True,
    remove_punct=False,
    expand_contraction=True,
    expand_abbrev=True,
    correct_spelling=True,
    lemmatise=True,
)
```

```python
def get_tokens(
    text,
    lowercase=True,
    remove_html=True,
    remove_emoji=True,
    remove_whitespace=True,
    remove_punct=False,
    expand_contraction=True,
    expand_abbrev=True,
    correct_spelling=True,
    lemmatise=True,
)
```

## 📦 Dependencies

	•	spacy
	•	autocorrect
	•	contractions

You can install them manually or via the included requirements.txt:
```bash
pip install -r requirements.txt
```

And download the required spaCy model:
```bash
python -m spacy download en_core_web_sm
```


## 📄 License

MIT License © Dr. Partha Majumdar

Raw data

            {
    "_id": null,
    "home_page": "https://github.com/partha6369/textcleaner",
    "name": "textcleaner-partha",
    "maintainer": null,
    "docs_url": null,
    "requires_python": ">=3.7",
    "maintainer_email": null,
    "keywords": "NLP, text preprocessing, lemmatization, contractions, emoji removal, spacy, autocorrect, tokens",
    "author": "Dr. Partha Majumdar",
    "author_email": "\"Dr. Partha Majumdar\" <partha.majumdar@icloud.com>",
    "download_url": "https://files.pythonhosted.org/packages/b4/3e/6a4955a41a667b3c15fca7f5775437b466d57a9d1c075ce343839f62dc46/textcleaner_partha-1.1.2.tar.gz",
    "platform": null,
    "description": "# \ud83e\uddf9 textcleaner-partha\n\n[![PyPI version](https://img.shields.io/pypi/v/textcleaner-partha?color=blue)](https://pypi.org/project/textcleaner-partha/)\n[![License](https://img.shields.io/badge/license-MIT-green.svg)](LICENSE)\n\nA lightweight and reusable text preprocessing package for NLP tasks.\nIt cleans text by removing HTML tags and emojis, expanding contractions, correcting spelling, and performing lemmatization using spaCy.\n\n## \u2728 Features\n\t\u2022\t\u2705 HTML tag and emoji removal\n\t\u2022\t\u2705 Stopword removal\n\t\u2022\t\u2705 Contraction expansion (e.g., \u201ccan\u2019t\u201d \u2192 \u201ccannot\u201d)\n\t\u2022\t\u2705 Abbreviation expansion (e.g., \u201casap\u201d \u2192 \u201cas soon as possible\u201d)\n\t\u2022\t\u2705 Spelling correction with autocorrect\n\t\u2022\t\u2705 Lemmatisation using spaCy (en_core_web_sm)\n\t\u2022\t\u2705 Filters out stopwords, punctuation, numbers\n\t\u2022\t\u2705 Retains only nouns, verbs, adjectives, and adverbs\n\t\u2022\t\u2705 Returns tokens in a text\n\n\n## \ud83d\ude80 Installation\n\n### From PyPI:\n\n```bash\npip install --upgrade textcleaner-partha\n```\n\nInstall directly from GitHub:\n\n```bash\npip install git+https://github.com/partha6369/textcleaner.git\n```\n\n## \ud83e\udde0 Usage\n\n```python\nfrom textcleaner_partha import preprocess\n\ntext = \"I can't believe it's already raining! \ud83d\ude1e <p>Click here</p>\"\n\n# Default usage (all features enabled)\ncleaned = preprocess(text)\nprint(cleaned)\n\n# Custom usage with optional features disabled\ncleaned_partial = preprocess(\n    text,\n    lemmatise=False,            # Skip spaCy processing (lemmatisation, POS filtering)\n    correct_spelling=False,     # Skip spelling correction\n    expand_contraction=False    # Skip contraction expansion\n)\nprint(cleaned_partial)\n```\n\n```python\nfrom textcleaner_partha import get_tokens\n\ntext = \"I can't believe it's already raining! \ud83d\ude1e <p>Click here</p>\"\n\n# Default usage (all features enabled)\ntokens = get_tokens(text)\nprint(tokens)\n\n# Custom usage with optional features disabled\ntokens_partial = get_tokens(\n    text,\n    lemmatise=False,            # Skip spaCy processing (lemmatisation, POS filtering)\n    correct_spelling=False,     # Skip spelling correction\n    expand_contraction=False    # Skip contraction expansion\n)\nprint(tokens_partial)\n```\n\n## \ud83d\udd27 Parameters\n\nThe preprocess() and get_tokens() functions offer flexible control over each text cleaning step. You can selectively enable or disable operations using the parameters below:\n\n```python\ndef preprocess(\n    text,\n    lowercase=True,\n    remove_html=True,\n    remove_emoji=True,\n    remove_whitespace=True,\n    remove_punct=False,\n    expand_contraction=True,\n    expand_abbrev=True,\n    correct_spelling=True,\n    lemmatise=True,\n)\n```\n\n```python\ndef get_tokens(\n    text,\n    lowercase=True,\n    remove_html=True,\n    remove_emoji=True,\n    remove_whitespace=True,\n    remove_punct=False,\n    expand_contraction=True,\n    expand_abbrev=True,\n    correct_spelling=True,\n    lemmatise=True,\n)\n```\n\n## \ud83d\udce6 Dependencies\n\n\t\u2022\tspacy\n\t\u2022\tautocorrect\n\t\u2022\tcontractions\n\nYou can install them manually or via the included requirements.txt:\n```bash\npip install -r requirements.txt\n```\n\nAnd download the required spaCy model:\n```bash\npython -m spacy download en_core_web_sm\n```\n\n\n## \ud83d\udcc4 License\n\nMIT License \u00a9 Dr. Partha Majumdar\n",
    "bugtrack_url": null,
    "license": null,
    "summary": "A lightweight and reusable text preprocessing package for NLP tasks",
    "version": "1.1.2",
    "project_urls": {
        "Homepage": "https://github.com/partha6369/textcleaner"
    },
    "split_keywords": [
        "nlp",
        " text preprocessing",
        " lemmatization",
        " contractions",
        " emoji removal",
        " spacy",
        " autocorrect",
        " tokens"
    ],
    "urls": [
        {
            "comment_text": null,
            "digests": {
                "blake2b_256": "454f3c87bae058127227b71bf67efb61c42e7ad217a64c985d4b45ec32f1162f",
                "md5": "85a37d831b864a452c7d16060c8a37ff",
                "sha256": "70e1188ba178ae368cd06f8fb0160dd883878ae01cff5272eb9b03a65fc37368"
            },
            "downloads": -1,
            "filename": "textcleaner_partha-1.1.2-py3-none-any.whl",
            "has_sig": false,
            "md5_digest": "85a37d831b864a452c7d16060c8a37ff",
            "packagetype": "bdist_wheel",
            "python_version": "py3",
            "requires_python": ">=3.7",
            "size": 12551,
            "upload_time": "2025-08-02T16:21:02",
            "upload_time_iso_8601": "2025-08-02T16:21:02.644978Z",
            "url": "https://files.pythonhosted.org/packages/45/4f/3c87bae058127227b71bf67efb61c42e7ad217a64c985d4b45ec32f1162f/textcleaner_partha-1.1.2-py3-none-any.whl",
            "yanked": false,
            "yanked_reason": null
        },
        {
            "comment_text": null,
            "digests": {
                "blake2b_256": "b43e6a4955a41a667b3c15fca7f5775437b466d57a9d1c075ce343839f62dc46",
                "md5": "bc322ca53eb106fc3cb44b4250746af9",
                "sha256": "31eb247de29b98172660181fa0475b298c2f8cf0d45ff56b4707bcf12d5630c8"
            },
            "downloads": -1,
            "filename": "textcleaner_partha-1.1.2.tar.gz",
            "has_sig": false,
            "md5_digest": "bc322ca53eb106fc3cb44b4250746af9",
            "packagetype": "sdist",
            "python_version": "source",
            "requires_python": ">=3.7",
            "size": 17076,
            "upload_time": "2025-08-02T16:21:04",
            "upload_time_iso_8601": "2025-08-02T16:21:04.073904Z",
            "url": "https://files.pythonhosted.org/packages/b4/3e/6a4955a41a667b3c15fca7f5775437b466d57a9d1c075ce343839f62dc46/textcleaner_partha-1.1.2.tar.gz",
            "yanked": false,
            "yanked_reason": null
        }
    ],
    "upload_time": "2025-08-02 16:21:04",
    "github": true,
    "gitlab": false,
    "bitbucket": false,
    "codeberg": false,
    "github_user": "partha6369",
    "github_project": "textcleaner",
    "travis_ci": false,
    "coveralls": false,
    "github_actions": false,
    "requirements": [
        {
            "name": "spacy",
            "specs": [
                [
                    ">=",
                    "3.0.0"
                ]
            ]
        },
        {
            "name": "autocorrect",
            "specs": [
                [
                    "==",
                    "0.4.4"
                ]
            ]
        },
        {
            "name": "contractions",
            "specs": [
                [
                    ">=",
                    "0.1.73"
                ]
            ]
        },
        {
            "name": "beautifulsoup4",
            "specs": [
                [
                    ">=",
                    "4.12.0"
                ]
            ]
        }
    ],
    "lcname": "textcleaner-partha"
}

Dr. Partha Majumdar