textcleaner-partha


Nametextcleaner-partha JSON
Version 0.2.0 PyPI version JSON
download
home_pagehttps://github.com/partha6369/textcleaner
SummaryA lightweight and reusable text preprocessing package for NLP tasks
upload_time2025-07-12 21:38:21
maintainerNone
docs_urlNone
authorDr. Partha Majumdar
requires_python>=3.7
licenseNone
keywords nlp text preprocessing lemmatization contractions emoji removal spacy autocorrect
VCS
bugtrack_url
requirements spacy autocorrect contractions beautifulsoup4
Travis-CI No Travis.
coveralls test coverage No coveralls.
            # 🧹 textcleaner-partha

[![PyPI version](https://img.shields.io/pypi/v/textcleaner-partha?color=blue)](https://pypi.org/project/textcleaner-partha/)
[![License](https://img.shields.io/badge/license-MIT-green.svg)](LICENSE)

A lightweight and reusable text preprocessing package for NLP tasks.
It cleans text by removing HTML tags and emojis, expanding contractions, correcting spelling, and performing lemmatization using spaCy.

## ✨ Features
	•	✅ HTML tag and emoji removal
	•	✅ Contraction expansion (e.g., “can’t” → “cannot”)
	•	✅ Abbreviation expansion (e.g., “asap” → “as soon as possible”)
	•	✅ Spelling correction with autocorrect
	•	✅ Lemmatization using spaCy (en_core_web_sm)
	•	✅ Filters out stopwords, punctuation, numbers
	•	✅ Retains only nouns, verbs, adjectives, and adverbs


## 🚀 Installation

### From PyPI:

```bash
pip install textcleaner-partha
```

Install directly from GitHub:

```bash
pip install git+https://github.com/partha6369/textcleaner.git
```

## 🧠 Usage

```python
from textcleaner_partha import preprocess

text = "I can't believe it's already raining! 😞 <p>Click here</p>"

# Default usage (all features enabled)
cleaned = preprocess(text)
print(cleaned)

# Custom usage with optional features disabled
cleaned_partial = preprocess(
    text,
    lemmatise=False,            # Skip spaCy processing (lemmatisation, POS filtering)
    correct_spelling=False,     # Skip spelling correction
    expand_contraction=False    # Skip contraction expansion
)
print(cleaned_partial)
```

## 🔧 Parameters

The preprocess() function offers flexible control over each text cleaning step. You can selectively enable or disable operations using the parameters below:

```python
def preprocess(
    text,
    lowercase=True,
    remove_html=True,
    remove_emoji=True,
    expand_contraction=True,
    expand_abbrev=True,
    correct_spelling=True,
    lemmatise=True,
)
```

## 📦 Dependencies

	•	spacy
	•	autocorrect
	•	contractions

You can install them manually or via the included requirements.txt:
```bash
pip install -r requirements.txt
```

And download the required spaCy model:
```bash
python -m spacy download en_core_web_sm
```


## 📄 License

MIT License © Dr. Partha Majumdar

            

Raw data

            {
    "_id": null,
    "home_page": "https://github.com/partha6369/textcleaner",
    "name": "textcleaner-partha",
    "maintainer": null,
    "docs_url": null,
    "requires_python": ">=3.7",
    "maintainer_email": null,
    "keywords": "NLP, text preprocessing, lemmatization, contractions, emoji removal, spacy, autocorrect",
    "author": "Dr. Partha Majumdar",
    "author_email": "\"Dr. Partha Majumdar\" <partha.majumdar@icloud.com>",
    "download_url": "https://files.pythonhosted.org/packages/ea/e2/e4ea2aa7e78f795ddac99580debcd8e234660d9569361485de182fc4fcae/textcleaner_partha-0.2.0.tar.gz",
    "platform": null,
    "description": "# \ud83e\uddf9 textcleaner-partha\n\n[![PyPI version](https://img.shields.io/pypi/v/textcleaner-partha?color=blue)](https://pypi.org/project/textcleaner-partha/)\n[![License](https://img.shields.io/badge/license-MIT-green.svg)](LICENSE)\n\nA lightweight and reusable text preprocessing package for NLP tasks.\nIt cleans text by removing HTML tags and emojis, expanding contractions, correcting spelling, and performing lemmatization using spaCy.\n\n## \u2728 Features\n\t\u2022\t\u2705 HTML tag and emoji removal\n\t\u2022\t\u2705 Contraction expansion (e.g., \u201ccan\u2019t\u201d \u2192 \u201ccannot\u201d)\n\t\u2022\t\u2705 Abbreviation expansion (e.g., \u201casap\u201d \u2192 \u201cas soon as possible\u201d)\n\t\u2022\t\u2705 Spelling correction with autocorrect\n\t\u2022\t\u2705 Lemmatization using spaCy (en_core_web_sm)\n\t\u2022\t\u2705 Filters out stopwords, punctuation, numbers\n\t\u2022\t\u2705 Retains only nouns, verbs, adjectives, and adverbs\n\n\n## \ud83d\ude80 Installation\n\n### From PyPI:\n\n```bash\npip install textcleaner-partha\n```\n\nInstall directly from GitHub:\n\n```bash\npip install git+https://github.com/partha6369/textcleaner.git\n```\n\n## \ud83e\udde0 Usage\n\n```python\nfrom textcleaner_partha import preprocess\n\ntext = \"I can't believe it's already raining! \ud83d\ude1e <p>Click here</p>\"\n\n# Default usage (all features enabled)\ncleaned = preprocess(text)\nprint(cleaned)\n\n# Custom usage with optional features disabled\ncleaned_partial = preprocess(\n    text,\n    lemmatise=False,            # Skip spaCy processing (lemmatisation, POS filtering)\n    correct_spelling=False,     # Skip spelling correction\n    expand_contraction=False    # Skip contraction expansion\n)\nprint(cleaned_partial)\n```\n\n## \ud83d\udd27 Parameters\n\nThe preprocess() function offers flexible control over each text cleaning step. You can selectively enable or disable operations using the parameters below:\n\n```python\ndef preprocess(\n    text,\n    lowercase=True,\n    remove_html=True,\n    remove_emoji=True,\n    expand_contraction=True,\n    expand_abbrev=True,\n    correct_spelling=True,\n    lemmatise=True,\n)\n```\n\n## \ud83d\udce6 Dependencies\n\n\t\u2022\tspacy\n\t\u2022\tautocorrect\n\t\u2022\tcontractions\n\nYou can install them manually or via the included requirements.txt:\n```bash\npip install -r requirements.txt\n```\n\nAnd download the required spaCy model:\n```bash\npython -m spacy download en_core_web_sm\n```\n\n\n## \ud83d\udcc4 License\n\nMIT License \u00a9 Dr. Partha Majumdar\n",
    "bugtrack_url": null,
    "license": null,
    "summary": "A lightweight and reusable text preprocessing package for NLP tasks",
    "version": "0.2.0",
    "project_urls": {
        "Homepage": "https://github.com/partha6369/textcleaner"
    },
    "split_keywords": [
        "nlp",
        " text preprocessing",
        " lemmatization",
        " contractions",
        " emoji removal",
        " spacy",
        " autocorrect"
    ],
    "urls": [
        {
            "comment_text": null,
            "digests": {
                "blake2b_256": "1d3447dd6469642fb11e4cbb9a6c3f2aceccfde0aa12f5ed6d3f69a578012df5",
                "md5": "5d7b6d370445b84935c5ae3ae2f21faa",
                "sha256": "e8468ce8d4cb74ecf0641f132adefc468baeac7481185519487a872d52b40b79"
            },
            "downloads": -1,
            "filename": "textcleaner_partha-0.2.0-py3-none-any.whl",
            "has_sig": false,
            "md5_digest": "5d7b6d370445b84935c5ae3ae2f21faa",
            "packagetype": "bdist_wheel",
            "python_version": "py3",
            "requires_python": ">=3.7",
            "size": 7013,
            "upload_time": "2025-07-12T21:38:20",
            "upload_time_iso_8601": "2025-07-12T21:38:20.492505Z",
            "url": "https://files.pythonhosted.org/packages/1d/34/47dd6469642fb11e4cbb9a6c3f2aceccfde0aa12f5ed6d3f69a578012df5/textcleaner_partha-0.2.0-py3-none-any.whl",
            "yanked": false,
            "yanked_reason": null
        },
        {
            "comment_text": null,
            "digests": {
                "blake2b_256": "eae2e4ea2aa7e78f795ddac99580debcd8e234660d9569361485de182fc4fcae",
                "md5": "67da07be45e5b5bba930506303e47987",
                "sha256": "c825483d9c926b36caf7b9248ce88d5e2b892aa1196ed9cd22a6e359233ec9ce"
            },
            "downloads": -1,
            "filename": "textcleaner_partha-0.2.0.tar.gz",
            "has_sig": false,
            "md5_digest": "67da07be45e5b5bba930506303e47987",
            "packagetype": "sdist",
            "python_version": "source",
            "requires_python": ">=3.7",
            "size": 7212,
            "upload_time": "2025-07-12T21:38:21",
            "upload_time_iso_8601": "2025-07-12T21:38:21.861335Z",
            "url": "https://files.pythonhosted.org/packages/ea/e2/e4ea2aa7e78f795ddac99580debcd8e234660d9569361485de182fc4fcae/textcleaner_partha-0.2.0.tar.gz",
            "yanked": false,
            "yanked_reason": null
        }
    ],
    "upload_time": "2025-07-12 21:38:21",
    "github": true,
    "gitlab": false,
    "bitbucket": false,
    "codeberg": false,
    "github_user": "partha6369",
    "github_project": "textcleaner",
    "travis_ci": false,
    "coveralls": false,
    "github_actions": false,
    "requirements": [
        {
            "name": "spacy",
            "specs": [
                [
                    ">=",
                    "3.0.0"
                ]
            ]
        },
        {
            "name": "autocorrect",
            "specs": [
                [
                    "==",
                    "0.4.4"
                ]
            ]
        },
        {
            "name": "contractions",
            "specs": [
                [
                    ">=",
                    "0.1.73"
                ]
            ]
        },
        {
            "name": "beautifulsoup4",
            "specs": [
                [
                    ">=",
                    "4.12.0"
                ]
            ]
        }
    ],
    "lcname": "textcleaner-partha"
}
        
Elapsed time: 0.51575s