textcleaner-partha


Nametextcleaner-partha JSON
Version 1.1.2 PyPI version JSON
download
home_pagehttps://github.com/partha6369/textcleaner
SummaryA lightweight and reusable text preprocessing package for NLP tasks
upload_time2025-08-02 16:21:04
maintainerNone
docs_urlNone
authorDr. Partha Majumdar
requires_python>=3.7
licenseNone
keywords nlp text preprocessing lemmatization contractions emoji removal spacy autocorrect tokens
VCS
bugtrack_url
requirements spacy autocorrect contractions beautifulsoup4
Travis-CI No Travis.
coveralls test coverage No coveralls.
            # 🧹 textcleaner-partha

[![PyPI version](https://img.shields.io/pypi/v/textcleaner-partha?color=blue)](https://pypi.org/project/textcleaner-partha/)
[![License](https://img.shields.io/badge/license-MIT-green.svg)](LICENSE)

A lightweight and reusable text preprocessing package for NLP tasks.
It cleans text by removing HTML tags and emojis, expanding contractions, correcting spelling, and performing lemmatization using spaCy.

## ✨ Features
	β€’	βœ… HTML tag and emoji removal
	β€’	βœ… Stopword removal
	β€’	βœ… Contraction expansion (e.g., β€œcan’t” β†’ β€œcannot”)
	β€’	βœ… Abbreviation expansion (e.g., β€œasap” β†’ β€œas soon as possible”)
	β€’	βœ… Spelling correction with autocorrect
	β€’	βœ… Lemmatisation using spaCy (en_core_web_sm)
	β€’	βœ… Filters out stopwords, punctuation, numbers
	β€’	βœ… Retains only nouns, verbs, adjectives, and adverbs
	β€’	βœ… Returns tokens in a text


## πŸš€ Installation

### From PyPI:

```bash
pip install --upgrade textcleaner-partha
```

Install directly from GitHub:

```bash
pip install git+https://github.com/partha6369/textcleaner.git
```

## 🧠 Usage

```python
from textcleaner_partha import preprocess

text = "I can't believe it's already raining! 😞 <p>Click here</p>"

# Default usage (all features enabled)
cleaned = preprocess(text)
print(cleaned)

# Custom usage with optional features disabled
cleaned_partial = preprocess(
    text,
    lemmatise=False,            # Skip spaCy processing (lemmatisation, POS filtering)
    correct_spelling=False,     # Skip spelling correction
    expand_contraction=False    # Skip contraction expansion
)
print(cleaned_partial)
```

```python
from textcleaner_partha import get_tokens

text = "I can't believe it's already raining! 😞 <p>Click here</p>"

# Default usage (all features enabled)
tokens = get_tokens(text)
print(tokens)

# Custom usage with optional features disabled
tokens_partial = get_tokens(
    text,
    lemmatise=False,            # Skip spaCy processing (lemmatisation, POS filtering)
    correct_spelling=False,     # Skip spelling correction
    expand_contraction=False    # Skip contraction expansion
)
print(tokens_partial)
```

## πŸ”§ Parameters

The preprocess() and get_tokens() functions offer flexible control over each text cleaning step. You can selectively enable or disable operations using the parameters below:

```python
def preprocess(
    text,
    lowercase=True,
    remove_html=True,
    remove_emoji=True,
    remove_whitespace=True,
    remove_punct=False,
    expand_contraction=True,
    expand_abbrev=True,
    correct_spelling=True,
    lemmatise=True,
)
```

```python
def get_tokens(
    text,
    lowercase=True,
    remove_html=True,
    remove_emoji=True,
    remove_whitespace=True,
    remove_punct=False,
    expand_contraction=True,
    expand_abbrev=True,
    correct_spelling=True,
    lemmatise=True,
)
```

## πŸ“¦ Dependencies

	β€’	spacy
	β€’	autocorrect
	β€’	contractions

You can install them manually or via the included requirements.txt:
```bash
pip install -r requirements.txt
```

And download the required spaCy model:
```bash
python -m spacy download en_core_web_sm
```


## πŸ“„ License

MIT License Β© Dr. Partha Majumdar

            

Raw data

            {
    "_id": null,
    "home_page": "https://github.com/partha6369/textcleaner",
    "name": "textcleaner-partha",
    "maintainer": null,
    "docs_url": null,
    "requires_python": ">=3.7",
    "maintainer_email": null,
    "keywords": "NLP, text preprocessing, lemmatization, contractions, emoji removal, spacy, autocorrect, tokens",
    "author": "Dr. Partha Majumdar",
    "author_email": "\"Dr. Partha Majumdar\" <partha.majumdar@icloud.com>",
    "download_url": "https://files.pythonhosted.org/packages/b4/3e/6a4955a41a667b3c15fca7f5775437b466d57a9d1c075ce343839f62dc46/textcleaner_partha-1.1.2.tar.gz",
    "platform": null,
    "description": "# \ud83e\uddf9 textcleaner-partha\n\n[![PyPI version](https://img.shields.io/pypi/v/textcleaner-partha?color=blue)](https://pypi.org/project/textcleaner-partha/)\n[![License](https://img.shields.io/badge/license-MIT-green.svg)](LICENSE)\n\nA lightweight and reusable text preprocessing package for NLP tasks.\nIt cleans text by removing HTML tags and emojis, expanding contractions, correcting spelling, and performing lemmatization using spaCy.\n\n## \u2728 Features\n\t\u2022\t\u2705 HTML tag and emoji removal\n\t\u2022\t\u2705 Stopword removal\n\t\u2022\t\u2705 Contraction expansion (e.g., \u201ccan\u2019t\u201d \u2192 \u201ccannot\u201d)\n\t\u2022\t\u2705 Abbreviation expansion (e.g., \u201casap\u201d \u2192 \u201cas soon as possible\u201d)\n\t\u2022\t\u2705 Spelling correction with autocorrect\n\t\u2022\t\u2705 Lemmatisation using spaCy (en_core_web_sm)\n\t\u2022\t\u2705 Filters out stopwords, punctuation, numbers\n\t\u2022\t\u2705 Retains only nouns, verbs, adjectives, and adverbs\n\t\u2022\t\u2705 Returns tokens in a text\n\n\n## \ud83d\ude80 Installation\n\n### From PyPI:\n\n```bash\npip install --upgrade textcleaner-partha\n```\n\nInstall directly from GitHub:\n\n```bash\npip install git+https://github.com/partha6369/textcleaner.git\n```\n\n## \ud83e\udde0 Usage\n\n```python\nfrom textcleaner_partha import preprocess\n\ntext = \"I can't believe it's already raining! \ud83d\ude1e <p>Click here</p>\"\n\n# Default usage (all features enabled)\ncleaned = preprocess(text)\nprint(cleaned)\n\n# Custom usage with optional features disabled\ncleaned_partial = preprocess(\n    text,\n    lemmatise=False,            # Skip spaCy processing (lemmatisation, POS filtering)\n    correct_spelling=False,     # Skip spelling correction\n    expand_contraction=False    # Skip contraction expansion\n)\nprint(cleaned_partial)\n```\n\n```python\nfrom textcleaner_partha import get_tokens\n\ntext = \"I can't believe it's already raining! \ud83d\ude1e <p>Click here</p>\"\n\n# Default usage (all features enabled)\ntokens = get_tokens(text)\nprint(tokens)\n\n# Custom usage with optional features disabled\ntokens_partial = get_tokens(\n    text,\n    lemmatise=False,            # Skip spaCy processing (lemmatisation, POS filtering)\n    correct_spelling=False,     # Skip spelling correction\n    expand_contraction=False    # Skip contraction expansion\n)\nprint(tokens_partial)\n```\n\n## \ud83d\udd27 Parameters\n\nThe preprocess() and get_tokens() functions offer flexible control over each text cleaning step. You can selectively enable or disable operations using the parameters below:\n\n```python\ndef preprocess(\n    text,\n    lowercase=True,\n    remove_html=True,\n    remove_emoji=True,\n    remove_whitespace=True,\n    remove_punct=False,\n    expand_contraction=True,\n    expand_abbrev=True,\n    correct_spelling=True,\n    lemmatise=True,\n)\n```\n\n```python\ndef get_tokens(\n    text,\n    lowercase=True,\n    remove_html=True,\n    remove_emoji=True,\n    remove_whitespace=True,\n    remove_punct=False,\n    expand_contraction=True,\n    expand_abbrev=True,\n    correct_spelling=True,\n    lemmatise=True,\n)\n```\n\n## \ud83d\udce6 Dependencies\n\n\t\u2022\tspacy\n\t\u2022\tautocorrect\n\t\u2022\tcontractions\n\nYou can install them manually or via the included requirements.txt:\n```bash\npip install -r requirements.txt\n```\n\nAnd download the required spaCy model:\n```bash\npython -m spacy download en_core_web_sm\n```\n\n\n## \ud83d\udcc4 License\n\nMIT License \u00a9 Dr. Partha Majumdar\n",
    "bugtrack_url": null,
    "license": null,
    "summary": "A lightweight and reusable text preprocessing package for NLP tasks",
    "version": "1.1.2",
    "project_urls": {
        "Homepage": "https://github.com/partha6369/textcleaner"
    },
    "split_keywords": [
        "nlp",
        " text preprocessing",
        " lemmatization",
        " contractions",
        " emoji removal",
        " spacy",
        " autocorrect",
        " tokens"
    ],
    "urls": [
        {
            "comment_text": null,
            "digests": {
                "blake2b_256": "454f3c87bae058127227b71bf67efb61c42e7ad217a64c985d4b45ec32f1162f",
                "md5": "85a37d831b864a452c7d16060c8a37ff",
                "sha256": "70e1188ba178ae368cd06f8fb0160dd883878ae01cff5272eb9b03a65fc37368"
            },
            "downloads": -1,
            "filename": "textcleaner_partha-1.1.2-py3-none-any.whl",
            "has_sig": false,
            "md5_digest": "85a37d831b864a452c7d16060c8a37ff",
            "packagetype": "bdist_wheel",
            "python_version": "py3",
            "requires_python": ">=3.7",
            "size": 12551,
            "upload_time": "2025-08-02T16:21:02",
            "upload_time_iso_8601": "2025-08-02T16:21:02.644978Z",
            "url": "https://files.pythonhosted.org/packages/45/4f/3c87bae058127227b71bf67efb61c42e7ad217a64c985d4b45ec32f1162f/textcleaner_partha-1.1.2-py3-none-any.whl",
            "yanked": false,
            "yanked_reason": null
        },
        {
            "comment_text": null,
            "digests": {
                "blake2b_256": "b43e6a4955a41a667b3c15fca7f5775437b466d57a9d1c075ce343839f62dc46",
                "md5": "bc322ca53eb106fc3cb44b4250746af9",
                "sha256": "31eb247de29b98172660181fa0475b298c2f8cf0d45ff56b4707bcf12d5630c8"
            },
            "downloads": -1,
            "filename": "textcleaner_partha-1.1.2.tar.gz",
            "has_sig": false,
            "md5_digest": "bc322ca53eb106fc3cb44b4250746af9",
            "packagetype": "sdist",
            "python_version": "source",
            "requires_python": ">=3.7",
            "size": 17076,
            "upload_time": "2025-08-02T16:21:04",
            "upload_time_iso_8601": "2025-08-02T16:21:04.073904Z",
            "url": "https://files.pythonhosted.org/packages/b4/3e/6a4955a41a667b3c15fca7f5775437b466d57a9d1c075ce343839f62dc46/textcleaner_partha-1.1.2.tar.gz",
            "yanked": false,
            "yanked_reason": null
        }
    ],
    "upload_time": "2025-08-02 16:21:04",
    "github": true,
    "gitlab": false,
    "bitbucket": false,
    "codeberg": false,
    "github_user": "partha6369",
    "github_project": "textcleaner",
    "travis_ci": false,
    "coveralls": false,
    "github_actions": false,
    "requirements": [
        {
            "name": "spacy",
            "specs": [
                [
                    ">=",
                    "3.0.0"
                ]
            ]
        },
        {
            "name": "autocorrect",
            "specs": [
                [
                    "==",
                    "0.4.4"
                ]
            ]
        },
        {
            "name": "contractions",
            "specs": [
                [
                    ">=",
                    "0.1.73"
                ]
            ]
        },
        {
            "name": "beautifulsoup4",
            "specs": [
                [
                    ">=",
                    "4.12.0"
                ]
            ]
        }
    ],
    "lcname": "textcleaner-partha"
}
        
Elapsed time: 0.71200s