# π§Ή textcleaner-partha
[](https://pypi.org/project/textcleaner-partha/)
[](LICENSE)
A lightweight and reusable text preprocessing package for NLP tasks.
It cleans text by removing HTML tags and emojis, expanding contractions, correcting spelling, and performing lemmatization using spaCy.
## β¨ Features
β’ β
HTML tag and emoji removal
β’ β
Stopword removal
β’ β
Contraction expansion (e.g., βcanβtβ β βcannotβ)
β’ β
Abbreviation expansion (e.g., βasapβ β βas soon as possibleβ)
β’ β
Spelling correction with autocorrect
β’ β
Lemmatisation using spaCy (en_core_web_sm)
β’ β
Filters out stopwords, punctuation, numbers
β’ β
Retains only nouns, verbs, adjectives, and adverbs
β’ β
Returns tokens in a text
## π Installation
### From PyPI:
```bash
pip install --upgrade textcleaner-partha
```
Install directly from GitHub:
```bash
pip install git+https://github.com/partha6369/textcleaner.git
```
## π§ Usage
```python
from textcleaner_partha import preprocess
text = "I can't believe it's already raining! π <p>Click here</p>"
# Default usage (all features enabled)
cleaned = preprocess(text)
print(cleaned)
# Custom usage with optional features disabled
cleaned_partial = preprocess(
text,
lemmatise=False, # Skip spaCy processing (lemmatisation, POS filtering)
correct_spelling=False, # Skip spelling correction
expand_contraction=False # Skip contraction expansion
)
print(cleaned_partial)
```
```python
from textcleaner_partha import get_tokens
text = "I can't believe it's already raining! π <p>Click here</p>"
# Default usage (all features enabled)
tokens = get_tokens(text)
print(tokens)
# Custom usage with optional features disabled
tokens_partial = get_tokens(
text,
lemmatise=False, # Skip spaCy processing (lemmatisation, POS filtering)
correct_spelling=False, # Skip spelling correction
expand_contraction=False # Skip contraction expansion
)
print(tokens_partial)
```
## π§ Parameters
The preprocess() and get_tokens() functions offer flexible control over each text cleaning step. You can selectively enable or disable operations using the parameters below:
```python
def preprocess(
text,
lowercase=True,
remove_html=True,
remove_emoji=True,
remove_whitespace=True,
remove_punct=False,
expand_contraction=True,
expand_abbrev=True,
correct_spelling=True,
lemmatise=True,
)
```
```python
def get_tokens(
text,
lowercase=True,
remove_html=True,
remove_emoji=True,
remove_whitespace=True,
remove_punct=False,
expand_contraction=True,
expand_abbrev=True,
correct_spelling=True,
lemmatise=True,
)
```
## π¦ Dependencies
β’ spacy
β’ autocorrect
β’ contractions
You can install them manually or via the included requirements.txt:
```bash
pip install -r requirements.txt
```
And download the required spaCy model:
```bash
python -m spacy download en_core_web_sm
```
## π License
MIT License Β© Dr. Partha Majumdar
Raw data
{
"_id": null,
"home_page": "https://github.com/partha6369/textcleaner",
"name": "textcleaner-partha",
"maintainer": null,
"docs_url": null,
"requires_python": ">=3.7",
"maintainer_email": null,
"keywords": "NLP, text preprocessing, lemmatization, contractions, emoji removal, spacy, autocorrect, tokens",
"author": "Dr. Partha Majumdar",
"author_email": "\"Dr. Partha Majumdar\" <partha.majumdar@icloud.com>",
"download_url": "https://files.pythonhosted.org/packages/b4/3e/6a4955a41a667b3c15fca7f5775437b466d57a9d1c075ce343839f62dc46/textcleaner_partha-1.1.2.tar.gz",
"platform": null,
"description": "# \ud83e\uddf9 textcleaner-partha\n\n[](https://pypi.org/project/textcleaner-partha/)\n[](LICENSE)\n\nA lightweight and reusable text preprocessing package for NLP tasks.\nIt cleans text by removing HTML tags and emojis, expanding contractions, correcting spelling, and performing lemmatization using spaCy.\n\n## \u2728 Features\n\t\u2022\t\u2705 HTML tag and emoji removal\n\t\u2022\t\u2705 Stopword removal\n\t\u2022\t\u2705 Contraction expansion (e.g., \u201ccan\u2019t\u201d \u2192 \u201ccannot\u201d)\n\t\u2022\t\u2705 Abbreviation expansion (e.g., \u201casap\u201d \u2192 \u201cas soon as possible\u201d)\n\t\u2022\t\u2705 Spelling correction with autocorrect\n\t\u2022\t\u2705 Lemmatisation using spaCy (en_core_web_sm)\n\t\u2022\t\u2705 Filters out stopwords, punctuation, numbers\n\t\u2022\t\u2705 Retains only nouns, verbs, adjectives, and adverbs\n\t\u2022\t\u2705 Returns tokens in a text\n\n\n## \ud83d\ude80 Installation\n\n### From PyPI:\n\n```bash\npip install --upgrade textcleaner-partha\n```\n\nInstall directly from GitHub:\n\n```bash\npip install git+https://github.com/partha6369/textcleaner.git\n```\n\n## \ud83e\udde0 Usage\n\n```python\nfrom textcleaner_partha import preprocess\n\ntext = \"I can't believe it's already raining! \ud83d\ude1e <p>Click here</p>\"\n\n# Default usage (all features enabled)\ncleaned = preprocess(text)\nprint(cleaned)\n\n# Custom usage with optional features disabled\ncleaned_partial = preprocess(\n text,\n lemmatise=False, # Skip spaCy processing (lemmatisation, POS filtering)\n correct_spelling=False, # Skip spelling correction\n expand_contraction=False # Skip contraction expansion\n)\nprint(cleaned_partial)\n```\n\n```python\nfrom textcleaner_partha import get_tokens\n\ntext = \"I can't believe it's already raining! \ud83d\ude1e <p>Click here</p>\"\n\n# Default usage (all features enabled)\ntokens = get_tokens(text)\nprint(tokens)\n\n# Custom usage with optional features disabled\ntokens_partial = get_tokens(\n text,\n lemmatise=False, # Skip spaCy processing (lemmatisation, POS filtering)\n correct_spelling=False, # Skip spelling correction\n expand_contraction=False # Skip contraction expansion\n)\nprint(tokens_partial)\n```\n\n## \ud83d\udd27 Parameters\n\nThe preprocess() and get_tokens() functions offer flexible control over each text cleaning step. You can selectively enable or disable operations using the parameters below:\n\n```python\ndef preprocess(\n text,\n lowercase=True,\n remove_html=True,\n remove_emoji=True,\n remove_whitespace=True,\n remove_punct=False,\n expand_contraction=True,\n expand_abbrev=True,\n correct_spelling=True,\n lemmatise=True,\n)\n```\n\n```python\ndef get_tokens(\n text,\n lowercase=True,\n remove_html=True,\n remove_emoji=True,\n remove_whitespace=True,\n remove_punct=False,\n expand_contraction=True,\n expand_abbrev=True,\n correct_spelling=True,\n lemmatise=True,\n)\n```\n\n## \ud83d\udce6 Dependencies\n\n\t\u2022\tspacy\n\t\u2022\tautocorrect\n\t\u2022\tcontractions\n\nYou can install them manually or via the included requirements.txt:\n```bash\npip install -r requirements.txt\n```\n\nAnd download the required spaCy model:\n```bash\npython -m spacy download en_core_web_sm\n```\n\n\n## \ud83d\udcc4 License\n\nMIT License \u00a9 Dr. Partha Majumdar\n",
"bugtrack_url": null,
"license": null,
"summary": "A lightweight and reusable text preprocessing package for NLP tasks",
"version": "1.1.2",
"project_urls": {
"Homepage": "https://github.com/partha6369/textcleaner"
},
"split_keywords": [
"nlp",
" text preprocessing",
" lemmatization",
" contractions",
" emoji removal",
" spacy",
" autocorrect",
" tokens"
],
"urls": [
{
"comment_text": null,
"digests": {
"blake2b_256": "454f3c87bae058127227b71bf67efb61c42e7ad217a64c985d4b45ec32f1162f",
"md5": "85a37d831b864a452c7d16060c8a37ff",
"sha256": "70e1188ba178ae368cd06f8fb0160dd883878ae01cff5272eb9b03a65fc37368"
},
"downloads": -1,
"filename": "textcleaner_partha-1.1.2-py3-none-any.whl",
"has_sig": false,
"md5_digest": "85a37d831b864a452c7d16060c8a37ff",
"packagetype": "bdist_wheel",
"python_version": "py3",
"requires_python": ">=3.7",
"size": 12551,
"upload_time": "2025-08-02T16:21:02",
"upload_time_iso_8601": "2025-08-02T16:21:02.644978Z",
"url": "https://files.pythonhosted.org/packages/45/4f/3c87bae058127227b71bf67efb61c42e7ad217a64c985d4b45ec32f1162f/textcleaner_partha-1.1.2-py3-none-any.whl",
"yanked": false,
"yanked_reason": null
},
{
"comment_text": null,
"digests": {
"blake2b_256": "b43e6a4955a41a667b3c15fca7f5775437b466d57a9d1c075ce343839f62dc46",
"md5": "bc322ca53eb106fc3cb44b4250746af9",
"sha256": "31eb247de29b98172660181fa0475b298c2f8cf0d45ff56b4707bcf12d5630c8"
},
"downloads": -1,
"filename": "textcleaner_partha-1.1.2.tar.gz",
"has_sig": false,
"md5_digest": "bc322ca53eb106fc3cb44b4250746af9",
"packagetype": "sdist",
"python_version": "source",
"requires_python": ">=3.7",
"size": 17076,
"upload_time": "2025-08-02T16:21:04",
"upload_time_iso_8601": "2025-08-02T16:21:04.073904Z",
"url": "https://files.pythonhosted.org/packages/b4/3e/6a4955a41a667b3c15fca7f5775437b466d57a9d1c075ce343839f62dc46/textcleaner_partha-1.1.2.tar.gz",
"yanked": false,
"yanked_reason": null
}
],
"upload_time": "2025-08-02 16:21:04",
"github": true,
"gitlab": false,
"bitbucket": false,
"codeberg": false,
"github_user": "partha6369",
"github_project": "textcleaner",
"travis_ci": false,
"coveralls": false,
"github_actions": false,
"requirements": [
{
"name": "spacy",
"specs": [
[
">=",
"3.0.0"
]
]
},
{
"name": "autocorrect",
"specs": [
[
"==",
"0.4.4"
]
]
},
{
"name": "contractions",
"specs": [
[
">=",
"0.1.73"
]
]
},
{
"name": "beautifulsoup4",
"specs": [
[
">=",
"4.12.0"
]
]
}
],
"lcname": "textcleaner-partha"
}