# 🧹 textcleaner-partha
[](https://pypi.org/project/textcleaner-partha/)
[](LICENSE)
A lightweight and reusable text preprocessing package for NLP tasks.
It cleans text by removing HTML tags and emojis, expanding contractions, correcting spelling, and performing lemmatization using spaCy.
## ✨ Features
• ✅ HTML tag and emoji removal
• ✅ Contraction expansion (e.g., “can’t” → “cannot”)
• ✅ Abbreviation expansion (e.g., “asap” → “as soon as possible”)
• ✅ Spelling correction with autocorrect
• ✅ Lemmatization using spaCy (en_core_web_sm)
• ✅ Filters out stopwords, punctuation, numbers
• ✅ Retains only nouns, verbs, adjectives, and adverbs
## 🚀 Installation
### From PyPI:
```bash
pip install textcleaner-partha
```
Install directly from GitHub:
```bash
pip install git+https://github.com/partha6369/textcleaner.git
```
## 🧠 Usage
```python
from textcleaner_partha import preprocess
text = "I can't believe it's already raining! 😞 <p>Click here</p>"
# Default usage (all features enabled)
cleaned = preprocess(text)
print(cleaned)
# Custom usage with optional features disabled
cleaned_partial = preprocess(
text,
lemmatise=False, # Skip spaCy processing (lemmatisation, POS filtering)
correct_spelling=False, # Skip spelling correction
expand_contraction=False # Skip contraction expansion
)
print(cleaned_partial)
```
## 🔧 Parameters
The preprocess() function offers flexible control over each text cleaning step. You can selectively enable or disable operations using the parameters below:
```python
def preprocess(
text,
lowercase=True,
remove_html=True,
remove_emoji=True,
expand_contraction=True,
expand_abbrev=True,
correct_spelling=True,
lemmatise=True,
)
```
## 📦 Dependencies
• spacy
• autocorrect
• contractions
You can install them manually or via the included requirements.txt:
```bash
pip install -r requirements.txt
```
And download the required spaCy model:
```bash
python -m spacy download en_core_web_sm
```
## 📄 License
MIT License © Dr. Partha Majumdar
Raw data
{
"_id": null,
"home_page": "https://github.com/partha6369/textcleaner",
"name": "textcleaner-partha",
"maintainer": null,
"docs_url": null,
"requires_python": ">=3.7",
"maintainer_email": null,
"keywords": "NLP, text preprocessing, lemmatization, contractions, emoji removal, spacy, autocorrect",
"author": "Dr. Partha Majumdar",
"author_email": "\"Dr. Partha Majumdar\" <partha.majumdar@icloud.com>",
"download_url": "https://files.pythonhosted.org/packages/ea/e2/e4ea2aa7e78f795ddac99580debcd8e234660d9569361485de182fc4fcae/textcleaner_partha-0.2.0.tar.gz",
"platform": null,
"description": "# \ud83e\uddf9 textcleaner-partha\n\n[](https://pypi.org/project/textcleaner-partha/)\n[](LICENSE)\n\nA lightweight and reusable text preprocessing package for NLP tasks.\nIt cleans text by removing HTML tags and emojis, expanding contractions, correcting spelling, and performing lemmatization using spaCy.\n\n## \u2728 Features\n\t\u2022\t\u2705 HTML tag and emoji removal\n\t\u2022\t\u2705 Contraction expansion (e.g., \u201ccan\u2019t\u201d \u2192 \u201ccannot\u201d)\n\t\u2022\t\u2705 Abbreviation expansion (e.g., \u201casap\u201d \u2192 \u201cas soon as possible\u201d)\n\t\u2022\t\u2705 Spelling correction with autocorrect\n\t\u2022\t\u2705 Lemmatization using spaCy (en_core_web_sm)\n\t\u2022\t\u2705 Filters out stopwords, punctuation, numbers\n\t\u2022\t\u2705 Retains only nouns, verbs, adjectives, and adverbs\n\n\n## \ud83d\ude80 Installation\n\n### From PyPI:\n\n```bash\npip install textcleaner-partha\n```\n\nInstall directly from GitHub:\n\n```bash\npip install git+https://github.com/partha6369/textcleaner.git\n```\n\n## \ud83e\udde0 Usage\n\n```python\nfrom textcleaner_partha import preprocess\n\ntext = \"I can't believe it's already raining! \ud83d\ude1e <p>Click here</p>\"\n\n# Default usage (all features enabled)\ncleaned = preprocess(text)\nprint(cleaned)\n\n# Custom usage with optional features disabled\ncleaned_partial = preprocess(\n text,\n lemmatise=False, # Skip spaCy processing (lemmatisation, POS filtering)\n correct_spelling=False, # Skip spelling correction\n expand_contraction=False # Skip contraction expansion\n)\nprint(cleaned_partial)\n```\n\n## \ud83d\udd27 Parameters\n\nThe preprocess() function offers flexible control over each text cleaning step. You can selectively enable or disable operations using the parameters below:\n\n```python\ndef preprocess(\n text,\n lowercase=True,\n remove_html=True,\n remove_emoji=True,\n expand_contraction=True,\n expand_abbrev=True,\n correct_spelling=True,\n lemmatise=True,\n)\n```\n\n## \ud83d\udce6 Dependencies\n\n\t\u2022\tspacy\n\t\u2022\tautocorrect\n\t\u2022\tcontractions\n\nYou can install them manually or via the included requirements.txt:\n```bash\npip install -r requirements.txt\n```\n\nAnd download the required spaCy model:\n```bash\npython -m spacy download en_core_web_sm\n```\n\n\n## \ud83d\udcc4 License\n\nMIT License \u00a9 Dr. Partha Majumdar\n",
"bugtrack_url": null,
"license": null,
"summary": "A lightweight and reusable text preprocessing package for NLP tasks",
"version": "0.2.0",
"project_urls": {
"Homepage": "https://github.com/partha6369/textcleaner"
},
"split_keywords": [
"nlp",
" text preprocessing",
" lemmatization",
" contractions",
" emoji removal",
" spacy",
" autocorrect"
],
"urls": [
{
"comment_text": null,
"digests": {
"blake2b_256": "1d3447dd6469642fb11e4cbb9a6c3f2aceccfde0aa12f5ed6d3f69a578012df5",
"md5": "5d7b6d370445b84935c5ae3ae2f21faa",
"sha256": "e8468ce8d4cb74ecf0641f132adefc468baeac7481185519487a872d52b40b79"
},
"downloads": -1,
"filename": "textcleaner_partha-0.2.0-py3-none-any.whl",
"has_sig": false,
"md5_digest": "5d7b6d370445b84935c5ae3ae2f21faa",
"packagetype": "bdist_wheel",
"python_version": "py3",
"requires_python": ">=3.7",
"size": 7013,
"upload_time": "2025-07-12T21:38:20",
"upload_time_iso_8601": "2025-07-12T21:38:20.492505Z",
"url": "https://files.pythonhosted.org/packages/1d/34/47dd6469642fb11e4cbb9a6c3f2aceccfde0aa12f5ed6d3f69a578012df5/textcleaner_partha-0.2.0-py3-none-any.whl",
"yanked": false,
"yanked_reason": null
},
{
"comment_text": null,
"digests": {
"blake2b_256": "eae2e4ea2aa7e78f795ddac99580debcd8e234660d9569361485de182fc4fcae",
"md5": "67da07be45e5b5bba930506303e47987",
"sha256": "c825483d9c926b36caf7b9248ce88d5e2b892aa1196ed9cd22a6e359233ec9ce"
},
"downloads": -1,
"filename": "textcleaner_partha-0.2.0.tar.gz",
"has_sig": false,
"md5_digest": "67da07be45e5b5bba930506303e47987",
"packagetype": "sdist",
"python_version": "source",
"requires_python": ">=3.7",
"size": 7212,
"upload_time": "2025-07-12T21:38:21",
"upload_time_iso_8601": "2025-07-12T21:38:21.861335Z",
"url": "https://files.pythonhosted.org/packages/ea/e2/e4ea2aa7e78f795ddac99580debcd8e234660d9569361485de182fc4fcae/textcleaner_partha-0.2.0.tar.gz",
"yanked": false,
"yanked_reason": null
}
],
"upload_time": "2025-07-12 21:38:21",
"github": true,
"gitlab": false,
"bitbucket": false,
"codeberg": false,
"github_user": "partha6369",
"github_project": "textcleaner",
"travis_ci": false,
"coveralls": false,
"github_actions": false,
"requirements": [
{
"name": "spacy",
"specs": [
[
">=",
"3.0.0"
]
]
},
{
"name": "autocorrect",
"specs": [
[
"==",
"0.4.4"
]
]
},
{
"name": "contractions",
"specs": [
[
">=",
"0.1.73"
]
]
},
{
"name": "beautifulsoup4",
"specs": [
[
">=",
"4.12.0"
]
]
}
],
"lcname": "textcleaner-partha"
}