Name | kleantext JSON |
Version |
0.1.3
JSON |
| download |
home_page | https://github.com/karan9970/kleantext |
Summary | A Python module for preprocessing text for NLP tasks |
upload_time | 2025-02-02 12:35:39 |
maintainer | None |
docs_url | None |
author | Karan |
requires_python | >=3.7 |
license | MIT |
keywords |
|
VCS |
 |
bugtrack_url |
|
requirements |
No requirements were recorded.
|
Travis-CI |
No Travis.
|
coveralls test coverage |
No coveralls.
|
# kleantext
A Python package for preprocessing textual data for machine learning and natural language processing tasks. It includes functionality for:
- Converting text to lowercase (optional case-sensitive mode)
- Removing HTML tags, punctuation, numbers, and special characters
- Handling emojis (removal or conversion to textual descriptions)
- Handling negations
- Removing or retaining specific patterns (hashtags, mentions, etc.)
- Removing stopwords (with customizable stopword lists)
- Stemming and lemmatization
- Correcting spelling (optional)
- Expanding contractions and slangs
- Named Entity Recognition (NER) masking (e.g., replacing entities with placeholders)
- Detecting and translating text to a target language
- Profanity filtering
- Customizable text preprocessing pipeline
---
## Installation
### Option 1: Clone or Download
1. Clone the repository using:
```bash
git clone https://github.com/your-username/kleantext.git
```
2. Navigate to the project directory:
```bash
cd kleantext
```
### Option 2: Install via pip (if published)
```bash
pip install kleantext
```
---
## Usage
### Quick Start
```python
from kleantext.preprocessor import TextPreprocessor
# Initialize the preprocessor with custom settings
preprocessor = TextPreprocessor(
remove_stopwords=True,
perform_spellcheck=True,
use_stemming=False,
use_lemmatization=True,
custom_stopwords={"example", "test"},
case_sensitive=False,
detect_language=True,
target_language="en"
)
# Input text
text = "This is an example! Isn't it great? Visit https://example.com for more ๐."
# Preprocess the text
clean_text = preprocessor.clean_text(text)
print(clean_text) # Output: "this is isnt it great visit for more"
```
---
## Features and Configuration
### 1. Case Sensitivity
Control whether the text should be converted to lowercase:
```python
preprocessor = TextPreprocessor(case_sensitive=True)
```
### 2. Removing HTML Tags
Automatically remove HTML tags like `<div>` or `<p>`.
### 3. Emoji Handling
Convert emojis to text or remove them entirely:
```python
import emoji
text = emoji.demojize("๐ Hello!") # Output: ":blush: Hello!"
```
### 4. Stopword Removal
Remove common stopwords, with support for custom lists:
```python
custom_stopwords = {"is", "an", "the"}
preprocessor = TextPreprocessor(custom_stopwords=custom_stopwords)
```
### 5. Slang and Contraction Expansion
Expand contractions like "can't" to "cannot":
```python
text = "I can't go"
expanded_text = preprocessor.clean_text(text)
```
### 6. Named Entity Recognition (NER) Masking
Mask entities like names, organizations, or dates using `spacy`:
```python
text = "Barack Obama was the 44th President of the USA."
masked_text = preprocessor.clean_text(text)
```
### 7. Profanity Filtering
Censor offensive words:
```python
text = "This is a badword!"
filtered_text = preprocessor.clean_text(text)
```
### 8. Language Detection and Translation
Detect the text's language and translate it:
```python
preprocessor = TextPreprocessor(detect_language=True, target_language="en")
text = "Bonjour tout le monde"
translated_text = preprocessor.clean_text(text) # Output: "Hello everyone"
```
### 9. Tokenization
Tokenize text for further NLP tasks:
```python
from nltk.tokenize import word_tokenize
tokens = word_tokenize("This is an example.")
print(tokens) # Output: ['This', 'is', 'an', 'example', '.']
```
---
## Advanced Configuration
Create a custom pipeline by enabling or disabling specific cleaning steps:
```python
pipeline = ["lowercase", "remove_html", "remove_urls", "remove_stopwords"]
preprocessor.clean_text(text, pipeline=pipeline)
```
---
## Testing
Run unit tests using:
```bash
python -m unittest discover tests
```
---
## License
This project is licensed under the MIT License.
---
## Contributing
Feel free to fork the repository, create a feature branch, and submit a pull request. Contributions are welcome!
---
## Snippets
### Full Preprocessing Example
```python
from kleantext.preprocessor import TextPreprocessor
# Initialize with default settings
preprocessor = TextPreprocessor(remove_stopwords=True, perform_spellcheck=False)
text = "Hello!!! This is, an example. Isn't it? ๐"
clean_text = preprocessor.clean_text(text)
print(clean_text)
```
### Profanity Filtering
```python
preprocessor = TextPreprocessor()
text = "This is a badword!"
clean_text = preprocessor.clean_text(text)
print(clean_text) # Output: "This is a [CENSORED]!"
```
## Usage Examples
### 1. Converting Text to Lowercase (Optional Case-Sensitive Mode)
```python
preprocessor = TextPreprocessor(case_sensitive=True)
text = "Hello WORLD!"
cleaned_text = preprocessor.clean_text(text)
print(cleaned_text) # Output: "Hello WORLD!"
preprocessor = TextPreprocessor(case_sensitive=False)
cleaned_text = preprocessor.clean_text(text)
print(cleaned_text) # Output: "hello world"
```
---
### 2. Removing HTML Tags, Punctuation, Numbers, and Special Characters
```python
text = "This is a <b>bold</b> statement! Price: $100."
cleaned_text = preprocessor.clean_text(text)
print(cleaned_text) # Output: "this is a bold statement price"
```
---
### 3. Handling Emojis (Removal or Conversion to Textual Descriptions)
```python
text = "I love Python! ๐๐ฅ"
cleaned_text = preprocessor.clean_text(text)
print(cleaned_text) # Output: "i love python :heart_eyes: :fire:"
```
---
### 4. Handling Negations
```python
text = "I don't like this movie."
cleaned_text = preprocessor.clean_text(text)
print(cleaned_text) # Output: "i do not like this movie"
```
---
### 5. Removing or Retaining Specific Patterns (Hashtags, Mentions, etc.)
```python
text = "Follow @user and check #MachineLearning!"
cleaned_text = preprocessor.clean_text(text)
print(cleaned_text) # Output: "follow user and check machinelearning"
```
---
### 6. Removing Stopwords (With Customizable Stopword Lists)
```python
preprocessor = TextPreprocessor(custom_stopwords={"example", "test"})
text = "This is an example test showing stopword removal."
cleaned_text = preprocessor.clean_text(text)
print(cleaned_text) # Output: "this is an showing stopword removal"
```
---
### 7. Stemming and Lemmatization
#### With Stemming:
```python
preprocessor = TextPreprocessor(use_stemming=True, use_lemmatization=False)
text = "running flies better than walking"
cleaned_text = preprocessor.clean_text(text)
print(cleaned_text) # Output: "run fli better than walk"
```
#### With Lemmatization:
```python
preprocessor = TextPreprocessor(use_stemming=False, use_lemmatization=True)
cleaned_text = preprocessor.clean_text(text)
print(cleaned_text) # Output: "running fly better than walking"
```
---
### 8. Correcting Spelling (Optional)
```python
preprocessor = TextPreprocessor(perform_spellcheck=True)
text = "Ths is a tst sentnce."
cleaned_text = preprocessor.clean_text(text)
print(cleaned_text) # Output: "This is a test sentence"
```
---
### 9. Expanding Contractions and Slang Handling
```python
from contractions import fix
text = "I'm gonna go, but I can't wait!"
text = fix(text)
cleaned_text = preprocessor.clean_text(text)
print(cleaned_text) # Output: "i am going to go but i cannot wait"
```
---
### 10. Named Entity Recognition (NER) Masking
```python
text = "John Doe lives in New York."
cleaned_text = preprocessor.clean_text(text)
print(cleaned_text) # Output: "[PERSON] lives in [LOCATION]"
```
---
### 11. Detecting and Translating Text to a Target Language
```python
preprocessor = TextPreprocessor(detect_language=True, target_language="en")
text = "Bonjour! Comment รงa va?"
cleaned_text = preprocessor.clean_text(text)
print(cleaned_text) # Output: "Hello! How are you?"
```
---
### 12. Profanity Filtering
```python
text = "This is a f***ing great product!"
cleaned_text = preprocessor.clean_text(text)
print(cleaned_text) # Output: "this is a **** great product"
```
---
### 13. Customizable Text Preprocessing Pipeline
```python
preprocessor = TextPreprocessor(
remove_stopwords=True,
perform_spellcheck=True,
use_stemming=False,
use_lemmatization=True,
case_sensitive=False,
detect_language=True,
target_language="en"
)
text = "Ths is an amazng movi!! ๐๐ฅ <b>100%</b> recommended!"
cleaned_text = preprocessor.clean_text(text)
print(cleaned_text) # Output: "this is an amazing movie heart_eyes fire recommended"
```
---
## Conclusion
KleanText is a robust and flexible text preprocessing library designed to clean and normalize text efficiently. You can customize the pipeline to fit your specific NLP needs. ๐
Raw data
{
"_id": null,
"home_page": "https://github.com/karan9970/kleantext",
"name": "kleantext",
"maintainer": null,
"docs_url": null,
"requires_python": ">=3.7",
"maintainer_email": null,
"keywords": null,
"author": "Karan",
"author_email": "Karansd00@gmail.com",
"download_url": "https://files.pythonhosted.org/packages/e6/1d/5fa904e7b01f3a0ab7c3ba68d70c63b60d1c3179a32fb994e3c12088cd37/kleantext-0.1.3.tar.gz",
"platform": null,
"description": "\r\n# kleantext\r\nA Python package for preprocessing textual data for machine learning and natural language processing tasks. It includes functionality for:\r\n\r\n- Converting text to lowercase (optional case-sensitive mode)\r\n- Removing HTML tags, punctuation, numbers, and special characters\r\n- Handling emojis (removal or conversion to textual descriptions)\r\n- Handling negations\r\n- Removing or retaining specific patterns (hashtags, mentions, etc.)\r\n- Removing stopwords (with customizable stopword lists)\r\n- Stemming and lemmatization\r\n- Correcting spelling (optional)\r\n- Expanding contractions and slangs\r\n- Named Entity Recognition (NER) masking (e.g., replacing entities with placeholders)\r\n- Detecting and translating text to a target language\r\n- Profanity filtering\r\n- Customizable text preprocessing pipeline\r\n\r\n---\r\n\r\n## Installation\r\n### Option 1: Clone or Download\r\n1. Clone the repository using:\r\n ```bash\r\n git clone https://github.com/your-username/kleantext.git\r\n ```\r\n2. Navigate to the project directory:\r\n ```bash\r\n cd kleantext\r\n ```\r\n\r\n### Option 2: Install via pip (if published)\r\n```bash\r\npip install kleantext\r\n```\r\n\r\n---\r\n\r\n## Usage\r\n### Quick Start\r\n```python\r\nfrom kleantext.preprocessor import TextPreprocessor\r\n\r\n# Initialize the preprocessor with custom settings\r\npreprocessor = TextPreprocessor(\r\n remove_stopwords=True,\r\n perform_spellcheck=True,\r\n use_stemming=False,\r\n use_lemmatization=True,\r\n custom_stopwords={\"example\", \"test\"},\r\n case_sensitive=False,\r\n detect_language=True,\r\n target_language=\"en\"\r\n)\r\n\r\n# Input text\r\ntext = \"This is an example! Isn't it great? Visit https://example.com for more \ud83d\ude0a.\"\r\n\r\n# Preprocess the text\r\nclean_text = preprocessor.clean_text(text)\r\nprint(clean_text) # Output: \"this is isnt it great visit for more\"\r\n```\r\n\r\n---\r\n\r\n## Features and Configuration\r\n### 1. Case Sensitivity\r\nControl whether the text should be converted to lowercase:\r\n```python\r\npreprocessor = TextPreprocessor(case_sensitive=True)\r\n```\r\n\r\n### 2. Removing HTML Tags\r\nAutomatically remove HTML tags like `<div>` or `<p>`.\r\n\r\n### 3. Emoji Handling\r\nConvert emojis to text or remove them entirely:\r\n```python\r\nimport emoji\r\ntext = emoji.demojize(\"\ud83d\ude0a Hello!\") # Output: \":blush: Hello!\"\r\n```\r\n\r\n### 4. Stopword Removal\r\nRemove common stopwords, with support for custom lists:\r\n```python\r\ncustom_stopwords = {\"is\", \"an\", \"the\"}\r\npreprocessor = TextPreprocessor(custom_stopwords=custom_stopwords)\r\n```\r\n\r\n### 5. Slang and Contraction Expansion\r\nExpand contractions like \"can't\" to \"cannot\":\r\n```python\r\ntext = \"I can't go\"\r\nexpanded_text = preprocessor.clean_text(text)\r\n```\r\n\r\n### 6. Named Entity Recognition (NER) Masking\r\nMask entities like names, organizations, or dates using `spacy`:\r\n```python\r\ntext = \"Barack Obama was the 44th President of the USA.\"\r\nmasked_text = preprocessor.clean_text(text)\r\n```\r\n\r\n### 7. Profanity Filtering\r\nCensor offensive words:\r\n```python\r\ntext = \"This is a badword!\"\r\nfiltered_text = preprocessor.clean_text(text)\r\n```\r\n\r\n### 8. Language Detection and Translation\r\nDetect the text's language and translate it:\r\n```python\r\npreprocessor = TextPreprocessor(detect_language=True, target_language=\"en\")\r\ntext = \"Bonjour tout le monde\"\r\ntranslated_text = preprocessor.clean_text(text) # Output: \"Hello everyone\"\r\n```\r\n\r\n### 9. Tokenization\r\nTokenize text for further NLP tasks:\r\n```python\r\nfrom nltk.tokenize import word_tokenize\r\ntokens = word_tokenize(\"This is an example.\")\r\nprint(tokens) # Output: ['This', 'is', 'an', 'example', '.']\r\n```\r\n\r\n---\r\n\r\n## Advanced Configuration\r\nCreate a custom pipeline by enabling or disabling specific cleaning steps:\r\n```python\r\npipeline = [\"lowercase\", \"remove_html\", \"remove_urls\", \"remove_stopwords\"]\r\npreprocessor.clean_text(text, pipeline=pipeline)\r\n```\r\n\r\n---\r\n\r\n## Testing\r\nRun unit tests using:\r\n```bash\r\npython -m unittest discover tests\r\n```\r\n\r\n---\r\n\r\n## License\r\nThis project is licensed under the MIT License.\r\n\r\n---\r\n\r\n## Contributing\r\nFeel free to fork the repository, create a feature branch, and submit a pull request. Contributions are welcome!\r\n\r\n---\r\n\r\n## Snippets\r\n### Full Preprocessing Example\r\n```python\r\nfrom kleantext.preprocessor import TextPreprocessor\r\n\r\n# Initialize with default settings\r\npreprocessor = TextPreprocessor(remove_stopwords=True, perform_spellcheck=False)\r\n\r\ntext = \"Hello!!! This is, an example. Isn't it? \ud83d\ude0a\"\r\nclean_text = preprocessor.clean_text(text)\r\nprint(clean_text)\r\n```\r\n\r\n### Profanity Filtering\r\n```python\r\npreprocessor = TextPreprocessor()\r\ntext = \"This is a badword!\"\r\nclean_text = preprocessor.clean_text(text)\r\nprint(clean_text) # Output: \"This is a [CENSORED]!\"\r\n```\r\n\r\n## Usage Examples\r\n\r\n### 1. Converting Text to Lowercase (Optional Case-Sensitive Mode)\r\n```python\r\npreprocessor = TextPreprocessor(case_sensitive=True)\r\ntext = \"Hello WORLD!\"\r\ncleaned_text = preprocessor.clean_text(text)\r\nprint(cleaned_text) # Output: \"Hello WORLD!\"\r\n\r\npreprocessor = TextPreprocessor(case_sensitive=False)\r\ncleaned_text = preprocessor.clean_text(text)\r\nprint(cleaned_text) # Output: \"hello world\"\r\n```\r\n\r\n---\r\n\r\n### 2. Removing HTML Tags, Punctuation, Numbers, and Special Characters\r\n```python\r\ntext = \"This is a <b>bold</b> statement! Price: $100.\"\r\ncleaned_text = preprocessor.clean_text(text)\r\nprint(cleaned_text) # Output: \"this is a bold statement price\"\r\n```\r\n\r\n---\r\n\r\n### 3. Handling Emojis (Removal or Conversion to Textual Descriptions)\r\n```python\r\ntext = \"I love Python! \ud83d\ude0d\ud83d\udd25\"\r\ncleaned_text = preprocessor.clean_text(text)\r\nprint(cleaned_text) # Output: \"i love python :heart_eyes: :fire:\"\r\n```\r\n\r\n---\r\n\r\n### 4. Handling Negations\r\n```python\r\ntext = \"I don't like this movie.\"\r\ncleaned_text = preprocessor.clean_text(text)\r\nprint(cleaned_text) # Output: \"i do not like this movie\"\r\n```\r\n\r\n---\r\n\r\n### 5. Removing or Retaining Specific Patterns (Hashtags, Mentions, etc.)\r\n```python\r\ntext = \"Follow @user and check #MachineLearning!\"\r\ncleaned_text = preprocessor.clean_text(text)\r\nprint(cleaned_text) # Output: \"follow user and check machinelearning\"\r\n```\r\n\r\n---\r\n\r\n### 6. Removing Stopwords (With Customizable Stopword Lists)\r\n```python\r\npreprocessor = TextPreprocessor(custom_stopwords={\"example\", \"test\"})\r\ntext = \"This is an example test showing stopword removal.\"\r\ncleaned_text = preprocessor.clean_text(text)\r\nprint(cleaned_text) # Output: \"this is an showing stopword removal\"\r\n```\r\n\r\n---\r\n\r\n### 7. Stemming and Lemmatization\r\n#### With Stemming:\r\n```python\r\npreprocessor = TextPreprocessor(use_stemming=True, use_lemmatization=False)\r\ntext = \"running flies better than walking\"\r\ncleaned_text = preprocessor.clean_text(text)\r\nprint(cleaned_text) # Output: \"run fli better than walk\"\r\n```\r\n#### With Lemmatization:\r\n```python\r\npreprocessor = TextPreprocessor(use_stemming=False, use_lemmatization=True)\r\ncleaned_text = preprocessor.clean_text(text)\r\nprint(cleaned_text) # Output: \"running fly better than walking\"\r\n```\r\n\r\n---\r\n\r\n### 8. Correcting Spelling (Optional)\r\n```python\r\npreprocessor = TextPreprocessor(perform_spellcheck=True)\r\ntext = \"Ths is a tst sentnce.\"\r\ncleaned_text = preprocessor.clean_text(text)\r\nprint(cleaned_text) # Output: \"This is a test sentence\"\r\n```\r\n\r\n---\r\n\r\n### 9. Expanding Contractions and Slang Handling\r\n```python\r\nfrom contractions import fix\r\n\r\ntext = \"I'm gonna go, but I can't wait!\"\r\ntext = fix(text) \r\ncleaned_text = preprocessor.clean_text(text)\r\nprint(cleaned_text) # Output: \"i am going to go but i cannot wait\"\r\n```\r\n\r\n---\r\n\r\n### 10. Named Entity Recognition (NER) Masking\r\n```python\r\ntext = \"John Doe lives in New York.\"\r\ncleaned_text = preprocessor.clean_text(text)\r\nprint(cleaned_text) # Output: \"[PERSON] lives in [LOCATION]\"\r\n```\r\n\r\n---\r\n\r\n### 11. Detecting and Translating Text to a Target Language\r\n```python\r\npreprocessor = TextPreprocessor(detect_language=True, target_language=\"en\")\r\ntext = \"Bonjour! Comment \u00e7a va?\"\r\ncleaned_text = preprocessor.clean_text(text)\r\nprint(cleaned_text) # Output: \"Hello! How are you?\"\r\n```\r\n\r\n---\r\n\r\n### 12. Profanity Filtering\r\n```python\r\ntext = \"This is a f***ing great product!\"\r\ncleaned_text = preprocessor.clean_text(text)\r\nprint(cleaned_text) # Output: \"this is a **** great product\"\r\n```\r\n\r\n---\r\n\r\n### 13. Customizable Text Preprocessing Pipeline\r\n```python\r\npreprocessor = TextPreprocessor(\r\n remove_stopwords=True,\r\n perform_spellcheck=True,\r\n use_stemming=False,\r\n use_lemmatization=True,\r\n case_sensitive=False,\r\n detect_language=True,\r\n target_language=\"en\"\r\n)\r\n\r\ntext = \"Ths is an amazng movi!! \ud83d\ude0d\ud83d\udd25 <b>100%</b> recommended!\"\r\ncleaned_text = preprocessor.clean_text(text)\r\nprint(cleaned_text) # Output: \"this is an amazing movie heart_eyes fire recommended\"\r\n```\r\n\r\n---\r\n\r\n## Conclusion\r\nKleanText is a robust and flexible text preprocessing library designed to clean and normalize text efficiently. You can customize the pipeline to fit your specific NLP needs. \ud83d\ude80\r\n\r\n\r\n\r\n",
"bugtrack_url": null,
"license": "MIT",
"summary": "A Python module for preprocessing text for NLP tasks",
"version": "0.1.3",
"project_urls": {
"Homepage": "https://github.com/karan9970/kleantext"
},
"split_keywords": [],
"urls": [
{
"comment_text": null,
"digests": {
"blake2b_256": "454d5d342debaf57ee834c29e88258d88b2fc68f1dbcdcd7313bb60b6e8129a6",
"md5": "1f0daeda43d09d994f589a79fc02d9ce",
"sha256": "9aec86020c8fa68a49f3bf19bd3c73423b48aa6ef3cf397bfbfbfe228ed6de83"
},
"downloads": -1,
"filename": "kleantext-0.1.3-py3-none-any.whl",
"has_sig": false,
"md5_digest": "1f0daeda43d09d994f589a79fc02d9ce",
"packagetype": "bdist_wheel",
"python_version": "py3",
"requires_python": ">=3.7",
"size": 4964,
"upload_time": "2025-02-02T12:35:37",
"upload_time_iso_8601": "2025-02-02T12:35:37.239856Z",
"url": "https://files.pythonhosted.org/packages/45/4d/5d342debaf57ee834c29e88258d88b2fc68f1dbcdcd7313bb60b6e8129a6/kleantext-0.1.3-py3-none-any.whl",
"yanked": false,
"yanked_reason": null
},
{
"comment_text": null,
"digests": {
"blake2b_256": "e61d5fa904e7b01f3a0ab7c3ba68d70c63b60d1c3179a32fb994e3c12088cd37",
"md5": "1f50c635d1ad01dd0ae55aa72b3482ab",
"sha256": "00e00d7885b8f897299962928ac13246de3fcb68f6807366cbad5387b76d759b"
},
"downloads": -1,
"filename": "kleantext-0.1.3.tar.gz",
"has_sig": false,
"md5_digest": "1f50c635d1ad01dd0ae55aa72b3482ab",
"packagetype": "sdist",
"python_version": "source",
"requires_python": ">=3.7",
"size": 5624,
"upload_time": "2025-02-02T12:35:39",
"upload_time_iso_8601": "2025-02-02T12:35:39.494811Z",
"url": "https://files.pythonhosted.org/packages/e6/1d/5fa904e7b01f3a0ab7c3ba68d70c63b60d1c3179a32fb994e3c12088cd37/kleantext-0.1.3.tar.gz",
"yanked": false,
"yanked_reason": null
}
],
"upload_time": "2025-02-02 12:35:39",
"github": true,
"gitlab": false,
"bitbucket": false,
"codeberg": false,
"github_user": "karan9970",
"github_project": "kleantext",
"travis_ci": false,
"coveralls": false,
"github_actions": false,
"lcname": "kleantext"
}