kleantext

Name	kleantext JSON
Version	0.1.3 JSON
	download
home_page	https://github.com/karan9970/kleantext
Summary	A Python module for preprocessing text for NLP tasks
upload_time	2025-02-02 12:35:39
maintainer	None
docs_url	None
author	Karan
requires_python	>=3.7
license	MIT
keywords
VCS
bugtrack_url
requirements	No requirements were recorded.
Travis-CI	No Travis.
coveralls test coverage	No coveralls.

            
# kleantext
A Python package for preprocessing textual data for machine learning and natural language processing tasks. It includes functionality for:

- Converting text to lowercase (optional case-sensitive mode)
- Removing HTML tags, punctuation, numbers, and special characters
- Handling emojis (removal or conversion to textual descriptions)
- Handling negations
- Removing or retaining specific patterns (hashtags, mentions, etc.)
- Removing stopwords (with customizable stopword lists)
- Stemming and lemmatization
- Correcting spelling (optional)
- Expanding contractions and slangs
- Named Entity Recognition (NER) masking (e.g., replacing entities with placeholders)
- Detecting and translating text to a target language
- Profanity filtering
- Customizable text preprocessing pipeline

---

## Installation
### Option 1: Clone or Download
1. Clone the repository using:
   ```bash
   git clone https://github.com/your-username/kleantext.git
   ```
2. Navigate to the project directory:
   ```bash
   cd kleantext
   ```

### Option 2: Install via pip (if published)
```bash
pip install kleantext
```

---

## Usage
### Quick Start
```python
from kleantext.preprocessor import TextPreprocessor

# Initialize the preprocessor with custom settings
preprocessor = TextPreprocessor(
    remove_stopwords=True,
    perform_spellcheck=True,
    use_stemming=False,
    use_lemmatization=True,
    custom_stopwords={"example", "test"},
    case_sensitive=False,
    detect_language=True,
    target_language="en"
)

# Input text
text = "This is an example! Isn't it great? Visit https://example.com for more 😊."

# Preprocess the text
clean_text = preprocessor.clean_text(text)
print(clean_text)  # Output: "this is isnt it great visit for more"
```

---

## Features and Configuration
### 1. Case Sensitivity
Control whether the text should be converted to lowercase:
```python
preprocessor = TextPreprocessor(case_sensitive=True)
```

### 2. Removing HTML Tags
Automatically remove HTML tags like `<div>` or `<p>`.

### 3. Emoji Handling
Convert emojis to text or remove them entirely:
```python
import emoji
text = emoji.demojize("😊 Hello!")  # Output: ":blush: Hello!"
```

### 4. Stopword Removal
Remove common stopwords, with support for custom lists:
```python
custom_stopwords = {"is", "an", "the"}
preprocessor = TextPreprocessor(custom_stopwords=custom_stopwords)
```

### 5. Slang and Contraction Expansion
Expand contractions like "can't" to "cannot":
```python
text = "I can't go"
expanded_text = preprocessor.clean_text(text)
```

### 6. Named Entity Recognition (NER) Masking
Mask entities like names, organizations, or dates using `spacy`:
```python
text = "Barack Obama was the 44th President of the USA."
masked_text = preprocessor.clean_text(text)
```

### 7. Profanity Filtering
Censor offensive words:
```python
text = "This is a badword!"
filtered_text = preprocessor.clean_text(text)
```

### 8. Language Detection and Translation
Detect the text's language and translate it:
```python
preprocessor = TextPreprocessor(detect_language=True, target_language="en")
text = "Bonjour tout le monde"
translated_text = preprocessor.clean_text(text)  # Output: "Hello everyone"
```

### 9. Tokenization
Tokenize text for further NLP tasks:
```python
from nltk.tokenize import word_tokenize
tokens = word_tokenize("This is an example.")
print(tokens)  # Output: ['This', 'is', 'an', 'example', '.']
```

---

## Advanced Configuration
Create a custom pipeline by enabling or disabling specific cleaning steps:
```python
pipeline = ["lowercase", "remove_html", "remove_urls", "remove_stopwords"]
preprocessor.clean_text(text, pipeline=pipeline)
```

---

## Testing
Run unit tests using:
```bash
python -m unittest discover tests
```

---

## License
This project is licensed under the MIT License.

---

## Contributing
Feel free to fork the repository, create a feature branch, and submit a pull request. Contributions are welcome!

---

## Snippets
### Full Preprocessing Example
```python
from kleantext.preprocessor import TextPreprocessor

# Initialize with default settings
preprocessor = TextPreprocessor(remove_stopwords=True, perform_spellcheck=False)

text = "Hello!!! This is, an example. Isn't it? 😊"
clean_text = preprocessor.clean_text(text)
print(clean_text)
```

### Profanity Filtering
```python
preprocessor = TextPreprocessor()
text = "This is a badword!"
clean_text = preprocessor.clean_text(text)
print(clean_text)  # Output: "This is a [CENSORED]!"
```

## Usage Examples

### 1. Converting Text to Lowercase (Optional Case-Sensitive Mode)
```python
preprocessor = TextPreprocessor(case_sensitive=True)
text = "Hello WORLD!"
cleaned_text = preprocessor.clean_text(text)
print(cleaned_text)  # Output: "Hello WORLD!"

preprocessor = TextPreprocessor(case_sensitive=False)
cleaned_text = preprocessor.clean_text(text)
print(cleaned_text)  # Output: "hello world"
```

---

### 2. Removing HTML Tags, Punctuation, Numbers, and Special Characters
```python
text = "This is a <b>bold</b> statement! Price: $100."
cleaned_text = preprocessor.clean_text(text)
print(cleaned_text)  # Output: "this is a bold statement price"
```

---

### 3. Handling Emojis (Removal or Conversion to Textual Descriptions)
```python
text = "I love Python! 😍🔥"
cleaned_text = preprocessor.clean_text(text)
print(cleaned_text)  # Output: "i love python :heart_eyes: :fire:"
```

---

### 4. Handling Negations
```python
text = "I don't like this movie."
cleaned_text = preprocessor.clean_text(text)
print(cleaned_text)  # Output: "i do not like this movie"
```

---

### 5. Removing or Retaining Specific Patterns (Hashtags, Mentions, etc.)
```python
text = "Follow @user and check #MachineLearning!"
cleaned_text = preprocessor.clean_text(text)
print(cleaned_text)  # Output: "follow user and check machinelearning"
```

---

### 6. Removing Stopwords (With Customizable Stopword Lists)
```python
preprocessor = TextPreprocessor(custom_stopwords={"example", "test"})
text = "This is an example test showing stopword removal."
cleaned_text = preprocessor.clean_text(text)
print(cleaned_text)  # Output: "this is an showing stopword removal"
```

---

### 7. Stemming and Lemmatization
#### With Stemming:
```python
preprocessor = TextPreprocessor(use_stemming=True, use_lemmatization=False)
text = "running flies better than walking"
cleaned_text = preprocessor.clean_text(text)
print(cleaned_text)  # Output: "run fli better than walk"
```
#### With Lemmatization:
```python
preprocessor = TextPreprocessor(use_stemming=False, use_lemmatization=True)
cleaned_text = preprocessor.clean_text(text)
print(cleaned_text)  # Output: "running fly better than walking"
```

---

### 8. Correcting Spelling (Optional)
```python
preprocessor = TextPreprocessor(perform_spellcheck=True)
text = "Ths is a tst sentnce."
cleaned_text = preprocessor.clean_text(text)
print(cleaned_text)  # Output: "This is a test sentence"
```

---

### 9. Expanding Contractions and Slang Handling
```python
from contractions import fix

text = "I'm gonna go, but I can't wait!"
text = fix(text)  
cleaned_text = preprocessor.clean_text(text)
print(cleaned_text)  # Output: "i am going to go but i cannot wait"
```

---

### 10. Named Entity Recognition (NER) Masking
```python
text = "John Doe lives in New York."
cleaned_text = preprocessor.clean_text(text)
print(cleaned_text)  # Output: "[PERSON] lives in [LOCATION]"
```

---

### 11. Detecting and Translating Text to a Target Language
```python
preprocessor = TextPreprocessor(detect_language=True, target_language="en")
text = "Bonjour! Comment ça va?"
cleaned_text = preprocessor.clean_text(text)
print(cleaned_text)  # Output: "Hello! How are you?"
```

---

### 12. Profanity Filtering
```python
text = "This is a f***ing great product!"
cleaned_text = preprocessor.clean_text(text)
print(cleaned_text)  # Output: "this is a **** great product"
```

---

### 13. Customizable Text Preprocessing Pipeline
```python
preprocessor = TextPreprocessor(
    remove_stopwords=True,
    perform_spellcheck=True,
    use_stemming=False,
    use_lemmatization=True,
    case_sensitive=False,
    detect_language=True,
    target_language="en"
)

text = "Ths is an amazng movi!! 😍🔥 <b>100%</b> recommended!"
cleaned_text = preprocessor.clean_text(text)
print(cleaned_text)  # Output: "this is an amazing movie heart_eyes fire recommended"
```

---

## Conclusion
KleanText is a robust and flexible text preprocessing library designed to clean and normalize text efficiently. You can customize the pipeline to fit your specific NLP needs. 🚀

Raw data

            {
    "_id": null,
    "home_page": "https://github.com/karan9970/kleantext",
    "name": "kleantext",
    "maintainer": null,
    "docs_url": null,
    "requires_python": ">=3.7",
    "maintainer_email": null,
    "keywords": null,
    "author": "Karan",
    "author_email": "Karansd00@gmail.com",
    "download_url": "https://files.pythonhosted.org/packages/e6/1d/5fa904e7b01f3a0ab7c3ba68d70c63b60d1c3179a32fb994e3c12088cd37/kleantext-0.1.3.tar.gz",
    "platform": null,
    "description": "\r\n# kleantext\r\nA Python package for preprocessing textual data for machine learning and natural language processing tasks. It includes functionality for:\r\n\r\n- Converting text to lowercase (optional case-sensitive mode)\r\n- Removing HTML tags, punctuation, numbers, and special characters\r\n- Handling emojis (removal or conversion to textual descriptions)\r\n- Handling negations\r\n- Removing or retaining specific patterns (hashtags, mentions, etc.)\r\n- Removing stopwords (with customizable stopword lists)\r\n- Stemming and lemmatization\r\n- Correcting spelling (optional)\r\n- Expanding contractions and slangs\r\n- Named Entity Recognition (NER) masking (e.g., replacing entities with placeholders)\r\n- Detecting and translating text to a target language\r\n- Profanity filtering\r\n- Customizable text preprocessing pipeline\r\n\r\n---\r\n\r\n## Installation\r\n### Option 1: Clone or Download\r\n1. Clone the repository using:\r\n   ```bash\r\n   git clone https://github.com/your-username/kleantext.git\r\n   ```\r\n2. Navigate to the project directory:\r\n   ```bash\r\n   cd kleantext\r\n   ```\r\n\r\n### Option 2: Install via pip (if published)\r\n```bash\r\npip install kleantext\r\n```\r\n\r\n---\r\n\r\n## Usage\r\n### Quick Start\r\n```python\r\nfrom kleantext.preprocessor import TextPreprocessor\r\n\r\n# Initialize the preprocessor with custom settings\r\npreprocessor = TextPreprocessor(\r\n    remove_stopwords=True,\r\n    perform_spellcheck=True,\r\n    use_stemming=False,\r\n    use_lemmatization=True,\r\n    custom_stopwords={\"example\", \"test\"},\r\n    case_sensitive=False,\r\n    detect_language=True,\r\n    target_language=\"en\"\r\n)\r\n\r\n# Input text\r\ntext = \"This is an example! Isn't it great? Visit https://example.com for more \ud83d\ude0a.\"\r\n\r\n# Preprocess the text\r\nclean_text = preprocessor.clean_text(text)\r\nprint(clean_text)  # Output: \"this is isnt it great visit for more\"\r\n```\r\n\r\n---\r\n\r\n## Features and Configuration\r\n### 1. Case Sensitivity\r\nControl whether the text should be converted to lowercase:\r\n```python\r\npreprocessor = TextPreprocessor(case_sensitive=True)\r\n```\r\n\r\n### 2. Removing HTML Tags\r\nAutomatically remove HTML tags like `<div>` or `<p>`.\r\n\r\n### 3. Emoji Handling\r\nConvert emojis to text or remove them entirely:\r\n```python\r\nimport emoji\r\ntext = emoji.demojize(\"\ud83d\ude0a Hello!\")  # Output: \":blush: Hello!\"\r\n```\r\n\r\n### 4. Stopword Removal\r\nRemove common stopwords, with support for custom lists:\r\n```python\r\ncustom_stopwords = {\"is\", \"an\", \"the\"}\r\npreprocessor = TextPreprocessor(custom_stopwords=custom_stopwords)\r\n```\r\n\r\n### 5. Slang and Contraction Expansion\r\nExpand contractions like \"can't\" to \"cannot\":\r\n```python\r\ntext = \"I can't go\"\r\nexpanded_text = preprocessor.clean_text(text)\r\n```\r\n\r\n### 6. Named Entity Recognition (NER) Masking\r\nMask entities like names, organizations, or dates using `spacy`:\r\n```python\r\ntext = \"Barack Obama was the 44th President of the USA.\"\r\nmasked_text = preprocessor.clean_text(text)\r\n```\r\n\r\n### 7. Profanity Filtering\r\nCensor offensive words:\r\n```python\r\ntext = \"This is a badword!\"\r\nfiltered_text = preprocessor.clean_text(text)\r\n```\r\n\r\n### 8. Language Detection and Translation\r\nDetect the text's language and translate it:\r\n```python\r\npreprocessor = TextPreprocessor(detect_language=True, target_language=\"en\")\r\ntext = \"Bonjour tout le monde\"\r\ntranslated_text = preprocessor.clean_text(text)  # Output: \"Hello everyone\"\r\n```\r\n\r\n### 9. Tokenization\r\nTokenize text for further NLP tasks:\r\n```python\r\nfrom nltk.tokenize import word_tokenize\r\ntokens = word_tokenize(\"This is an example.\")\r\nprint(tokens)  # Output: ['This', 'is', 'an', 'example', '.']\r\n```\r\n\r\n---\r\n\r\n## Advanced Configuration\r\nCreate a custom pipeline by enabling or disabling specific cleaning steps:\r\n```python\r\npipeline = [\"lowercase\", \"remove_html\", \"remove_urls\", \"remove_stopwords\"]\r\npreprocessor.clean_text(text, pipeline=pipeline)\r\n```\r\n\r\n---\r\n\r\n## Testing\r\nRun unit tests using:\r\n```bash\r\npython -m unittest discover tests\r\n```\r\n\r\n---\r\n\r\n## License\r\nThis project is licensed under the MIT License.\r\n\r\n---\r\n\r\n## Contributing\r\nFeel free to fork the repository, create a feature branch, and submit a pull request. Contributions are welcome!\r\n\r\n---\r\n\r\n## Snippets\r\n### Full Preprocessing Example\r\n```python\r\nfrom kleantext.preprocessor import TextPreprocessor\r\n\r\n# Initialize with default settings\r\npreprocessor = TextPreprocessor(remove_stopwords=True, perform_spellcheck=False)\r\n\r\ntext = \"Hello!!! This is, an example. Isn't it? \ud83d\ude0a\"\r\nclean_text = preprocessor.clean_text(text)\r\nprint(clean_text)\r\n```\r\n\r\n### Profanity Filtering\r\n```python\r\npreprocessor = TextPreprocessor()\r\ntext = \"This is a badword!\"\r\nclean_text = preprocessor.clean_text(text)\r\nprint(clean_text)  # Output: \"This is a [CENSORED]!\"\r\n```\r\n\r\n## Usage Examples\r\n\r\n### 1. Converting Text to Lowercase (Optional Case-Sensitive Mode)\r\n```python\r\npreprocessor = TextPreprocessor(case_sensitive=True)\r\ntext = \"Hello WORLD!\"\r\ncleaned_text = preprocessor.clean_text(text)\r\nprint(cleaned_text)  # Output: \"Hello WORLD!\"\r\n\r\npreprocessor = TextPreprocessor(case_sensitive=False)\r\ncleaned_text = preprocessor.clean_text(text)\r\nprint(cleaned_text)  # Output: \"hello world\"\r\n```\r\n\r\n---\r\n\r\n### 2. Removing HTML Tags, Punctuation, Numbers, and Special Characters\r\n```python\r\ntext = \"This is a <b>bold</b> statement! Price: $100.\"\r\ncleaned_text = preprocessor.clean_text(text)\r\nprint(cleaned_text)  # Output: \"this is a bold statement price\"\r\n```\r\n\r\n---\r\n\r\n### 3. Handling Emojis (Removal or Conversion to Textual Descriptions)\r\n```python\r\ntext = \"I love Python! \ud83d\ude0d\ud83d\udd25\"\r\ncleaned_text = preprocessor.clean_text(text)\r\nprint(cleaned_text)  # Output: \"i love python :heart_eyes: :fire:\"\r\n```\r\n\r\n---\r\n\r\n### 4. Handling Negations\r\n```python\r\ntext = \"I don't like this movie.\"\r\ncleaned_text = preprocessor.clean_text(text)\r\nprint(cleaned_text)  # Output: \"i do not like this movie\"\r\n```\r\n\r\n---\r\n\r\n### 5. Removing or Retaining Specific Patterns (Hashtags, Mentions, etc.)\r\n```python\r\ntext = \"Follow @user and check #MachineLearning!\"\r\ncleaned_text = preprocessor.clean_text(text)\r\nprint(cleaned_text)  # Output: \"follow user and check machinelearning\"\r\n```\r\n\r\n---\r\n\r\n### 6. Removing Stopwords (With Customizable Stopword Lists)\r\n```python\r\npreprocessor = TextPreprocessor(custom_stopwords={\"example\", \"test\"})\r\ntext = \"This is an example test showing stopword removal.\"\r\ncleaned_text = preprocessor.clean_text(text)\r\nprint(cleaned_text)  # Output: \"this is an showing stopword removal\"\r\n```\r\n\r\n---\r\n\r\n### 7. Stemming and Lemmatization\r\n#### With Stemming:\r\n```python\r\npreprocessor = TextPreprocessor(use_stemming=True, use_lemmatization=False)\r\ntext = \"running flies better than walking\"\r\ncleaned_text = preprocessor.clean_text(text)\r\nprint(cleaned_text)  # Output: \"run fli better than walk\"\r\n```\r\n#### With Lemmatization:\r\n```python\r\npreprocessor = TextPreprocessor(use_stemming=False, use_lemmatization=True)\r\ncleaned_text = preprocessor.clean_text(text)\r\nprint(cleaned_text)  # Output: \"running fly better than walking\"\r\n```\r\n\r\n---\r\n\r\n### 8. Correcting Spelling (Optional)\r\n```python\r\npreprocessor = TextPreprocessor(perform_spellcheck=True)\r\ntext = \"Ths is a tst sentnce.\"\r\ncleaned_text = preprocessor.clean_text(text)\r\nprint(cleaned_text)  # Output: \"This is a test sentence\"\r\n```\r\n\r\n---\r\n\r\n### 9. Expanding Contractions and Slang Handling\r\n```python\r\nfrom contractions import fix\r\n\r\ntext = \"I'm gonna go, but I can't wait!\"\r\ntext = fix(text)  \r\ncleaned_text = preprocessor.clean_text(text)\r\nprint(cleaned_text)  # Output: \"i am going to go but i cannot wait\"\r\n```\r\n\r\n---\r\n\r\n### 10. Named Entity Recognition (NER) Masking\r\n```python\r\ntext = \"John Doe lives in New York.\"\r\ncleaned_text = preprocessor.clean_text(text)\r\nprint(cleaned_text)  # Output: \"[PERSON] lives in [LOCATION]\"\r\n```\r\n\r\n---\r\n\r\n### 11. Detecting and Translating Text to a Target Language\r\n```python\r\npreprocessor = TextPreprocessor(detect_language=True, target_language=\"en\")\r\ntext = \"Bonjour! Comment \u00e7a va?\"\r\ncleaned_text = preprocessor.clean_text(text)\r\nprint(cleaned_text)  # Output: \"Hello! How are you?\"\r\n```\r\n\r\n---\r\n\r\n### 12. Profanity Filtering\r\n```python\r\ntext = \"This is a f***ing great product!\"\r\ncleaned_text = preprocessor.clean_text(text)\r\nprint(cleaned_text)  # Output: \"this is a **** great product\"\r\n```\r\n\r\n---\r\n\r\n### 13. Customizable Text Preprocessing Pipeline\r\n```python\r\npreprocessor = TextPreprocessor(\r\n    remove_stopwords=True,\r\n    perform_spellcheck=True,\r\n    use_stemming=False,\r\n    use_lemmatization=True,\r\n    case_sensitive=False,\r\n    detect_language=True,\r\n    target_language=\"en\"\r\n)\r\n\r\ntext = \"Ths is an amazng movi!! \ud83d\ude0d\ud83d\udd25 <b>100%</b> recommended!\"\r\ncleaned_text = preprocessor.clean_text(text)\r\nprint(cleaned_text)  # Output: \"this is an amazing movie heart_eyes fire recommended\"\r\n```\r\n\r\n---\r\n\r\n## Conclusion\r\nKleanText is a robust and flexible text preprocessing library designed to clean and normalize text efficiently. You can customize the pipeline to fit your specific NLP needs. \ud83d\ude80\r\n\r\n\r\n\r\n",
    "bugtrack_url": null,
    "license": "MIT",
    "summary": "A Python module for preprocessing text for NLP tasks",
    "version": "0.1.3",
    "project_urls": {
        "Homepage": "https://github.com/karan9970/kleantext"
    },
    "split_keywords": [],
    "urls": [
        {
            "comment_text": null,
            "digests": {
                "blake2b_256": "454d5d342debaf57ee834c29e88258d88b2fc68f1dbcdcd7313bb60b6e8129a6",
                "md5": "1f0daeda43d09d994f589a79fc02d9ce",
                "sha256": "9aec86020c8fa68a49f3bf19bd3c73423b48aa6ef3cf397bfbfbfe228ed6de83"
            },
            "downloads": -1,
            "filename": "kleantext-0.1.3-py3-none-any.whl",
            "has_sig": false,
            "md5_digest": "1f0daeda43d09d994f589a79fc02d9ce",
            "packagetype": "bdist_wheel",
            "python_version": "py3",
            "requires_python": ">=3.7",
            "size": 4964,
            "upload_time": "2025-02-02T12:35:37",
            "upload_time_iso_8601": "2025-02-02T12:35:37.239856Z",
            "url": "https://files.pythonhosted.org/packages/45/4d/5d342debaf57ee834c29e88258d88b2fc68f1dbcdcd7313bb60b6e8129a6/kleantext-0.1.3-py3-none-any.whl",
            "yanked": false,
            "yanked_reason": null
        },
        {
            "comment_text": null,
            "digests": {
                "blake2b_256": "e61d5fa904e7b01f3a0ab7c3ba68d70c63b60d1c3179a32fb994e3c12088cd37",
                "md5": "1f50c635d1ad01dd0ae55aa72b3482ab",
                "sha256": "00e00d7885b8f897299962928ac13246de3fcb68f6807366cbad5387b76d759b"
            },
            "downloads": -1,
            "filename": "kleantext-0.1.3.tar.gz",
            "has_sig": false,
            "md5_digest": "1f50c635d1ad01dd0ae55aa72b3482ab",
            "packagetype": "sdist",
            "python_version": "source",
            "requires_python": ">=3.7",
            "size": 5624,
            "upload_time": "2025-02-02T12:35:39",
            "upload_time_iso_8601": "2025-02-02T12:35:39.494811Z",
            "url": "https://files.pythonhosted.org/packages/e6/1d/5fa904e7b01f3a0ab7c3ba68d70c63b60d1c3179a32fb994e3c12088cd37/kleantext-0.1.3.tar.gz",
            "yanked": false,
            "yanked_reason": null
        }
    ],
    "upload_time": "2025-02-02 12:35:39",
    "github": true,
    "gitlab": false,
    "bitbucket": false,
    "codeberg": false,
    "github_user": "karan9970",
    "github_project": "kleantext",
    "travis_ci": false,
    "coveralls": false,
    "github_actions": false,
    "lcname": "kleantext"
}

Karan