<div align="center">
# SqueakyCleanText
[![PyPI](https://img.shields.io/pypi/v/squeakycleantext.svg)](https://pypi.org/project/squeakycleantext/)
[![PyPI - Downloads](https://img.shields.io/pypi/dm/squeakycleantext)](https://pypistats.org/packages/squeakycleantext)
[![Python package](https://github.com/rhnfzl/SqueakyCleanText/actions/workflows/python-package.yml/badge.svg)](https://github.com/rhnfzl/SqueakyCleanText/actions/workflows/python-package.yml)
[![Python Versions](https://img.shields.io/badge/Python-3.10%20|%203.11%20|%203.12-blue)](https://pypi.org/project/squeakycleantext/)
[![License](https://img.shields.io/badge/license-MIT-blue.svg)](LICENSE)
A comprehensive text cleaning and preprocessing pipeline for machine learning and NLP tasks.
</div>
In the world of machine learning and natural language processing, clean and well-structured text data is crucial for building effective downstream models and managing token limits in language models.
SqueakyCleanText simplifies the process by automatically addressing common text issues, ensuring your data is clean and well-structured with minimal effort on your part.
### Key Features
- **Encoding Issues**: Corrects text encoding problems and handles bad Unicode characters.
- **HTML and URLs**: Removes or replaces HTML tags and URLs with configurable tokens.
- **Contact Information**: Handles emails, phone numbers, and other contact details with customizable replacement tokens.
- **Named Entity Recognition (NER)**:
- Multi-language support (English, Dutch, German, Spanish)
- Ensemble voting technique for improved accuracy
- Configurable confidence thresholds
- Efficient batch processing
- Automatic text chunking for long documents
- GPU acceleration support
- **Text Normalization**:
- Removes isolated letters and symbols
- Normalizes whitespace
- Handles currency symbols
- Year detection and replacement
- Number standardization
- **Language Support**:
- Automatic language detection
- Language-specific NER models
- Language-aware stopword removal
- **Dual Output Formats**:
- Language Model format (preserves structure with tokens)
- Statistical Model format (optimized for classical ML)
- **Performance Optimization**:
- Batch processing support
- Configurable batch sizes
- Memory-efficient processing of large texts
- GPU memory management
![Default Flow of cleaning Text](resources/sct_flow.png)
### Benefits
#### For Language Models
- Maintains text structure while anonymizing sensitive information
- Configurable token replacements
- Preserves context while removing noise
- Handles long documents through intelligent chunking
#### For Statistical Models
- Removes stopwords and punctuation
- Case normalization
- Special symbol removal
- Optimized for classification tasks
#### Advanced NER Processing
- Ensemble approach reduces missed entities
- Language-specific models improve accuracy
- Confidence thresholds for precision control
- Efficient batch processing for large datasets
- Automatic handling of long documents
## Installation
```sh
pip install SqueakyCleanText
```
## Usage
### Basic Usage
```python
from sct import sct
# Initialize the TextCleaner
sx = sct.TextCleaner()
# Process single text
text = "Hey John Doe, email me at john.doe@example.com"
lm_text, stat_text, language = sx.process(text)
# Process multiple texts efficiently
texts = ["Text 1", "Text 2", "Text 3"]
results = sx.process_batch(texts, batch_size=2)
```
### Advanced Configuration
```python
from sct import sct, config
# Customize NER settings
config.CHECK_NER_PROCESS = True
config.NER_CONFIDENCE_THRESHOLD = 0.85
config.POSITIONAL_TAGS = ['PER', 'LOC', 'ORG']
# Customize replacement tokens
config.REPLACE_WITH_URL = "<URL>"
config.REPLACE_WITH_EMAIL = "<EMAIL>"
config.REPLACE_WITH_PHONE_NUMBERS = "<PHONE>"
# Set known language (skips detection)
config.LANGUAGE = "ENGLISH" # Options: ENGLISH, DUTCH, GERMAN, SPANISH
# Initialize with custom settings
sx = sct.TextCleaner()
```
## API
### `sct.TextCleaner`
#### `process(text: str) -> Tuple[str, str, str]`
Processes the input text and returns a tuple containing:
- Cleaned text formatted for language models.
- Cleaned text formatted for statistical models (stopwords removed).
- Detected language of the text.
## Contributing
Contributions are welcome! Please feel free to submit a Pull Request or open an issue.
## License
This project is licensed under the MIT License - see the [LICENSE](LICENSE) file for details.
## Acknowledgements
The package took inspirations from the following repo:
- [clean-text](https://github.com/jfilter/clean-text)
Raw data
{
"_id": null,
"home_page": "https://github.com/rhnfzl/SqueakyCleanText",
"name": "SqueakyCleanText",
"maintainer": null,
"docs_url": null,
"requires_python": ">=3.10",
"maintainer_email": null,
"keywords": "text cleaning, text preprocessing, NLP, natural language processing",
"author": "Rehan Fazal",
"author_email": null,
"download_url": "https://files.pythonhosted.org/packages/61/62/ed5c72ed086d2296edf0a9a50e39b209392e3823fdd059978b8dc6698b38/SqueakyCleanText-0.3.0.tar.gz",
"platform": null,
"description": "<div align=\"center\">\n\n# SqueakyCleanText\n\n[![PyPI](https://img.shields.io/pypi/v/squeakycleantext.svg)](https://pypi.org/project/squeakycleantext/)\n[![PyPI - Downloads](https://img.shields.io/pypi/dm/squeakycleantext)](https://pypistats.org/packages/squeakycleantext)\n[![Python package](https://github.com/rhnfzl/SqueakyCleanText/actions/workflows/python-package.yml/badge.svg)](https://github.com/rhnfzl/SqueakyCleanText/actions/workflows/python-package.yml)\n[![Python Versions](https://img.shields.io/badge/Python-3.10%20|%203.11%20|%203.12-blue)](https://pypi.org/project/squeakycleantext/)\n[![License](https://img.shields.io/badge/license-MIT-blue.svg)](LICENSE)\n\nA comprehensive text cleaning and preprocessing pipeline for machine learning and NLP tasks.\n</div>\n\nIn the world of machine learning and natural language processing, clean and well-structured text data is crucial for building effective downstream models and managing token limits in language models. \n\nSqueakyCleanText simplifies the process by automatically addressing common text issues, ensuring your data is clean and well-structured with minimal effort on your part.\n\n### Key Features\n- **Encoding Issues**: Corrects text encoding problems and handles bad Unicode characters.\n- **HTML and URLs**: Removes or replaces HTML tags and URLs with configurable tokens.\n- **Contact Information**: Handles emails, phone numbers, and other contact details with customizable replacement tokens.\n- **Named Entity Recognition (NER)**:\n - Multi-language support (English, Dutch, German, Spanish)\n - Ensemble voting technique for improved accuracy\n - Configurable confidence thresholds\n - Efficient batch processing\n - Automatic text chunking for long documents\n - GPU acceleration support\n- **Text Normalization**:\n - Removes isolated letters and symbols\n - Normalizes whitespace\n - Handles currency symbols\n - Year detection and replacement\n - Number standardization\n- **Language Support**:\n - Automatic language detection\n - Language-specific NER models\n - Language-aware stopword removal\n- **Dual Output Formats**:\n - Language Model format (preserves structure with tokens)\n - Statistical Model format (optimized for classical ML)\n- **Performance Optimization**:\n - Batch processing support\n - Configurable batch sizes\n - Memory-efficient processing of large texts\n - GPU memory management\n\n![Default Flow of cleaning Text](resources/sct_flow.png)\n\n### Benefits\n\n#### For Language Models\n- Maintains text structure while anonymizing sensitive information\n- Configurable token replacements\n- Preserves context while removing noise\n- Handles long documents through intelligent chunking\n\n#### For Statistical Models\n- Removes stopwords and punctuation\n- Case normalization\n- Special symbol removal\n- Optimized for classification tasks\n\n#### Advanced NER Processing\n- Ensemble approach reduces missed entities\n- Language-specific models improve accuracy\n- Confidence thresholds for precision control\n- Efficient batch processing for large datasets\n- Automatic handling of long documents\n\n## Installation\n\n```sh\npip install SqueakyCleanText\n```\n\n## Usage\n\n### Basic Usage\n```python\nfrom sct import sct\n\n# Initialize the TextCleaner\nsx = sct.TextCleaner()\n\n# Process single text\ntext = \"Hey John Doe, email me at john.doe@example.com\"\nlm_text, stat_text, language = sx.process(text)\n\n# Process multiple texts efficiently\ntexts = [\"Text 1\", \"Text 2\", \"Text 3\"]\nresults = sx.process_batch(texts, batch_size=2)\n```\n\n### Advanced Configuration\n```python\nfrom sct import sct, config\n\n# Customize NER settings\nconfig.CHECK_NER_PROCESS = True\nconfig.NER_CONFIDENCE_THRESHOLD = 0.85\nconfig.POSITIONAL_TAGS = ['PER', 'LOC', 'ORG']\n\n# Customize replacement tokens\nconfig.REPLACE_WITH_URL = \"<URL>\"\nconfig.REPLACE_WITH_EMAIL = \"<EMAIL>\"\nconfig.REPLACE_WITH_PHONE_NUMBERS = \"<PHONE>\"\n\n# Set known language (skips detection)\nconfig.LANGUAGE = \"ENGLISH\" # Options: ENGLISH, DUTCH, GERMAN, SPANISH\n\n# Initialize with custom settings\nsx = sct.TextCleaner()\n```\n\n## API\n\n### `sct.TextCleaner`\n\n#### `process(text: str) -> Tuple[str, str, str]`\n\nProcesses the input text and returns a tuple containing:\n - Cleaned text formatted for language models.\n - Cleaned text formatted for statistical models (stopwords removed).\n - Detected language of the text.\n\n## Contributing\n\nContributions are welcome! Please feel free to submit a Pull Request or open an issue.\n\n## License\n\nThis project is licensed under the MIT License - see the [LICENSE](LICENSE) file for details.\n\n## Acknowledgements\n\nThe package took inspirations from the following repo:\n\n- [clean-text](https://github.com/jfilter/clean-text)\n",
"bugtrack_url": null,
"license": "MIT",
"summary": "A comprehensive text cleaning and preprocessing pipeline.",
"version": "0.3.0",
"project_urls": {
"Homepage": "https://github.com/rhnfzl/SqueakyCleanText"
},
"split_keywords": [
"text cleaning",
" text preprocessing",
" nlp",
" natural language processing"
],
"urls": [
{
"comment_text": "",
"digests": {
"blake2b_256": "9c427986511d55bbe4a3d648065e097b8992436702966623484e66a40ecb986d",
"md5": "acca0bce9753268288d08b0bc4bfdabe",
"sha256": "3c82795df50257d6eb4c6fc90e6bb6fd85d7aebf5966aa070cedc334bd878bf3"
},
"downloads": -1,
"filename": "SqueakyCleanText-0.3.0-py3-none-any.whl",
"has_sig": false,
"md5_digest": "acca0bce9753268288d08b0bc4bfdabe",
"packagetype": "bdist_wheel",
"python_version": "py3",
"requires_python": ">=3.10",
"size": 23898,
"upload_time": "2024-11-13T17:40:26",
"upload_time_iso_8601": "2024-11-13T17:40:26.809068Z",
"url": "https://files.pythonhosted.org/packages/9c/42/7986511d55bbe4a3d648065e097b8992436702966623484e66a40ecb986d/SqueakyCleanText-0.3.0-py3-none-any.whl",
"yanked": false,
"yanked_reason": null
},
{
"comment_text": "",
"digests": {
"blake2b_256": "6162ed5c72ed086d2296edf0a9a50e39b209392e3823fdd059978b8dc6698b38",
"md5": "b6551bd4ba35afe7f3f902909f700668",
"sha256": "de17ddf0d62704f4046af1720e3f8f3ba4a8ff870e441d2f79bb8afbc27966fe"
},
"downloads": -1,
"filename": "SqueakyCleanText-0.3.0.tar.gz",
"has_sig": false,
"md5_digest": "b6551bd4ba35afe7f3f902909f700668",
"packagetype": "sdist",
"python_version": "source",
"requires_python": ">=3.10",
"size": 20251,
"upload_time": "2024-11-13T17:40:28",
"upload_time_iso_8601": "2024-11-13T17:40:28.310057Z",
"url": "https://files.pythonhosted.org/packages/61/62/ed5c72ed086d2296edf0a9a50e39b209392e3823fdd059978b8dc6698b38/SqueakyCleanText-0.3.0.tar.gz",
"yanked": false,
"yanked_reason": null
}
],
"upload_time": "2024-11-13 17:40:28",
"github": true,
"gitlab": false,
"bitbucket": false,
"codeberg": false,
"github_user": "rhnfzl",
"github_project": "SqueakyCleanText",
"travis_ci": false,
"coveralls": false,
"github_actions": true,
"requirements": [
{
"name": "ftfy",
"specs": [
[
"==",
"6.1.1"
]
]
},
{
"name": "nltk",
"specs": [
[
">=",
"3.8.1"
]
]
},
{
"name": "emoji",
"specs": [
[
"==",
"2.8.0"
]
]
},
{
"name": "torch",
"specs": [
[
">=",
"2.0.0"
]
]
},
{
"name": "Unidecode",
"specs": [
[
"==",
"1.3.6"
]
]
},
{
"name": "transformers",
"specs": [
[
">=",
"4.30.0"
]
]
},
{
"name": "beautifulsoup4",
"specs": [
[
"==",
"4.12.2"
]
]
},
{
"name": "presidio_anonymizer",
"specs": [
[
">=",
"2.2.0"
]
]
},
{
"name": "lingua-language-detector",
"specs": [
[
">=",
"2.0.2"
]
]
},
{
"name": "hypothesis",
"specs": [
[
"==",
"6.82.7"
]
]
},
{
"name": "faker",
"specs": [
[
"==",
"20.1.0"
]
]
},
{
"name": "flake8",
"specs": [
[
"==",
"6.1.0"
]
]
},
{
"name": "pytest",
"specs": [
[
"==",
"8.3.3"
]
]
},
{
"name": "coverage",
"specs": [
[
"==",
"7.3.1"
]
]
},
{
"name": "pytest-cov",
"specs": [
[
"==",
"4.1.0"
]
]
},
{
"name": "timeout-decorator",
"specs": [
[
"==",
"0.5.0"
]
]
}
],
"lcname": "squeakycleantext"
}