Name | tidyX JSON |
Version |
1.6.7
JSON |
| download |
home_page | |
Summary | Python package to clean raw tweets for ML applications |
upload_time | 2023-10-22 02:43:54 |
maintainer | |
docs_url | None |
author | Lucas G贸mez Tob贸n, Jose Fernando Barrera |
requires_python | |
license | MIT |
keywords |
|
VCS |
|
bugtrack_url |
|
requirements |
No requirements were recorded.
|
Travis-CI |
No Travis.
|
coveralls test coverage |
No coveralls.
|
# tidyX
`tidyX` is a Python package designed for cleaning and preprocessing text for machine learning applications, **especially for text written in Spanish and originating from social networks.** This library provides a complete pipeline to remove unwanted characters, normalize text, group similar terms, etc. to facilitate NLP applications.
**To deep dive in the package visit our [website](https://tidyx.readthedocs.io/en/latest/)**
## Installation
Install the package using pip:
```bash
pip install tidyX
```
Make sure you have the necessary dependencies installed. If you plan on lemmatizing, you'll need `spaCy` along with the appropriate language models. For Spanish lemmatization, we recommend downloading the `es_core_web_sm` model:
```bash
python -m spacy download es_core_web_sm
```
For English lemmatization, we suggest the `en_core_web_sm` model:
```bash
python -m spacy download en_core_web_sm
```
To see a full list of available models for different languages, visit [Spacy's documentation](https://spacy.io/models/).
## Features
- **Standardize Text Pipeline**: The `preprocess()` method provides an all-encompassing solution for quickly and effectively standardizing input strings, with a particular focus on tweets. It transforms the input to lowercase, strips accents (and emojis, if specified), and removes URLs, hashtags, and certain special characters. Additionally, it offers the option to delete stopwords in a specified language, trims extra spaces, extracts mentions, and removes 'RT' prefixes from retweets.
```python
from tidyX import TextPreprocessor as tp
# Raw tweet example
raw_tweet = "RT @user: Check out this link: https://example.com 馃實 #example 馃槂"
# Applying the preprocess method
cleaned_text = tp.preprocess(raw_tweet)
# Printing the cleaned text
print("Cleaned Text:", cleaned_text)
```
**Output**:
```
Cleaned Text: check out this link
```
To remove English stopwords, simply add the parameters `remove_stopwords=True` and `language_stopwords="english"`:
```python
from tidyX import TextPreprocessor as tp
# Raw tweet example
raw_tweet = "RT @user: Check out this link: https://example.com 馃實 #example 馃槂"
# Applying the preprocess method with additional parameters
cleaned_text = tp.preprocess(raw_tweet, remove_stopwords=True, language_stopwords="english")
# Printing the cleaned text
print("Cleaned Text:", cleaned_text)
```
**Output**:
```
Cleaned Text: check link
```
For a more detailed explanation of the customizable steps of the function, visit the official [preprocess() documentation](https://tidyx.readthedocs.io/en/latest/api/TextPreprocessor.html#tidyX.text_preprocessor.TextPreprocessor.preprocess).
- **Stemming and Lemmatizing**: One of the foundational steps in preparing text for NLP applications is bringing words to a common base or root. This library provides both `stemmer()` and `lemmatizer()` functions to perform this task across various languages.
- **Group similar terms**: When working with a corpus sourced from social networks, it's common to encounter texts with grammatical errors or words that aren't formally included in dictionaries. These irregularities can pose challenges when creating Term Frequency matrices for NLP algorithms. To address this, we developed the [`create_bol()`](https://tidyx.readthedocs.io/en/latest/examples/tutorial.html#create-bol) function, which allows you to create specific bags of terms to cluster related terms.
- **Remove unwanted elements**: such as special characters, extra spaces, accents, emojis, urls, tweeter mentions, among others.
- **Dependency Parsing Visualization**: Incorporates visualization tools that enable the display of dependency parses, facilitating linguistic analysis and feature engineering.
- **Much more!**
## Contributing
Contributions to enhance `tidyX` are welcome! Feel free to open issues for bug reports, feature requests, or submit pull requests.
Raw data
{
"_id": null,
"home_page": "",
"name": "tidyX",
"maintainer": "",
"docs_url": null,
"requires_python": "",
"maintainer_email": "",
"keywords": "",
"author": "Lucas G\u00f3mez Tob\u00f3n, Jose Fernando Barrera",
"author_email": "lucasgomeztobon@gmail.com, jf.barrera10@uniandes.edu.co",
"download_url": "https://files.pythonhosted.org/packages/74/51/3241448a284e1d404aef1091a99a9fb72b086a1c98d8e8312705c81e0e62/tidyX-1.6.7.tar.gz",
"platform": null,
"description": "# tidyX\r\n\r\n\r\n\r\n`tidyX` is a Python package designed for cleaning and preprocessing text for machine learning applications, **especially for text written in Spanish and originating from social networks.** This library provides a complete pipeline to remove unwanted characters, normalize text, group similar terms, etc. to facilitate NLP applications.\r\n\r\n\r\n\r\n**To deep dive in the package visit our [website](https://tidyx.readthedocs.io/en/latest/)**\r\n\r\n\r\n\r\n## Installation\r\n\r\n\r\n\r\nInstall the package using pip:\r\n\r\n\r\n\r\n```bash\r\n\r\npip install tidyX\r\n\r\n```\r\n\r\n\r\n\r\nMake sure you have the necessary dependencies installed. If you plan on lemmatizing, you'll need `spaCy` along with the appropriate language models. For Spanish lemmatization, we recommend downloading the `es_core_web_sm` model:\r\n\r\n\r\n\r\n```bash\r\n\r\npython -m spacy download es_core_web_sm \r\n\r\n```\r\n\r\n\r\n\r\nFor English lemmatization, we suggest the `en_core_web_sm` model:\r\n\r\n\r\n\r\n```bash\r\n\r\npython -m spacy download en_core_web_sm \r\n\r\n```\r\n\r\n\r\n\r\nTo see a full list of available models for different languages, visit [Spacy's documentation](https://spacy.io/models/).\r\n\r\n\r\n\r\n\r\n\r\n## Features\r\n\r\n\r\n\r\n- **Standardize Text Pipeline**: The `preprocess()` method provides an all-encompassing solution for quickly and effectively standardizing input strings, with a particular focus on tweets. It transforms the input to lowercase, strips accents (and emojis, if specified), and removes URLs, hashtags, and certain special characters. Additionally, it offers the option to delete stopwords in a specified language, trims extra spaces, extracts mentions, and removes 'RT' prefixes from retweets.\r\n\r\n\r\n\r\n```python\r\n\r\nfrom tidyX import TextPreprocessor as tp\r\n\r\n\r\n\r\n# Raw tweet example\r\n\r\nraw_tweet = \"RT @user: Check out this link: https://example.com \ud83c\udf0d #example \ud83d\ude03\"\r\n\r\n\r\n\r\n# Applying the preprocess method\r\n\r\ncleaned_text = tp.preprocess(raw_tweet)\r\n\r\n\r\n\r\n# Printing the cleaned text\r\n\r\nprint(\"Cleaned Text:\", cleaned_text)\r\n\r\n```\r\n\r\n\r\n\r\n**Output**:\r\n\r\n```\r\n\r\nCleaned Text: check out this link\r\n\r\n```\r\n\r\n\r\n\r\nTo remove English stopwords, simply add the parameters `remove_stopwords=True` and `language_stopwords=\"english\"`:\r\n\r\n\r\n\r\n```python\r\n\r\nfrom tidyX import TextPreprocessor as tp\r\n\r\n\r\n\r\n# Raw tweet example\r\n\r\nraw_tweet = \"RT @user: Check out this link: https://example.com \ud83c\udf0d #example \ud83d\ude03\"\r\n\r\n\r\n\r\n# Applying the preprocess method with additional parameters\r\n\r\ncleaned_text = tp.preprocess(raw_tweet, remove_stopwords=True, language_stopwords=\"english\")\r\n\r\n\r\n\r\n# Printing the cleaned text\r\n\r\nprint(\"Cleaned Text:\", cleaned_text)\r\n\r\n```\r\n\r\n\r\n\r\n**Output**:\r\n\r\n```\r\n\r\nCleaned Text: check link\r\n\r\n```\r\n\r\n\r\n\r\nFor a more detailed explanation of the customizable steps of the function, visit the official [preprocess() documentation](https://tidyx.readthedocs.io/en/latest/api/TextPreprocessor.html#tidyX.text_preprocessor.TextPreprocessor.preprocess).\r\n\r\n\r\n\r\n\r\n\r\n- **Stemming and Lemmatizing**: One of the foundational steps in preparing text for NLP applications is bringing words to a common base or root. This library provides both `stemmer()` and `lemmatizer()` functions to perform this task across various languages.\r\n\r\n- **Group similar terms**: When working with a corpus sourced from social networks, it's common to encounter texts with grammatical errors or words that aren't formally included in dictionaries. These irregularities can pose challenges when creating Term Frequency matrices for NLP algorithms. To address this, we developed the [`create_bol()`](https://tidyx.readthedocs.io/en/latest/examples/tutorial.html#create-bol) function, which allows you to create specific bags of terms to cluster related terms.\r\n\r\n- **Remove unwanted elements**: such as special characters, extra spaces, accents, emojis, urls, tweeter mentions, among others.\r\n\r\n- **Dependency Parsing Visualization**: Incorporates visualization tools that enable the display of dependency parses, facilitating linguistic analysis and feature engineering.\r\n\r\n- **Much more!**\r\n\r\n\r\n\r\n## Contributing\r\n\r\n\r\n\r\nContributions to enhance `tidyX` are welcome! Feel free to open issues for bug reports, feature requests, or submit pull requests.\r\n\r\n",
"bugtrack_url": null,
"license": "MIT",
"summary": "Python package to clean raw tweets for ML applications",
"version": "1.6.7",
"project_urls": null,
"split_keywords": [],
"urls": [
{
"comment_text": "",
"digests": {
"blake2b_256": "256b48a5c0be5227a18dd2ae7714dc71df4b25cc3f71895c54fc060deb42faa4",
"md5": "dd4ec074c353a8a4f61f1cf6bc7bb4f6",
"sha256": "02dfc330077fe6355238ddaf1fb1dad4111743fa3e7ff8b671dd02e18c95220f"
},
"downloads": -1,
"filename": "tidyX-1.6.7-py3-none-any.whl",
"has_sig": false,
"md5_digest": "dd4ec074c353a8a4f61f1cf6bc7bb4f6",
"packagetype": "bdist_wheel",
"python_version": "py3",
"requires_python": null,
"size": 133113,
"upload_time": "2023-10-22T02:43:48",
"upload_time_iso_8601": "2023-10-22T02:43:48.618798Z",
"url": "https://files.pythonhosted.org/packages/25/6b/48a5c0be5227a18dd2ae7714dc71df4b25cc3f71895c54fc060deb42faa4/tidyX-1.6.7-py3-none-any.whl",
"yanked": false,
"yanked_reason": null
},
{
"comment_text": "",
"digests": {
"blake2b_256": "74513241448a284e1d404aef1091a99a9fb72b086a1c98d8e8312705c81e0e62",
"md5": "8768ff39526cfebbd07f94cf0bd5f2e1",
"sha256": "49adfe0e065e92d8065757b9afe7a60679da0d5ef0054e6b961eb652c3c29f42"
},
"downloads": -1,
"filename": "tidyX-1.6.7.tar.gz",
"has_sig": false,
"md5_digest": "8768ff39526cfebbd07f94cf0bd5f2e1",
"packagetype": "sdist",
"python_version": "source",
"requires_python": null,
"size": 5516528,
"upload_time": "2023-10-22T02:43:54",
"upload_time_iso_8601": "2023-10-22T02:43:54.691331Z",
"url": "https://files.pythonhosted.org/packages/74/51/3241448a284e1d404aef1091a99a9fb72b086a1c98d8e8312705c81e0e62/tidyX-1.6.7.tar.gz",
"yanked": false,
"yanked_reason": null
}
],
"upload_time": "2023-10-22 02:43:54",
"github": false,
"gitlab": false,
"bitbucket": false,
"codeberg": false,
"lcname": "tidyx"
}