<h1 align="center">
<p> TextPrepro - Text Preprocessing </p>
</h1>
<p align="center">
<a href="https://pypi.org/project/textprepro">
<img src="https://img.shields.io/pypi/v/textprepro.svg?logo=pypi&logoColor=white"
alt="PyPI">
</a>
<a href="https://pypi.org/project/textprepro">
<img src="https://img.shields.io/pypi/pyversions/textprepro.svg?logo=python&logoColor=white"
alt="Python">
</a>
<a href="https://codecov.io/gh/umapornp/textprepro">
<img src="https://img.shields.io/codecov/c/github/umapornp/textprepro?logo=codecov"
alt="Codecov">
</a>
<a href="https://github.com/umapornp/textprepro/blob/master/LICENSE">
<img src="https://img.shields.io/github/license/umapornp/textprepro.svg?logo=github"
alt="License">
</a>
</p>
<p align="center">
<img src="https://raw.githubusercontent.com/umapornp/textprepro/main/assets/banner.png">
</p>
**TextPrepro** - Everything Everyway All At Once Text Preprocessing: Allow you to preprocess both general and social media text with easy-to-use features. Help you gain insight from your data with analytical tools. Stand on the shoulders of various famous libraries (e.g., NLTK, Spacy, Gensim, etc.).
---------------------------------
### Table of Contents
* [β³ Installation](#β³-installation)
* [π Quickstart](#π-quickstart)
* [π§Ή Simply preprocess with the pipeline](#π§Ή-simply-preprocess-with-the-pipeline)
* [π Work with document or dataFrame](#π-work-with-document-or-dataframe)
* [πͺ Customize your own pipeline](#πͺ-customize-your-own-pipeline)
* [π‘ Features & Guides](#π‘-features--guides)
* [π For General Text](#π-for-general-text)
* [π± For Social Media Text](#π±-for-social-media-text)
* [π For Web Scraping Text](#π-for-web-scraping-text)
* [π Analytical Tools](#π-analytical-tools)
---------------------------------
## β³ Installation
Simply install via `pip`:
```bash
pip install textprepro
```
or:
```bash
pip install "git+https://github.com/umapornp/textprepro"
```
---------------------------------
## π Quickstart
* ### π§Ή Simply preprocess with the pipeline
You can preprocess your textual data by using the function `preprocess_text()` with the default pipeline as follows:
```python
>>> import textprepro as pre
# Proprocess text.
>>> text = "ChatGPT is AI chatbot developed by OpenAI It is built on top of OpenAI GPT foundational large language models and has been fine-tuned an approach to transfer learning using both supervised and reinforcement learning techniques"
>>> text = pre.preprocess_text(text)
>>> text
"chatgpt ai chatbot developed openai built top openai gpt foundational large language model finetuned approach transfer learning using supervised reinforcement learning technique"
```
* ### π Work with document or dataFrame
You can preprocess your document or dataframe as follows:
* If you work with a list of strings, you can use the function `preprocess_document()` to preprocess each of them.
```python
import textprepro as pre
>>> document = ["Hello123", "World!@&"]
>>> document = pre.preprocess_document(document)
>>> document
["hello", "world"]
```
* If you work with a dataframe, you can use the function `apply()` with the function `preprocess_text()` to apply the function to each row.
```python
import textprepro as pre
import pandas as pd
>>> document = {"text": ["Hello123", "World!@&"]}
>>> df = pd.DataFrame(document)
>>> df["clean_text"] = df["text"].apply(pre.preprocess_text)
>>> df
```
| text | clean_text |
| :-------- | :--------- |
| Hello123 | hello |
| World!@& | world |
* ### πͺ Customize your own pipeline
You can customize your own preprocessing pipeline as follows:
```python
>>> import textprepro as pre
# Customize pipeline.
>>> pipeline = [
pre.lower,
pre.remove_punctuations,
pre.expand_contractions,
pre.lemmatize
]
>>> text = "ChatGPT is AI chatbot developed by OpenAI It is built on top of OpenAI GPT foundational large language models and has been fine-tuned an approach to transfer learning using both supervised and reinforcement learning techniques"
>>> text = pre.preprocess_text(text=text, pipeline=pipeline)
>>> text
"chatgpt is ai chatbot developed by openai it is built on top of openai gpt foundational large language model and ha been finetuned an approach to transfer learning using both supervised and reinforcement learning technique"
```
---------------------------------
## π‘ Features & Guides
TextPrep provides many easy-to-use features for preprocessing general text as well as social media text. Apart from preprocessing tools, TextPrep also provides useful analytical tools to help you gain insight from your data (e.g., word distribution graphs and word clouds).
* ### π For General Text
<!-- Misspelling Correction -->
<details>
<Summary> π Misspelling Correction </Summary>
Correct misspelled words:
```python
>>> import textprepro as pre
>>> text = "she loves swiming"
>>> text = pre.correct_spelling(text)
>>> text
"she loves swimming"
```
</details>
<!-- Emoji & Emoticon -->
<details>
<Summary> π Emoji & Emoticon </Summary>
Remove, replace, or decode emojis (e.g., π, π, β€οΈ):
```python
>>> import textprepro as pre
>>> text = "very good π"
# Remove.
>>> text = pre.remove_emoji(text)
>>> text
"very good "
# Replace.
>>> text = pre.replace_emoji(text, "[EMOJI]")
>>> text
"very good [EMOJI]"
# Decode.
>>> text = pre.decode_emoji(text)
>>> text
"very good :thumbs_up:"
```
Remove, replace, or decode emoticons (e.g., :-), (>_<), (^o^)):
```python
>>> import textprepro as pre
>>> text = "thank you :)"
# Remove.
>>> text = pre.remove_emoticons(text)
>>> text
"thank you "
# Replace.
>>> text = pre.replace_emoticons(text, "[EMOTICON]")
>>> text
"thank you [EMOTICON]"
# Decode.
>>> text = pre.decode_emoticons(text)
>>> text
"thank you happy_face_or_smiley"
```
</details>
<!-- URLs -->
<details>
<Summary> π URL </Summary>
Remove or replace URLs:
```python
>>> import textprepro as pre
>>> text = "my url https://www.google.com"
# Remove.
>>> text = pre.remove_urls(text)
>>> text
"my url "
# Replace.
>>> text = pre.replace_urls(text, "[URL]")
>>> text
"my url [URL]"
```
</details>
<!-- Email -->
<details>
<Summary> π Email </Summary>
Remove or replace emails.
```python
>>> import textprepro as pre
>>> text = "my email name.surname@user.com"
# Remove.
>>> text = pre.remove_emails(text)
>>> text
"my email "
# Replace.
>>> text = pre.replace_emails(text, "[EMAIL]")
>>> text
"my email [EMAIL]"
```
</details>
<!-- Number & Phone Number -->
<details>
<Summary> π Number & Phone Number </Summary>
Remove or replace numbers.
```python
>>> import textprepro as pre
>>> text = "my number 123"
# Remove.
>>> text = pre.remove_numbers(text)
>>> text
"my number "
# Replace.
>>> text = pre.replace_numbers(text)
>>> text
"my number 123"
```
Remove or replace phone numbers.
```python
>>> import textprepro as pre
>>> text = "my phone number +1 (123)-456-7890"
# Remove.
>>> text = pre.remove_phone_numbers(text)
>>> text
"my phone number "
# Replace.
>>> text = pre.replace_phone_numbers(text, "[PHONE]")
>>> text
"my phone number [PHONE]"
```
</details>
<!-- Contraction -->
<details>
<Summary> π Contraction </Summary>
Expand contractions (e.g., can't, shouldn't, don't).
```python
>>> import textprepro as pre
>>> text = "she can't swim"
>>> text = pre.expand_contractions(text)
>>> text
"she cannot swim"
```
</details>
<!-- Stopwords -->
<details>
<Summary> π Stopword </Summary>
Remove stopwords:
You can also specify stopwords: `nltk`, `spacy`, `sklearn`, and `gensim`.
```python
>>> import textprepro as pre
>>> text = "her dog is so cute"
# Default stopword is NLTK.
>>> text = pre.remove_stopwords(text)
>>> text
"dog cute"
# Use stopwords from Spacy.
>>> text = pre.remove_stopwords(text, stpwords="spacy")
>>> text
"dog cute"
```
</details>
<!-- Punctuation & Special Character & Whitespace -->
<details>
<Summary> π Punctuation & Special Character & Whitespace </Summary>
Remove punctuations:
```python
>>> import textprepro as pre
>>> text = "wow!!!"
>>> text = pre.remove_punctuations(text)
>>> text
"wow"
```
Remove special characters:
```python
>>> import textprepro as pre
>>> text = "hello world!! #happy"
>>> text = pre.remove_special_characters(text)
>>> text
"hello world happy"
```
Remove whitespace:
```python
>>> import textprepro as pre
>>> text = " hello world "
>>> text = pre.remove_whitespace(text)
>>> text
"hello world"
```
</details>
<!-- Non-ASCII Character (Accent Character) -->
<details>
<Summary> π Non-ASCII Character (Accent Character) </Summary>
Standardize non-ASCII characters (accent characters):
```python
>>> import textprepro as pre
>>> text = "lattΓ© cafΓ©"
>>> text = pre.standardize_non_ascii(text)
>>> text
"latte cafe"
```
</details>
<!-- Stemming & Lemmatization -->
<details>
<Summary> π Stemming & Lemmatization </Summary>
Stem text:
```python
>>> import textprepro as pre
>>> text = "discover the truth"
>>> text = pre.stem(text)
>>> text
"discov the truth"
```
Lemmatize text:
```python
>>> import textprepro as pre
>>> text = "he works at a school"
>>> text = pre.lemmatize(text)
>>> text
"he work at a school"
```
</details>
<!-- Lowercase & Uppercase -->
<details>
<Summary> π Lowercase & Uppercase </Summary>
Convert text to lowercase & uppercase:
```python
>>> import textprepro as pre
>>> text = "Hello World"
# Lowercase
>>> text = pre.lower(text)
>>> text
"hello world"
# Uppercase
>>> text = pre.upper(text)
>>> text
"HELLO WORLD"
```
</details>
<!-- Tokenization -->
<details>
<Summary> π Tokenization </Summary>
Tokenize text: You can also specify types of tokenization: `word` and `tweet`.
```python
>>> import textprepro as pre
>>> text = "hello world @user #hashtag"
# Tokenize word.
>>> text = pre.tokenize(text, "word")
>>> text
["hello", "world", "@", "user", "#", "hashtag"]
# Tokenize tweet.
>>> text = pre.upper(text, "tweet")
>>> text
["hello", "world", "@user", "#hashtag"]
```
</details>
* ### π± For Social Media Text
<!-- Slang -->
<details>
<Summary> π Slang </Summary>
Remove, replace, or expand slangs:
```python
>>> import textprepro as pre
>>> text = "i will brb"
# Remove
>>> pre.remove_slangs(text)
"i will "
# Replace
>>> pre.replace_slangs(text, "[SLANG]")
"i will [SLANG]"
# Expand
>>> pre.expand_slangs(text)
"i will be right back"
```
</details>
<!-- Mention -->
<details>
<Summary> π Mention </Summary>
Remove or replace mentions.
```python
>>> import textprepro as pre
>>> text = "@user hello world"
# Remove
>>> text = pre.remove_mentions(text)
>>> text
"hello world"
# Replace
>>> text = pre.replace_mentions(text)
>>> text
"[MENTION] hello world"
```
</details>
<!-- Hashtag -->
<details>
<Summary> π Hashtag </Summary>
Remove or replace hashtags.
```python
>>> import textprepro as pre
>>> text = "hello world #twitter"
# Remove
>>> text = pre.remove_hashtags(text)
>>> text
"hello world"
# Replace
>>> text = pre.replace_hashtags(text, "[HASHTAG]")
>>> text
"hello world [HASHTAG]"
```
</details>
<!-- Retweet -->
<details>
<Summary> π Retweet </Summary>
Remove retweet prefix.
```python
>>> import textprepro as pre
>>> text = "RT @user: hello world"
>>> text = pre.remove_retweet_prefix(text)
>>> text
"hello world"
```
</details>
* ### π For Web Scraping Text
<!-- HTML Tag -->
<details>
<Summary> π HTML Tag </Summary>
Remove HTML tags.
```python
>>> import textprepro as pre
>>> text = "<head> hello </head> <body> world </body>"
>>> text = pre.remove_html_tags(text)
>>> text
"hello world"
```
</details>
* ### π Analytical Tools
<!-- Word Distribution -->
<details>
<Summary> π Word Distribution </Summary>
Find word distribution.
```python
>>> import textprepro as pre
>>> document = "love me love my dog"
>>> word_dist = pre.find_word_distribution(document)
>>> word_dist
Counter({"love": 2, "me": 1, "my": 1, "dog": 1})
```
Plot word distribution in a bar graph.
```python
>>> import textprepro as pre
>>> document = "ChatGPT is AI chatbot developed by OpenAI It is built on top of OpenAI GPT foundational large language models and has been fine-tuned an approach to transfer learning using both supervised and reinforcement learning techniques"
>>> word_dist = pre.find_word_distribution(document)
>>> pre.plot_word_distribution(word_dist)
```
<p align="center">
<img src="https://raw.githubusercontent.com/umapornp/textprepro/main/assets/word_dist.png">
</p>
</details>
<!-- Word Cloud -->
<details>
<Summary> π Word Cloud </Summary>
Generate word cloud.
```python
>>> import textprepro as pre
>>> document = "ChatGPT is AI chatbot developed by OpenAI It is built on top of OpenAI GPT foundational large language models and has been fine-tuned an approach to transfer learning using both supervised and reinforcement learning techniques"
>>> pre.generate_word_cloud(document)
```
<p align="center">
<img src="https://raw.githubusercontent.com/umapornp/textprepro/main/assets/word_cloud.png">
</p>
</details>
<!-- Rare & Frequent Word -->
<details>
<Summary> π Rare & Frequent Word</Summary>
Remove rare or frequent words.
```python
>>> import textprepro as pre
>>> document = "love me love my dog"
# Remove rare word
>>> document = pre.remove_rare_words(document, num_words=2)
"love me love"
# Remove frequent word
>>> document = pre.remove_freq_words(document, num_words=2)
"my dog"
```
</details>
Raw data
{
"_id": null,
"home_page": "",
"name": "textprepro",
"maintainer": "",
"docs_url": null,
"requires_python": ">=3.8",
"maintainer_email": "",
"keywords": "text preprocessing,text mining,NLP,Natural Language Processing",
"author": "",
"author_email": "Umaporn Padungkiatwattana <umaploy@gmail.com>",
"download_url": "https://files.pythonhosted.org/packages/65/91/06efb518c59611d6066623c0780041667540611d81bdd9b5757693710c8a/textprepro-0.0.1.tar.gz",
"platform": null,
"description": "<h1 align=\"center\">\r\n <p> TextPrepro - Text Preprocessing </p>\r\n</h1>\r\n\r\n<p align=\"center\">\r\n <a href=\"https://pypi.org/project/textprepro\">\r\n <img src=\"https://img.shields.io/pypi/v/textprepro.svg?logo=pypi&logoColor=white\"\r\n alt=\"PyPI\">\r\n </a>\r\n <a href=\"https://pypi.org/project/textprepro\">\r\n <img src=\"https://img.shields.io/pypi/pyversions/textprepro.svg?logo=python&logoColor=white\"\r\n alt=\"Python\">\r\n </a> \r\n <a href=\"https://codecov.io/gh/umapornp/textprepro\">\r\n <img src=\"https://img.shields.io/codecov/c/github/umapornp/textprepro?logo=codecov\"\r\n alt=\"Codecov\">\r\n </a> \r\n <a href=\"https://github.com/umapornp/textprepro/blob/master/LICENSE\">\r\n <img src=\"https://img.shields.io/github/license/umapornp/textprepro.svg?logo=github\"\r\n alt=\"License\">\r\n </a>\r\n</p>\r\n\r\n\r\n<p align=\"center\">\r\n <img src=\"https://raw.githubusercontent.com/umapornp/textprepro/main/assets/banner.png\">\r\n</p>\r\n\r\n**TextPrepro** - Everything Everyway All At Once Text Preprocessing: Allow you to preprocess both general and social media text with easy-to-use features. Help you gain insight from your data with analytical tools. Stand on the shoulders of various famous libraries (e.g., NLTK, Spacy, Gensim, etc.).\r\n\r\n---------------------------------\r\n### Table of Contents\r\n* [\u23f3 Installation](#\u23f3-installation)\r\n* [\ud83d\ude80 Quickstart](#\ud83d\ude80-quickstart)\r\n * [\ud83e\uddf9 Simply preprocess with the pipeline](#\ud83e\uddf9-simply-preprocess-with-the-pipeline)\r\n * [\ud83d\udcc2 Work with document or dataFrame](#\ud83d\udcc2-work-with-document-or-dataframe)\r\n * [\ud83e\ude90 Customize your own pipeline](#\ud83e\ude90-customize-your-own-pipeline)\r\n* [\ud83d\udca1 Features & Guides](#\ud83d\udca1-features--guides)\r\n * [\ud83d\udccb For General Text](#\ud83d\udccb-for-general-text)\r\n * [\ud83d\udcf1 For Social Media Text](#\ud83d\udcf1-for-social-media-text)\r\n * [\ud83c\udf10 For Web Scraping Text](#\ud83c\udf10-for-web-scraping-text)\r\n * [\ud83d\udcc8 Analytical Tools](#\ud83d\udcc8-analytical-tools)\r\n\r\n---------------------------------\r\n\r\n## \u23f3 Installation\r\nSimply install via `pip`:\r\n\r\n```bash\r\npip install textprepro\r\n```\r\n\r\nor:\r\n```bash\r\npip install \"git+https://github.com/umapornp/textprepro\"\r\n```\r\n\r\n---------------------------------\r\n\r\n## \ud83d\ude80 Quickstart\r\n\r\n* ### \ud83e\uddf9 Simply preprocess with the pipeline\r\n\r\n You can preprocess your textual data by using the function `preprocess_text()` with the default pipeline as follows:\r\n\r\n ```python\r\n >>> import textprepro as pre\r\n\r\n # Proprocess text.\r\n >>> text = \"ChatGPT is AI chatbot developed by OpenAI It is built on top of OpenAI GPT foundational large language models and has been fine-tuned an approach to transfer learning using both supervised and reinforcement learning techniques\"\r\n\r\n >>> text = pre.preprocess_text(text)\r\n >>> text\r\n \"chatgpt ai chatbot developed openai built top openai gpt foundational large language model finetuned approach transfer learning using supervised reinforcement learning technique\"\r\n ```\r\n\r\n* ### \ud83d\udcc2 Work with document or dataFrame\r\n\r\n You can preprocess your document or dataframe as follows:\r\n\r\n * If you work with a list of strings, you can use the function `preprocess_document()` to preprocess each of them.\r\n\r\n ```python\r\n import textprepro as pre\r\n\r\n >>> document = [\"Hello123\", \"World!@&\"]\r\n >>> document = pre.preprocess_document(document)\r\n >>> document\r\n [\"hello\", \"world\"]\r\n ```\r\n\r\n * If you work with a dataframe, you can use the function `apply()` with the function `preprocess_text()` to apply the function to each row.\r\n\r\n ```python\r\n import textprepro as pre\r\n import pandas as pd\r\n\r\n >>> document = {\"text\": [\"Hello123\", \"World!@&\"]}\r\n >>> df = pd.DataFrame(document)\r\n >>> df[\"clean_text\"] = df[\"text\"].apply(pre.preprocess_text)\r\n >>> df\r\n ```\r\n\r\n | text | clean_text |\r\n | :-------- | :--------- |\r\n | Hello123 | hello |\r\n | World!@& | world |\r\n\r\n* ### \ud83e\ude90 Customize your own pipeline\r\n\r\n You can customize your own preprocessing pipeline as follows:\r\n ```python\r\n >>> import textprepro as pre\r\n\r\n # Customize pipeline.\r\n >>> pipeline = [\r\n pre.lower,\r\n pre.remove_punctuations,\r\n pre.expand_contractions,\r\n pre.lemmatize\r\n ]\r\n\r\n >>> text = \"ChatGPT is AI chatbot developed by OpenAI It is built on top of OpenAI GPT foundational large language models and has been fine-tuned an approach to transfer learning using both supervised and reinforcement learning techniques\"\r\n\r\n >>> text = pre.preprocess_text(text=text, pipeline=pipeline)\r\n >>> text\r\n \"chatgpt is ai chatbot developed by openai it is built on top of openai gpt foundational large language model and ha been finetuned an approach to transfer learning using both supervised and reinforcement learning technique\"\r\n ```\r\n\r\n---------------------------------\r\n\r\n## \ud83d\udca1 Features & Guides\r\nTextPrep provides many easy-to-use features for preprocessing general text as well as social media text. Apart from preprocessing tools, TextPrep also provides useful analytical tools to help you gain insight from your data (e.g., word distribution graphs and word clouds).\r\n\r\n* ### \ud83d\udccb For General Text\r\n\r\n <!-- Misspelling Correction -->\r\n <details>\r\n <Summary> \ud83d\udc47 Misspelling Correction </Summary>\r\n\r\n Correct misspelled words:\r\n ```python\r\n >>> import textprepro as pre\r\n\r\n >>> text = \"she loves swiming\"\r\n\r\n >>> text = pre.correct_spelling(text)\r\n >>> text\r\n \"she loves swimming\"\r\n ```\r\n </details>\r\n\r\n\r\n <!-- Emoji & Emoticon -->\r\n <details>\r\n <Summary> \ud83d\udc47 Emoji & Emoticon </Summary>\r\n\r\n Remove, replace, or decode emojis (e.g., \ud83d\udc4d, \ud83d\ude0a, \u2764\ufe0f):\r\n ```python\r\n >>> import textprepro as pre\r\n\r\n >>> text = \"very good \ud83d\udc4d\"\r\n\r\n # Remove.\r\n >>> text = pre.remove_emoji(text)\r\n >>> text\r\n \"very good \"\r\n\r\n # Replace.\r\n >>> text = pre.replace_emoji(text, \"[EMOJI]\")\r\n >>> text\r\n \"very good [EMOJI]\"\r\n\r\n # Decode.\r\n >>> text = pre.decode_emoji(text)\r\n >>> text\r\n \"very good :thumbs_up:\"\r\n ```\r\n\r\n Remove, replace, or decode emoticons (e.g., :-), (>_<), (^o^)):\r\n ```python\r\n >>> import textprepro as pre\r\n\r\n >>> text = \"thank you :)\"\r\n\r\n # Remove.\r\n >>> text = pre.remove_emoticons(text)\r\n >>> text\r\n \"thank you \"\r\n\r\n # Replace.\r\n >>> text = pre.replace_emoticons(text, \"[EMOTICON]\")\r\n >>> text\r\n \"thank you [EMOTICON]\"\r\n\r\n # Decode.\r\n >>> text = pre.decode_emoticons(text)\r\n >>> text\r\n \"thank you happy_face_or_smiley\"\r\n ```\r\n </details>\r\n\r\n <!-- URLs -->\r\n <details>\r\n <Summary> \ud83d\udc47 URL </Summary>\r\n\r\n Remove or replace URLs:\r\n ```python\r\n >>> import textprepro as pre\r\n\r\n >>> text = \"my url https://www.google.com\"\r\n\r\n # Remove.\r\n >>> text = pre.remove_urls(text)\r\n >>> text\r\n \"my url \"\r\n\r\n # Replace.\r\n >>> text = pre.replace_urls(text, \"[URL]\")\r\n >>> text\r\n \"my url [URL]\"\r\n ```\r\n </details>\r\n\r\n\r\n <!-- Email -->\r\n <details>\r\n <Summary> \ud83d\udc47 Email </Summary>\r\n\r\n Remove or replace emails.\r\n ```python\r\n >>> import textprepro as pre\r\n\r\n >>> text = \"my email name.surname@user.com\"\r\n\r\n # Remove.\r\n >>> text = pre.remove_emails(text)\r\n >>> text\r\n \"my email \"\r\n\r\n # Replace.\r\n >>> text = pre.replace_emails(text, \"[EMAIL]\")\r\n >>> text\r\n \"my email [EMAIL]\"\r\n ```\r\n </details>\r\n\r\n\r\n <!-- Number & Phone Number -->\r\n <details>\r\n <Summary> \ud83d\udc47 Number & Phone Number </Summary>\r\n\r\n Remove or replace numbers.\r\n ```python\r\n >>> import textprepro as pre\r\n\r\n >>> text = \"my number 123\"\r\n\r\n # Remove.\r\n >>> text = pre.remove_numbers(text)\r\n >>> text\r\n \"my number \"\r\n\r\n # Replace.\r\n >>> text = pre.replace_numbers(text)\r\n >>> text\r\n \"my number 123\"\r\n ```\r\n\r\n Remove or replace phone numbers.\r\n ```python\r\n >>> import textprepro as pre\r\n\r\n >>> text = \"my phone number +1 (123)-456-7890\"\r\n\r\n # Remove.\r\n >>> text = pre.remove_phone_numbers(text)\r\n >>> text\r\n \"my phone number \"\r\n\r\n # Replace.\r\n >>> text = pre.replace_phone_numbers(text, \"[PHONE]\")\r\n >>> text\r\n \"my phone number [PHONE]\"\r\n ```\r\n </details>\r\n\r\n\r\n <!-- Contraction -->\r\n <details>\r\n <Summary> \ud83d\udc47 Contraction </Summary>\r\n\r\n Expand contractions (e.g., can't, shouldn't, don't).\r\n ```python\r\n >>> import textprepro as pre\r\n\r\n >>> text = \"she can't swim\"\r\n\r\n >>> text = pre.expand_contractions(text)\r\n >>> text\r\n \"she cannot swim\"\r\n ```\r\n </details>\r\n\r\n\r\n <!-- Stopwords -->\r\n <details>\r\n <Summary> \ud83d\udc47 Stopword </Summary>\r\n\r\n Remove stopwords:\r\n You can also specify stopwords: `nltk`, `spacy`, `sklearn`, and `gensim`.\r\n ```python\r\n >>> import textprepro as pre\r\n\r\n >>> text = \"her dog is so cute\"\r\n\r\n # Default stopword is NLTK.\r\n >>> text = pre.remove_stopwords(text)\r\n >>> text\r\n \"dog cute\"\r\n\r\n # Use stopwords from Spacy.\r\n >>> text = pre.remove_stopwords(text, stpwords=\"spacy\")\r\n >>> text\r\n \"dog cute\"\r\n ```\r\n </details>\r\n\r\n\r\n <!-- Punctuation & Special Character & Whitespace -->\r\n <details>\r\n <Summary> \ud83d\udc47 Punctuation & Special Character & Whitespace </Summary>\r\n\r\n Remove punctuations:\r\n ```python\r\n >>> import textprepro as pre\r\n\r\n >>> text = \"wow!!!\"\r\n\r\n >>> text = pre.remove_punctuations(text)\r\n >>> text\r\n \"wow\"\r\n ```\r\n\r\n Remove special characters:\r\n ```python\r\n >>> import textprepro as pre\r\n\r\n >>> text = \"hello world!! #happy\"\r\n\r\n >>> text = pre.remove_special_characters(text)\r\n >>> text\r\n \"hello world happy\"\r\n ```\r\n\r\n Remove whitespace:\r\n ```python\r\n >>> import textprepro as pre\r\n\r\n >>> text = \" hello world \"\r\n\r\n >>> text = pre.remove_whitespace(text)\r\n >>> text\r\n \"hello world\"\r\n ```\r\n </details>\r\n\r\n\r\n <!-- Non-ASCII Character (Accent Character) -->\r\n <details>\r\n <Summary> \ud83d\udc47 Non-ASCII Character (Accent Character) </Summary>\r\n\r\n Standardize non-ASCII characters (accent characters):\r\n ```python\r\n >>> import textprepro as pre\r\n\r\n >>> text = \"latt\u00e9 caf\u00e9\"\r\n\r\n >>> text = pre.standardize_non_ascii(text)\r\n >>> text\r\n \"latte cafe\"\r\n ```\r\n </details>\r\n\r\n\r\n <!-- Stemming & Lemmatization -->\r\n <details>\r\n <Summary> \ud83d\udc47 Stemming & Lemmatization </Summary>\r\n\r\n Stem text:\r\n ```python\r\n >>> import textprepro as pre\r\n\r\n >>> text = \"discover the truth\"\r\n\r\n >>> text = pre.stem(text)\r\n >>> text\r\n \"discov the truth\"\r\n ```\r\n\r\n Lemmatize text:\r\n ```python\r\n >>> import textprepro as pre\r\n\r\n >>> text = \"he works at a school\"\r\n\r\n >>> text = pre.lemmatize(text)\r\n >>> text\r\n \"he work at a school\"\r\n ```\r\n </details>\r\n\r\n\r\n <!-- Lowercase & Uppercase -->\r\n <details>\r\n <Summary> \ud83d\udc47 Lowercase & Uppercase </Summary>\r\n\r\n Convert text to lowercase & uppercase:\r\n ```python\r\n >>> import textprepro as pre\r\n\r\n >>> text = \"Hello World\"\r\n\r\n # Lowercase\r\n >>> text = pre.lower(text)\r\n >>> text\r\n \"hello world\"\r\n\r\n # Uppercase\r\n >>> text = pre.upper(text)\r\n >>> text\r\n \"HELLO WORLD\"\r\n ```\r\n </details>\r\n\r\n\r\n <!-- Tokenization -->\r\n <details>\r\n <Summary> \ud83d\udc47 Tokenization </Summary>\r\n\r\n Tokenize text: You can also specify types of tokenization: `word` and `tweet`.\r\n ```python\r\n >>> import textprepro as pre\r\n\r\n >>> text = \"hello world @user #hashtag\"\r\n\r\n # Tokenize word.\r\n >>> text = pre.tokenize(text, \"word\")\r\n >>> text\r\n [\"hello\", \"world\", \"@\", \"user\", \"#\", \"hashtag\"]\r\n\r\n # Tokenize tweet.\r\n >>> text = pre.upper(text, \"tweet\")\r\n >>> text\r\n [\"hello\", \"world\", \"@user\", \"#hashtag\"]\r\n ```\r\n </details>\r\n\r\n\r\n\r\n\r\n* ### \ud83d\udcf1 For Social Media Text\r\n\r\n <!-- Slang -->\r\n <details>\r\n <Summary> \ud83d\udc47 Slang </Summary>\r\n\r\n Remove, replace, or expand slangs:\r\n ```python\r\n >>> import textprepro as pre\r\n\r\n >>> text = \"i will brb\"\r\n\r\n # Remove\r\n >>> pre.remove_slangs(text)\r\n \"i will \"\r\n\r\n # Replace\r\n >>> pre.replace_slangs(text, \"[SLANG]\")\r\n \"i will [SLANG]\"\r\n\r\n # Expand\r\n >>> pre.expand_slangs(text)\r\n \"i will be right back\"\r\n ```\r\n </details>\r\n\r\n\r\n <!-- Mention -->\r\n <details>\r\n <Summary> \ud83d\udc47 Mention </Summary>\r\n\r\n Remove or replace mentions.\r\n ```python\r\n >>> import textprepro as pre\r\n\r\n >>> text = \"@user hello world\"\r\n\r\n # Remove\r\n >>> text = pre.remove_mentions(text)\r\n >>> text\r\n \"hello world\"\r\n\r\n # Replace\r\n >>> text = pre.replace_mentions(text)\r\n >>> text\r\n \"[MENTION] hello world\"\r\n ```\r\n </details>\r\n\r\n\r\n <!-- Hashtag -->\r\n <details>\r\n <Summary> \ud83d\udc47 Hashtag </Summary>\r\n\r\n Remove or replace hashtags.\r\n ```python\r\n >>> import textprepro as pre\r\n\r\n >>> text = \"hello world #twitter\"\r\n\r\n # Remove\r\n >>> text = pre.remove_hashtags(text)\r\n >>> text\r\n \"hello world\"\r\n\r\n # Replace\r\n >>> text = pre.replace_hashtags(text, \"[HASHTAG]\")\r\n >>> text\r\n \"hello world [HASHTAG]\"\r\n ```\r\n </details>\r\n\r\n\r\n <!-- Retweet -->\r\n <details>\r\n <Summary> \ud83d\udc47 Retweet </Summary>\r\n\r\n Remove retweet prefix.\r\n ```python\r\n >>> import textprepro as pre\r\n\r\n >>> text = \"RT @user: hello world\"\r\n\r\n >>> text = pre.remove_retweet_prefix(text)\r\n >>> text\r\n \"hello world\"\r\n ```\r\n </details>\r\n\r\n\r\n* ### \ud83c\udf10 For Web Scraping Text\r\n\r\n <!-- HTML Tag -->\r\n <details>\r\n <Summary> \ud83d\udc47 HTML Tag </Summary>\r\n\r\n Remove HTML tags.\r\n ```python\r\n >>> import textprepro as pre\r\n\r\n >>> text = \"<head> hello </head> <body> world </body>\"\r\n\r\n >>> text = pre.remove_html_tags(text)\r\n >>> text\r\n \"hello world\"\r\n ```\r\n </details>\r\n\r\n\r\n* ### \ud83d\udcc8 Analytical Tools\r\n\r\n <!-- Word Distribution -->\r\n <details>\r\n <Summary> \ud83d\udc47 Word Distribution </Summary>\r\n\r\n Find word distribution.\r\n ```python\r\n >>> import textprepro as pre\r\n\r\n >>> document = \"love me love my dog\"\r\n\r\n >>> word_dist = pre.find_word_distribution(document)\r\n >>> word_dist\r\n Counter({\"love\": 2, \"me\": 1, \"my\": 1, \"dog\": 1})\r\n ```\r\n\r\n Plot word distribution in a bar graph.\r\n ```python\r\n >>> import textprepro as pre\r\n\r\n >>> document = \"ChatGPT is AI chatbot developed by OpenAI It is built on top of OpenAI GPT foundational large language models and has been fine-tuned an approach to transfer learning using both supervised and reinforcement learning techniques\"\r\n\r\n >>> word_dist = pre.find_word_distribution(document)\r\n >>> pre.plot_word_distribution(word_dist)\r\n ```\r\n\r\n <p align=\"center\">\r\n <img src=\"https://raw.githubusercontent.com/umapornp/textprepro/main/assets/word_dist.png\">\r\n </p>\r\n\r\n </details>\r\n\r\n\r\n <!-- Word Cloud -->\r\n <details>\r\n <Summary> \ud83d\udc47 Word Cloud </Summary>\r\n\r\n Generate word cloud.\r\n ```python\r\n >>> import textprepro as pre\r\n\r\n >>> document = \"ChatGPT is AI chatbot developed by OpenAI It is built on top of OpenAI GPT foundational large language models and has been fine-tuned an approach to transfer learning using both supervised and reinforcement learning techniques\"\r\n\r\n >>> pre.generate_word_cloud(document)\r\n ```\r\n\r\n <p align=\"center\">\r\n <img src=\"https://raw.githubusercontent.com/umapornp/textprepro/main/assets/word_cloud.png\">\r\n </p>\r\n\r\n </details>\r\n\r\n\r\n <!-- Rare & Frequent Word -->\r\n <details>\r\n <Summary> \ud83d\udc47 Rare & Frequent Word</Summary>\r\n\r\n Remove rare or frequent words.\r\n ```python\r\n >>> import textprepro as pre\r\n\r\n >>> document = \"love me love my dog\"\r\n\r\n # Remove rare word\r\n >>> document = pre.remove_rare_words(document, num_words=2)\r\n \"love me love\"\r\n\r\n # Remove frequent word\r\n >>> document = pre.remove_freq_words(document, num_words=2)\r\n \"my dog\"\r\n ```\r\n </details>\r\n",
"bugtrack_url": null,
"license": "MIT",
"summary": "Everything Everyway All At Once Text Preprocessing.",
"version": "0.0.1",
"project_urls": {
"Documentation": "https://github.com/umapornp/textprepro/blob/master/README.md",
"Homepage": "https://github.com/umapornp/textprepro",
"Source": "https://github.com/umapornp/textprepro"
},
"split_keywords": [
"text preprocessing",
"text mining",
"nlp",
"natural language processing"
],
"urls": [
{
"comment_text": "",
"digests": {
"blake2b_256": "932ca2c756006108f5f2ec3055011abf561013eb4216c52458c5251ffbd61d61",
"md5": "7d6cb41771637fa5390b1787998c6460",
"sha256": "fb16828b9e2d6ab52f5a89f594f0041e94bb1d5b2ea9955c1e7d9bed8b2d5598"
},
"downloads": -1,
"filename": "textprepro-0.0.1-py3-none-any.whl",
"has_sig": false,
"md5_digest": "7d6cb41771637fa5390b1787998c6460",
"packagetype": "bdist_wheel",
"python_version": "py3",
"requires_python": ">=3.8",
"size": 13156,
"upload_time": "2023-05-11T11:40:05",
"upload_time_iso_8601": "2023-05-11T11:40:05.526236Z",
"url": "https://files.pythonhosted.org/packages/93/2c/a2c756006108f5f2ec3055011abf561013eb4216c52458c5251ffbd61d61/textprepro-0.0.1-py3-none-any.whl",
"yanked": false,
"yanked_reason": null
},
{
"comment_text": "",
"digests": {
"blake2b_256": "659106efb518c59611d6066623c0780041667540611d81bdd9b5757693710c8a",
"md5": "4b230849feb1a2000d918a2a84baed4b",
"sha256": "bf7509e342ef8bc6a9a3c610d44a38d0e2677852f728c22e9fa67b3153ac2122"
},
"downloads": -1,
"filename": "textprepro-0.0.1.tar.gz",
"has_sig": false,
"md5_digest": "4b230849feb1a2000d918a2a84baed4b",
"packagetype": "sdist",
"python_version": "source",
"requires_python": ">=3.8",
"size": 17712,
"upload_time": "2023-05-11T11:40:07",
"upload_time_iso_8601": "2023-05-11T11:40:07.449633Z",
"url": "https://files.pythonhosted.org/packages/65/91/06efb518c59611d6066623c0780041667540611d81bdd9b5757693710c8a/textprepro-0.0.1.tar.gz",
"yanked": false,
"yanked_reason": null
}
],
"upload_time": "2023-05-11 11:40:07",
"github": true,
"gitlab": false,
"bitbucket": false,
"codeberg": false,
"github_user": "umapornp",
"github_project": "textprepro",
"travis_ci": false,
"coveralls": false,
"github_actions": false,
"lcname": "textprepro"
}