# ========== processtext ==========
**processtext** is a an open-source python package to clean raw text data.
<p align="center">
<a href="https://pypi.org/project/processtext"><img alt="PyPI Version" src="https://img.shields.io/pypi/v/processtext.svg?maxAge=86400" /></a>
<a href="https://pypi.org/project/processtext"><img alt="Python Versions" src="https://img.shields.io/pypi/pyversions/processtext.svg?maxAge=86400" /></a>
</p>
## Installation
processtext requires [Python 3](https://www.python.org/downloads/), [NLTK](http://www.nltk.org/install.html), and [Autocorrect](https://github.com/filyp/autocorrect) to execute.
To install using pip, use
`pip install processtext`
[![Downloads](https://static.pepy.tech/badge/processtext)](https://pepy.tech/project/processtext)
## Features
### processtext package contains different functions such as:
* **degroup_num**: Removes comma(,) in between numbers inside a string
* **remove_hyphen**: Removes hyphen(-) in between texts
* **int_to_en**: Returns whole numbers in english text. e.g. 25 -> twenty-five
* **num_to_en**: Returns english of numbers one by one from left to right
* **float_to_en**: Returns floating point numbers into english text
* **int_to_text**: Replaces all the whole numbers inside string into English text
* **float_to_text**: Replacing all the positive rational numbers inside string into English text
* **decontract_strings**: Decontracts strings e.g. I'm -> I am
* **remove_emoji**: Removes emoji
* **clean_text**: For deep cleaning of texts
* **lowercase**: Converts the texts into lowercase
* **autocorrect**: Corrects spelling mistakes
* **lemmatize**: Lemmatizes the input texts
* **remove_sw**: Removes stop words
* **clean**: to clean raw text and return the cleaned text
* **clean_l**: to clean raw text and return a list of clean words
##### The processtext.clean() and processtext.clean_l() function can apply all, or a selected combination of the following cleaning operations:
* Remove special symbols/characters
* Remove digits from the text
* Remove punctuations from the text
* Remove extra white spaces
* Remove or replace the part of text with custom regex
* Convert the entire text into a uniform lowercase
* Lemmatize the words
* Remove stop words, and choose a language for stop words
## Usage
* **Import the library**:
``` python
import processtext as pt
```
* **Choose a method:**
To return the text in a string format,
``` python
pt.clean("your_raw_text_here")
```
To return a list of words from the text,
``` python
pt.clean_l("your_raw_text_here")
```
To choose a specific set of cleaning operations,
``` python
pt.clean_l("your_raw_text_here",
clean_all= False # Execute all cleaning operations
extra_spaces=True , # Remove extra white spaces
stemming=True , # Stem the words
stopwords=True ,# Remove stop words
lowercase=True ,# Convert to lowercase
numbers=True ,# Remove all digits
punct=True ,# Remove all punctuations
reg: str = '<regex>', # Remove parts of text based on regex
reg_replace: str = '<replace_value>', # String to replace the regex used in reg
stp_lang='english' # Language for stop words
)
```
## Examples
``` python
import processtext as pt
pt.degroup_num('111,222,333')
```
returns,
``` Python
'111222333'
```
``` python
import processtext as pt
pt.remove_hyphen('2022-2023')
```
returns,
``` Python
'2022 2023'
```
``` python
import processtext as pt
print(pt.int_to_en(1998))
print(pt.int_to_en('9123456789'))
```
returns,
``` Python
one thousand nine hundred and ninety-eight
nine billion one hundred and twenty-three million four hundred and fifty-six thousand seven hundred and eighty-nine
```
``` python
import processtext as pt
print(pt.num_to_en(12345))
print(pt.num_to_en('09876'))
```
returns,
``` Python
one two three four five
zero nine eight seven six
```
``` python
import processtext as pt
print(pt.float_to_en(12.345))
print(pt.float_to_en('456.09876'))
```
returns,
``` Python
twelve point three four five
four hundred and fifty-six point zero nine eight seven six
```
``` python
import processtext as pt
print(pt.float_to_en(12.345))
print(pt.float_to_en('456.09876'))
```
returns,
``` Python
twelve point three four five
four hundred and fifty-six point zero nine eight seven six
```
``` python
import processtext as pt
pt.int_to_text('First 100 twin primes have values between 3 & 5 and 3821 & 3823')
```
returns,
``` Python
First one hundred twin primes have values between three & five and three thousand eight hundred and twenty-one & three thousand eight hundred and twenty-three
```
``` python
import processtext as pt
pt.float_to_text('The first 10 digits of pi are 3.141592653')
```
returns,
``` Python
The first ten point zero digits of pi are three point one four one five nine two six five three
```
``` python
import processtext as pt
pt.decontract_strings("I can't believe he'll be graduating from college in just a few months.")
```
returns,
``` Python
I can not believe he will be graduating from college in just a few months.
```
``` python
import processtext as pt
pt.remove_emoji("🌞🌊☀️ Just spent an amazing day at the beach with my friends! 🏖️👭👬 We built sandcastles 🏰, played beach volleyball 🏐, and even went for a swim 🏊♀️🏊♂️. The sun was shining ☀️ and the water was so refreshing 💦. Can't wait to do it again! 🤩👍")
```
returns,
``` Python
Just spent an amazing day at the beach with my friends! We built sandcastles , played beach volleyball , and even went for a swim . The sun was shining and the water was so refreshing . Can't wait to do it again!
```
``` python
import processtext as pt
pt.clean_text('The password must contain at least one symbol such as !,^,*,+,=,%,$,~,?,/,<>,|@, #, or %.')
```
returns,
``` Python
The password must contain at least one symbol such as or
```
``` python
import processtext as pt
pt.lowercase('THE QUICK BROWN FOX JUMPED OVER THE LAZY DOG.')
```
returns,
``` Python
the quick brown fox jumped over the lazy dog.
```
``` python
import processtext as pt
pt.lowercase('THE QUICK BROWN FOX JUMPED OVER THE LAZY DOG.')
```
returns,
``` Python
the quick brown fox jumped over the lazy dog.
```
``` python
import processtext as pt
pt.autocorrect("I haven't receeved the package yet, but I think it should arrive somtime tomoro.")
```
returns,
``` Python
I haven't received the package yet, but I think it should arrive sometime tomorrow.
```
``` python
import processtext as pt
pt.autocorrect("I haven't receeved the package yet, but I think it should arrive somtime tomoro.")
```
returns,
``` Python
I haven't received the package yet, but I think it should arrive sometime tomorrow.
```
``` python
import processtext as pt
pt.lemmatize('they were playing in the garden.')
```
returns,
``` Python
they be play in the garden.
```
``` python
import processtext as pt
pt.remove_sw('I went to the store and bought some milk, bread, and eggs.')
```
returns,
``` Python
went store bought milk, bread, eggs.
```
``` python
import processtext as pt
pt.clean("TH@@#e Q!@#UicK bR0owN f*#!@)(O000000X JUmp100ED 000oV###3eR Th77777#$$e.......... L@a/\|z+Y d==OG.", extra_spaces=True, lowercase=True, numbers=True, punct=True)
```
returns,
``` Python
'the quick brown fox jumped over the lazy dog'
```
----
``` Python
import processtext as pt
pt.clean_l('TH@@#e Q!@#UicK bR0owN f*#!@)(O000000X JUmp100ED 000oV###3eR Th77777#$$e.......... L@a/\|z+Y d==OG.',
extra_spaces=True, lowercase=True, numbers=True, punct=True)
```
returns,
``` Python
['the', 'quick', 'brown', 'fox', 'jumped', 'over', 'the', 'lazy', 'dog']
```
----
``` Python
from processtext import clean
text = "my email id: ujjwal@rkmvu.ac.in and your's: mili@rnlk.ed"
clean(text, reg=r"[a-z0-9\.\-+_]+@[a-z0-9\.\-+_]+\.[a-z]+", reg_replace='********', clean_all=False)
```
returns,
``` Python
'my email id: ******** and your's: ********'
```
## License
##### MIT
Raw data
{
"_id": null,
"home_page": "https://github.com/U77w41/processtext",
"name": "processtext",
"maintainer": "",
"docs_url": null,
"requires_python": ">=3.8",
"maintainer_email": "",
"keywords": "python,nlp,text,regex,text processing",
"author": "Ujjwal Chowdhury",
"author_email": "<u77w41@gmail.com>",
"download_url": "https://files.pythonhosted.org/packages/ba/9a/8b19658f99485f60d2fd66818eae275b77d778d878648f890d252ed7b7f9/processtext-0.1.7.tar.gz",
"platform": null,
"description": "# ========== processtext ========== \n\n**processtext** is a an open-source python package to clean raw text data. \n\n<p align=\"center\">\n <a href=\"https://pypi.org/project/processtext\"><img alt=\"PyPI Version\" src=\"https://img.shields.io/pypi/v/processtext.svg?maxAge=86400\" /></a>\n <a href=\"https://pypi.org/project/processtext\"><img alt=\"Python Versions\" src=\"https://img.shields.io/pypi/pyversions/processtext.svg?maxAge=86400\" /></a>\n</p>\n\n## Installation\n\nprocesstext requires [Python 3](https://www.python.org/downloads/), [NLTK](http://www.nltk.org/install.html), and [Autocorrect](https://github.com/filyp/autocorrect) to execute. \n\nTo install using pip, use\n\n`pip install processtext`\n\n[![Downloads](https://static.pepy.tech/badge/processtext)](https://pepy.tech/project/processtext)\n\n## Features \n\n### processtext package contains different functions such as:\n* **degroup_num**: Removes comma(,) in between numbers inside a string\n* **remove_hyphen**: Removes hyphen(-) in between texts\n* **int_to_en**: Returns whole numbers in english text. e.g. 25 -> twenty-five\n* **num_to_en**: Returns english of numbers one by one from left to right\n* **float_to_en**: Returns floating point numbers into english text\n* **int_to_text**: Replaces all the whole numbers inside string into English text\n* **float_to_text**: Replacing all the positive rational numbers inside string into English text\n* **decontract_strings**: Decontracts strings e.g. I'm -> I am\n* **remove_emoji**: Removes emoji\n* **clean_text**: For deep cleaning of texts\n* **lowercase**: Converts the texts into lowercase\n* **autocorrect**: Corrects spelling mistakes \n* **lemmatize**: Lemmatizes the input texts\n* **remove_sw**: Removes stop words\n* **clean**: to clean raw text and return the cleaned text\n* **clean_l**: to clean raw text and return a list of clean words\n\n##### The processtext.clean() and processtext.clean_l() function can apply all, or a selected combination of the following cleaning operations:\n* Remove special symbols/characters\n* Remove digits from the text\n* Remove punctuations from the text\n* Remove extra white spaces\n* Remove or replace the part of text with custom regex\n* Convert the entire text into a uniform lowercase\n* Lemmatize the words \n* Remove stop words, and choose a language for stop words\n\n\n\n\n## Usage\n\n* **Import the library**:\n\n``` python\nimport processtext as pt\n```\n\n* **Choose a method:**\n\n To return the text in a string format, \n \n``` python\npt.clean(\"your_raw_text_here\") \n```\n \n To return a list of words from the text,\n \n``` python\npt.clean_l(\"your_raw_text_here\") \n```\n \n To choose a specific set of cleaning operations,\n\n``` python\npt.clean_l(\"your_raw_text_here\",\nclean_all= False # Execute all cleaning operations\nextra_spaces=True , # Remove extra white spaces \nstemming=True , # Stem the words\nstopwords=True ,# Remove stop words\nlowercase=True ,# Convert to lowercase\nnumbers=True ,# Remove all digits \npunct=True ,# Remove all punctuations\nreg: str = '<regex>', # Remove parts of text based on regex\nreg_replace: str = '<replace_value>', # String to replace the regex used in reg\nstp_lang='english' # Language for stop words\n)\n```\n\n## Examples\n\n\n``` python\nimport processtext as pt\npt.degroup_num('111,222,333')\n```\n\nreturns,\n\n``` Python\n'111222333'\n```\n\n\n``` python\nimport processtext as pt\npt.remove_hyphen('2022-2023')\n```\n\nreturns,\n\n``` Python\n'2022 2023'\n```\n\n\n\n``` python\nimport processtext as pt\nprint(pt.int_to_en(1998))\nprint(pt.int_to_en('9123456789'))\n```\n\nreturns,\n\n``` Python\none thousand nine hundred and ninety-eight\n\nnine billion one hundred and twenty-three million four hundred and fifty-six thousand seven hundred and eighty-nine\n```\n\n\n``` python\nimport processtext as pt\nprint(pt.num_to_en(12345))\nprint(pt.num_to_en('09876'))\n```\n\nreturns,\n\n``` Python\none two three four five\n\nzero nine eight seven six\n```\n\n\n``` python\nimport processtext as pt\nprint(pt.float_to_en(12.345))\nprint(pt.float_to_en('456.09876'))\n```\n\nreturns,\n\n``` Python\ntwelve point three four five\n\nfour hundred and fifty-six point zero nine eight seven six\n```\n\n\n\n``` python\nimport processtext as pt\nprint(pt.float_to_en(12.345))\nprint(pt.float_to_en('456.09876'))\n```\n\nreturns,\n\n``` Python\ntwelve point three four five\n\nfour hundred and fifty-six point zero nine eight seven six\n```\n\n\n``` python\nimport processtext as pt\npt.int_to_text('First 100 twin primes have values between 3 & 5 and 3821 & 3823')\n```\n\nreturns,\n\n``` Python\nFirst one hundred twin primes have values between three & five and three thousand eight hundred and twenty-one & three thousand eight hundred and twenty-three\n```\n\n\n``` python\nimport processtext as pt\npt.float_to_text('The first 10 digits of pi are 3.141592653')\n```\n\nreturns,\n\n``` Python\nThe first ten point zero digits of pi are three point one four one five nine two six five three\n```\n\n\n\n``` python\nimport processtext as pt\npt.decontract_strings(\"I can't believe he'll be graduating from college in just a few months.\")\n```\n\nreturns,\n\n``` Python\nI can not believe he will be graduating from college in just a few months.\n```\n\n\n\n``` python\nimport processtext as pt\npt.remove_emoji(\"\ud83c\udf1e\ud83c\udf0a\u2600\ufe0f Just spent an amazing day at the beach with my friends! \ud83c\udfd6\ufe0f\ud83d\udc6d\ud83d\udc6c We built sandcastles \ud83c\udff0, played beach volleyball \ud83c\udfd0, and even went for a swim \ud83c\udfca\u200d\u2640\ufe0f\ud83c\udfca\u200d\u2642\ufe0f. The sun was shining \u2600\ufe0f and the water was so refreshing \ud83d\udca6. Can't wait to do it again! \ud83e\udd29\ud83d\udc4d\")\n```\n\nreturns,\n\n``` Python\n Just spent an amazing day at the beach with my friends! We built sandcastles , played beach volleyball , and even went for a swim . The sun was shining and the water was so refreshing . Can't wait to do it again! \n```\n\n\n\n``` python\nimport processtext as pt\npt.clean_text('The password must contain at least one symbol such as !,^,*,+,=,%,$,~,?,/,<>,|@, #, or %.')\n```\n\nreturns,\n\n``` Python\nThe password must contain at least one symbol such as or \n```\n\n\n\n``` python\nimport processtext as pt\npt.lowercase('THE QUICK BROWN FOX JUMPED OVER THE LAZY DOG.')\n```\n\nreturns,\n\n``` Python\nthe quick brown fox jumped over the lazy dog.\n```\n\n\n\n``` python\nimport processtext as pt\npt.lowercase('THE QUICK BROWN FOX JUMPED OVER THE LAZY DOG.')\n```\n\nreturns,\n\n``` Python\nthe quick brown fox jumped over the lazy dog.\n```\n\n\n\n``` python\nimport processtext as pt\npt.autocorrect(\"I haven't receeved the package yet, but I think it should arrive somtime tomoro.\")\n```\n\nreturns,\n\n``` Python\nI haven't received the package yet, but I think it should arrive sometime tomorrow.\n```\n\n\n``` python\nimport processtext as pt\npt.autocorrect(\"I haven't receeved the package yet, but I think it should arrive somtime tomoro.\")\n```\n\nreturns,\n\n``` Python\nI haven't received the package yet, but I think it should arrive sometime tomorrow.\n```\n\n\n\n``` python\nimport processtext as pt\npt.lemmatize('they were playing in the garden.')\n```\n\nreturns,\n\n``` Python\nthey be play in the garden.\n```\n\n\n\n``` python\nimport processtext as pt\npt.remove_sw('I went to the store and bought some milk, bread, and eggs.')\n```\n\nreturns,\n\n``` Python\nwent store bought milk, bread, eggs.\n```\n \n\n\n``` python\nimport processtext as pt\npt.clean(\"TH@@#e Q!@#UicK bR0owN f*#!@)(O000000X JUmp100ED 000oV###3eR Th77777#$$e.......... L@a/\\|z+Y d==OG.\", extra_spaces=True, lowercase=True, numbers=True, punct=True)\n```\n\nreturns,\n\n``` Python\n'the quick brown fox jumped over the lazy dog'\n```\n\n----\n\n``` Python\nimport processtext as pt\npt.clean_l('TH@@#e Q!@#UicK bR0owN f*#!@)(O000000X JUmp100ED 000oV###3eR Th77777#$$e.......... L@a/\\|z+Y d==OG.', \n extra_spaces=True, lowercase=True, numbers=True, punct=True)\n```\n\nreturns,\n\n``` Python\n['the', 'quick', 'brown', 'fox', 'jumped', 'over', 'the', 'lazy', 'dog']\n```\n\n----\n\n``` Python\nfrom processtext import clean\ntext = \"my email id: ujjwal@rkmvu.ac.in and your's: mili@rnlk.ed\"\nclean(text, reg=r\"[a-z0-9\\.\\-+_]+@[a-z0-9\\.\\-+_]+\\.[a-z]+\", reg_replace='********', clean_all=False)\n\n```\n\nreturns,\n\n``` Python\n'my email id: ******** and your's: ********'\n```\n\n## License\n\n##### MIT\n\n",
"bugtrack_url": null,
"license": "MIT",
"summary": "An open-source python package to process text data",
"version": "0.1.7",
"project_urls": {
"Homepage": "https://github.com/U77w41/processtext"
},
"split_keywords": [
"python",
"nlp",
"text",
"regex",
"text processing"
],
"urls": [
{
"comment_text": "",
"digests": {
"blake2b_256": "becff3b88983afaf306b5bd58c4d459a8e60c334daf68b7f635d8cb64b04322a",
"md5": "a0f3a2805097a968d821f4cba97b2114",
"sha256": "2bef682f283ae20c78a22d9946066f8b1a186a82eda8a0cf8b643ceeab7640c5"
},
"downloads": -1,
"filename": "processtext-0.1.7-py3-none-any.whl",
"has_sig": false,
"md5_digest": "a0f3a2805097a968d821f4cba97b2114",
"packagetype": "bdist_wheel",
"python_version": "py3",
"requires_python": ">=3.8",
"size": 9368,
"upload_time": "2024-02-10T10:29:52",
"upload_time_iso_8601": "2024-02-10T10:29:52.557769Z",
"url": "https://files.pythonhosted.org/packages/be/cf/f3b88983afaf306b5bd58c4d459a8e60c334daf68b7f635d8cb64b04322a/processtext-0.1.7-py3-none-any.whl",
"yanked": false,
"yanked_reason": null
},
{
"comment_text": "",
"digests": {
"blake2b_256": "ba9a8b19658f99485f60d2fd66818eae275b77d778d878648f890d252ed7b7f9",
"md5": "ae219148ce8bd2d5ad41cc3f04b462e1",
"sha256": "9a36e4f6b2539358d36414f66adc8f783ac5effc18f7ff04988fccdfe801eef3"
},
"downloads": -1,
"filename": "processtext-0.1.7.tar.gz",
"has_sig": false,
"md5_digest": "ae219148ce8bd2d5ad41cc3f04b462e1",
"packagetype": "sdist",
"python_version": "source",
"requires_python": ">=3.8",
"size": 11523,
"upload_time": "2024-02-10T10:29:54",
"upload_time_iso_8601": "2024-02-10T10:29:54.631438Z",
"url": "https://files.pythonhosted.org/packages/ba/9a/8b19658f99485f60d2fd66818eae275b77d778d878648f890d252ed7b7f9/processtext-0.1.7.tar.gz",
"yanked": false,
"yanked_reason": null
}
],
"upload_time": "2024-02-10 10:29:54",
"github": true,
"gitlab": false,
"bitbucket": false,
"codeberg": false,
"github_user": "U77w41",
"github_project": "processtext",
"travis_ci": false,
"coveralls": false,
"github_actions": false,
"requirements": [],
"lcname": "processtext"
}