processtext


Nameprocesstext JSON
Version 0.1.7 PyPI version JSON
download
home_pagehttps://github.com/U77w41/processtext
SummaryAn open-source python package to process text data
upload_time2024-02-10 10:29:54
maintainer
docs_urlNone
authorUjjwal Chowdhury
requires_python>=3.8
licenseMIT
keywords python nlp text regex text processing
VCS
bugtrack_url
requirements No requirements were recorded.
Travis-CI No Travis.
coveralls test coverage No coveralls.
            # ==========          processtext          ========== 

**processtext** is a an open-source python package to clean raw text data.          

<p align="center">
  <a href="https://pypi.org/project/processtext"><img alt="PyPI Version" src="https://img.shields.io/pypi/v/processtext.svg?maxAge=86400" /></a>
  <a href="https://pypi.org/project/processtext"><img alt="Python Versions" src="https://img.shields.io/pypi/pyversions/processtext.svg?maxAge=86400" /></a>
</p>

## Installation

processtext requires [Python 3](https://www.python.org/downloads/), [NLTK](http://www.nltk.org/install.html), and [Autocorrect](https://github.com/filyp/autocorrect) to execute. 

To install using pip, use

`pip install processtext`

[![Downloads](https://static.pepy.tech/badge/processtext)](https://pepy.tech/project/processtext)

## Features 

### processtext package contains different functions such as:
* **degroup_num**: Removes comma(,) in between numbers inside a string
* **remove_hyphen**: Removes hyphen(-) in between texts
* **int_to_en**: Returns whole numbers in english text. e.g. 25 -> twenty-five
* **num_to_en**: Returns english of numbers one by one from left to right
* **float_to_en**: Returns floating point numbers into english text
* **int_to_text**: Replaces all the whole numbers inside string into English text
* **float_to_text**: Replacing all the positive rational numbers inside string into English text
* **decontract_strings**: Decontracts strings e.g. I'm -> I am
* **remove_emoji**: Removes emoji
* **clean_text**: For deep cleaning of texts
* **lowercase**: Converts the texts into lowercase
* **autocorrect**: Corrects spelling mistakes 
* **lemmatize**: Lemmatizes the input texts
* **remove_sw**: Removes stop words
* **clean**: to clean raw text and return the cleaned text
* **clean_l**: to clean raw text and return a list of clean words

##### The processtext.clean() and processtext.clean_l() function can apply all, or a selected combination of the following cleaning operations:
* Remove special symbols/characters
* Remove digits from the text
* Remove punctuations from the text
* Remove extra white spaces
* Remove or replace the part of text with custom regex
* Convert the entire text into a uniform lowercase
* Lemmatize the words 
* Remove stop words, and choose a language for stop words




## Usage

* **Import the library**:

``` python
import processtext as pt
```

* **Choose a method:**

 To return the text in a string format, 
 
``` python
pt.clean("your_raw_text_here") 
```
 
 To return a list of words from the text,
 
``` python
pt.clean_l("your_raw_text_here") 
```
 
 To choose a specific set of cleaning operations,

``` python
pt.clean_l("your_raw_text_here",
clean_all= False # Execute all cleaning operations
extra_spaces=True ,  # Remove extra white spaces 
stemming=True , # Stem the words
stopwords=True ,# Remove stop words
lowercase=True ,# Convert to lowercase
numbers=True ,# Remove all digits 
punct=True ,# Remove all punctuations
reg: str = '<regex>', # Remove parts of text based on regex
reg_replace: str = '<replace_value>', # String to replace the regex used in reg
stp_lang='english'  # Language for stop words
)
```

## Examples


``` python
import processtext as pt
pt.degroup_num('111,222,333')
```

returns,

``` Python
'111222333'
```


``` python
import processtext as pt
pt.remove_hyphen('2022-2023')
```

returns,

``` Python
'2022 2023'
```



``` python
import processtext as pt
print(pt.int_to_en(1998))
print(pt.int_to_en('9123456789'))
```

returns,

``` Python
one thousand nine hundred and ninety-eight

nine billion one hundred and twenty-three million four hundred and fifty-six thousand seven hundred and eighty-nine
```


``` python
import processtext as pt
print(pt.num_to_en(12345))
print(pt.num_to_en('09876'))
```

returns,

``` Python
one two three four five

zero nine eight seven six
```


``` python
import processtext as pt
print(pt.float_to_en(12.345))
print(pt.float_to_en('456.09876'))
```

returns,

``` Python
twelve point three four five

four hundred and fifty-six point zero nine eight seven six
```



``` python
import processtext as pt
print(pt.float_to_en(12.345))
print(pt.float_to_en('456.09876'))
```

returns,

``` Python
twelve point three four five

four hundred and fifty-six point zero nine eight seven six
```


``` python
import processtext as pt
pt.int_to_text('First 100 twin primes have values between 3 & 5 and 3821 & 3823')
```

returns,

``` Python
First one hundred twin primes have values between three & five and three thousand eight hundred and twenty-one & three thousand eight hundred and twenty-three
```


``` python
import processtext as pt
pt.float_to_text('The first 10 digits of pi are 3.141592653')
```

returns,

``` Python
The first ten point zero digits of pi are three point one four one five nine two six five three
```



``` python
import processtext as pt
pt.decontract_strings("I can't believe he'll be graduating from college in just a few months.")
```

returns,

``` Python
I can not believe he will be graduating from college in just a few months.
```



``` python
import processtext as pt
pt.remove_emoji("🌞🌊☀️ Just spent an amazing day at the beach with my friends! 🏖️👭👬 We built sandcastles 🏰, played beach volleyball 🏐, and even went for a swim 🏊‍♀️🏊‍♂️. The sun was shining ☀️ and the water was so refreshing 💦. Can't wait to do it again! 🤩👍")
```

returns,

``` Python
 Just spent an amazing day at the beach with my friends!  We built sandcastles , played beach volleyball , and even went for a swim . The sun was shining  and the water was so refreshing . Can't wait to do it again! 
```



``` python
import processtext as pt
pt.clean_text('The password must contain at least one symbol such as !,^,*,+,=,%,$,~,?,/,<>,|@, #, or %.')
```

returns,

``` Python
The password must contain at least one symbol such as                               or   
```



``` python
import processtext as pt
pt.lowercase('THE QUICK BROWN FOX JUMPED OVER THE LAZY DOG.')
```

returns,

``` Python
the quick brown fox jumped over the lazy dog.
```



``` python
import processtext as pt
pt.lowercase('THE QUICK BROWN FOX JUMPED OVER THE LAZY DOG.')
```

returns,

``` Python
the quick brown fox jumped over the lazy dog.
```



``` python
import processtext as pt
pt.autocorrect("I haven't receeved the package yet, but I think it should arrive somtime tomoro.")
```

returns,

``` Python
I haven't received the package yet, but I think it should arrive sometime tomorrow.
```


``` python
import processtext as pt
pt.autocorrect("I haven't receeved the package yet, but I think it should arrive somtime tomoro.")
```

returns,

``` Python
I haven't received the package yet, but I think it should arrive sometime tomorrow.
```



``` python
import processtext as pt
pt.lemmatize('they were playing in the garden.')
```

returns,

``` Python
they be play in the garden.
```



``` python
import processtext as pt
pt.remove_sw('I went to the store and bought some milk, bread, and eggs.')
```

returns,

``` Python
went store bought milk, bread, eggs.
```
 


``` python
import processtext as pt
pt.clean("TH@@#e Q!@#UicK bR0owN f*#!@)(O000000X JUmp100ED 000oV###3eR Th77777#$$e..........                 L@a/\|z+Y d==OG.", extra_spaces=True, lowercase=True, numbers=True, punct=True)
```

returns,

``` Python
'the quick brown fox jumped over the lazy dog'
```

----

``` Python
import processtext as pt
pt.clean_l('TH@@#e Q!@#UicK bR0owN f*#!@)(O000000X JUmp100ED 000oV###3eR Th77777#$$e..........                 L@a/\|z+Y d==OG.', 
           extra_spaces=True, lowercase=True, numbers=True, punct=True)
```

returns,

``` Python
['the', 'quick', 'brown', 'fox', 'jumped', 'over', 'the', 'lazy', 'dog']
```

----

``` Python
from processtext import clean
text = "my email id: ujjwal@rkmvu.ac.in and your's: mili@rnlk.ed"
clean(text, reg=r"[a-z0-9\.\-+_]+@[a-z0-9\.\-+_]+\.[a-z]+", reg_replace='********', clean_all=False)

```

returns,

``` Python
'my email id: ******** and your's: ********'
```

## License

##### MIT


            

Raw data

            {
    "_id": null,
    "home_page": "https://github.com/U77w41/processtext",
    "name": "processtext",
    "maintainer": "",
    "docs_url": null,
    "requires_python": ">=3.8",
    "maintainer_email": "",
    "keywords": "python,nlp,text,regex,text processing",
    "author": "Ujjwal Chowdhury",
    "author_email": "<u77w41@gmail.com>",
    "download_url": "https://files.pythonhosted.org/packages/ba/9a/8b19658f99485f60d2fd66818eae275b77d778d878648f890d252ed7b7f9/processtext-0.1.7.tar.gz",
    "platform": null,
    "description": "# ==========          processtext          ========== \n\n**processtext** is a an open-source python package to clean raw text data.          \n\n<p align=\"center\">\n  <a href=\"https://pypi.org/project/processtext\"><img alt=\"PyPI Version\" src=\"https://img.shields.io/pypi/v/processtext.svg?maxAge=86400\" /></a>\n  <a href=\"https://pypi.org/project/processtext\"><img alt=\"Python Versions\" src=\"https://img.shields.io/pypi/pyversions/processtext.svg?maxAge=86400\" /></a>\n</p>\n\n## Installation\n\nprocesstext requires [Python 3](https://www.python.org/downloads/), [NLTK](http://www.nltk.org/install.html), and [Autocorrect](https://github.com/filyp/autocorrect) to execute. \n\nTo install using pip, use\n\n`pip install processtext`\n\n[![Downloads](https://static.pepy.tech/badge/processtext)](https://pepy.tech/project/processtext)\n\n## Features \n\n### processtext package contains different functions such as:\n* **degroup_num**: Removes comma(,) in between numbers inside a string\n* **remove_hyphen**: Removes hyphen(-) in between texts\n* **int_to_en**: Returns whole numbers in english text. e.g. 25 -> twenty-five\n* **num_to_en**: Returns english of numbers one by one from left to right\n* **float_to_en**: Returns floating point numbers into english text\n* **int_to_text**: Replaces all the whole numbers inside string into English text\n* **float_to_text**: Replacing all the positive rational numbers inside string into English text\n* **decontract_strings**: Decontracts strings e.g. I'm -> I am\n* **remove_emoji**: Removes emoji\n* **clean_text**: For deep cleaning of texts\n* **lowercase**: Converts the texts into lowercase\n* **autocorrect**: Corrects spelling mistakes \n* **lemmatize**: Lemmatizes the input texts\n* **remove_sw**: Removes stop words\n* **clean**: to clean raw text and return the cleaned text\n* **clean_l**: to clean raw text and return a list of clean words\n\n##### The processtext.clean() and processtext.clean_l() function can apply all, or a selected combination of the following cleaning operations:\n* Remove special symbols/characters\n* Remove digits from the text\n* Remove punctuations from the text\n* Remove extra white spaces\n* Remove or replace the part of text with custom regex\n* Convert the entire text into a uniform lowercase\n* Lemmatize the words \n* Remove stop words, and choose a language for stop words\n\n\n\n\n## Usage\n\n* **Import the library**:\n\n``` python\nimport processtext as pt\n```\n\n* **Choose a method:**\n\n To return the text in a string format, \n \n``` python\npt.clean(\"your_raw_text_here\") \n```\n \n To return a list of words from the text,\n \n``` python\npt.clean_l(\"your_raw_text_here\") \n```\n \n To choose a specific set of cleaning operations,\n\n``` python\npt.clean_l(\"your_raw_text_here\",\nclean_all= False # Execute all cleaning operations\nextra_spaces=True ,  # Remove extra white spaces \nstemming=True , # Stem the words\nstopwords=True ,# Remove stop words\nlowercase=True ,# Convert to lowercase\nnumbers=True ,# Remove all digits \npunct=True ,# Remove all punctuations\nreg: str = '<regex>', # Remove parts of text based on regex\nreg_replace: str = '<replace_value>', # String to replace the regex used in reg\nstp_lang='english'  # Language for stop words\n)\n```\n\n## Examples\n\n\n``` python\nimport processtext as pt\npt.degroup_num('111,222,333')\n```\n\nreturns,\n\n``` Python\n'111222333'\n```\n\n\n``` python\nimport processtext as pt\npt.remove_hyphen('2022-2023')\n```\n\nreturns,\n\n``` Python\n'2022 2023'\n```\n\n\n\n``` python\nimport processtext as pt\nprint(pt.int_to_en(1998))\nprint(pt.int_to_en('9123456789'))\n```\n\nreturns,\n\n``` Python\none thousand nine hundred and ninety-eight\n\nnine billion one hundred and twenty-three million four hundred and fifty-six thousand seven hundred and eighty-nine\n```\n\n\n``` python\nimport processtext as pt\nprint(pt.num_to_en(12345))\nprint(pt.num_to_en('09876'))\n```\n\nreturns,\n\n``` Python\none two three four five\n\nzero nine eight seven six\n```\n\n\n``` python\nimport processtext as pt\nprint(pt.float_to_en(12.345))\nprint(pt.float_to_en('456.09876'))\n```\n\nreturns,\n\n``` Python\ntwelve point three four five\n\nfour hundred and fifty-six point zero nine eight seven six\n```\n\n\n\n``` python\nimport processtext as pt\nprint(pt.float_to_en(12.345))\nprint(pt.float_to_en('456.09876'))\n```\n\nreturns,\n\n``` Python\ntwelve point three four five\n\nfour hundred and fifty-six point zero nine eight seven six\n```\n\n\n``` python\nimport processtext as pt\npt.int_to_text('First 100 twin primes have values between 3 & 5 and 3821 & 3823')\n```\n\nreturns,\n\n``` Python\nFirst one hundred twin primes have values between three & five and three thousand eight hundred and twenty-one & three thousand eight hundred and twenty-three\n```\n\n\n``` python\nimport processtext as pt\npt.float_to_text('The first 10 digits of pi are 3.141592653')\n```\n\nreturns,\n\n``` Python\nThe first ten point zero digits of pi are three point one four one five nine two six five three\n```\n\n\n\n``` python\nimport processtext as pt\npt.decontract_strings(\"I can't believe he'll be graduating from college in just a few months.\")\n```\n\nreturns,\n\n``` Python\nI can not believe he will be graduating from college in just a few months.\n```\n\n\n\n``` python\nimport processtext as pt\npt.remove_emoji(\"\ud83c\udf1e\ud83c\udf0a\u2600\ufe0f Just spent an amazing day at the beach with my friends! \ud83c\udfd6\ufe0f\ud83d\udc6d\ud83d\udc6c We built sandcastles \ud83c\udff0, played beach volleyball \ud83c\udfd0, and even went for a swim \ud83c\udfca\u200d\u2640\ufe0f\ud83c\udfca\u200d\u2642\ufe0f. The sun was shining \u2600\ufe0f and the water was so refreshing \ud83d\udca6. Can't wait to do it again! \ud83e\udd29\ud83d\udc4d\")\n```\n\nreturns,\n\n``` Python\n Just spent an amazing day at the beach with my friends!  We built sandcastles , played beach volleyball , and even went for a swim . The sun was shining  and the water was so refreshing . Can't wait to do it again! \n```\n\n\n\n``` python\nimport processtext as pt\npt.clean_text('The password must contain at least one symbol such as !,^,*,+,=,%,$,~,?,/,<>,|@, #, or %.')\n```\n\nreturns,\n\n``` Python\nThe password must contain at least one symbol such as                               or   \n```\n\n\n\n``` python\nimport processtext as pt\npt.lowercase('THE QUICK BROWN FOX JUMPED OVER THE LAZY DOG.')\n```\n\nreturns,\n\n``` Python\nthe quick brown fox jumped over the lazy dog.\n```\n\n\n\n``` python\nimport processtext as pt\npt.lowercase('THE QUICK BROWN FOX JUMPED OVER THE LAZY DOG.')\n```\n\nreturns,\n\n``` Python\nthe quick brown fox jumped over the lazy dog.\n```\n\n\n\n``` python\nimport processtext as pt\npt.autocorrect(\"I haven't receeved the package yet, but I think it should arrive somtime tomoro.\")\n```\n\nreturns,\n\n``` Python\nI haven't received the package yet, but I think it should arrive sometime tomorrow.\n```\n\n\n``` python\nimport processtext as pt\npt.autocorrect(\"I haven't receeved the package yet, but I think it should arrive somtime tomoro.\")\n```\n\nreturns,\n\n``` Python\nI haven't received the package yet, but I think it should arrive sometime tomorrow.\n```\n\n\n\n``` python\nimport processtext as pt\npt.lemmatize('they were playing in the garden.')\n```\n\nreturns,\n\n``` Python\nthey be play in the garden.\n```\n\n\n\n``` python\nimport processtext as pt\npt.remove_sw('I went to the store and bought some milk, bread, and eggs.')\n```\n\nreturns,\n\n``` Python\nwent store bought milk, bread, eggs.\n```\n \n\n\n``` python\nimport processtext as pt\npt.clean(\"TH@@#e Q!@#UicK bR0owN f*#!@)(O000000X JUmp100ED 000oV###3eR Th77777#$$e..........                 L@a/\\|z+Y d==OG.\", extra_spaces=True, lowercase=True, numbers=True, punct=True)\n```\n\nreturns,\n\n``` Python\n'the quick brown fox jumped over the lazy dog'\n```\n\n----\n\n``` Python\nimport processtext as pt\npt.clean_l('TH@@#e Q!@#UicK bR0owN f*#!@)(O000000X JUmp100ED 000oV###3eR Th77777#$$e..........                 L@a/\\|z+Y d==OG.', \n           extra_spaces=True, lowercase=True, numbers=True, punct=True)\n```\n\nreturns,\n\n``` Python\n['the', 'quick', 'brown', 'fox', 'jumped', 'over', 'the', 'lazy', 'dog']\n```\n\n----\n\n``` Python\nfrom processtext import clean\ntext = \"my email id: ujjwal@rkmvu.ac.in and your's: mili@rnlk.ed\"\nclean(text, reg=r\"[a-z0-9\\.\\-+_]+@[a-z0-9\\.\\-+_]+\\.[a-z]+\", reg_replace='********', clean_all=False)\n\n```\n\nreturns,\n\n``` Python\n'my email id: ******** and your's: ********'\n```\n\n## License\n\n##### MIT\n\n",
    "bugtrack_url": null,
    "license": "MIT",
    "summary": "An open-source python package to process text data",
    "version": "0.1.7",
    "project_urls": {
        "Homepage": "https://github.com/U77w41/processtext"
    },
    "split_keywords": [
        "python",
        "nlp",
        "text",
        "regex",
        "text processing"
    ],
    "urls": [
        {
            "comment_text": "",
            "digests": {
                "blake2b_256": "becff3b88983afaf306b5bd58c4d459a8e60c334daf68b7f635d8cb64b04322a",
                "md5": "a0f3a2805097a968d821f4cba97b2114",
                "sha256": "2bef682f283ae20c78a22d9946066f8b1a186a82eda8a0cf8b643ceeab7640c5"
            },
            "downloads": -1,
            "filename": "processtext-0.1.7-py3-none-any.whl",
            "has_sig": false,
            "md5_digest": "a0f3a2805097a968d821f4cba97b2114",
            "packagetype": "bdist_wheel",
            "python_version": "py3",
            "requires_python": ">=3.8",
            "size": 9368,
            "upload_time": "2024-02-10T10:29:52",
            "upload_time_iso_8601": "2024-02-10T10:29:52.557769Z",
            "url": "https://files.pythonhosted.org/packages/be/cf/f3b88983afaf306b5bd58c4d459a8e60c334daf68b7f635d8cb64b04322a/processtext-0.1.7-py3-none-any.whl",
            "yanked": false,
            "yanked_reason": null
        },
        {
            "comment_text": "",
            "digests": {
                "blake2b_256": "ba9a8b19658f99485f60d2fd66818eae275b77d778d878648f890d252ed7b7f9",
                "md5": "ae219148ce8bd2d5ad41cc3f04b462e1",
                "sha256": "9a36e4f6b2539358d36414f66adc8f783ac5effc18f7ff04988fccdfe801eef3"
            },
            "downloads": -1,
            "filename": "processtext-0.1.7.tar.gz",
            "has_sig": false,
            "md5_digest": "ae219148ce8bd2d5ad41cc3f04b462e1",
            "packagetype": "sdist",
            "python_version": "source",
            "requires_python": ">=3.8",
            "size": 11523,
            "upload_time": "2024-02-10T10:29:54",
            "upload_time_iso_8601": "2024-02-10T10:29:54.631438Z",
            "url": "https://files.pythonhosted.org/packages/ba/9a/8b19658f99485f60d2fd66818eae275b77d778d878648f890d252ed7b7f9/processtext-0.1.7.tar.gz",
            "yanked": false,
            "yanked_reason": null
        }
    ],
    "upload_time": "2024-02-10 10:29:54",
    "github": true,
    "gitlab": false,
    "bitbucket": false,
    "codeberg": false,
    "github_user": "U77w41",
    "github_project": "processtext",
    "travis_ci": false,
    "coveralls": false,
    "github_actions": false,
    "requirements": [],
    "lcname": "processtext"
}
        
Elapsed time: 0.19600s