cleantext


Namecleantext JSON
Version 1.1.4 PyPI version JSON
download
home_pagehttps://github.com/prasanthg3/cleantext
SummaryAn open-source python package to clean raw text data
upload_time2021-12-29 22:08:33
maintainer
docs_urlNone
authorPrasanth Gudiwada
requires_python
licenseMIT
keywords
VCS
bugtrack_url
requirements nltk
Travis-CI No Travis.
coveralls test coverage No coveralls.
            # cleantext

[![Downloads](https://static.pepy.tech/personalized-badge/cleantext?period=month&units=international_system&left_color=grey&right_color=green&left_text=Downloads/month)](https://pepy.tech/project/cleantext)

**cleantext** is a an open-source python package to clean raw text data. Source code for the library can be found [here.](https://github.com/prasanthg3/cleantext)



## Features 

cleantext has two main methods,
* **clean**: to clean raw text and return the cleaned text
* **clean_words**: to clean raw text and return a list of clean words

cleantext can apply all, or a selected combination of the following cleaning operations:
* Remove extra white spaces
* Convert the entire text into a uniform lowercase
* Remove digits from the text
* Remove punctuations from the text
* Remove or replace the part of text with custom regex
* Remove stop words, and choose a language for stop words
( Stop words are generally the most common words in a language with no significant meaning such as is, am, the, this, are etc.)
* Stem the words
(Stemming is a process of converting words with similar meaning into a single word. For example, stemming of words run, runs, running will result run, run, run)

## Installation

cleantext requires [Python 3](https://www.python.org/downloads/) and [NLTK](http://www.nltk.org/install.html) to execute. 

To install using pip, use

`pip install cleantext`

## Usage

* **Import the library**:

``` python
import cleantext
```

* **Choose a method:**

 To return the text in a string format, 
 
``` python
cleantext.clean("your_raw_text_here") 
```
 
 To return a list of words from the text,
 
``` python
cleantext.clean_words("your_raw_text_here") 
```
 
 To choose a specific set of cleaning operations,

``` python
cleantext.clean_words("your_raw_text_here",
clean_all= False # Execute all cleaning operations
extra_spaces=True ,  # Remove extra white spaces 
stemming=True , # Stem the words
stopwords=True ,# Remove stop words
lowercase=True ,# Convert to lowercase
numbers=True ,# Remove all digits 
punct=True ,# Remove all punctuations
reg: str = '<regex>', # Remove parts of text based on regex
reg_replace: str = '<replace_value>', # String to replace the regex used in reg
stp_lang='english'  # Language for stop words
)
```

## Examples

``` python
import cleantext
cleantext.clean('This is A s$ample !!!! tExt3% to   cleaN566556+2+59*/133', extra_spaces=True, lowercase=True, numbers=True, punct=True)
```

returns,

``` Python
'this is a sample text to clean'
```

----

``` Python
import cleantext
cleantext.clean_words('This is A s$ample !!!! tExt3% to   cleaN566556+2+59*/133')
```

returns,

``` Python
['sampl', 'text', 'clean']
```

----

``` Python
from cleantext import clean
text = "my id, name1@dom1.com and your, name2@dom2.in"
clean(text, reg=r"[a-z0-9\.\-+_]+@[a-z0-9\.\-+_]+\.[a-z]+", reg_replace='email', clean_all=False)

```

returns,

``` Python
"my id, email and your, email"
```

## License

##### MIT

For any questions, issues, bugs, and suggestions please visit [here](https://github.com/prasanthg3/cleantext/issues)



            

Raw data

            {
    "_id": null,
    "home_page": "https://github.com/prasanthg3/cleantext",
    "name": "cleantext",
    "maintainer": "",
    "docs_url": null,
    "requires_python": "",
    "maintainer_email": "",
    "keywords": "",
    "author": "Prasanth Gudiwada",
    "author_email": "prasanth.gudiwada@gmail.com",
    "download_url": "https://files.pythonhosted.org/packages/9e/39/883774dadb46a8ea348ddbdc9dfdb9aaa1a104825e65ee9ebe9a375f46e0/cleantext-1.1.4.tar.gz",
    "platform": "",
    "description": "# cleantext\n\n[![Downloads](https://static.pepy.tech/personalized-badge/cleantext?period=month&units=international_system&left_color=grey&right_color=green&left_text=Downloads/month)](https://pepy.tech/project/cleantext)\n\n**cleantext** is a an open-source python package to clean raw text data. Source code for the library can be found [here.](https://github.com/prasanthg3/cleantext)\n\n\n\n## Features \n\ncleantext has two main methods,\n* **clean**: to clean raw text and return the cleaned text\n* **clean_words**: to clean raw text and return a list of clean words\n\ncleantext can apply all, or a selected combination of the following cleaning operations:\n* Remove extra white spaces\n* Convert the entire text into a uniform lowercase\n* Remove digits from the text\n* Remove punctuations from the text\n* Remove or replace the part of text with custom regex\n* Remove stop words, and choose a language for stop words\n( Stop words are generally the most common words in a language with no significant meaning such as is, am, the, this, are etc.)\n* Stem the words\n(Stemming is a process of converting words with similar meaning into a single word. For example, stemming of words run, runs, running will result run, run, run)\n\n## Installation\n\ncleantext requires [Python 3](https://www.python.org/downloads/) and [NLTK](http://www.nltk.org/install.html) to execute. \n\nTo install using pip, use\n\n`pip install cleantext`\n\n## Usage\n\n* **Import the library**:\n\n``` python\nimport cleantext\n```\n\n* **Choose a method:**\n\n To return the text in a string format, \n \n``` python\ncleantext.clean(\"your_raw_text_here\") \n```\n \n To return a list of words from the text,\n \n``` python\ncleantext.clean_words(\"your_raw_text_here\") \n```\n \n To choose a specific set of cleaning operations,\n\n``` python\ncleantext.clean_words(\"your_raw_text_here\",\nclean_all= False # Execute all cleaning operations\nextra_spaces=True ,  # Remove extra white spaces \nstemming=True , # Stem the words\nstopwords=True ,# Remove stop words\nlowercase=True ,# Convert to lowercase\nnumbers=True ,# Remove all digits \npunct=True ,# Remove all punctuations\nreg: str = '<regex>', # Remove parts of text based on regex\nreg_replace: str = '<replace_value>', # String to replace the regex used in reg\nstp_lang='english'  # Language for stop words\n)\n```\n\n## Examples\n\n``` python\nimport cleantext\ncleantext.clean('This is A s$ample !!!! tExt3% to   cleaN566556+2+59*/133', extra_spaces=True, lowercase=True, numbers=True, punct=True)\n```\n\nreturns,\n\n``` Python\n'this is a sample text to clean'\n```\n\n----\n\n``` Python\nimport cleantext\ncleantext.clean_words('This is A s$ample !!!! tExt3% to   cleaN566556+2+59*/133')\n```\n\nreturns,\n\n``` Python\n['sampl', 'text', 'clean']\n```\n\n----\n\n``` Python\nfrom cleantext import clean\ntext = \"my id, name1@dom1.com and your, name2@dom2.in\"\nclean(text, reg=r\"[a-z0-9\\.\\-+_]+@[a-z0-9\\.\\-+_]+\\.[a-z]+\", reg_replace='email', clean_all=False)\n\n```\n\nreturns,\n\n``` Python\n\"my id, email and your, email\"\n```\n\n## License\n\n##### MIT\n\nFor any questions, issues, bugs, and suggestions please visit [here](https://github.com/prasanthg3/cleantext/issues)\n\n\n",
    "bugtrack_url": null,
    "license": "MIT",
    "summary": "An open-source python package to clean raw text data",
    "version": "1.1.4",
    "split_keywords": [],
    "urls": [
        {
            "comment_text": "",
            "digests": {
                "md5": "90047d93770255bb806a85916528a017",
                "sha256": "138a658a8084796793910c876140002435ffc7ce51a9abf28d2a6b059a7a4d13"
            },
            "downloads": -1,
            "filename": "cleantext-1.1.4-py3-none-any.whl",
            "has_sig": false,
            "md5_digest": "90047d93770255bb806a85916528a017",
            "packagetype": "bdist_wheel",
            "python_version": "py3",
            "requires_python": null,
            "size": 4869,
            "upload_time": "2021-12-29T22:08:32",
            "upload_time_iso_8601": "2021-12-29T22:08:32.003500Z",
            "url": "https://files.pythonhosted.org/packages/df/d0/bd954cf316c1d3a605a9bc29d2cf2bbd388b82d2626b60ab92e8d18457a3/cleantext-1.1.4-py3-none-any.whl",
            "yanked": false,
            "yanked_reason": null
        },
        {
            "comment_text": "",
            "digests": {
                "md5": "f41366f4393aba6490e635c51936453c",
                "sha256": "854003de912406d8d821623774b307dc6f0626fd9fac0bdc5d24864ee3f37578"
            },
            "downloads": -1,
            "filename": "cleantext-1.1.4.tar.gz",
            "has_sig": false,
            "md5_digest": "f41366f4393aba6490e635c51936453c",
            "packagetype": "sdist",
            "python_version": "source",
            "requires_python": null,
            "size": 4242,
            "upload_time": "2021-12-29T22:08:33",
            "upload_time_iso_8601": "2021-12-29T22:08:33.399531Z",
            "url": "https://files.pythonhosted.org/packages/9e/39/883774dadb46a8ea348ddbdc9dfdb9aaa1a104825e65ee9ebe9a375f46e0/cleantext-1.1.4.tar.gz",
            "yanked": false,
            "yanked_reason": null
        }
    ],
    "upload_time": "2021-12-29 22:08:33",
    "github": true,
    "gitlab": false,
    "bitbucket": false,
    "github_user": "prasanthg3",
    "github_project": "cleantext",
    "travis_ci": false,
    "coveralls": false,
    "github_actions": true,
    "requirements": [
        {
            "name": "nltk",
            "specs": [
                [
                    "~=",
                    "3.6.5"
                ]
            ]
        }
    ],
    "lcname": "cleantext"
}
        
Elapsed time: 0.01906s