cleanphi


Namecleanphi JSON
Version 0.2.0 PyPI version JSON
download
home_pagehttps://github.com/enginestein/Phi
SummaryNatural language processing framework to clean sentences and texts.
upload_time2023-06-30 09:25:04
maintainer
docs_urlNone
authorArya Praneil Pritesh
requires_python>=3.6
license
keywords
VCS
bugtrack_url
requirements No requirements were recorded.
Travis-CI No Travis.
coveralls test coverage No coveralls.
            # CleanPhi

CleanPhi is a powerful Python framework designed to enhance text processing by effectively removing unwanted elements such as extraneous characters and unicodes. Leveraging the capabilities of natural language processing, CleanPhi provides a comprehensive set of functionalities, making it an invaluable tool for text cleaning and related tasks.

```python
from CleanPhi import clean

clean("some input",
    unicode=True,               # fix various unicode errors
    to_ascii=True,                  # transliterate to closest ASCII representation
    to_lower=True,                     # to_lowercase text
    no_line_breaks=False,           # fully strip line breaks as opposed to only normalizing them
    remove_url=False,                  # replace all URLs with a special token
    remove_email=False,                # replace all email addresses with a special token
    remove_ph=False,         # replace all phone numbers with a special token
    remove_nums=False,               # replace all numbers with a special token
    remove_digits=False,                # replace all digits with a special token
    remove_currency=False,      # replace all currency symbols with a special token
    remove_punct=False,                 # remove punctuations
    replace_with_punct="",          # instead of removing punctuations you may replace them
    replace_with_url="<URL>",
    replace_with_email="<EMAIL>",
    replace_with_phone_number="<PHONE>",
    replace_with_number="<NUMBER>",
    replace_with_digit="0",
    replace_with_currency_symbol="<CUR>",
    lang="en"                       # set to 'de' for German special handling
)
```

Choose an arguement and use the **clean** function in your code:

```python
import CleanPhi
text = "Hello, world!  Hello...\t \tworld?\n\nHello:\r\n\n\nWorld. "
proc_text = "Hello, world! Hello... world?\nHello:\nWorld."
assert CleanPhi.remove_whitespace(text, no_line_breaks=False) == proc_text
assert CleanPhi.remove_whitespace(" dd\nd  ", no_line_breaks=True) == "dd d"
```

### To install CleanPhi in >=Python3.6

```powershell
pip install CleanPhi
```

### Use CleanPhi with Scikit

```python
from CleanPhi.scikit import PhiTransformer

cleaner = PhiTransformer(remove_punct=False, to_lower=False)
cleaner.transform(['Clean text.', 'Natural language processing!'])
```

# Version 0.2.0 

- Bugs fixed

            

Raw data

            {
    "_id": null,
    "home_page": "https://github.com/enginestein/Phi",
    "name": "cleanphi",
    "maintainer": "",
    "docs_url": null,
    "requires_python": ">=3.6",
    "maintainer_email": "",
    "keywords": "",
    "author": "Arya Praneil Pritesh",
    "author_email": "aryapraneil@gmail.com",
    "download_url": "",
    "platform": null,
    "description": "# CleanPhi\r\n\r\nCleanPhi is a powerful Python framework designed to enhance text processing by effectively removing unwanted elements such as extraneous characters and unicodes. Leveraging the capabilities of natural language processing, CleanPhi provides a comprehensive set of functionalities, making it an invaluable tool for text cleaning and related tasks.\r\n\r\n```python\r\nfrom CleanPhi import clean\r\n\r\nclean(\"some input\",\r\n    unicode=True,               # fix various unicode errors\r\n    to_ascii=True,                  # transliterate to closest ASCII representation\r\n    to_lower=True,                     # to_lowercase text\r\n    no_line_breaks=False,           # fully strip line breaks as opposed to only normalizing them\r\n    remove_url=False,                  # replace all URLs with a special token\r\n    remove_email=False,                # replace all email addresses with a special token\r\n    remove_ph=False,         # replace all phone numbers with a special token\r\n    remove_nums=False,               # replace all numbers with a special token\r\n    remove_digits=False,                # replace all digits with a special token\r\n    remove_currency=False,      # replace all currency symbols with a special token\r\n    remove_punct=False,                 # remove punctuations\r\n    replace_with_punct=\"\",          # instead of removing punctuations you may replace them\r\n    replace_with_url=\"<URL>\",\r\n    replace_with_email=\"<EMAIL>\",\r\n    replace_with_phone_number=\"<PHONE>\",\r\n    replace_with_number=\"<NUMBER>\",\r\n    replace_with_digit=\"0\",\r\n    replace_with_currency_symbol=\"<CUR>\",\r\n    lang=\"en\"                       # set to 'de' for German special handling\r\n)\r\n```\r\n\r\nChoose an arguement and use the **clean** function in your code:\r\n\r\n```python\r\nimport CleanPhi\r\ntext = \"Hello, world!  Hello...\\t \\tworld?\\n\\nHello:\\r\\n\\n\\nWorld. \"\r\nproc_text = \"Hello, world! Hello... world?\\nHello:\\nWorld.\"\r\nassert CleanPhi.remove_whitespace(text, no_line_breaks=False) == proc_text\r\nassert CleanPhi.remove_whitespace(\" dd\\nd  \", no_line_breaks=True) == \"dd d\"\r\n```\r\n\r\n### To install CleanPhi in >=Python3.6\r\n\r\n```powershell\r\npip install CleanPhi\r\n```\r\n\r\n### Use CleanPhi with Scikit\r\n\r\n```python\r\nfrom CleanPhi.scikit import PhiTransformer\r\n\r\ncleaner = PhiTransformer(remove_punct=False, to_lower=False)\r\ncleaner.transform(['Clean text.', 'Natural language processing!'])\r\n```\r\n\r\n# Version 0.2.0 \r\n\r\n- Bugs fixed\r\n",
    "bugtrack_url": null,
    "license": "",
    "summary": "Natural language processing framework to clean sentences and texts.",
    "version": "0.2.0",
    "project_urls": {
        "Homepage": "https://github.com/enginestein/Phi"
    },
    "split_keywords": [],
    "urls": [
        {
            "comment_text": "",
            "digests": {
                "blake2b_256": "21b1b992d859fe94d65240c38bd571de5c559b3a816743db2ff87404d515e31b",
                "md5": "36f94c20beaa94e0a8afdfddbc9268ea",
                "sha256": "cf47a645558bae2b3b0034c2ef6aa0875b9581e0e74d92d972b2707759aea7fc"
            },
            "downloads": -1,
            "filename": "cleanphi-0.2.0-py3-none-any.whl",
            "has_sig": false,
            "md5_digest": "36f94c20beaa94e0a8afdfddbc9268ea",
            "packagetype": "bdist_wheel",
            "python_version": "py3",
            "requires_python": ">=3.6",
            "size": 19227,
            "upload_time": "2023-06-30T09:25:04",
            "upload_time_iso_8601": "2023-06-30T09:25:04.245803Z",
            "url": "https://files.pythonhosted.org/packages/21/b1/b992d859fe94d65240c38bd571de5c559b3a816743db2ff87404d515e31b/cleanphi-0.2.0-py3-none-any.whl",
            "yanked": false,
            "yanked_reason": null
        }
    ],
    "upload_time": "2023-06-30 09:25:04",
    "github": true,
    "gitlab": false,
    "bitbucket": false,
    "codeberg": false,
    "github_user": "enginestein",
    "github_project": "Phi",
    "travis_ci": false,
    "coveralls": false,
    "github_actions": false,
    "lcname": "cleanphi"
}
        
Elapsed time: 0.19478s