# CleanPhi
CleanPhi is a powerful Python framework designed to enhance text processing by effectively removing unwanted elements such as extraneous characters and unicodes. Leveraging the capabilities of natural language processing, CleanPhi provides a comprehensive set of functionalities, making it an invaluable tool for text cleaning and related tasks.
```python
from CleanPhi import clean
clean("some input",
unicode=True, # fix various unicode errors
to_ascii=True, # transliterate to closest ASCII representation
to_lower=True, # to_lowercase text
no_line_breaks=False, # fully strip line breaks as opposed to only normalizing them
remove_url=False, # replace all URLs with a special token
remove_email=False, # replace all email addresses with a special token
remove_ph=False, # replace all phone numbers with a special token
remove_nums=False, # replace all numbers with a special token
remove_digits=False, # replace all digits with a special token
remove_currency=False, # replace all currency symbols with a special token
remove_punct=False, # remove punctuations
replace_with_punct="", # instead of removing punctuations you may replace them
replace_with_url="<URL>",
replace_with_email="<EMAIL>",
replace_with_phone_number="<PHONE>",
replace_with_number="<NUMBER>",
replace_with_digit="0",
replace_with_currency_symbol="<CUR>",
lang="en" # set to 'de' for German special handling
)
```
Choose an arguement and use the **clean** function in your code:
```python
import CleanPhi
text = "Hello, world! Hello...\t \tworld?\n\nHello:\r\n\n\nWorld. "
proc_text = "Hello, world! Hello... world?\nHello:\nWorld."
assert CleanPhi.remove_whitespace(text, no_line_breaks=False) == proc_text
assert CleanPhi.remove_whitespace(" dd\nd ", no_line_breaks=True) == "dd d"
```
### To install CleanPhi in >=Python3.6
```powershell
pip install CleanPhi
```
### Use CleanPhi with Scikit
```python
from CleanPhi.scikit import PhiTransformer
cleaner = PhiTransformer(remove_punct=False, to_lower=False)
cleaner.transform(['Clean text.', 'Natural language processing!'])
```
# Version 0.2.0
- Bugs fixed
Raw data
{
"_id": null,
"home_page": "https://github.com/enginestein/Phi",
"name": "cleanphi",
"maintainer": "",
"docs_url": null,
"requires_python": ">=3.6",
"maintainer_email": "",
"keywords": "",
"author": "Arya Praneil Pritesh",
"author_email": "aryapraneil@gmail.com",
"download_url": "",
"platform": null,
"description": "# CleanPhi\r\n\r\nCleanPhi is a powerful Python framework designed to enhance text processing by effectively removing unwanted elements such as extraneous characters and unicodes. Leveraging the capabilities of natural language processing, CleanPhi provides a comprehensive set of functionalities, making it an invaluable tool for text cleaning and related tasks.\r\n\r\n```python\r\nfrom CleanPhi import clean\r\n\r\nclean(\"some input\",\r\n unicode=True, # fix various unicode errors\r\n to_ascii=True, # transliterate to closest ASCII representation\r\n to_lower=True, # to_lowercase text\r\n no_line_breaks=False, # fully strip line breaks as opposed to only normalizing them\r\n remove_url=False, # replace all URLs with a special token\r\n remove_email=False, # replace all email addresses with a special token\r\n remove_ph=False, # replace all phone numbers with a special token\r\n remove_nums=False, # replace all numbers with a special token\r\n remove_digits=False, # replace all digits with a special token\r\n remove_currency=False, # replace all currency symbols with a special token\r\n remove_punct=False, # remove punctuations\r\n replace_with_punct=\"\", # instead of removing punctuations you may replace them\r\n replace_with_url=\"<URL>\",\r\n replace_with_email=\"<EMAIL>\",\r\n replace_with_phone_number=\"<PHONE>\",\r\n replace_with_number=\"<NUMBER>\",\r\n replace_with_digit=\"0\",\r\n replace_with_currency_symbol=\"<CUR>\",\r\n lang=\"en\" # set to 'de' for German special handling\r\n)\r\n```\r\n\r\nChoose an arguement and use the **clean** function in your code:\r\n\r\n```python\r\nimport CleanPhi\r\ntext = \"Hello, world! Hello...\\t \\tworld?\\n\\nHello:\\r\\n\\n\\nWorld. \"\r\nproc_text = \"Hello, world! Hello... world?\\nHello:\\nWorld.\"\r\nassert CleanPhi.remove_whitespace(text, no_line_breaks=False) == proc_text\r\nassert CleanPhi.remove_whitespace(\" dd\\nd \", no_line_breaks=True) == \"dd d\"\r\n```\r\n\r\n### To install CleanPhi in >=Python3.6\r\n\r\n```powershell\r\npip install CleanPhi\r\n```\r\n\r\n### Use CleanPhi with Scikit\r\n\r\n```python\r\nfrom CleanPhi.scikit import PhiTransformer\r\n\r\ncleaner = PhiTransformer(remove_punct=False, to_lower=False)\r\ncleaner.transform(['Clean text.', 'Natural language processing!'])\r\n```\r\n\r\n# Version 0.2.0 \r\n\r\n- Bugs fixed\r\n",
"bugtrack_url": null,
"license": "",
"summary": "Natural language processing framework to clean sentences and texts.",
"version": "0.2.0",
"project_urls": {
"Homepage": "https://github.com/enginestein/Phi"
},
"split_keywords": [],
"urls": [
{
"comment_text": "",
"digests": {
"blake2b_256": "21b1b992d859fe94d65240c38bd571de5c559b3a816743db2ff87404d515e31b",
"md5": "36f94c20beaa94e0a8afdfddbc9268ea",
"sha256": "cf47a645558bae2b3b0034c2ef6aa0875b9581e0e74d92d972b2707759aea7fc"
},
"downloads": -1,
"filename": "cleanphi-0.2.0-py3-none-any.whl",
"has_sig": false,
"md5_digest": "36f94c20beaa94e0a8afdfddbc9268ea",
"packagetype": "bdist_wheel",
"python_version": "py3",
"requires_python": ">=3.6",
"size": 19227,
"upload_time": "2023-06-30T09:25:04",
"upload_time_iso_8601": "2023-06-30T09:25:04.245803Z",
"url": "https://files.pythonhosted.org/packages/21/b1/b992d859fe94d65240c38bd571de5c559b3a816743db2ff87404d515e31b/cleanphi-0.2.0-py3-none-any.whl",
"yanked": false,
"yanked_reason": null
}
],
"upload_time": "2023-06-30 09:25:04",
"github": true,
"gitlab": false,
"bitbucket": false,
"codeberg": false,
"github_user": "enginestein",
"github_project": "Phi",
"travis_ci": false,
"coveralls": false,
"github_actions": false,
"lcname": "cleanphi"
}