### 3 functions to normalize strings, repair bad encoding, replace non-printable characters
#### The function use numba under the hood - that means the first run is very slow, (compile time), but then the speed-up is tremendous.
##### pip install charchef
```python
from charchef import aa_convert_utf8_to_ascii_,aa_repair_bad_conversion_to_utf8,aa_replace_non_printable_chars
text = r"""ąćęłńóśźż ĄĆĘŁŃÓŚŹ\x00Ż Junto à Estação de Carcavelos; Bragança Situado
en el núcleo de Es Caló de Sant Agustí frente al Hostal Rafalet. Cartão MOBI.E R.
Conselheiro Emídio Navarro (frente ao ISEL) R. Conselheiro Emidio Navarro (frente ao ISEL)
àáâãäåa èéêëe ìíîïi òóôõöo ùúûüu ýÿy Suzy & John " £682m
\u00FF\u00FF\u00F0\u00f0\x95\xFF SmörgÃ¥s Non ti suscita niente la parola pietÃ\xa0 RosŽ RUF MICH ZURÃœCK.
aqu\195\173 09. Bát Nhã Tâm Kinh criança Koç University Technische Universität Dresden Universität
für Musik und darstellende Kunst Wien Technische Universität Wien Ã\x89cole Nationale Supérieure
des Beaux-Arts Paris Universidad Simón BolÃ\xadvar (USB) 240 Åland Islands 2014.0
MARIEHAMN 11437.0 1 240 Åland Islands 2010.0 MARIEHAMN 5829.5 1 240
Albania 2011.0 Durrës 113249.0 240 Albania 2011.0 TIRANA
418495.0 240 Albania 2011.0 Durrës 56511.0 "Tutu Au Mic' – dumbéa"
""".splitlines()
bigc1 = aa_convert_utf8_to_ascii_(
str_=text,
preprocessing_functions=(
"8x_3_lower_case_escaped",
"8x_3_upper_case_escaped",
"8u_4_upper_case_escaped",
"8u_4_lower_case_escaped",
"8x_69_upper_case_escaped",
"8x_69_lower_case_escaped",
"8n_escaped",
"8wrong_chars",
"8zerox_unescaped_lower",
"8zerox_unescaped_upper",
"8html_entity",
),
preprocessing_function_non_printable=(
"substitute_allcontrols_s",
"substitute_allcontrols",
"substitute_allcontrols2",
"substitute_allcontrols2_s",
"substitute_allcontrols3",
"substitute_allcontrols3_s",
),
respect_german_letters=True,
)
bigc2 = aa_repair_bad_conversion_to_utf8(
str_=text,
functions=(
"8x_3_lower_case_escaped",
"8x_3_upper_case_escaped",
"8u_4_upper_case_escaped",
"8u_4_lower_case_escaped",
"8x_69_upper_case_escaped",
"8x_69_lower_case_escaped",
"8n_escaped",
"8wrong_chars",
"8zerox_unescaped_lower",
"8zerox_unescaped_upper",
"8html_entity",
),
)
bigc3 = aa_replace_non_printable_chars(
str_="\x00rsi\\x00d\x00ad \x0aSimón BolÃ\xadvar",
functions=(
"substitute_allcontrols_s",
"substitute_allcontrols",
"substitute_allcontrols2",
"substitute_allcontrols2_s",
),
removex0a=False,
)
bigc1 # replaces all accents, special characters ...
Out[3]:
['acelnoszz ACELNOSZZ Junto a Estacao de Carcavelos; Braganca Situado ',
'en el nucleo de Es Calo de Sant Agusti frente al Hostal Rafalet. Cartao MOBI.E R. ',
'Conselheiro Emidio Navarro (frente ao ISEL) R. Conselheiro Emidio Navarro (frente ao ISEL)',
' aaaaaeaa eeeee iiiii oooooeo uuuueu yyy Suzy & John " PS682m ',
' yydd*y Smorgas Non ti suscita niente la parola pieti RosZ RUF MICH ZURUCK.',
' aqui 09. Bat Nha Tam Kinh crianca Koc University Technische Universitat Dresden Universitat ',
' fur Musik und darstellende Kunst Wien Technische Universitat Wien Ecole Nationale Superieure ',
' des Beaux-Arts Paris Universidad Simon Bolivar (USB) 240 Sland Islands 2014.0 ',
' MARIEHAMN 11437.0 1 240 Sland Islands 2010.0 MARIEHAMN 5829.5 1 240 ',
' Albania 2011.0 Durres 113249.0 240 Albania 2011.0 TIRANA ',
' 418495.0 240 Albania 2011.0 Durres 56511.0 "Tutu Au Mic\' - dumbea"',
' ']
bigc2 # Repairs messed up Unicode
Out[4]:
['ąćęłńóśźż ĄĆĘŁŃÓŚŹŻ Junto à Estação de Carcavelos; Bragança Situado ',
'en el núcleo de Es Caló de Sant Agustí frente al Hostal Rafalet. Cartão MOBI.E R. ',
'Conselheiro Emídio Navarro (frente ao ISEL) R. Conselheiro Emidio Navarro (frente ao ISEL)',
' àáâãäåa èéêëe ìíîïi òóôõöo ùúûüu ýÿy Suzy & John " £682m ',
' ÿÿðð•ÿ Smörgås Non ti suscita niente la parola pietí RosŽ RUF MICH ZURÜCK.',
' aquí 09. Bát Nhã Tâm Kinh criança Koç University Technische Universität Dresden Universität ',
' für Musik und darstellende Kunst Wien Technische Universität Wien École Nationale Supérieure ',
' des Beaux-Arts Paris Universidad Simón Bolívar (USB) 240 Šland Islands 2014.0 ',
' MARIEHAMN 11437.0 1 240 Šland Islands 2010.0 MARIEHAMN 5829.5 1 240 ',
' Albania 2011.0 Durrës 113249.0 240 Albania 2011.0 TIRANA ',
' 418495.0 240 Albania 2011.0 Durrës 56511.0 "Tutu Au Mic\' – dumbéa"',
' ']
bigc3 # Removes non-printable characters
Out[5]: ['rsidad Simón BolÃ\xadvar']
```
Raw data
{
"_id": null,
"home_page": "https://github.com/hansalemaos/charchef",
"name": "charchef",
"maintainer": "",
"docs_url": null,
"requires_python": "",
"maintainer_email": "",
"keywords": "unicode,normalize,decode,encode",
"author": "Johannes Fischer",
"author_email": "<aulasparticularesdealemaosp@gmail.com>",
"download_url": "https://files.pythonhosted.org/packages/7c/d1/81dd6fde1e5ef23b90b77a248b774469b4ccf715568a42841db9bfe8db14/charchef-0.12.tar.gz",
"platform": null,
"description": "\n### 3 functions to normalize strings, repair bad encoding, replace non-printable characters \n\n\n\n\n\n#### The function use numba under the hood - that means the first run is very slow, (compile time), but then the speed-up is tremendous.\n\n\n\n\n\n##### pip install charchef\n\n\n\n```python\n\nfrom charchef import aa_convert_utf8_to_ascii_,aa_repair_bad_conversion_to_utf8,aa_replace_non_printable_chars\n\ntext = r\"\"\"\u0105\u0107\u0119\u0142\u0144\u00f3\u015b\u017a\u017c \u0104\u0106\u0118\u0141\u0143\u00d3\u015a\u0179\\x00\u017b Junto \u00e0 Esta\u00e7\u00e3o de Carcavelos; Bragan\u00e7a Situado \n\nen el n\u00facleo de Es Cal\u00f3 de Sant Agust\u00ed frente al Hostal Rafalet. Cart\u00e3o MOBI.E R. \n\nConselheiro Em\u00eddio Navarro (frente ao ISEL) R. Conselheiro Emidio Navarro (frente ao ISEL)\n\n \u00e0\u00e1\u00e2\u00e3\u00e4\u00e5a \u00e8\u00e9\u00ea\u00ebe \u00ec\u00ed\u00ee\u00efi \u00f2\u00f3\u00f4\u00f5\u00f6o \u00f9\u00fa\u00fb\u00fcu \u00fd\u00ffy Suzy & John " £682m \n\n \\u00FF\\u00FF\\u00F0\\u00f0\\x95\\xFF Sm\u00c3\u00b6rg\u00c3\u00a5s Non ti suscita niente la parola piet\u00c3\\xa0 Ros\u00c5\u00bd RUF MICH ZUR\u00c3\u0153CK.\n\n aqu\\195\\173 09. B\u00c3\u00a1t Nh\u00c3\u00a3 T\u00c3\u00a2m Kinh crian\u00c3\u00a7a Ko\u00c3\u00a7 University Technische Universit\u00c3\u00a4t Dresden Universit\u00c3\u00a4t \n\n f\u00c3\u00bcr Musik und darstellende Kunst Wien Technische Universit\u00c3\u00a4t Wien \u00c3\\x89cole Nationale Sup\u00c3\u00a9rieure \n\n des Beaux-Arts Paris Universidad Sim\u00c3\u00b3n Bol\u00c3\\xadvar (USB) 240 \u00c5land Islands 2014.0 \n\n MARIEHAMN 11437.0 1 240 \u00c5land Islands 2010.0 MARIEHAMN 5829.5 1 240 \n\n Albania 2011.0 Durr\u00ebs 113249.0 240 Albania 2011.0 TIRANA \n\n 418495.0 240 Albania 2011.0 Durr\u00ebs 56511.0 \"Tutu Au Mic' \u2013 dumb\u00e9a\"\n\n \"\"\".splitlines()\n\nbigc1 = aa_convert_utf8_to_ascii_(\n\n str_=text,\n\n preprocessing_functions=(\n\n \"8x_3_lower_case_escaped\",\n\n \"8x_3_upper_case_escaped\",\n\n \"8u_4_upper_case_escaped\",\n\n \"8u_4_lower_case_escaped\",\n\n \"8x_69_upper_case_escaped\",\n\n \"8x_69_lower_case_escaped\",\n\n \"8n_escaped\",\n\n \"8wrong_chars\",\n\n \"8zerox_unescaped_lower\",\n\n \"8zerox_unescaped_upper\",\n\n \"8html_entity\",\n\n ),\n\n preprocessing_function_non_printable=(\n\n \"substitute_allcontrols_s\",\n\n \"substitute_allcontrols\",\n\n \"substitute_allcontrols2\",\n\n \"substitute_allcontrols2_s\",\n\n \"substitute_allcontrols3\",\n\n \"substitute_allcontrols3_s\",\n\n ),\n\n respect_german_letters=True,\n\n )\n\n\n\nbigc2 = aa_repair_bad_conversion_to_utf8(\n\n str_=text,\n\n functions=(\n\n \"8x_3_lower_case_escaped\",\n\n \"8x_3_upper_case_escaped\",\n\n \"8u_4_upper_case_escaped\",\n\n \"8u_4_lower_case_escaped\",\n\n \"8x_69_upper_case_escaped\",\n\n \"8x_69_lower_case_escaped\",\n\n \"8n_escaped\",\n\n \"8wrong_chars\",\n\n \"8zerox_unescaped_lower\",\n\n \"8zerox_unescaped_upper\",\n\n \"8html_entity\",\n\n ),\n\n )\n\n\n\nbigc3 = aa_replace_non_printable_chars(\n\n str_=\"\\x00rsi\\\\x00d\\x00ad \\x0aSim\u00c3\u00b3n Bol\u00c3\\xadvar\",\n\n functions=(\n\n \"substitute_allcontrols_s\",\n\n \"substitute_allcontrols\",\n\n \"substitute_allcontrols2\",\n\n \"substitute_allcontrols2_s\",\n\n ),\n\n removex0a=False,\n\n )\n\n\t\n\n\t\n\nbigc1 # replaces all accents, special characters ... \n\nOut[3]: \n\n['acelnoszz ACELNOSZZ Junto a Estacao de Carcavelos; Braganca Situado ',\n\n 'en el nucleo de Es Calo de Sant Agusti frente al Hostal Rafalet. Cartao MOBI.E R. ',\n\n 'Conselheiro Emidio Navarro (frente ao ISEL) R. Conselheiro Emidio Navarro (frente ao ISEL)',\n\n ' aaaaaeaa eeeee iiiii oooooeo uuuueu yyy Suzy & John \" PS682m ',\n\n ' yydd*y Smorgas Non ti suscita niente la parola pieti RosZ RUF MICH ZURUCK.',\n\n ' aqui 09. Bat Nha Tam Kinh crianca Koc University Technische Universitat Dresden Universitat ',\n\n ' fur Musik und darstellende Kunst Wien Technische Universitat Wien Ecole Nationale Superieure ',\n\n ' des Beaux-Arts Paris Universidad Simon Bolivar (USB) 240 Sland Islands 2014.0 ',\n\n ' MARIEHAMN 11437.0 1 240 Sland Islands 2010.0 MARIEHAMN 5829.5 1 240 ',\n\n ' Albania 2011.0 Durres 113249.0 240 Albania 2011.0 TIRANA ',\n\n ' 418495.0 240 Albania 2011.0 Durres 56511.0 \"Tutu Au Mic\\' - dumbea\"',\n\n ' ']\n\n \n\n \n\nbigc2 # Repairs messed up Unicode\n\nOut[4]: \n\n['\u0105\u0107\u0119\u0142\u0144\u00f3\u015b\u017a\u017c \u0104\u0106\u0118\u0141\u0143\u00d3\u015a\u0179\u017b Junto \u00e0 Esta\u00e7\u00e3o de Carcavelos; Bragan\u00e7a Situado ',\n\n 'en el n\u00facleo de Es Cal\u00f3 de Sant Agust\u00ed frente al Hostal Rafalet. Cart\u00e3o MOBI.E R. ',\n\n 'Conselheiro Em\u00eddio Navarro (frente ao ISEL) R. Conselheiro Emidio Navarro (frente ao ISEL)',\n\n ' \u00e0\u00e1\u00e2\u00e3\u00e4\u00e5a \u00e8\u00e9\u00ea\u00ebe \u00ec\u00ed\u00ee\u00efi \u00f2\u00f3\u00f4\u00f5\u00f6o \u00f9\u00fa\u00fb\u00fcu \u00fd\u00ffy Suzy & John \" \u00a3682m ',\n\n ' \u00ff\u00ff\u00f0\u00f0\u2022\u00ff Sm\u00f6rg\u00e5s Non ti suscita niente la parola piet\u00ed Ros\u017d RUF MICH ZUR\u00dcCK.',\n\n ' aqu\u00ed 09. B\u00e1t Nh\u00e3 T\u00e2m Kinh crian\u00e7a Ko\u00e7 University Technische Universit\u00e4t Dresden Universit\u00e4t ',\n\n ' f\u00fcr Musik und darstellende Kunst Wien Technische Universit\u00e4t Wien \u00c9cole Nationale Sup\u00e9rieure ',\n\n ' des Beaux-Arts Paris Universidad Sim\u00f3n Bol\u00edvar (USB) 240 \u0160land Islands 2014.0 ',\n\n ' MARIEHAMN 11437.0 1 240 \u0160land Islands 2010.0 MARIEHAMN 5829.5 1 240 ',\n\n ' Albania 2011.0 Durr\u00ebs 113249.0 240 Albania 2011.0 TIRANA ',\n\n ' 418495.0 240 Albania 2011.0 Durr\u00ebs 56511.0 \"Tutu Au Mic\\' \u2013 dumb\u00e9a\"',\n\n ' ']\n\n \n\n \n\nbigc3 # Removes non-printable characters\n\nOut[5]: ['rsidad Sim\u00c3\u00b3n Bol\u00c3\\xadvar']\n\n```\n",
"bugtrack_url": null,
"license": "MIT",
"summary": "3 functions to normalize strings, repair bad encoding, replace non-printable characters",
"version": "0.12",
"split_keywords": [
"unicode",
"normalize",
"decode",
"encode"
],
"urls": [
{
"comment_text": "",
"digests": {
"blake2b_256": "582f7a0963be7d291195cf12bdcfaa5afd9a03533d2e7473a9f7ef9e6ac621ae",
"md5": "65effffe7d78b0971e66d56ec4f911cb",
"sha256": "cf099de64396704575654cfe79ddd29436078ef2fd32076f13bfbffd132a6b3d"
},
"downloads": -1,
"filename": "charchef-0.12-py3-none-any.whl",
"has_sig": false,
"md5_digest": "65effffe7d78b0971e66d56ec4f911cb",
"packagetype": "bdist_wheel",
"python_version": "py3",
"requires_python": null,
"size": 218078,
"upload_time": "2023-03-01T05:35:21",
"upload_time_iso_8601": "2023-03-01T05:35:21.937166Z",
"url": "https://files.pythonhosted.org/packages/58/2f/7a0963be7d291195cf12bdcfaa5afd9a03533d2e7473a9f7ef9e6ac621ae/charchef-0.12-py3-none-any.whl",
"yanked": false,
"yanked_reason": null
},
{
"comment_text": "",
"digests": {
"blake2b_256": "7cd181dd6fde1e5ef23b90b77a248b774469b4ccf715568a42841db9bfe8db14",
"md5": "7b5f8fb92df755b4f798af06ee2b7733",
"sha256": "d1b83d836d586f6383c7ea2e3578e3b2c97b6060db6df5595070d17c5f908018"
},
"downloads": -1,
"filename": "charchef-0.12.tar.gz",
"has_sig": false,
"md5_digest": "7b5f8fb92df755b4f798af06ee2b7733",
"packagetype": "sdist",
"python_version": "source",
"requires_python": null,
"size": 213578,
"upload_time": "2023-03-01T05:35:24",
"upload_time_iso_8601": "2023-03-01T05:35:24.258589Z",
"url": "https://files.pythonhosted.org/packages/7c/d1/81dd6fde1e5ef23b90b77a248b774469b4ccf715568a42841db9bfe8db14/charchef-0.12.tar.gz",
"yanked": false,
"yanked_reason": null
}
],
"upload_time": "2023-03-01 05:35:24",
"github": true,
"gitlab": false,
"bitbucket": false,
"github_user": "hansalemaos",
"github_project": "charchef",
"travis_ci": false,
"coveralls": false,
"github_actions": false,
"requirements": [],
"lcname": "charchef"
}