charchef


Namecharchef JSON
Version 0.12 PyPI version JSON
download
home_pagehttps://github.com/hansalemaos/charchef
Summary3 functions to normalize strings, repair bad encoding, replace non-printable characters
upload_time2023-03-01 05:35:24
maintainer
docs_urlNone
authorJohannes Fischer
requires_python
licenseMIT
keywords unicode normalize decode encode
VCS
bugtrack_url
requirements No requirements were recorded.
Travis-CI No Travis.
coveralls test coverage No coveralls.
            
### 3 functions to normalize strings, repair bad encoding, replace non-printable characters 





#### The function use numba under the hood - that means the first run is very slow, (compile time), but then the speed-up is tremendous.





##### pip install charchef



```python

from charchef import aa_convert_utf8_to_ascii_,aa_repair_bad_conversion_to_utf8,aa_replace_non_printable_chars

text = r"""ąćęłńóśźż ĄĆĘŁŃÓŚŹ\x00Ż Junto à Estação de Carcavelos; Bragança Situado 

en el núcleo de Es Caló de Sant Agustí frente al Hostal Rafalet. Cartão MOBI.E R. 

Conselheiro Emídio Navarro (frente ao ISEL) R. Conselheiro Emidio Navarro (frente ao ISEL)

 àáâãäåa èéêëe ìíîïi òóôõöo ùúûüu ýÿy Suzy & John " £682m 

 \u00FF\u00FF\u00F0\u00f0\x95\xFF SmörgÃ¥s Non ti suscita niente la parola pietÃ\xa0 RosŽ RUF MICH ZURÃœCK.

 aqu\195\173 09. Bát Nhã Tâm Kinh criança Koç University Technische Universität Dresden Universität 

 für Musik und darstellende Kunst Wien Technische Universität Wien Ã\x89cole Nationale Supérieure 

 des Beaux-Arts Paris Universidad Simón BolÃ\xadvar (USB) 240         Åland Islands   2014.0    

 MARIEHAMN   11437.0 1 240         Åland Islands   2010.0      MARIEHAMN   5829.5  1 240        

 Albania         2011.0      Durrës      113249.0 240         Albania         2011.0      TIRANA 

 418495.0 240         Albania         2011.0      Durrës      56511.0 "Tutu Au Mic' – dumbéa"

    """.splitlines()

bigc1 = aa_convert_utf8_to_ascii_(

        str_=text,

        preprocessing_functions=(

            "8x_3_lower_case_escaped",

            "8x_3_upper_case_escaped",

            "8u_4_upper_case_escaped",

            "8u_4_lower_case_escaped",

            "8x_69_upper_case_escaped",

            "8x_69_lower_case_escaped",

            "8n_escaped",

            "8wrong_chars",

            "8zerox_unescaped_lower",

            "8zerox_unescaped_upper",

            "8html_entity",

        ),

        preprocessing_function_non_printable=(

            "substitute_allcontrols_s",

            "substitute_allcontrols",

            "substitute_allcontrols2",

            "substitute_allcontrols2_s",

            "substitute_allcontrols3",

            "substitute_allcontrols3_s",

        ),

        respect_german_letters=True,

    )



bigc2 = aa_repair_bad_conversion_to_utf8(

        str_=text,

        functions=(

            "8x_3_lower_case_escaped",

            "8x_3_upper_case_escaped",

            "8u_4_upper_case_escaped",

            "8u_4_lower_case_escaped",

            "8x_69_upper_case_escaped",

            "8x_69_lower_case_escaped",

            "8n_escaped",

            "8wrong_chars",

            "8zerox_unescaped_lower",

            "8zerox_unescaped_upper",

            "8html_entity",

        ),

    )



bigc3 = aa_replace_non_printable_chars(

        str_="\x00rsi\\x00d\x00ad \x0aSimón BolÃ\xadvar",

        functions=(

            "substitute_allcontrols_s",

            "substitute_allcontrols",

            "substitute_allcontrols2",

            "substitute_allcontrols2_s",

        ),

        removex0a=False,

    )

	

	

bigc1 # replaces all accents, special characters ... 

Out[3]: 

['acelnoszz ACELNOSZZ Junto a Estacao de Carcavelos; Braganca Situado ',

 'en el nucleo de Es Calo de Sant Agusti frente al Hostal Rafalet. Cartao MOBI.E R. ',

 'Conselheiro Emidio Navarro (frente ao ISEL) R. Conselheiro Emidio Navarro (frente ao ISEL)',

 ' aaaaaeaa eeeee iiiii oooooeo uuuueu yyy Suzy & John " PS682m ',

 ' yydd*y Smorgas Non ti suscita niente la parola pieti RosZ RUF MICH ZURUCK.',

 ' aqui 09. Bat Nha Tam Kinh crianca Koc University Technische Universitat Dresden Universitat ',

 ' fur Musik und darstellende Kunst Wien Technische Universitat Wien Ecole Nationale Superieure ',

 ' des Beaux-Arts Paris Universidad Simon Bolivar (USB) 240         Sland Islands   2014.0    ',

 ' MARIEHAMN   11437.0 1 240         Sland Islands   2010.0      MARIEHAMN   5829.5  1 240        ',

 ' Albania         2011.0      Durres      113249.0 240         Albania         2011.0      TIRANA ',

 ' 418495.0 240         Albania         2011.0      Durres      56511.0 "Tutu Au Mic\' - dumbea"',

 '    ']

 

 

bigc2 # Repairs messed up Unicode

Out[4]: 

['ąćęłńóśźż ĄĆĘŁŃÓŚŹŻ Junto à Estação de Carcavelos; Bragança Situado ',

 'en el núcleo de Es Caló de Sant Agustí frente al Hostal Rafalet. Cartão MOBI.E R. ',

 'Conselheiro Emídio Navarro (frente ao ISEL) R. Conselheiro Emidio Navarro (frente ao ISEL)',

 ' àáâãäåa èéêëe ìíîïi òóôõöo ùúûüu ýÿy Suzy & John " £682m ',

 ' ÿÿðð•ÿ Smörgås Non ti suscita niente la parola pietí RosŽ RUF MICH ZURÜCK.',

 ' aquí 09. Bát Nhã Tâm Kinh criança Koç University Technische Universität Dresden Universität ',

 ' für Musik und darstellende Kunst Wien Technische Universität Wien École Nationale Supérieure ',

 ' des Beaux-Arts Paris Universidad Simón Bolívar (USB) 240         Šland Islands   2014.0    ',

 ' MARIEHAMN   11437.0 1 240         Šland Islands   2010.0      MARIEHAMN   5829.5  1 240        ',

 ' Albania         2011.0      Durrës      113249.0 240         Albania         2011.0      TIRANA ',

 ' 418495.0 240         Albania         2011.0      Durrës      56511.0 "Tutu Au Mic\' – dumbéa"',

 '    ']

 

 

bigc3 # Removes non-printable characters

Out[5]: ['rsidad Simón BolÃ\xadvar']

```

            

Raw data

            {
    "_id": null,
    "home_page": "https://github.com/hansalemaos/charchef",
    "name": "charchef",
    "maintainer": "",
    "docs_url": null,
    "requires_python": "",
    "maintainer_email": "",
    "keywords": "unicode,normalize,decode,encode",
    "author": "Johannes Fischer",
    "author_email": "<aulasparticularesdealemaosp@gmail.com>",
    "download_url": "https://files.pythonhosted.org/packages/7c/d1/81dd6fde1e5ef23b90b77a248b774469b4ccf715568a42841db9bfe8db14/charchef-0.12.tar.gz",
    "platform": null,
    "description": "\n### 3 functions to normalize strings, repair bad encoding, replace non-printable characters \n\n\n\n\n\n#### The function use numba under the hood - that means the first run is very slow, (compile time), but then the speed-up is tremendous.\n\n\n\n\n\n##### pip install charchef\n\n\n\n```python\n\nfrom charchef import aa_convert_utf8_to_ascii_,aa_repair_bad_conversion_to_utf8,aa_replace_non_printable_chars\n\ntext = r\"\"\"\u0105\u0107\u0119\u0142\u0144\u00f3\u015b\u017a\u017c \u0104\u0106\u0118\u0141\u0143\u00d3\u015a\u0179\\x00\u017b Junto \u00e0 Esta\u00e7\u00e3o de Carcavelos; Bragan\u00e7a Situado \n\nen el n\u00facleo de Es Cal\u00f3 de Sant Agust\u00ed frente al Hostal Rafalet. Cart\u00e3o MOBI.E R. \n\nConselheiro Em\u00eddio Navarro (frente ao ISEL) R. Conselheiro Emidio Navarro (frente ao ISEL)\n\n \u00e0\u00e1\u00e2\u00e3\u00e4\u00e5a \u00e8\u00e9\u00ea\u00ebe \u00ec\u00ed\u00ee\u00efi \u00f2\u00f3\u00f4\u00f5\u00f6o \u00f9\u00fa\u00fb\u00fcu \u00fd\u00ffy Suzy &amp; John &quot; &pound;682m \n\n \\u00FF\\u00FF\\u00F0\\u00f0\\x95\\xFF Sm\u00c3\u00b6rg\u00c3\u00a5s Non ti suscita niente la parola piet\u00c3\\xa0 Ros\u00c5\u00bd RUF MICH ZUR\u00c3\u0153CK.\n\n aqu\\195\\173 09. B\u00c3\u00a1t Nh\u00c3\u00a3 T\u00c3\u00a2m Kinh crian\u00c3\u00a7a Ko\u00c3\u00a7 University Technische Universit\u00c3\u00a4t Dresden Universit\u00c3\u00a4t \n\n f\u00c3\u00bcr Musik und darstellende Kunst Wien Technische Universit\u00c3\u00a4t Wien \u00c3\\x89cole Nationale Sup\u00c3\u00a9rieure \n\n des Beaux-Arts Paris Universidad Sim\u00c3\u00b3n Bol\u00c3\\xadvar (USB) 240         \u00c5land Islands   2014.0    \n\n MARIEHAMN   11437.0 1 240         \u00c5land Islands   2010.0      MARIEHAMN   5829.5  1 240        \n\n Albania         2011.0      Durr\u00ebs      113249.0 240         Albania         2011.0      TIRANA \n\n 418495.0 240         Albania         2011.0      Durr\u00ebs      56511.0 \"Tutu Au Mic' \u2013 dumb\u00e9a\"\n\n    \"\"\".splitlines()\n\nbigc1 = aa_convert_utf8_to_ascii_(\n\n        str_=text,\n\n        preprocessing_functions=(\n\n            \"8x_3_lower_case_escaped\",\n\n            \"8x_3_upper_case_escaped\",\n\n            \"8u_4_upper_case_escaped\",\n\n            \"8u_4_lower_case_escaped\",\n\n            \"8x_69_upper_case_escaped\",\n\n            \"8x_69_lower_case_escaped\",\n\n            \"8n_escaped\",\n\n            \"8wrong_chars\",\n\n            \"8zerox_unescaped_lower\",\n\n            \"8zerox_unescaped_upper\",\n\n            \"8html_entity\",\n\n        ),\n\n        preprocessing_function_non_printable=(\n\n            \"substitute_allcontrols_s\",\n\n            \"substitute_allcontrols\",\n\n            \"substitute_allcontrols2\",\n\n            \"substitute_allcontrols2_s\",\n\n            \"substitute_allcontrols3\",\n\n            \"substitute_allcontrols3_s\",\n\n        ),\n\n        respect_german_letters=True,\n\n    )\n\n\n\nbigc2 = aa_repair_bad_conversion_to_utf8(\n\n        str_=text,\n\n        functions=(\n\n            \"8x_3_lower_case_escaped\",\n\n            \"8x_3_upper_case_escaped\",\n\n            \"8u_4_upper_case_escaped\",\n\n            \"8u_4_lower_case_escaped\",\n\n            \"8x_69_upper_case_escaped\",\n\n            \"8x_69_lower_case_escaped\",\n\n            \"8n_escaped\",\n\n            \"8wrong_chars\",\n\n            \"8zerox_unescaped_lower\",\n\n            \"8zerox_unescaped_upper\",\n\n            \"8html_entity\",\n\n        ),\n\n    )\n\n\n\nbigc3 = aa_replace_non_printable_chars(\n\n        str_=\"\\x00rsi\\\\x00d\\x00ad \\x0aSim\u00c3\u00b3n Bol\u00c3\\xadvar\",\n\n        functions=(\n\n            \"substitute_allcontrols_s\",\n\n            \"substitute_allcontrols\",\n\n            \"substitute_allcontrols2\",\n\n            \"substitute_allcontrols2_s\",\n\n        ),\n\n        removex0a=False,\n\n    )\n\n\t\n\n\t\n\nbigc1 # replaces all accents, special characters ... \n\nOut[3]: \n\n['acelnoszz ACELNOSZZ Junto a Estacao de Carcavelos; Braganca Situado ',\n\n 'en el nucleo de Es Calo de Sant Agusti frente al Hostal Rafalet. Cartao MOBI.E R. ',\n\n 'Conselheiro Emidio Navarro (frente ao ISEL) R. Conselheiro Emidio Navarro (frente ao ISEL)',\n\n ' aaaaaeaa eeeee iiiii oooooeo uuuueu yyy Suzy & John \" PS682m ',\n\n ' yydd*y Smorgas Non ti suscita niente la parola pieti RosZ RUF MICH ZURUCK.',\n\n ' aqui 09. Bat Nha Tam Kinh crianca Koc University Technische Universitat Dresden Universitat ',\n\n ' fur Musik und darstellende Kunst Wien Technische Universitat Wien Ecole Nationale Superieure ',\n\n ' des Beaux-Arts Paris Universidad Simon Bolivar (USB) 240         Sland Islands   2014.0    ',\n\n ' MARIEHAMN   11437.0 1 240         Sland Islands   2010.0      MARIEHAMN   5829.5  1 240        ',\n\n ' Albania         2011.0      Durres      113249.0 240         Albania         2011.0      TIRANA ',\n\n ' 418495.0 240         Albania         2011.0      Durres      56511.0 \"Tutu Au Mic\\' - dumbea\"',\n\n '    ']\n\n \n\n \n\nbigc2 # Repairs messed up Unicode\n\nOut[4]: \n\n['\u0105\u0107\u0119\u0142\u0144\u00f3\u015b\u017a\u017c \u0104\u0106\u0118\u0141\u0143\u00d3\u015a\u0179\u017b Junto \u00e0 Esta\u00e7\u00e3o de Carcavelos; Bragan\u00e7a Situado ',\n\n 'en el n\u00facleo de Es Cal\u00f3 de Sant Agust\u00ed frente al Hostal Rafalet. Cart\u00e3o MOBI.E R. ',\n\n 'Conselheiro Em\u00eddio Navarro (frente ao ISEL) R. Conselheiro Emidio Navarro (frente ao ISEL)',\n\n ' \u00e0\u00e1\u00e2\u00e3\u00e4\u00e5a \u00e8\u00e9\u00ea\u00ebe \u00ec\u00ed\u00ee\u00efi \u00f2\u00f3\u00f4\u00f5\u00f6o \u00f9\u00fa\u00fb\u00fcu \u00fd\u00ffy Suzy & John \" \u00a3682m ',\n\n ' \u00ff\u00ff\u00f0\u00f0\u2022\u00ff Sm\u00f6rg\u00e5s Non ti suscita niente la parola piet\u00ed Ros\u017d RUF MICH ZUR\u00dcCK.',\n\n ' aqu\u00ed 09. B\u00e1t Nh\u00e3 T\u00e2m Kinh crian\u00e7a Ko\u00e7 University Technische Universit\u00e4t Dresden Universit\u00e4t ',\n\n ' f\u00fcr Musik und darstellende Kunst Wien Technische Universit\u00e4t Wien \u00c9cole Nationale Sup\u00e9rieure ',\n\n ' des Beaux-Arts Paris Universidad Sim\u00f3n Bol\u00edvar (USB) 240         \u0160land Islands   2014.0    ',\n\n ' MARIEHAMN   11437.0 1 240         \u0160land Islands   2010.0      MARIEHAMN   5829.5  1 240        ',\n\n ' Albania         2011.0      Durr\u00ebs      113249.0 240         Albania         2011.0      TIRANA ',\n\n ' 418495.0 240         Albania         2011.0      Durr\u00ebs      56511.0 \"Tutu Au Mic\\' \u2013 dumb\u00e9a\"',\n\n '    ']\n\n \n\n \n\nbigc3 # Removes non-printable characters\n\nOut[5]: ['rsidad Sim\u00c3\u00b3n Bol\u00c3\\xadvar']\n\n```\n",
    "bugtrack_url": null,
    "license": "MIT",
    "summary": "3 functions to normalize strings, repair bad encoding, replace non-printable characters",
    "version": "0.12",
    "split_keywords": [
        "unicode",
        "normalize",
        "decode",
        "encode"
    ],
    "urls": [
        {
            "comment_text": "",
            "digests": {
                "blake2b_256": "582f7a0963be7d291195cf12bdcfaa5afd9a03533d2e7473a9f7ef9e6ac621ae",
                "md5": "65effffe7d78b0971e66d56ec4f911cb",
                "sha256": "cf099de64396704575654cfe79ddd29436078ef2fd32076f13bfbffd132a6b3d"
            },
            "downloads": -1,
            "filename": "charchef-0.12-py3-none-any.whl",
            "has_sig": false,
            "md5_digest": "65effffe7d78b0971e66d56ec4f911cb",
            "packagetype": "bdist_wheel",
            "python_version": "py3",
            "requires_python": null,
            "size": 218078,
            "upload_time": "2023-03-01T05:35:21",
            "upload_time_iso_8601": "2023-03-01T05:35:21.937166Z",
            "url": "https://files.pythonhosted.org/packages/58/2f/7a0963be7d291195cf12bdcfaa5afd9a03533d2e7473a9f7ef9e6ac621ae/charchef-0.12-py3-none-any.whl",
            "yanked": false,
            "yanked_reason": null
        },
        {
            "comment_text": "",
            "digests": {
                "blake2b_256": "7cd181dd6fde1e5ef23b90b77a248b774469b4ccf715568a42841db9bfe8db14",
                "md5": "7b5f8fb92df755b4f798af06ee2b7733",
                "sha256": "d1b83d836d586f6383c7ea2e3578e3b2c97b6060db6df5595070d17c5f908018"
            },
            "downloads": -1,
            "filename": "charchef-0.12.tar.gz",
            "has_sig": false,
            "md5_digest": "7b5f8fb92df755b4f798af06ee2b7733",
            "packagetype": "sdist",
            "python_version": "source",
            "requires_python": null,
            "size": 213578,
            "upload_time": "2023-03-01T05:35:24",
            "upload_time_iso_8601": "2023-03-01T05:35:24.258589Z",
            "url": "https://files.pythonhosted.org/packages/7c/d1/81dd6fde1e5ef23b90b77a248b774469b4ccf715568a42841db9bfe8db14/charchef-0.12.tar.gz",
            "yanked": false,
            "yanked_reason": null
        }
    ],
    "upload_time": "2023-03-01 05:35:24",
    "github": true,
    "gitlab": false,
    "bitbucket": false,
    "github_user": "hansalemaos",
    "github_project": "charchef",
    "travis_ci": false,
    "coveralls": false,
    "github_actions": false,
    "requirements": [],
    "lcname": "charchef"
}
        
Elapsed time: 0.07002s