# textnorm
A simple package for normalizing whitespace and Unicode composition forms in Python 3 strings.
The package provides two functions, as follows. Extended use examples may be found in ```tests/tests.py```.
## normalize_space
This function takes a Python 3 string argument (```v```) and returns a Python string in which each continuous sequence of one or more whitespace characters found in ```v``` has been collapsed into a single whitespace character.
Basic usage is as simple as:
```python
from textnorm import normalize_space
s = ' There was an\tOld \tMan in a tree,\t\t'
n = normalize_space(s)
print('"{}"'.format(n))
```
which produces output like:
```
"There was an Old Man in a tree,"
```
By default, characters like newlines (```\n```) are treated as any other whitespace character, such that use like this:
```python
s = """I’m now arrived—thanks to the gods!—
Thro’ pathways rough and muddy,
A certain sign that makin roads
Is no this people’s study: """
print('"{}"'.format(normalize_space(s)))
```
yields a result like this:
```
"I’m now arrived—thanks to the gods!— Thro’ pathways rough and muddy, A certain sign that makin roads Is no this people’s study:"
```
An optional keyword argument (```preserve```) may be used to designate a list of one or more whitespace characters that are to be left untouched. So, it is possible to preserve the newlines in the string from the preceding example while normalizing the rest of the whitespace:
```python
print('"{}"'.format(normalize_space(s, preserve = ['\n'])))
```
which produces:
```
"I’m now arrived—thanks to the gods!—
Thro’ pathways rough and muddy,
A certain sign that makin roads
Is no this people’s study:"
```
Another optional keyword argument (```trim```) can be used to adjust handling of whitespace appearing at the beginning and end of the input string. Leading and trailing characters indicated in the ```preserve``` argument are always protected, but otherwise ```trim=True``` (the default) ensures a result with no leading or trailing whitespace. If the input string has leading or trailing whitespace and ```trim``` is set to ```False```, then the result string will have either a single space character corresponding to the original leading/trailing whitespace characters **or** a sequence of preserved whitespace characters copied from the original. For examples:
```python
s = '\t\n orange '
print('"{}"'.format(normalize_space(s, trim=False)))
```
produces
```python
" orange "
```
and
```python
s = '\t\n orange '
print('"{}"'.format(normalize_space(s, preserve=['\n'], trim=False)))
```
produces
```python
"
orange "
```
but
```python
s = '\t\n orange '
print('"{}"'.format(normalize_space(s, trim=True))) # default
```
produces
```python
"orange"
```
## normalize_unicode
The second function wraps ```unicodedata.normalize``` from the standard library, adding some minor additional functionality. Its primary purpose retains that of ```unicodedata.normalize```, i.e., to return the specified normal form ('NFC', 'NFD', 'NFKC', 'NFKD') for the Unicode string provided to the function.
Explaining Unicode normalization is beyond the scope of this readme file; however, I can recommend the following for additional reading:
- [J.K. Tauber's article on "Python, Unicode and Ancient Greek."](https://jktauber.com/articles/python-unicode-ancient-greek/)
- [_The Python Standard Library_, Section 6.5 "unicodedata — Unicode Database"](https://docs.python.org/3.6/library/unicodedata.html)
- ["Unicode Equivalence" in _Wikipedia_](https://en.wikipedia.org/wiki/Unicode_equivalence)
Basic usage of ```textnorm.normalize_unicode``` looks like this:
```python
from textnorm import normalize_unicode
s_nfc = 'μ\u03adγα βιβλ\u03afον μ\u03adγα κακ\u03ccν' # NB: "composed" forms of accented characters
n_nfd = normalize_unicode(s_nfc, 'NFD')
n_nfc = normalize_unicode(n_nfd, 'NFC')
print(s_nfc == n_nfd)
print(s_nfc == n_nfc)
print('original NFC: "{}"'.format(s_nfc))
print('normalized NFD: "{}"'.format(n_nfd))
print('round-tripped NFC: "{}"'.format(n_nfc))
```
which produces output like this:
```
False
True
original NFC: "μέγα βιβλίον μέγα κακόν"
normalized NFD: "μέγα βιβλίον μέγα κακόν"
round-tripped NFC: "μέγα βιβλίον μέγα κακόν"
```
Even though modern software and fonts are pretty good and making the NFC and NFC forms appear identical, if you examine the underlying encoding you can see that the differences are real.
### added functionality: compatibility checking
Over and above the functionality provided by unicodedata.normalize, textnorm.normalize_unicode has a ```check_compatible``` argument that, if ```True```, triggers a comparison of the targeted normalization form with the corresponding "compatibility" form (i.e., it compares 'NFC' with 'NFKC' and 'NFD' with 'NFKD'). If the "canonical" and "compatibility" forms differ, the function raises ```ValueError``` with a diagnostic message. A calling program that traps for this exception can then implement double-checking or supervision.
The lunate sigma ('Ϲ' == '\u03f9') provides a good way to demonstrate this behavior since the canonical forms (NFC and NFD) preserve the character, but the "compatibility" forms (NFKC and NFKD) convert it to conventional sigma ('Σ' == '\u03a3'). First, we'll conduct the conversion without a test:
```python
s = '\u03f9υρβανή' # i.e. Ϲυρβανή
n = normalize_unicode(s, 'NFKC')
print(n)
```
and we get
```
Συρβανή
```
But, if we activate the test:
```python
s = '\u03f9υρβανή' # i.e. Ϲυρβανή
n = normalize_unicode(s, 'NFKC', check_compatible=True)
```
we'll be treated to the informative traceback:
```
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
File "/Users/paregorios/Documents/files/T/textnorm/textnorm/__init__.py", line 93, in normalize_unicode
raise ValueError(msg)
ValueError: Unicode normalization may have changed the string "Ϲυρβανή" in an undesireable way or may have failed to do so in a manner desired. The NFKC normalized form "Συρβανή" (b'\\N{GREEK CAPITAL LETTER SIGMA}\\N{GREEK SMALL LETTER UPSILON}\\N{GREEK SMALL LETTER RHO}\\N{GREEK SMALL LETTER BETA}\\N{GREEK SMALL LETTER ALPHA}\\N{GREEK SMALL LETTER NU}\\N{GREEK SMALL LETTER ETA WITH TONOS}') does not match the corresponding NFC form "Ϲυρβανή" (b'\\N{GREEK CAPITAL LUNATE SIGMA SYMBOL}\\N{GREEK SMALL LETTER UPSILON}\\N{GREEK SMALL LETTER RHO}\\N{GREEK SMALL LETTER BETA}\\N{GREEK SMALL LETTER ALPHA}\\N{GREEK SMALL LETTER NU}\\N{GREEK SMALL LETTER ETA WITH TONOS}').
```
## etc
Pull requests and new tickets on the issue tracker are welcome.
*This README has been created with thanks and apologies to https://www.poets.org and to the ghosts of Robert Burns and Edward Lear.*
Raw data
{
"_id": null,
"home_page": "https://github.com/isawnyu/textnorm",
"name": "textnorm",
"maintainer": "",
"docs_url": null,
"requires_python": "",
"maintainer_email": "",
"keywords": "\"text normalization\",\"whitespace normalization\",\"unicode normalization\"",
"author": "Tom Elliott",
"author_email": "tom.elliott@nyu.edu",
"download_url": "https://files.pythonhosted.org/packages/b6/1b/0bd32ea2e1dbd6aae113ea7fb6a38454dbb0267d12b25316b241171b1141/textnorm-1.2.tar.gz",
"platform": "",
"description": "# textnorm\n\nA simple package for normalizing whitespace and Unicode composition forms in Python 3 strings.\n\nThe package provides two functions, as follows. Extended use examples may be found in ```tests/tests.py```.\n\n## normalize_space\n\nThis function takes a Python 3 string argument (```v```) and returns a Python string in which each continuous sequence of one or more whitespace characters found in ```v``` has been collapsed into a single whitespace character. \n\nBasic usage is as simple as:\n\n```python\nfrom textnorm import normalize_space\ns = ' There was an\\tOld \\tMan in a tree,\\t\\t'\nn = normalize_space(s)\nprint('\"{}\"'.format(n))\n```\n\nwhich produces output like:\n\n```\n\"There was an Old Man in a tree,\"\n```\n\nBy default, characters like newlines (```\\n```) are treated as any other whitespace character, such that use like this:\n\n```python\ns = \"\"\"I\u2019m now arrived\u2014thanks to the gods!\u2014 \n Thro\u2019 pathways rough and muddy, \n A certain sign that makin roads \n Is no this people\u2019s study: \"\"\"\nprint('\"{}\"'.format(normalize_space(s)))\n```\n\nyields a result like this: \n\n```\n\"I\u2019m now arrived\u2014thanks to the gods!\u2014 Thro\u2019 pathways rough and muddy, A certain sign that makin roads Is no this people\u2019s study:\"\n```\n\nAn optional keyword argument (```preserve```) may be used to designate a list of one or more whitespace characters that are to be left untouched. So, it is possible to preserve the newlines in the string from the preceding example while normalizing the rest of the whitespace:\n\n```python\nprint('\"{}\"'.format(normalize_space(s, preserve = ['\\n'])))\n```\n\nwhich produces: \n\n```\n\"I\u2019m now arrived\u2014thanks to the gods!\u2014\nThro\u2019 pathways rough and muddy,\nA certain sign that makin roads\nIs no this people\u2019s study:\"\n```\n\nAnother optional keyword argument (```trim```) can be used to adjust handling of whitespace appearing at the beginning and end of the input string. Leading and trailing characters indicated in the ```preserve``` argument are always protected, but otherwise ```trim=True``` (the default) ensures a result with no leading or trailing whitespace. If the input string has leading or trailing whitespace and ```trim``` is set to ```False```, then the result string will have either a single space character corresponding to the original leading/trailing whitespace characters **or** a sequence of preserved whitespace characters copied from the original. For examples:\n\n```python\ns = '\\t\\n orange '\nprint('\"{}\"'.format(normalize_space(s, trim=False)))\n```\n\nproduces\n\n```python\n\" orange \"\n```\n\nand \n\n```python\ns = '\\t\\n orange '\nprint('\"{}\"'.format(normalize_space(s, preserve=['\\n'], trim=False)))\n```\n\nproduces \n\n```python\n\"\norange \"\n```\n\nbut\n\n```python\ns = '\\t\\n orange '\nprint('\"{}\"'.format(normalize_space(s, trim=True))) # default\n```\n\nproduces\n\n```python\n\"orange\"\n```\n\n## normalize_unicode\n\nThe second function wraps ```unicodedata.normalize``` from the standard library, adding some minor additional functionality. Its primary purpose retains that of ```unicodedata.normalize```, i.e., to return the specified normal form ('NFC', 'NFD', 'NFKC', 'NFKD') for the Unicode string provided to the function. \n\nExplaining Unicode normalization is beyond the scope of this readme file; however, I can recommend the following for additional reading:\n\n - [J.K. Tauber's article on \"Python, Unicode and Ancient Greek.\"](https://jktauber.com/articles/python-unicode-ancient-greek/)\n - [_The Python Standard Library_, Section 6.5 \"unicodedata \u2014 Unicode Database\"](https://docs.python.org/3.6/library/unicodedata.html)\n - [\"Unicode Equivalence\" in _Wikipedia_](https://en.wikipedia.org/wiki/Unicode_equivalence)\n\nBasic usage of ```textnorm.normalize_unicode``` looks like this:\n\n```python\nfrom textnorm import normalize_unicode\ns_nfc = '\u03bc\\u03ad\u03b3\u03b1 \u03b2\u03b9\u03b2\u03bb\\u03af\u03bf\u03bd \u03bc\\u03ad\u03b3\u03b1 \u03ba\u03b1\u03ba\\u03cc\u03bd' # NB: \"composed\" forms of accented characters \nn_nfd = normalize_unicode(s_nfc, 'NFD')\nn_nfc = normalize_unicode(n_nfd, 'NFC')\nprint(s_nfc == n_nfd)\nprint(s_nfc == n_nfc)\nprint('original NFC: \"{}\"'.format(s_nfc))\nprint('normalized NFD: \"{}\"'.format(n_nfd))\nprint('round-tripped NFC: \"{}\"'.format(n_nfc))\n```\n\nwhich produces output like this:\n\n```\nFalse\nTrue\noriginal NFC: \"\u03bc\u03ad\u03b3\u03b1 \u03b2\u03b9\u03b2\u03bb\u03af\u03bf\u03bd \u03bc\u03ad\u03b3\u03b1 \u03ba\u03b1\u03ba\u03cc\u03bd\"\nnormalized NFD: \"\u03bc\u03b5\u0301\u03b3\u03b1 \u03b2\u03b9\u03b2\u03bb\u03b9\u0301\u03bf\u03bd \u03bc\u03b5\u0301\u03b3\u03b1 \u03ba\u03b1\u03ba\u03bf\u0301\u03bd\"\nround-tripped NFC: \"\u03bc\u03ad\u03b3\u03b1 \u03b2\u03b9\u03b2\u03bb\u03af\u03bf\u03bd \u03bc\u03ad\u03b3\u03b1 \u03ba\u03b1\u03ba\u03cc\u03bd\"\n```\n\nEven though modern software and fonts are pretty good and making the NFC and NFC forms appear identical, if you examine the underlying encoding you can see that the differences are real. \n\n### added functionality: compatibility checking\n\nOver and above the functionality provided by unicodedata.normalize, textnorm.normalize_unicode has a ```check_compatible``` argument that, if ```True```, triggers a comparison of the targeted normalization form with the corresponding \"compatibility\" form (i.e., it compares 'NFC' with 'NFKC' and 'NFD' with 'NFKD'). If the \"canonical\" and \"compatibility\" forms differ, the function raises ```ValueError``` with a diagnostic message. A calling program that traps for this exception can then implement double-checking or supervision.\n\nThe lunate sigma ('\u03f9' == '\\u03f9') provides a good way to demonstrate this behavior since the canonical forms (NFC and NFD) preserve the character, but the \"compatibility\" forms (NFKC and NFKD) convert it to conventional sigma ('\u03a3' == '\\u03a3'). First, we'll conduct the conversion without a test:\n\n```python\ns = '\\u03f9\u03c5\u03c1\u03b2\u03b1\u03bd\u1f75' # i.e. \u03f9\u03c5\u03c1\u03b2\u03b1\u03bd\u1f75\nn = normalize_unicode(s, 'NFKC')\nprint(n)\n```\n\nand we get\n\n```\n\u03a3\u03c5\u03c1\u03b2\u03b1\u03bd\u03ae\n```\n\nBut, if we activate the test: \n\n```python\ns = '\\u03f9\u03c5\u03c1\u03b2\u03b1\u03bd\u1f75' # i.e. \u03f9\u03c5\u03c1\u03b2\u03b1\u03bd\u1f75\nn = normalize_unicode(s, 'NFKC', check_compatible=True)\n```\n\nwe'll be treated to the informative traceback:\n\n```\nTraceback (most recent call last):\n File \"<stdin>\", line 1, in <module>\n File \"/Users/paregorios/Documents/files/T/textnorm/textnorm/__init__.py\", line 93, in normalize_unicode\n raise ValueError(msg)\nValueError: Unicode normalization may have changed the string \"\u03f9\u03c5\u03c1\u03b2\u03b1\u03bd\u1f75\" in an undesireable way or may have failed to do so in a manner desired. The NFKC normalized form \"\u03a3\u03c5\u03c1\u03b2\u03b1\u03bd\u03ae\" (b'\\\\N{GREEK CAPITAL LETTER SIGMA}\\\\N{GREEK SMALL LETTER UPSILON}\\\\N{GREEK SMALL LETTER RHO}\\\\N{GREEK SMALL LETTER BETA}\\\\N{GREEK SMALL LETTER ALPHA}\\\\N{GREEK SMALL LETTER NU}\\\\N{GREEK SMALL LETTER ETA WITH TONOS}') does not match the corresponding NFC form \"\u03f9\u03c5\u03c1\u03b2\u03b1\u03bd\u03ae\" (b'\\\\N{GREEK CAPITAL LUNATE SIGMA SYMBOL}\\\\N{GREEK SMALL LETTER UPSILON}\\\\N{GREEK SMALL LETTER RHO}\\\\N{GREEK SMALL LETTER BETA}\\\\N{GREEK SMALL LETTER ALPHA}\\\\N{GREEK SMALL LETTER NU}\\\\N{GREEK SMALL LETTER ETA WITH TONOS}').\n```\n\n## etc\n\nPull requests and new tickets on the issue tracker are welcome. \n\n*This README has been created with thanks and apologies to https://www.poets.org and to the ghosts of Robert Burns and Edward Lear.*\n\n\n\n",
"bugtrack_url": null,
"license": "LICENSE.txt",
"summary": "A simple package for normalizing whitespace and Unicode composition forms in Python 3 strings",
"version": "1.2",
"project_urls": {
"Homepage": "https://github.com/isawnyu/textnorm"
},
"split_keywords": [
"\"text normalization\"",
"\"whitespace normalization\"",
"\"unicode normalization\""
],
"urls": [
{
"comment_text": "",
"digests": {
"blake2b_256": "b04e0fc054e1e9c6d563fd72151427ee4dd37ed5913ca2deb9087f0978f6d472",
"md5": "7f2fea6afb51b762f7c374d85a57934d",
"sha256": "b826ef97e9f631158d71ef4e24ca25e3a85ac62043e1819f069adbfc68f15995"
},
"downloads": -1,
"filename": "textnorm-1.2-py3-none-any.whl",
"has_sig": false,
"md5_digest": "7f2fea6afb51b762f7c374d85a57934d",
"packagetype": "bdist_wheel",
"python_version": "py3",
"requires_python": null,
"size": 17529,
"upload_time": "2020-05-07T09:34:04",
"upload_time_iso_8601": "2020-05-07T09:34:04.371482Z",
"url": "https://files.pythonhosted.org/packages/b0/4e/0fc054e1e9c6d563fd72151427ee4dd37ed5913ca2deb9087f0978f6d472/textnorm-1.2-py3-none-any.whl",
"yanked": false,
"yanked_reason": null
},
{
"comment_text": "",
"digests": {
"blake2b_256": "b61b0bd32ea2e1dbd6aae113ea7fb6a38454dbb0267d12b25316b241171b1141",
"md5": "f05497c9acc35b0c914bc038c7f8bfb4",
"sha256": "c04b398d137a6909e4d7f39ee6935d027927b40d083ef11d1f1209f2793662b1"
},
"downloads": -1,
"filename": "textnorm-1.2.tar.gz",
"has_sig": false,
"md5_digest": "f05497c9acc35b0c914bc038c7f8bfb4",
"packagetype": "sdist",
"python_version": "source",
"requires_python": null,
"size": 19072,
"upload_time": "2020-05-07T09:34:06",
"upload_time_iso_8601": "2020-05-07T09:34:06.522935Z",
"url": "https://files.pythonhosted.org/packages/b6/1b/0bd32ea2e1dbd6aae113ea7fb6a38454dbb0267d12b25316b241171b1141/textnorm-1.2.tar.gz",
"yanked": false,
"yanked_reason": null
}
],
"upload_time": "2020-05-07 09:34:06",
"github": true,
"gitlab": false,
"bitbucket": false,
"codeberg": false,
"github_user": "isawnyu",
"github_project": "textnorm",
"travis_ci": false,
"coveralls": false,
"github_actions": false,
"lcname": "textnorm"
}