neologdn
===========
|downloads| |pyversion| |version| |license|
neologdn is a Japanese text normalizer for `mecab-neologd <https://github.com/neologd/mecab-ipadic-neologd>`_.
The normalization is based on the neologd's rules:
https://github.com/neologd/mecab-ipadic-neologd/wiki/Regexp.ja
Contributions are welcome!
NOTE: Installing this module requires C++11 compiler.
Installation
------------
::
$ pip install neologdn
Usage
-----
.. code:: python
import neologdn
neologdn.normalize("ハンカクカナ")
# => 'ハンカクカナ'
neologdn.normalize("全角記号!?@#")
# => '全角記号!?@#'
neologdn.normalize("全角記号例外「・」")
# => '全角記号例外「・」'
neologdn.normalize("長音短縮ウェーーーーイ")
# => '長音短縮ウェーイ'
neologdn.normalize("チルダ削除ウェ~∼∾〜〰~イ")
# => 'チルダ削除ウェイ'
neologdn.normalize("いろんなハイフン˗֊‐‑‒–⁃⁻₋−")
# => 'いろんなハイフン-'
neologdn.normalize(" PRML 副 読 本 ")
# => 'PRML副読本'
neologdn.normalize(" Natural Language Processing ")
# => 'Natural Language Processing'
neologdn.normalize("かわいいいいいいいいい", repeat=6)
# => 'かわいいいいいい'
neologdn.normalize("無駄無駄無駄無駄ァ", repeat=1)
# => '無駄ァ'
neologdn.normalize("1995〜2001年", tilde="normalize")
# => '1995~2001年'
neologdn.normalize("1995~2001年", tilde="normalize_zenkaku")
# => '1995〜2001年'
neologdn.normalize("1995〜2001年", tilde="ignore") # Don't convert tilde
# => '1995〜2001年'
neologdn.normalize("1995〜2001年", tilde="remove")
# => '19952001年'
neologdn.normalize("1995〜2001年") # Default parameter
# => '19952001年'
Benchmark
----------
.. code:: python
# Sample code from
# https://github.com/neologd/mecab-ipadic-neologd/wiki/Regexp.ja#python-written-by-hideaki-t--overlast
import normalize_neologd
%timeit normalize(normalize_neologd.normalize_neologd)
# => 9.55 s ± 29.4 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
import neologdn
%timeit normalize(neologdn.normalize)
# => 6.66 s ± 35.8 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
neologdn is about x1.43 faster than sample code.
details are described as the below notebook:
https://github.com/ikegami-yukino/neologdn/blob/master/benchmark/benchmark.ipynb
License
-------
Apache Software License.
Contribution
------------
Contributions are welcome! See: https://github.com/ikegami-yukino/neologdn/blob/master/.github/CONTRIBUTING.md
.. |downloads| image:: https://static.pepy.tech/personalized-badge/neologdn?period=total&units=international_system&left_color=black&right_color=orange&left_text=Downloads
:target: https://pepy.tech/project/neologdn
.. |version| image:: https://img.shields.io/pypi/v/neologdn.svg
:target: http://pypi.python.org/pypi/neologdn/
:alt: latest version
.. |pyversion| image:: https://img.shields.io/pypi/pyversions/neologdn.svg
.. |license| image:: https://img.shields.io/pypi/l/neologdn.svg
:target: http://pypi.python.org/pypi/neologdn/
:alt: license
CHANGES
========
0.5.2 (2023-08-03)
----------------------------
- Support Python 3.10 and 3.11 (Many thanks @polm)
0.5.1 (2021-05-02)
----------------------------
- Improve performance of shorten_repeat function (Many thanks @yskn67)
- Add tilde option to normalize function
0.4 (2018-12-06)
----------------------------
- Add shorten_repeat function, which shortening contiguous substring. For example: neologdn.normalize("無駄無駄無駄無駄ァ", repeat=1) -> 無駄ァ
0.3.2 (2018-05-17)
----------------------------
- Add option for suppression removal of spaces between Japanese characters
0.2.2 (2018-03-10)
----------------------------
- Fix bug (daku-ten & handaku-ten)
- Support mac osx 10.13 (Many thanks @r9y9)
0.2.1 (2017-01-23)
----------------------------
- Fix bug (Check if a previous character of daku-ten character is in maps) (Many thanks @unnonouno)
0.2 (2016-04-12)
----------------------------
- Add lengthened expression (repeating character) threshold
0.1.2 (2016-03-29)
----------------------------
- Fix installation bug
0.1.1.1 (2016-03-19)
----------------------------
- Support Windows
- Explicitly specify to -std=c++11 in build (Many thanks @id774)
0.1.1 (2015-10-10)
----------------------------
Initial release.
Raw data
{
"_id": null,
"home_page": "http://github.com/ikegami-yukino/neologdn",
"name": "neologdn",
"maintainer": "",
"docs_url": null,
"requires_python": "",
"maintainer_email": "",
"keywords": "japanese,MeCab",
"author": "Yukino Ikegami",
"author_email": "yknikgm@gmail.com",
"download_url": "https://files.pythonhosted.org/packages/25/74/a0a015e7ce8da5d12be013f3f0cf7ce85c83b9308f4b7419b70a981e41d9/neologdn-0.5.2.tar.gz",
"platform": null,
"description": "neologdn\n===========\n\n|downloads| |pyversion| |version| |license|\n\nneologdn is a Japanese text normalizer for `mecab-neologd <https://github.com/neologd/mecab-ipadic-neologd>`_.\n\nThe normalization is based on the neologd's rules:\nhttps://github.com/neologd/mecab-ipadic-neologd/wiki/Regexp.ja\n\n\nContributions are welcome!\n\nNOTE: Installing this module requires C++11 compiler.\n\nInstallation\n------------\n\n::\n\n $ pip install neologdn\n\nUsage\n-----\n\n.. code:: python\n\n import neologdn\n neologdn.normalize(\"\uff8a\uff9d\uff76\uff78\uff76\uff85\")\n # => '\u30cf\u30f3\u30ab\u30af\u30ab\u30ca'\n neologdn.normalize(\"\u5168\u89d2\u8a18\u53f7\uff01\uff1f\uff20\uff03\")\n # => '\u5168\u89d2\u8a18\u53f7!?@#'\n neologdn.normalize(\"\u5168\u89d2\u8a18\u53f7\u4f8b\u5916\u300c\u30fb\u300d\")\n # => '\u5168\u89d2\u8a18\u53f7\u4f8b\u5916\u300c\u30fb\u300d'\n neologdn.normalize(\"\u9577\u97f3\u77ed\u7e2e\u30a6\u30a7\u30fc\u30fc\u30fc\u30fc\u30a4\")\n # => '\u9577\u97f3\u77ed\u7e2e\u30a6\u30a7\u30fc\u30a4'\n neologdn.normalize(\"\u30c1\u30eb\u30c0\u524a\u9664\u30a6\u30a7~\u223c\u223e\u301c\u3030\uff5e\u30a4\")\n # => '\u30c1\u30eb\u30c0\u524a\u9664\u30a6\u30a7\u30a4'\n neologdn.normalize(\"\u3044\u308d\u3093\u306a\u30cf\u30a4\u30d5\u30f3\u02d7\u058a\u2010\u2011\u2012\u2013\u2043\u207b\u208b\u2212\")\n # => '\u3044\u308d\u3093\u306a\u30cf\u30a4\u30d5\u30f3-'\n neologdn.normalize(\"\u3000\u3000\u3000\uff30\uff32\uff2d\uff2c\u3000\u3000\u526f\u3000\u8aad\u3000\u672c\u3000\u3000\u3000\")\n # => 'PRML\u526f\u8aad\u672c'\n neologdn.normalize(\" Natural Language Processing \")\n # => 'Natural Language Processing'\n neologdn.normalize(\"\u304b\u308f\u3044\u3044\u3044\u3044\u3044\u3044\u3044\u3044\u3044\", repeat=6)\n # => '\u304b\u308f\u3044\u3044\u3044\u3044\u3044\u3044'\n neologdn.normalize(\"\u7121\u99c4\u7121\u99c4\u7121\u99c4\u7121\u99c4\u30a1\", repeat=1)\n # => '\u7121\u99c4\u30a1'\n neologdn.normalize(\"1995\u301c2001\u5e74\", tilde=\"normalize\")\n # => '1995~2001\u5e74'\n neologdn.normalize(\"1995~2001\u5e74\", tilde=\"normalize_zenkaku\")\n # => '1995\u301c2001\u5e74'\n neologdn.normalize(\"1995\u301c2001\u5e74\", tilde=\"ignore\") # Don't convert tilde\n # => '1995\u301c2001\u5e74'\n neologdn.normalize(\"1995\u301c2001\u5e74\", tilde=\"remove\")\n # => '19952001\u5e74'\n neologdn.normalize(\"1995\u301c2001\u5e74\") # Default parameter\n # => '19952001\u5e74'\n\n\nBenchmark\n----------\n\n.. code:: python\n\n # Sample code from\n # https://github.com/neologd/mecab-ipadic-neologd/wiki/Regexp.ja#python-written-by-hideaki-t--overlast\n import normalize_neologd\n\n %timeit normalize(normalize_neologd.normalize_neologd)\n # => 9.55 s \u00b1 29.4 ms per loop (mean \u00b1 std. dev. of 7 runs, 1 loop each)\n\n\n import neologdn\n %timeit normalize(neologdn.normalize)\n # => 6.66 s \u00b1 35.8 ms per loop (mean \u00b1 std. dev. of 7 runs, 1 loop each)\n\n\nneologdn is about x1.43 faster than sample code.\n\ndetails are described as the below notebook:\nhttps://github.com/ikegami-yukino/neologdn/blob/master/benchmark/benchmark.ipynb\n\n\nLicense\n-------\n\nApache Software License.\n\n\nContribution\n------------\n\nContributions are welcome! See: https://github.com/ikegami-yukino/neologdn/blob/master/.github/CONTRIBUTING.md\n\n.. |downloads| image:: https://static.pepy.tech/personalized-badge/neologdn?period=total&units=international_system&left_color=black&right_color=orange&left_text=Downloads\n :target: https://pepy.tech/project/neologdn\n\n.. |version| image:: https://img.shields.io/pypi/v/neologdn.svg\n :target: http://pypi.python.org/pypi/neologdn/\n :alt: latest version\n\n.. |pyversion| image:: https://img.shields.io/pypi/pyversions/neologdn.svg\n\n.. |license| image:: https://img.shields.io/pypi/l/neologdn.svg\n :target: http://pypi.python.org/pypi/neologdn/\n :alt: license\n\n\n\nCHANGES\n========\n\n0.5.2 (2023-08-03)\n----------------------------\n\n- Support Python 3.10 and 3.11 (Many thanks @polm)\n\n0.5.1 (2021-05-02)\n----------------------------\n\n- Improve performance of shorten_repeat function (Many thanks @yskn67)\n- Add tilde option to normalize function\n\n0.4 (2018-12-06)\n----------------------------\n\n- Add shorten_repeat function, which shortening contiguous substring. For example: neologdn.normalize(\"\u7121\u99c4\u7121\u99c4\u7121\u99c4\u7121\u99c4\u30a1\", repeat=1) -> \u7121\u99c4\u30a1\n\n0.3.2 (2018-05-17)\n----------------------------\n\n- Add option for suppression removal of spaces between Japanese characters\n\n0.2.2 (2018-03-10)\n----------------------------\n\n- Fix bug (daku-ten & handaku-ten)\n- Support mac osx 10.13 (Many thanks @r9y9)\n\n0.2.1 (2017-01-23)\n----------------------------\n\n- Fix bug (Check if a previous character of daku-ten character is in maps) (Many thanks @unnonouno)\n\n0.2 (2016-04-12)\n----------------------------\n\n- Add lengthened expression (repeating character) threshold\n\n0.1.2 (2016-03-29)\n----------------------------\n\n- Fix installation bug\n\n0.1.1.1 (2016-03-19)\n----------------------------\n\n- Support Windows\n- Explicitly specify to -std=c++11 in build (Many thanks @id774)\n\n0.1.1 (2015-10-10)\n----------------------------\n\nInitial release.\n",
"bugtrack_url": null,
"license": "Apache Software License",
"summary": "Japanese text normalizer for mecab-neologd",
"version": "0.5.2",
"project_urls": {
"Homepage": "http://github.com/ikegami-yukino/neologdn"
},
"split_keywords": [
"japanese",
"mecab"
],
"urls": [
{
"comment_text": "",
"digests": {
"blake2b_256": "2574a0a015e7ce8da5d12be013f3f0cf7ce85c83b9308f4b7419b70a981e41d9",
"md5": "baa609fd1e44fc83e68147e89f042f70",
"sha256": "2f56b2ffddfe7f8613d52b9f6366c224af2bb217c47c1e80e227a348345cce52"
},
"downloads": -1,
"filename": "neologdn-0.5.2.tar.gz",
"has_sig": false,
"md5_digest": "baa609fd1e44fc83e68147e89f042f70",
"packagetype": "sdist",
"python_version": "source",
"requires_python": null,
"size": 86170,
"upload_time": "2023-08-03T12:57:00",
"upload_time_iso_8601": "2023-08-03T12:57:00.886233Z",
"url": "https://files.pythonhosted.org/packages/25/74/a0a015e7ce8da5d12be013f3f0cf7ce85c83b9308f4b7419b70a981e41d9/neologdn-0.5.2.tar.gz",
"yanked": false,
"yanked_reason": null
}
],
"upload_time": "2023-08-03 12:57:00",
"github": true,
"gitlab": false,
"bitbucket": false,
"codeberg": false,
"github_user": "ikegami-yukino",
"github_project": "neologdn",
"travis_ci": true,
"coveralls": false,
"github_actions": true,
"lcname": "neologdn"
}