neologdn


Nameneologdn JSON
Version 0.5.2 PyPI version JSON
download
home_pagehttp://github.com/ikegami-yukino/neologdn
SummaryJapanese text normalizer for mecab-neologd
upload_time2023-08-03 12:57:00
maintainer
docs_urlNone
authorYukino Ikegami
requires_python
licenseApache Software License
keywords japanese mecab
VCS
bugtrack_url
requirements No requirements were recorded.
Travis-CI
coveralls test coverage No coveralls.
            neologdn
===========

|downloads| |pyversion| |version| |license|

neologdn is a Japanese text normalizer for `mecab-neologd <https://github.com/neologd/mecab-ipadic-neologd>`_.

The normalization is based on the neologd's rules:
https://github.com/neologd/mecab-ipadic-neologd/wiki/Regexp.ja


Contributions are welcome!

NOTE: Installing this module requires C++11 compiler.

Installation
------------

::

 $ pip install neologdn

Usage
-----

.. code:: python

    import neologdn
    neologdn.normalize("ハンカクカナ")
    # => 'ハンカクカナ'
    neologdn.normalize("全角記号!?@#")
    # => '全角記号!?@#'
    neologdn.normalize("全角記号例外「・」")
    # => '全角記号例外「・」'
    neologdn.normalize("長音短縮ウェーーーーイ")
    # => '長音短縮ウェーイ'
    neologdn.normalize("チルダ削除ウェ~∼∾〜〰~イ")
    # => 'チルダ削除ウェイ'
    neologdn.normalize("いろんなハイフン˗֊‐‑‒–⁃⁻₋−")
    # => 'いろんなハイフン-'
    neologdn.normalize("   PRML  副 読 本   ")
    # => 'PRML副読本'
    neologdn.normalize(" Natural Language Processing ")
    # => 'Natural Language Processing'
    neologdn.normalize("かわいいいいいいいいい", repeat=6)
    # => 'かわいいいいいい'
    neologdn.normalize("無駄無駄無駄無駄ァ", repeat=1)
    # => '無駄ァ'
    neologdn.normalize("1995〜2001年", tilde="normalize")
    # => '1995~2001年'
    neologdn.normalize("1995~2001年", tilde="normalize_zenkaku")
    # => '1995〜2001年'
    neologdn.normalize("1995〜2001年", tilde="ignore")  # Don't convert tilde
    # => '1995〜2001年'
    neologdn.normalize("1995〜2001年", tilde="remove")
    # => '19952001年'
    neologdn.normalize("1995〜2001年")  # Default parameter
    # => '19952001年'


Benchmark
----------

.. code:: python

    # Sample code from
    # https://github.com/neologd/mecab-ipadic-neologd/wiki/Regexp.ja#python-written-by-hideaki-t--overlast
    import normalize_neologd

    %timeit normalize(normalize_neologd.normalize_neologd)
    # => 9.55 s ± 29.4 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)


    import neologdn
    %timeit normalize(neologdn.normalize)
    # => 6.66 s ± 35.8 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)


neologdn is about x1.43 faster than sample code.

details are described as the below notebook:
https://github.com/ikegami-yukino/neologdn/blob/master/benchmark/benchmark.ipynb


License
-------

Apache Software License.


Contribution
------------

Contributions are welcome! See: https://github.com/ikegami-yukino/neologdn/blob/master/.github/CONTRIBUTING.md

.. |downloads| image:: https://static.pepy.tech/personalized-badge/neologdn?period=total&units=international_system&left_color=black&right_color=orange&left_text=Downloads
 :target: https://pepy.tech/project/neologdn

.. |version| image:: https://img.shields.io/pypi/v/neologdn.svg
    :target: http://pypi.python.org/pypi/neologdn/
    :alt: latest version

.. |pyversion| image:: https://img.shields.io/pypi/pyversions/neologdn.svg

.. |license| image:: https://img.shields.io/pypi/l/neologdn.svg
    :target: http://pypi.python.org/pypi/neologdn/
    :alt: license



CHANGES
========

0.5.2 (2023-08-03)
----------------------------

- Support Python 3.10 and 3.11 (Many thanks @polm)

0.5.1 (2021-05-02)
----------------------------

- Improve performance of shorten_repeat function (Many thanks @yskn67)
- Add tilde option to normalize function

0.4 (2018-12-06)
----------------------------

- Add shorten_repeat function, which shortening contiguous substring. For example: neologdn.normalize("無駄無駄無駄無駄ァ", repeat=1) -> 無駄ァ

0.3.2 (2018-05-17)
----------------------------

- Add option for suppression removal of spaces between Japanese characters

0.2.2 (2018-03-10)
----------------------------

- Fix bug (daku-ten & handaku-ten)
- Support mac osx 10.13 (Many thanks @r9y9)

0.2.1 (2017-01-23)
----------------------------

- Fix bug (Check if a previous character of daku-ten character is in maps) (Many thanks @unnonouno)

0.2 (2016-04-12)
----------------------------

- Add lengthened expression (repeating character) threshold

0.1.2 (2016-03-29)
----------------------------

- Fix installation bug

0.1.1.1 (2016-03-19)
----------------------------

- Support Windows
- Explicitly specify to -std=c++11 in build (Many thanks @id774)

0.1.1 (2015-10-10)
----------------------------

Initial release.

            

Raw data

            {
    "_id": null,
    "home_page": "http://github.com/ikegami-yukino/neologdn",
    "name": "neologdn",
    "maintainer": "",
    "docs_url": null,
    "requires_python": "",
    "maintainer_email": "",
    "keywords": "japanese,MeCab",
    "author": "Yukino Ikegami",
    "author_email": "yknikgm@gmail.com",
    "download_url": "https://files.pythonhosted.org/packages/25/74/a0a015e7ce8da5d12be013f3f0cf7ce85c83b9308f4b7419b70a981e41d9/neologdn-0.5.2.tar.gz",
    "platform": null,
    "description": "neologdn\n===========\n\n|downloads| |pyversion| |version| |license|\n\nneologdn is a Japanese text normalizer for `mecab-neologd <https://github.com/neologd/mecab-ipadic-neologd>`_.\n\nThe normalization is based on the neologd's rules:\nhttps://github.com/neologd/mecab-ipadic-neologd/wiki/Regexp.ja\n\n\nContributions are welcome!\n\nNOTE: Installing this module requires C++11 compiler.\n\nInstallation\n------------\n\n::\n\n $ pip install neologdn\n\nUsage\n-----\n\n.. code:: python\n\n    import neologdn\n    neologdn.normalize(\"\uff8a\uff9d\uff76\uff78\uff76\uff85\")\n    # => '\u30cf\u30f3\u30ab\u30af\u30ab\u30ca'\n    neologdn.normalize(\"\u5168\u89d2\u8a18\u53f7\uff01\uff1f\uff20\uff03\")\n    # => '\u5168\u89d2\u8a18\u53f7!?@#'\n    neologdn.normalize(\"\u5168\u89d2\u8a18\u53f7\u4f8b\u5916\u300c\u30fb\u300d\")\n    # => '\u5168\u89d2\u8a18\u53f7\u4f8b\u5916\u300c\u30fb\u300d'\n    neologdn.normalize(\"\u9577\u97f3\u77ed\u7e2e\u30a6\u30a7\u30fc\u30fc\u30fc\u30fc\u30a4\")\n    # => '\u9577\u97f3\u77ed\u7e2e\u30a6\u30a7\u30fc\u30a4'\n    neologdn.normalize(\"\u30c1\u30eb\u30c0\u524a\u9664\u30a6\u30a7~\u223c\u223e\u301c\u3030\uff5e\u30a4\")\n    # => '\u30c1\u30eb\u30c0\u524a\u9664\u30a6\u30a7\u30a4'\n    neologdn.normalize(\"\u3044\u308d\u3093\u306a\u30cf\u30a4\u30d5\u30f3\u02d7\u058a\u2010\u2011\u2012\u2013\u2043\u207b\u208b\u2212\")\n    # => '\u3044\u308d\u3093\u306a\u30cf\u30a4\u30d5\u30f3-'\n    neologdn.normalize(\"\u3000\u3000\u3000\uff30\uff32\uff2d\uff2c\u3000\u3000\u526f\u3000\u8aad\u3000\u672c\u3000\u3000\u3000\")\n    # => 'PRML\u526f\u8aad\u672c'\n    neologdn.normalize(\" Natural Language Processing \")\n    # => 'Natural Language Processing'\n    neologdn.normalize(\"\u304b\u308f\u3044\u3044\u3044\u3044\u3044\u3044\u3044\u3044\u3044\", repeat=6)\n    # => '\u304b\u308f\u3044\u3044\u3044\u3044\u3044\u3044'\n    neologdn.normalize(\"\u7121\u99c4\u7121\u99c4\u7121\u99c4\u7121\u99c4\u30a1\", repeat=1)\n    # => '\u7121\u99c4\u30a1'\n    neologdn.normalize(\"1995\u301c2001\u5e74\", tilde=\"normalize\")\n    # => '1995~2001\u5e74'\n    neologdn.normalize(\"1995~2001\u5e74\", tilde=\"normalize_zenkaku\")\n    # => '1995\u301c2001\u5e74'\n    neologdn.normalize(\"1995\u301c2001\u5e74\", tilde=\"ignore\")  # Don't convert tilde\n    # => '1995\u301c2001\u5e74'\n    neologdn.normalize(\"1995\u301c2001\u5e74\", tilde=\"remove\")\n    # => '19952001\u5e74'\n    neologdn.normalize(\"1995\u301c2001\u5e74\")  # Default parameter\n    # => '19952001\u5e74'\n\n\nBenchmark\n----------\n\n.. code:: python\n\n    # Sample code from\n    # https://github.com/neologd/mecab-ipadic-neologd/wiki/Regexp.ja#python-written-by-hideaki-t--overlast\n    import normalize_neologd\n\n    %timeit normalize(normalize_neologd.normalize_neologd)\n    # => 9.55 s \u00b1 29.4 ms per loop (mean \u00b1 std. dev. of 7 runs, 1 loop each)\n\n\n    import neologdn\n    %timeit normalize(neologdn.normalize)\n    # => 6.66 s \u00b1 35.8 ms per loop (mean \u00b1 std. dev. of 7 runs, 1 loop each)\n\n\nneologdn is about x1.43 faster than sample code.\n\ndetails are described as the below notebook:\nhttps://github.com/ikegami-yukino/neologdn/blob/master/benchmark/benchmark.ipynb\n\n\nLicense\n-------\n\nApache Software License.\n\n\nContribution\n------------\n\nContributions are welcome! See: https://github.com/ikegami-yukino/neologdn/blob/master/.github/CONTRIBUTING.md\n\n.. |downloads| image:: https://static.pepy.tech/personalized-badge/neologdn?period=total&units=international_system&left_color=black&right_color=orange&left_text=Downloads\n :target: https://pepy.tech/project/neologdn\n\n.. |version| image:: https://img.shields.io/pypi/v/neologdn.svg\n    :target: http://pypi.python.org/pypi/neologdn/\n    :alt: latest version\n\n.. |pyversion| image:: https://img.shields.io/pypi/pyversions/neologdn.svg\n\n.. |license| image:: https://img.shields.io/pypi/l/neologdn.svg\n    :target: http://pypi.python.org/pypi/neologdn/\n    :alt: license\n\n\n\nCHANGES\n========\n\n0.5.2 (2023-08-03)\n----------------------------\n\n- Support Python 3.10 and 3.11 (Many thanks @polm)\n\n0.5.1 (2021-05-02)\n----------------------------\n\n- Improve performance of shorten_repeat function (Many thanks @yskn67)\n- Add tilde option to normalize function\n\n0.4 (2018-12-06)\n----------------------------\n\n- Add shorten_repeat function, which shortening contiguous substring. For example: neologdn.normalize(\"\u7121\u99c4\u7121\u99c4\u7121\u99c4\u7121\u99c4\u30a1\", repeat=1) -> \u7121\u99c4\u30a1\n\n0.3.2 (2018-05-17)\n----------------------------\n\n- Add option for suppression removal of spaces between Japanese characters\n\n0.2.2 (2018-03-10)\n----------------------------\n\n- Fix bug (daku-ten & handaku-ten)\n- Support mac osx 10.13 (Many thanks @r9y9)\n\n0.2.1 (2017-01-23)\n----------------------------\n\n- Fix bug (Check if a previous character of daku-ten character is in maps) (Many thanks @unnonouno)\n\n0.2 (2016-04-12)\n----------------------------\n\n- Add lengthened expression (repeating character) threshold\n\n0.1.2 (2016-03-29)\n----------------------------\n\n- Fix installation bug\n\n0.1.1.1 (2016-03-19)\n----------------------------\n\n- Support Windows\n- Explicitly specify to -std=c++11 in build (Many thanks @id774)\n\n0.1.1 (2015-10-10)\n----------------------------\n\nInitial release.\n",
    "bugtrack_url": null,
    "license": "Apache Software License",
    "summary": "Japanese text normalizer for mecab-neologd",
    "version": "0.5.2",
    "project_urls": {
        "Homepage": "http://github.com/ikegami-yukino/neologdn"
    },
    "split_keywords": [
        "japanese",
        "mecab"
    ],
    "urls": [
        {
            "comment_text": "",
            "digests": {
                "blake2b_256": "2574a0a015e7ce8da5d12be013f3f0cf7ce85c83b9308f4b7419b70a981e41d9",
                "md5": "baa609fd1e44fc83e68147e89f042f70",
                "sha256": "2f56b2ffddfe7f8613d52b9f6366c224af2bb217c47c1e80e227a348345cce52"
            },
            "downloads": -1,
            "filename": "neologdn-0.5.2.tar.gz",
            "has_sig": false,
            "md5_digest": "baa609fd1e44fc83e68147e89f042f70",
            "packagetype": "sdist",
            "python_version": "source",
            "requires_python": null,
            "size": 86170,
            "upload_time": "2023-08-03T12:57:00",
            "upload_time_iso_8601": "2023-08-03T12:57:00.886233Z",
            "url": "https://files.pythonhosted.org/packages/25/74/a0a015e7ce8da5d12be013f3f0cf7ce85c83b9308f4b7419b70a981e41d9/neologdn-0.5.2.tar.gz",
            "yanked": false,
            "yanked_reason": null
        }
    ],
    "upload_time": "2023-08-03 12:57:00",
    "github": true,
    "gitlab": false,
    "bitbucket": false,
    "codeberg": false,
    "github_user": "ikegami-yukino",
    "github_project": "neologdn",
    "travis_ci": true,
    "coveralls": false,
    "github_actions": true,
    "lcname": "neologdn"
}
        
Elapsed time: 0.09786s