natto-py

Name	natto-py JSON
Version	1.0.1 JSON
	download
home_page	https://github.com/buruzaemon/natto-py
Summary	A Tasty Python Binding with MeCab(FFI-based, no SWIG or compiler necessary)
upload_time	2022-09-07 04:25:11
maintainer
docs_url	None
author	Brooke M. Fujita
requires_python
license	BSD
keywords	mecab 和布蕪納豆 japanese morphological analyzer nlp 形態素解析自然言語処理 ffi binding バインディング
VCS
bugtrack_url
requirements	No requirements were recorded.
Travis-CI	No Travis.
coveralls test coverage	No coveralls.

            natto-py
========

What is natto-py?
-----------------
A package leveraging FFI (foreign function interface), ``natto-py`` combines
the Python_ programming language with MeCab_, the part-of-speech and
morphological analyzer for the Japanese language. No compiler is necessary, as
it is **not** a C extension. ``natto-py`` will run on Mac OS, Windows and
\*nix.

You can learn more about `natto-py at GitHub`_.

If you are still using `Python 2 after sunset`_, please stick with version
``natto-py==0.9.2``.

|version| |pyversions| |license| |github-actions| |readthedocs|

Requirements
------------
``natto-py`` requires the following:

- An existing installation of `MeCab 0.996`_
- A system dictionary, like `IPA`_, `Juman`_ or `Unidic`_
- `cffi 0.8.6`_ or greater

The following Python 3 versions are supported:

- `Python 3.7`_
- `Python 3.8`_
- `Python 3.9`_
- `Python 3.10`_

For Python 2, please use version ``0.9.2``.

Installation
------------
Install ``natto-py`` as you would any other Python package:

.. code-block:: bash

    $ pip install natto-py

This will automatically install the ``cffi`` package, which ``natto-py`` uses
to bind to the ``mecab`` library.

Automatic Configuration
-----------------------
As long as the ``mecab`` (and ``mecab-config`` for \*nix and Mac OS)
executables are on your ``PATH``, ``natto-py`` does not require any explicit
configuration.

- On \*nix and Mac OS, it queries ``mecab-config`` to discover the path to the ``libmecab.so`` or ``libmecab.dylib``, respectively.
- On Windows, it queries the Windows Registry to locate the MeCab installation folder.
- In order to convert character encodings to/from Unicode, ``natto-py`` will examine the charset of the ``mecab`` system dictionary.

Explicit configuration via MECAB_PATH and MECAB_CHARSET
-------------------------------------------------------
If ``natto-py`` for some reason cannot locate the ``mecab`` library,
or if it cannot determine the correct charset used internally by
``mecab``, then you will need to set the ``MECAB_PATH`` and ``MECAB_CHARSET``
environment variables.

- Set the ``MECAB_PATH`` environment variable to the exact name/path to your ``mecab`` library.
- Set the ``MECAB_CHARSET`` environment variable to the ``charset`` character encoding used by your system dictionary.

e.g., for Mac OS:

.. code-block:: bash

    export MECAB_PATH=/usr/local/Cellar/mecab/0.996/lib/libmecab.dylib
    export MECAB_CHARSET=utf8

e.g., for bash on UNIX/Linux:

.. code-block:: bash

    export MECAB_PATH=/usr/local/lib/libmecab.so
    export MECAB_CHARSET=euc-jp

e.g., on Windows:

.. code-block:: bat

    set MECAB_PATH=C:\Program Files\MeCab\bin\libmecab.dll
    set MECAB_CHARSET=shift-jis

e.g., from within a Python program:

.. code-block:: python

    import os

    os.environ['MECAB_PATH']='/usr/local/lib/libmecab.so'
    os.environ['MECAB_CHARSET']='utf-16'

Usage
-----
Here's a very quick guide to using ``natto-py``.

Instantiate a reference to the ``mecab`` library, and display some details:

.. code-block:: python

    from natto import MeCab

    nm = MeCab()
    print(nm)

    # displays details about the MeCab instance
    <natto.mecab.MeCab
     model=<cdata 'mecab_model_t *' 0x801c16300>,
     tagger=<cdata 'mecab_t *' 0x801c17470>,
     lattice=<cdata 'mecab_lattice_t *' 0x801c196c0>,
     libpath="/usr/local/lib/libmecab.so",
     options={},
     dicts=[<natto.dictionary.DictionaryInfo
             dictionary='mecab_dictionary_info_t *' 0x801c19540>,
             filepath="/usr/local/lib/mecab/dic/ipadic/sys.dic",
             charset=utf8,
             type=0],
     version=0.996>

----

Display details about the ``mecab`` system dictionary used:

.. code-block:: python

    sysdic = nm.dicts[0]
    print(sysdic)

    # displays the MeCab system dictionary info
    <natto.dictionary.DictionaryInfo
     dictionary='mecab_dictionary_info_t *' 0x801c19540>,
     filepath="/usr/local/lib/mecab/dic/ipadic/sys.dic",
     charset=utf8,
     type=0>

----

Parse Japanese text and send the MeCab result as a single string to
``stdout``:

.. code-block:: python

    print(nm.parse('ピンチの時には必ずヒーローが現れる。'))

    # MeCab result as a single string
    ピンチ    名詞,一般,*,*,*,*,ピンチ,ピンチ,ピンチ
    の      助詞,連体化,*,*,*,*,の,ノ,ノ
    時      名詞,非自立,副詞可能,*,*,*,時,トキ,トキ
    に      助詞,格助詞,一般,*,*,*,に,ニ,ニ
    は      助詞,係助詞,*,*,*,*,は,ハ,ワ
    必ず    副詞,助詞類接続,*,*,*,*,必ず,カナラズ,カナラズ
    ヒーロー  名詞,一般,*,*,*,*,ヒーロー,ヒーロー,ヒーロー
    が      助詞,格助詞,一般,*,*,*,が,ガ,ガ
    現れる  動詞,自立,*,*,一段,基本形,現れる,アラワレル,アラワレル
    。      記号,句点,*,*,*,*,。,。,。
    EOS

----

Next, try parsing the text with MeCab node parsing. A generator yielding the
MeCabNode instances lets you efficiently iterate over the output without first
materializing each and every resulting MeCabNode instance. The MeCabNode
instances yielded allow access to more detailed information about each
morpheme.

Here we use a `Python with-statement`_ to automatically clean up after we
finish node parsing with the MeCab tagger. This is the recommended approach
for using ``natto-py`` in a production environment:

.. code-block:: python

    # Use a Python with-statement to ensure mecab_destroy is invoked
    #
    with MeCab() as nm:
        for n in nm.parse('ピンチの時には必ずヒーローが現れる。', as_nodes=True):
    ...     # ignore any end-of-sentence nodes
    ...     if not n.is_eos():
    ...         print('{}\t{}'.format(n.surface, n.cost))
    ...
    ピンチ    3348
    の        3722
    時        5176
    に        5083
    は        5305
    必ず    7525
    ヒーロー   11363
    が       10508
    現れる   10841
    。        7127

----

MeCab output formatting is extremely flexible and is highly recommended for
any serious natural language processing task. Rather than parsing the MeCab
output as a single, large string, use MeCab's ``--node-format`` option
(short form ``-F``) to customize the node's ``feature`` attribute.

- morpheme surface
- part-of-speech
- part-of-speech ID
- pronunciation

It is good practice when using ``--node-format`` to also specify node 
formatting in the case where the morpheme cannot be found in the dictionary,
by using ``--unk-format`` (short form ``-U``).

This example formats the node ``feature`` to capture the items above as a
comma-separated value:

.. code-block:: python

    # MeCab options used:
    #
    # -F    ... short-form of --node-format
    # %m    ... morpheme surface
    # %f[0] ... part-of-speech
    # %h    ... part-of-speech id (ipadic)
    # %f[8] ... pronunciation
    # 
    # -U    ... short-form of --unk-format
    #           output ?,?,?,? for morphemes not in dictionary
    #
    with MeCab(r'-F%m,%f[0],%h,%f[8]\n -U?,?,?,?\n') as nm:
        for n in nm.parse('ピンチの時には必ずヒーローが現れる。', as_nodes=True):
    ...     # only normal nodes, ignore any end-of-sentence and unknown nodes
    ...     if n.is_nor():
    ...         print(n.feature)
    ...
    ピンチ,名詞,38,ピンチ
    の,助詞,24,ノ
    時,名詞,66,トキ
    に,助詞,13,ニ
    は,助詞,16,ワ
    必ず,副詞,35,カナラズ
    ヒーロー,名詞,38,ヒーロー
    が,助詞,13,ガ
    現れる,動詞,31,アラワレル
    。,記号,7,。


----

`Partial parsing`_ (制約付き解析), allows you to pass hints to MeCab on
how to tokenize morphemes when parsing. Most useful are boundary constraint
parsing and feature constraint parsing.

With boundary constraint parsing, you can specify either a compiled ``re``
regular expression object or a string to tell MeCab where the boundaries of
a morpheme should be. Use the ``boundary_constraints`` keyword. For hints on
tokenization, please see `Regular expression operations`_ and `re.finditer`_
in particular.

This example uses the ``-F`` node-format option to customize the resulting
``MeCabNode`` feature attribute to extract:

- ``%m`` - morpheme surface
- ``%f[0]`` - node part-of-speech
- ``%s`` - node ``stat`` status value, 1 is ``unknown``

Note that any such morphemes captured will have node ``stat`` status of 1 (unknown):

.. code-block:: python

    import re

    with MeCab(r'-F%m,\s%f[0],\s%s\n') as nm:

        text = '俺は努力したよっ？ お前の10倍、いや100倍1000倍したよっ！'
        
        # capture 10倍, 100倍 and 1000倍 as single parts-of-speech
        pattern = re.compile('10+倍') 

        for n in nm.parse(text, boundary_constraints=pattern, as_nodes=True):
    ...     print(n.feature)
    ...
    俺, 名詞, 0
    は, 助詞, 0
    努力, 名詞, 0
    し, 動詞, 0
    たよっ, 動詞, 0
    ？, 記号, 0
    お前, 名詞, 0
    の, 助詞, 0
    10倍, 名詞, 1
    、, 記号, 0
    いや, 接続詞, 0
    100倍, 名詞, 1
    1000倍, 名詞, 1
    し, 動詞, 0
    たよっ, 動詞, 0
    ！, 記号, 0
    EOS

With feature constraint parsing, you can provide instructions to MeCab
on what feature to use for a matching morpheme. Use the 
``feature_constraints`` keyword to pass in a ``tuple`` containing elements
that themselves are ``tuple`` instances with a specific morpheme (str) 
and a corresponding feature (str), in order of constraint precedence:

.. code-block:: python

    with MeCab(r'-F%m,\s%f[0],\s%s\n') as nm:

        text = '心の中で3回唱え、 ヒーロー見参！ヒーロー見参！ヒーロー見参！'
        features = (('ヒーロー見参', '感動詞'),)

        for n in nm.parse(text, feature_constraints=features, as_nodes=True):
    ...     print(n.feature)
    ...
    心, 名詞, 0
    の, 助詞, 0
    中, 名詞, 0
    で, 助詞, 0
    3, 名詞, 1
    回, 名詞, 0
    唱え, 動詞, 0
    、, 記号, 0
    ヒーロー見参, 感動詞, 1
    ！, 記号, 0
    ヒーロー見参, 感動詞, 1
    ！, 記号, 0
    ヒーロー見参, 感動詞, 1
    ！, 記号, 0
    EOS


----

Learn More
----------
- Examples and more detailed information about ``natto-py`` can be found on the `project Wiki`_.
- Working code in Jupyter notebook form can be found under this `project's notebooks directory`_.
- `API documentation on Read the Docs`_.

Contributing to natto-py
------------------------
- Use git_ and `check out the latest code at GitHub`_ to make sure the
  feature hasn't been implemented or the bug hasn't been fixed yet.
- `Browse the issue tracker`_ to make sure someone already hasn't requested it
  and/or contributed it.
- Fork the project.
- Start a feature/bugfix branch.
- Commit and push until you are happy with your contribution.
- Make sure to add tests for it. This is important so I don't break it in a
  future version unintentionally.
- Please try not to mess with the ``setup.py``, ``CHANGELOG``, or version
  files. If you must have your own version, that is fine, but please isolate
  to its own commit so I can cherry-pick around it.
- This project uses the following packages for development:

  - Sphinx_ for document generation
  - twine_ for secure uploads during release
  - unittest_ for unit tests, as it is very natural and easy-to-use
  - PyYAML_ for data loading during tests

Changelog
---------
Please see the ``CHANGELOG`` for the release history.

Copyright
---------
Copyright |copy| 2022, Brooke M. Fujita. All rights reserved. Please see
the ``LICENSE`` file for further details.

.. |version| image:: https://badge.fury.io/py/natto-py.svg
    :target: https://pypi.org/project/natto-py/ 
.. |pyversions| image:: https://img.shields.io/pypi/pyversions/natto-py.svg?style=flat
.. |github-actions| image:: https://github.com/buruzaemon/natto-py/actions/workflows/automated-test-actions.yml/badge.svg
.. |license| image:: https://img.shields.io/badge/license-BSD-blue.svg
    :target: https://raw.githubusercontent.com/buruzaemon/natto-py/master/LICENSE 
.. |readthedocs| image:: https://readthedocs.org/projects/natto-py/badge/?version=master
    :target: http://natto-py.readthedocs.org/en/master/?badge=master
    :alt: Documentation Status
.. _Python: http://www.python.org/
.. _MeCab: http://taku910.github.io/mecab/
.. _Python 2 after sunset: https://www.python.org/doc/sunset-python-2/
.. _IPA: http://taku910.github.io/mecab/#download
.. _Juman: http://taku910.github.io/mecab/#download
.. _Unidic: http://taku910.github.io/mecab/#download
.. _natto-py at GitHub: https://github.com/buruzaemon/natto-py
.. _MeCab 0.996: http://taku910.github.io/mecab/#download
.. _cffi 0.8.6: https://bitbucket.org/cffi/cffi
.. _Python 3.7: https://docs.python.org/3.7/whatsnew/3.7.html 
.. _Python 3.8: https://docs.python.org/3.8/whatsnew/3.8.html 
.. _Python 3.9: https://docs.python.org/3.9/whatsnew/3.9.html 
.. _Python 3.10: https://docs.python.org/3/whatsnew/3.10.html 
.. _NLTK3's lead: https://github.com/nltk/nltk/wiki/Porting-your-code-to-NLTK-3.0
.. _Python with-statement: https://www.python.org/dev/peps/pep-0343/
.. _Partial parsing: http://taku910.github.io/mecab/partial.html
.. _Regular expression operations: https://docs.python.org/3/library/re.html
.. _re.finditer: https://docs.python.org/3/library/re.html#re.finditer
.. _project Wiki: https://github.com/buruzaemon/natto-py/wiki 
.. _project's notebooks directory: https://github.com/buruzaemon/natto-py/tree/master/notebooks
.. _API documentation on Read the Docs: http://natto-py.readthedocs.org/en/master/
.. _git: http://git-scm.com/downloads
.. _check out the latest code at GitHub: https://github.com/buruzaemon/natto-py
.. _Browse the issue tracker: https://github.com/buruzaemon/natto-py/issues
.. _Sphinx: http://sphinx-doc.org/
.. _twine: https://github.com/pypa/twine
.. _unittest: http://pythontesting.net/framework/unittest/unittest-introduction/
.. _PyYAML: https://github.com/yaml/pyyaml 
.. |copy| unicode:: 0xA9 .. copyright sign

Raw data

            {
    "_id": null,
    "home_page": "https://github.com/buruzaemon/natto-py",
    "name": "natto-py",
    "maintainer": "",
    "docs_url": null,
    "requires_python": "",
    "maintainer_email": "",
    "keywords": "MeCab \u548c\u5e03\u856a \u7d0d\u8c46 Japanese morphological analyzer NLP \u5f62\u614b\u7d20\u89e3\u6790 \u81ea\u7136\u8a00\u8a9e\u51e6\u7406 FFI binding \u30d0\u30a4\u30f3\u30c7\u30a3\u30f3\u30b0",
    "author": "Brooke M. Fujita",
    "author_email": "buruzaemon@gmail.com",
    "download_url": "https://files.pythonhosted.org/packages/18/b5/ce97638848783d99e36388aa3cf88df5e67fa491232630a647bd091ecf07/natto-py-1.0.1.tar.gz",
    "platform": null,
    "description": "natto-py\r\n========\r\n\r\nWhat is natto-py?\r\n-----------------\r\nA package leveraging FFI (foreign function interface), ``natto-py`` combines\r\nthe Python_ programming language with MeCab_, the part-of-speech and\r\nmorphological analyzer for the Japanese language. No compiler is necessary, as\r\nit is **not** a C extension. ``natto-py`` will run on Mac OS, Windows and\r\n\\*nix.\r\n\r\nYou can learn more about `natto-py at GitHub`_.\r\n\r\nIf you are still using `Python 2 after sunset`_, please stick with version\r\n``natto-py==0.9.2``.\r\n\r\n|version| |pyversions| |license| |github-actions| |readthedocs|\r\n\r\nRequirements\r\n------------\r\n``natto-py`` requires the following:\r\n\r\n- An existing installation of `MeCab 0.996`_\r\n- A system dictionary, like `IPA`_, `Juman`_ or `Unidic`_\r\n- `cffi 0.8.6`_ or greater\r\n\r\nThe following Python 3 versions are supported:\r\n\r\n- `Python 3.7`_\r\n- `Python 3.8`_\r\n- `Python 3.9`_\r\n- `Python 3.10`_\r\n\r\nFor Python 2, please use version ``0.9.2``.\r\n\r\nInstallation\r\n------------\r\nInstall ``natto-py`` as you would any other Python package:\r\n\r\n.. code-block:: bash\r\n\r\n    $ pip install natto-py\r\n\r\nThis will automatically install the ``cffi`` package, which ``natto-py`` uses\r\nto bind to the ``mecab`` library.\r\n\r\nAutomatic Configuration\r\n-----------------------\r\nAs long as the ``mecab`` (and ``mecab-config`` for \\*nix and Mac OS)\r\nexecutables are on your ``PATH``, ``natto-py`` does not require any explicit\r\nconfiguration.\r\n\r\n- On \\*nix and Mac OS, it queries ``mecab-config`` to discover the path to the ``libmecab.so`` or ``libmecab.dylib``, respectively.\r\n- On Windows, it queries the Windows Registry to locate the MeCab installation folder.\r\n- In order to convert character encodings to/from Unicode, ``natto-py`` will examine the charset of the ``mecab`` system dictionary.\r\n\r\nExplicit configuration via MECAB_PATH and MECAB_CHARSET\r\n-------------------------------------------------------\r\nIf ``natto-py`` for some reason cannot locate the ``mecab`` library,\r\nor if it cannot determine the correct charset used internally by\r\n``mecab``, then you will need to set the ``MECAB_PATH`` and ``MECAB_CHARSET``\r\nenvironment variables.\r\n\r\n- Set the ``MECAB_PATH`` environment variable to the exact name/path to your ``mecab`` library.\r\n- Set the ``MECAB_CHARSET`` environment variable to the ``charset`` character encoding used by your system dictionary.\r\n\r\ne.g., for Mac OS:\r\n\r\n.. code-block:: bash\r\n\r\n    export MECAB_PATH=/usr/local/Cellar/mecab/0.996/lib/libmecab.dylib\r\n    export MECAB_CHARSET=utf8\r\n\r\ne.g., for bash on UNIX/Linux:\r\n\r\n.. code-block:: bash\r\n\r\n    export MECAB_PATH=/usr/local/lib/libmecab.so\r\n    export MECAB_CHARSET=euc-jp\r\n\r\ne.g., on Windows:\r\n\r\n.. code-block:: bat\r\n\r\n    set MECAB_PATH=C:\\Program Files\\MeCab\\bin\\libmecab.dll\r\n    set MECAB_CHARSET=shift-jis\r\n\r\ne.g., from within a Python program:\r\n\r\n.. code-block:: python\r\n\r\n    import os\r\n\r\n    os.environ['MECAB_PATH']='/usr/local/lib/libmecab.so'\r\n    os.environ['MECAB_CHARSET']='utf-16'\r\n\r\nUsage\r\n-----\r\nHere's a very quick guide to using ``natto-py``.\r\n\r\nInstantiate a reference to the ``mecab`` library, and display some details:\r\n\r\n.. code-block:: python\r\n\r\n    from natto import MeCab\r\n\r\n    nm = MeCab()\r\n    print(nm)\r\n\r\n    # displays details about the MeCab instance\r\n    <natto.mecab.MeCab\r\n     model=<cdata 'mecab_model_t *' 0x801c16300>,\r\n     tagger=<cdata 'mecab_t *' 0x801c17470>,\r\n     lattice=<cdata 'mecab_lattice_t *' 0x801c196c0>,\r\n     libpath=\"/usr/local/lib/libmecab.so\",\r\n     options={},\r\n     dicts=[<natto.dictionary.DictionaryInfo\r\n             dictionary='mecab_dictionary_info_t *' 0x801c19540>,\r\n             filepath=\"/usr/local/lib/mecab/dic/ipadic/sys.dic\",\r\n             charset=utf8,\r\n             type=0],\r\n     version=0.996>\r\n\r\n----\r\n\r\nDisplay details about the ``mecab`` system dictionary used:\r\n\r\n.. code-block:: python\r\n\r\n    sysdic = nm.dicts[0]\r\n    print(sysdic)\r\n\r\n    # displays the MeCab system dictionary info\r\n    <natto.dictionary.DictionaryInfo\r\n     dictionary='mecab_dictionary_info_t *' 0x801c19540>,\r\n     filepath=\"/usr/local/lib/mecab/dic/ipadic/sys.dic\",\r\n     charset=utf8,\r\n     type=0>\r\n\r\n----\r\n\r\nParse Japanese text and send the MeCab result as a single string to\r\n``stdout``:\r\n\r\n.. code-block:: python\r\n\r\n    print(nm.parse('\u30d4\u30f3\u30c1\u306e\u6642\u306b\u306f\u5fc5\u305a\u30d2\u30fc\u30ed\u30fc\u304c\u73fe\u308c\u308b\u3002'))\r\n\r\n    # MeCab result as a single string\r\n    \u30d4\u30f3\u30c1    \u540d\u8a5e,\u4e00\u822c,*,*,*,*,\u30d4\u30f3\u30c1,\u30d4\u30f3\u30c1,\u30d4\u30f3\u30c1\r\n    \u306e      \u52a9\u8a5e,\u9023\u4f53\u5316,*,*,*,*,\u306e,\u30ce,\u30ce\r\n    \u6642      \u540d\u8a5e,\u975e\u81ea\u7acb,\u526f\u8a5e\u53ef\u80fd,*,*,*,\u6642,\u30c8\u30ad,\u30c8\u30ad\r\n    \u306b      \u52a9\u8a5e,\u683c\u52a9\u8a5e,\u4e00\u822c,*,*,*,\u306b,\u30cb,\u30cb\r\n    \u306f      \u52a9\u8a5e,\u4fc2\u52a9\u8a5e,*,*,*,*,\u306f,\u30cf,\u30ef\r\n    \u5fc5\u305a    \u526f\u8a5e,\u52a9\u8a5e\u985e\u63a5\u7d9a,*,*,*,*,\u5fc5\u305a,\u30ab\u30ca\u30e9\u30ba,\u30ab\u30ca\u30e9\u30ba\r\n    \u30d2\u30fc\u30ed\u30fc  \u540d\u8a5e,\u4e00\u822c,*,*,*,*,\u30d2\u30fc\u30ed\u30fc,\u30d2\u30fc\u30ed\u30fc,\u30d2\u30fc\u30ed\u30fc\r\n    \u304c      \u52a9\u8a5e,\u683c\u52a9\u8a5e,\u4e00\u822c,*,*,*,\u304c,\u30ac,\u30ac\r\n    \u73fe\u308c\u308b  \u52d5\u8a5e,\u81ea\u7acb,*,*,\u4e00\u6bb5,\u57fa\u672c\u5f62,\u73fe\u308c\u308b,\u30a2\u30e9\u30ef\u30ec\u30eb,\u30a2\u30e9\u30ef\u30ec\u30eb\r\n    \u3002      \u8a18\u53f7,\u53e5\u70b9,*,*,*,*,\u3002,\u3002,\u3002\r\n    EOS\r\n\r\n----\r\n\r\nNext, try parsing the text with MeCab node parsing. A generator yielding the\r\nMeCabNode instances lets you efficiently iterate over the output without first\r\nmaterializing each and every resulting MeCabNode instance. The MeCabNode\r\ninstances yielded allow access to more detailed information about each\r\nmorpheme.\r\n\r\nHere we use a `Python with-statement`_ to automatically clean up after we\r\nfinish node parsing with the MeCab tagger. This is the recommended approach\r\nfor using ``natto-py`` in a production environment:\r\n\r\n.. code-block:: python\r\n\r\n    # Use a Python with-statement to ensure mecab_destroy is invoked\r\n    #\r\n    with MeCab() as nm:\r\n        for n in nm.parse('\u30d4\u30f3\u30c1\u306e\u6642\u306b\u306f\u5fc5\u305a\u30d2\u30fc\u30ed\u30fc\u304c\u73fe\u308c\u308b\u3002', as_nodes=True):\r\n    ...     # ignore any end-of-sentence nodes\r\n    ...     if not n.is_eos():\r\n    ...         print('{}\\t{}'.format(n.surface, n.cost))\r\n    ...\r\n    \u30d4\u30f3\u30c1    3348\r\n    \u306e        3722\r\n    \u6642        5176\r\n    \u306b        5083\r\n    \u306f        5305\r\n    \u5fc5\u305a    7525\r\n    \u30d2\u30fc\u30ed\u30fc   11363\r\n    \u304c       10508\r\n    \u73fe\u308c\u308b   10841\r\n    \u3002        7127\r\n\r\n----\r\n\r\nMeCab output formatting is extremely flexible and is highly recommended for\r\nany serious natural language processing task. Rather than parsing the MeCab\r\noutput as a single, large string, use MeCab's ``--node-format`` option\r\n(short form ``-F``) to customize the node's ``feature`` attribute.\r\n\r\n- morpheme surface\r\n- part-of-speech\r\n- part-of-speech ID\r\n- pronunciation\r\n\r\nIt is good practice when using ``--node-format`` to also specify node \r\nformatting in the case where the morpheme cannot be found in the dictionary,\r\nby using ``--unk-format`` (short form ``-U``).\r\n\r\nThis example formats the node ``feature`` to capture the items above as a\r\ncomma-separated value:\r\n\r\n.. code-block:: python\r\n\r\n    # MeCab options used:\r\n    #\r\n    # -F    ... short-form of --node-format\r\n    # %m    ... morpheme surface\r\n    # %f[0] ... part-of-speech\r\n    # %h    ... part-of-speech id (ipadic)\r\n    # %f[8] ... pronunciation\r\n    # \r\n    # -U    ... short-form of --unk-format\r\n    #           output ?,?,?,? for morphemes not in dictionary\r\n    #\r\n    with MeCab(r'-F%m,%f[0],%h,%f[8]\\n -U?,?,?,?\\n') as nm:\r\n        for n in nm.parse('\u30d4\u30f3\u30c1\u306e\u6642\u306b\u306f\u5fc5\u305a\u30d2\u30fc\u30ed\u30fc\u304c\u73fe\u308c\u308b\u3002', as_nodes=True):\r\n    ...     # only normal nodes, ignore any end-of-sentence and unknown nodes\r\n    ...     if n.is_nor():\r\n    ...         print(n.feature)\r\n    ...\r\n    \u30d4\u30f3\u30c1,\u540d\u8a5e,38,\u30d4\u30f3\u30c1\r\n    \u306e,\u52a9\u8a5e,24,\u30ce\r\n    \u6642,\u540d\u8a5e,66,\u30c8\u30ad\r\n    \u306b,\u52a9\u8a5e,13,\u30cb\r\n    \u306f,\u52a9\u8a5e,16,\u30ef\r\n    \u5fc5\u305a,\u526f\u8a5e,35,\u30ab\u30ca\u30e9\u30ba\r\n    \u30d2\u30fc\u30ed\u30fc,\u540d\u8a5e,38,\u30d2\u30fc\u30ed\u30fc\r\n    \u304c,\u52a9\u8a5e,13,\u30ac\r\n    \u73fe\u308c\u308b,\u52d5\u8a5e,31,\u30a2\u30e9\u30ef\u30ec\u30eb\r\n    \u3002,\u8a18\u53f7,7,\u3002\r\n\r\n\r\n----\r\n\r\n`Partial parsing`_ (\u5236\u7d04\u4ed8\u304d\u89e3\u6790), allows you to pass hints to MeCab on\r\nhow to tokenize morphemes when parsing. Most useful are boundary constraint\r\nparsing and feature constraint parsing.\r\n\r\nWith boundary constraint parsing, you can specify either a compiled ``re``\r\nregular expression object or a string to tell MeCab where the boundaries of\r\na morpheme should be. Use the ``boundary_constraints`` keyword. For hints on\r\ntokenization, please see `Regular expression operations`_ and `re.finditer`_\r\nin particular.\r\n\r\nThis example uses the ``-F`` node-format option to customize the resulting\r\n``MeCabNode`` feature attribute to extract:\r\n\r\n- ``%m`` - morpheme surface\r\n- ``%f[0]`` - node part-of-speech\r\n- ``%s`` - node ``stat`` status value, 1 is ``unknown``\r\n\r\nNote that any such morphemes captured will have node ``stat`` status of 1 (unknown):\r\n\r\n.. code-block:: python\r\n\r\n    import re\r\n\r\n    with MeCab(r'-F%m,\\s%f[0],\\s%s\\n') as nm:\r\n\r\n        text = '\u4ffa\u306f\u52aa\u529b\u3057\u305f\u3088\u3063\uff1f \u304a\u524d\u306e10\u500d\u3001\u3044\u3084100\u500d1000\u500d\u3057\u305f\u3088\u3063\uff01'\r\n        \r\n        # capture 10\u500d, 100\u500d and 1000\u500d as single parts-of-speech\r\n        pattern = re.compile('10+\u500d') \r\n\r\n        for n in nm.parse(text, boundary_constraints=pattern, as_nodes=True):\r\n    ...     print(n.feature)\r\n    ...\r\n    \u4ffa, \u540d\u8a5e, 0\r\n    \u306f, \u52a9\u8a5e, 0\r\n    \u52aa\u529b, \u540d\u8a5e, 0\r\n    \u3057, \u52d5\u8a5e, 0\r\n    \u305f\u3088\u3063, \u52d5\u8a5e, 0\r\n    \uff1f, \u8a18\u53f7, 0\r\n    \u304a\u524d, \u540d\u8a5e, 0\r\n    \u306e, \u52a9\u8a5e, 0\r\n    10\u500d, \u540d\u8a5e, 1\r\n    \u3001, \u8a18\u53f7, 0\r\n    \u3044\u3084, \u63a5\u7d9a\u8a5e, 0\r\n    100\u500d, \u540d\u8a5e, 1\r\n    1000\u500d, \u540d\u8a5e, 1\r\n    \u3057, \u52d5\u8a5e, 0\r\n    \u305f\u3088\u3063, \u52d5\u8a5e, 0\r\n    \uff01, \u8a18\u53f7, 0\r\n    EOS\r\n\r\nWith feature constraint parsing, you can provide instructions to MeCab\r\non what feature to use for a matching morpheme. Use the \r\n``feature_constraints`` keyword to pass in a ``tuple`` containing elements\r\nthat themselves are ``tuple`` instances with a specific morpheme (str) \r\nand a corresponding feature (str), in order of constraint precedence:\r\n\r\n.. code-block:: python\r\n\r\n    with MeCab(r'-F%m,\\s%f[0],\\s%s\\n') as nm:\r\n\r\n        text = '\u5fc3\u306e\u4e2d\u30673\u56de\u5531\u3048\u3001 \u30d2\u30fc\u30ed\u30fc\u898b\u53c2\uff01\u30d2\u30fc\u30ed\u30fc\u898b\u53c2\uff01\u30d2\u30fc\u30ed\u30fc\u898b\u53c2\uff01'\r\n        features = (('\u30d2\u30fc\u30ed\u30fc\u898b\u53c2', '\u611f\u52d5\u8a5e'),)\r\n\r\n        for n in nm.parse(text, feature_constraints=features, as_nodes=True):\r\n    ...     print(n.feature)\r\n    ...\r\n    \u5fc3, \u540d\u8a5e, 0\r\n    \u306e, \u52a9\u8a5e, 0\r\n    \u4e2d, \u540d\u8a5e, 0\r\n    \u3067, \u52a9\u8a5e, 0\r\n    3, \u540d\u8a5e, 1\r\n    \u56de, \u540d\u8a5e, 0\r\n    \u5531\u3048, \u52d5\u8a5e, 0\r\n    \u3001, \u8a18\u53f7, 0\r\n    \u30d2\u30fc\u30ed\u30fc\u898b\u53c2, \u611f\u52d5\u8a5e, 1\r\n    \uff01, \u8a18\u53f7, 0\r\n    \u30d2\u30fc\u30ed\u30fc\u898b\u53c2, \u611f\u52d5\u8a5e, 1\r\n    \uff01, \u8a18\u53f7, 0\r\n    \u30d2\u30fc\u30ed\u30fc\u898b\u53c2, \u611f\u52d5\u8a5e, 1\r\n    \uff01, \u8a18\u53f7, 0\r\n    EOS\r\n\r\n\r\n----\r\n\r\nLearn More\r\n----------\r\n- Examples and more detailed information about ``natto-py`` can be found on the `project Wiki`_.\r\n- Working code in Jupyter notebook form can be found under this `project's notebooks directory`_.\r\n- `API documentation on Read the Docs`_.\r\n\r\nContributing to natto-py\r\n------------------------\r\n- Use git_ and `check out the latest code at GitHub`_ to make sure the\r\n  feature hasn't been implemented or the bug hasn't been fixed yet.\r\n- `Browse the issue tracker`_ to make sure someone already hasn't requested it\r\n  and/or contributed it.\r\n- Fork the project.\r\n- Start a feature/bugfix branch.\r\n- Commit and push until you are happy with your contribution.\r\n- Make sure to add tests for it. This is important so I don't break it in a\r\n  future version unintentionally.\r\n- Please try not to mess with the ``setup.py``, ``CHANGELOG``, or version\r\n  files. If you must have your own version, that is fine, but please isolate\r\n  to its own commit so I can cherry-pick around it.\r\n- This project uses the following packages for development:\r\n\r\n  - Sphinx_ for document generation\r\n  - twine_ for secure uploads during release\r\n  - unittest_ for unit tests, as it is very natural and easy-to-use\r\n  - PyYAML_ for data loading during tests\r\n\r\nChangelog\r\n---------\r\nPlease see the ``CHANGELOG`` for the release history.\r\n\r\nCopyright\r\n---------\r\nCopyright |copy| 2022, Brooke M. Fujita. All rights reserved. Please see\r\nthe ``LICENSE`` file for further details.\r\n\r\n.. |version| image:: https://badge.fury.io/py/natto-py.svg\r\n    :target: https://pypi.org/project/natto-py/ \r\n.. |pyversions| image:: https://img.shields.io/pypi/pyversions/natto-py.svg?style=flat\r\n.. |github-actions| image:: https://github.com/buruzaemon/natto-py/actions/workflows/automated-test-actions.yml/badge.svg\r\n.. |license| image:: https://img.shields.io/badge/license-BSD-blue.svg\r\n    :target: https://raw.githubusercontent.com/buruzaemon/natto-py/master/LICENSE \r\n.. |readthedocs| image:: https://readthedocs.org/projects/natto-py/badge/?version=master\r\n    :target: http://natto-py.readthedocs.org/en/master/?badge=master\r\n    :alt: Documentation Status\r\n.. _Python: http://www.python.org/\r\n.. _MeCab: http://taku910.github.io/mecab/\r\n.. _Python 2 after sunset: https://www.python.org/doc/sunset-python-2/\r\n.. _IPA: http://taku910.github.io/mecab/#download\r\n.. _Juman: http://taku910.github.io/mecab/#download\r\n.. _Unidic: http://taku910.github.io/mecab/#download\r\n.. _natto-py at GitHub: https://github.com/buruzaemon/natto-py\r\n.. _MeCab 0.996: http://taku910.github.io/mecab/#download\r\n.. _cffi 0.8.6: https://bitbucket.org/cffi/cffi\r\n.. _Python 3.7: https://docs.python.org/3.7/whatsnew/3.7.html \r\n.. _Python 3.8: https://docs.python.org/3.8/whatsnew/3.8.html \r\n.. _Python 3.9: https://docs.python.org/3.9/whatsnew/3.9.html \r\n.. _Python 3.10: https://docs.python.org/3/whatsnew/3.10.html \r\n.. _NLTK3's lead: https://github.com/nltk/nltk/wiki/Porting-your-code-to-NLTK-3.0\r\n.. _Python with-statement: https://www.python.org/dev/peps/pep-0343/\r\n.. _Partial parsing: http://taku910.github.io/mecab/partial.html\r\n.. _Regular expression operations: https://docs.python.org/3/library/re.html\r\n.. _re.finditer: https://docs.python.org/3/library/re.html#re.finditer\r\n.. _project Wiki: https://github.com/buruzaemon/natto-py/wiki \r\n.. _project's notebooks directory: https://github.com/buruzaemon/natto-py/tree/master/notebooks\r\n.. _API documentation on Read the Docs: http://natto-py.readthedocs.org/en/master/\r\n.. _git: http://git-scm.com/downloads\r\n.. _check out the latest code at GitHub: https://github.com/buruzaemon/natto-py\r\n.. _Browse the issue tracker: https://github.com/buruzaemon/natto-py/issues\r\n.. _Sphinx: http://sphinx-doc.org/\r\n.. _twine: https://github.com/pypa/twine\r\n.. _unittest: http://pythontesting.net/framework/unittest/unittest-introduction/\r\n.. _PyYAML: https://github.com/yaml/pyyaml \r\n.. |copy| unicode:: 0xA9 .. copyright sign\r\n\n\n",
    "bugtrack_url": null,
    "license": "BSD",
    "summary": "A Tasty Python Binding with MeCab(FFI-based, no SWIG or compiler necessary)",
    "version": "1.0.1",
    "project_urls": {
        "Homepage": "https://github.com/buruzaemon/natto-py"
    },
    "split_keywords": [
        "mecab",
        "\u548c\u5e03\u856a",
        "\u7d0d\u8c46",
        "japanese",
        "morphological",
        "analyzer",
        "nlp",
        "\u5f62\u614b\u7d20\u89e3\u6790",
        "\u81ea\u7136\u8a00\u8a9e\u51e6\u7406",
        "ffi",
        "binding",
        "\u30d0\u30a4\u30f3\u30c7\u30a3\u30f3\u30b0"
    ],
    "urls": [
        {
            "comment_text": "",
            "digests": {
                "blake2b_256": "18b5ce97638848783d99e36388aa3cf88df5e67fa491232630a647bd091ecf07",
                "md5": "1e9159b2d233c831138418388823f61f",
                "sha256": "760103bb397232ee033c9924d1357e32b142bbe132fc6a43b8cf82dd6b654e86"
            },
            "downloads": -1,
            "filename": "natto-py-1.0.1.tar.gz",
            "has_sig": false,
            "md5_digest": "1e9159b2d233c831138418388823f61f",
            "packagetype": "sdist",
            "python_version": "source",
            "requires_python": null,
            "size": 36832,
            "upload_time": "2022-09-07T04:25:11",
            "upload_time_iso_8601": "2022-09-07T04:25:11.406336Z",
            "url": "https://files.pythonhosted.org/packages/18/b5/ce97638848783d99e36388aa3cf88df5e67fa491232630a647bd091ecf07/natto-py-1.0.1.tar.gz",
            "yanked": false,
            "yanked_reason": null
        }
    ],
    "upload_time": "2022-09-07 04:25:11",
    "github": true,
    "gitlab": false,
    "bitbucket": false,
    "codeberg": false,
    "github_user": "buruzaemon",
    "github_project": "natto-py",
    "travis_ci": false,
    "coveralls": false,
    "github_actions": true,
    "lcname": "natto-py"
}

Brooke M. Fujita