JapaneseTokenizer


NameJapaneseTokenizer JSON
Version 1.3.4 PyPI version JSON
download
home_pagehttps://github.com/Kensuke-Mitsuzawa/JapaneseTokenizers
Summary
upload_time2017-09-21 08:10:52
maintainer
docs_urlNone
authorKensuke Mitsuzawa
requires_python
licenseMIT
keywords mecab
VCS
bugtrack_url
requirements No requirements were recorded.
Travis-CI
coveralls test coverage No coveralls.
            |Build Status|\ |MIT License|

What's this?
============

This is simple python-wrapper for Japanese Tokenizers(A.K.A Tokenizer)

This project aims to call tokenizers and split a sentence into tokens as
easy as possible.

And, this project supports various Tokenization tools common interface.
Thus, it's easy to compare output from various tokenizers.

This project is available also in
`Github <https://github.com/Kensuke-Mitsuzawa/JapaneseTokenizers>`__.

If you find any bugs, please report them to github issues. Or any pull
requests are welcomed!

Requirements
============

-  Python 2.7
-  Python 3.5

Features
========

-  simple/common interface among various tokenizers
-  simple/common interface for filtering with stopwords or
   Part-of-Speech condition
-  simple interface to add user-dictionary(mecab only)

Supported Tokenizers
--------------------

Mecab
~~~~~

`Mecab <http://mecab.googlecode.com/svn/trunk/mecab/doc/index.html?sess=3f6a4f9896295ef2480fa2482de521f6>`__
is open source tokenizer system for various language(if you have
dictionary for it)

See `english
documentation <https://github.com/jordwest/mecab-docs-en>`__ for detail

Juman
~~~~~

`Juman <http://nlp.ist.i.kyoto-u.ac.jp/EN/index.php?JUMAN>`__ is a
tokenizer system developed by Kurohashi laboratory, Kyoto University,
Japan.

Juman is strong for ambiguous writing style in Japanese, and is strong
for new-comming words thanks to Web based huge dictionary.

And, Juman tells you semantic meaning of words.

Juman++
~~~~~~~

`Juman++ <http://nlp.ist.i.kyoto-u.ac.jp/EN/index.php?JUMAN++>`__ is a
tokenizer system developed by Kurohashi laboratory, Kyoto University,
Japan.

Juman++ is succeeding system of Juman. It adopts RNN model for
tokenization.

Juman++ is strong for ambigious writing style in Japanese, and is strong
for new-comming words thanks to Web based huge dictionary.

And, Juman tells you semantic meaning of words.

Kytea
~~~~~

`Kytea <http://www.phontron.com/kytea/>`__ is tokenizer tool developped
by Graham Neubig.

Kytea has a different algorithm from one of Mecab or Juman.

Setting up
==========

Tokenizers auto-install
-----------------------

::

    make install

mecab-neologd dictionary auto-install
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

::

    make install_neologd

Tokenizers manual-install
-------------------------

MeCab
~~~~~

See `here <https://github.com/jordwest/mecab-docs-en>`__ to install
MeCab system.

Mecab Neologd dictionary
~~~~~~~~~~~~~~~~~~~~~~~~

Mecab-neologd dictionary is a dictionary-extension based on
ipadic-dictionary, which is basic dictionary of Mecab.

With, Mecab-neologd dictionary, you're able to parse new-coming words
make one token.

Here, new-coming words is such like, movie actor name or company
name.....

See `here <https://github.com/neologd/mecab-ipadic-neologd>`__ and
install mecab-neologd dictionary.

Juman
~~~~~

::

    wget -O juman7.0.1.tar.bz2 "http://nlp.ist.i.kyoto-u.ac.jp/DLcounter/lime.cgi?down=http://nlp.ist.i.kyoto-u.ac.jp/nl-resource/juman/juman-7.01.tar.bz2&name=juman-7.01.tar.bz2"
    bzip2 -dc juman7.0.1.tar.bz2  | tar xvf -
    cd juman-7.01
    ./configure
    make   
    [sudo] make install

Juman++
-------

-  GCC version must be >= 5

::

    wget http://lotus.kuee.kyoto-u.ac.jp/nl-resource/jumanpp/jumanpp-1.02.tar.xz
    tar xJvf jumanpp-1.02.tar.xz
    cd jumanpp-1.02/
    ./configure
    make
    [sudo] make install

Kytea
-----

Install Kytea system

::

    wget http://www.phontron.com/kytea/download/kytea-0.4.7.tar.gz
    tar -xvf kytea-0.4.7.tar
    cd kytea-0.4.7
    ./configure
    make
    make install

Kytea has `python wrapper <https://github.com/chezou/Mykytea-python>`__
thanks to michiaki ariga. Install Kytea-python wrapper

::

    pip install kytea

install
-------

::

    [sudo] python setup.py install

Note
~~~~

During install, you see warning message when it fails to install
``pyknp`` or ``kytea``.

if you see these messages, try to re-install these packages manually.

Usage
=====

Tokenization Example(For python3.x. To see exmaple code for Python2.x,
plaese see
`here <https://github.com/Kensuke-Mitsuzawa/JapaneseTokenizers/blob/master/examples/examples.py>`__)

::

    import JapaneseTokenizer
    input_sentence = '10日放送の「中居正広のミになる図書館」(テレビ朝日系)で、SMAPの中居正広が、篠原信一の過去の勘違いを明かす一幕があった。'
    # ipadic is well-maintained dictionary #
    mecab_wrapper = JapaneseTokenizer.MecabWrapper(dictType='ipadic')
    print(mecab_wrapper.tokenize(input_sentence).convert_list_object())

    # neologd is automatically-generated dictionary from huge web-corpus #
    mecab_neologd_wrapper = JapaneseTokenizer.MecabWrapper(dictType='neologd')
    print(mecab_neologd_wrapper.tokenize(input_sentence).convert_list_object())

Filtering example
-----------------

::

    import JapaneseTokenizer
    # with word filtering by stopword & part-of-speech condition #
    print(mecab_wrapper.tokenize(input_sentence).filter(stopwords=['テレビ朝日'], pos_condition=[('名詞', '固有名詞')]).convert_list_object())

Part-of-speech structure
------------------------

Mecab, Juman, Kytea have different system of Part-of-Speech(POS).

You can check tables of Part-of-Speech(POS)
`here <http://www.unixuser.org/~euske/doc/postag/>`__

Similar Package
===============

natto-py
--------

natto-py is sophisticated package for tokenization. It supports
following features

-  easy interface for tokenization
-  importing additional dictionary
-  partial parsing mode

LICENSE
=======

MIT license

CHANGES
=======

0.6(2016-03-05)
---------------

-  first release to Pypi

0.7(2016-03-06)
---------------

-  Juman supports(only for python2.x)
-  Kytea supports(only for python2.x)

0.8(2016-04-03)
---------------

-  removed a bug when interface calls JUMAN
-  fixed the version number of jctconv

0.9 (2016-04-05)
----------------

-  Kytea supports also for Python3.x(Thanks to @chezou)

1.0 (2016-06-19)
----------------

-  Juman supports also for Python3.x

1.2.5 (2016-12-28)
------------------

-  It fixed bugs in Juman server mode in python3.x
-  It supports Juman++
-  It supports ``filter`` method with chain expression

1.2.6 (2017-01-12)
------------------

-  It introduced a paramter on text normalization function

   -  All ``\n`` strings are converted into ``。``. This is because
      ``\n`` string in input-text causes tokenization error especially
      with server-mode.

1.2.8 (2017-02-22)
------------------

-  It has make file for installing tokenizers.
-  It is tested with travis.

1.3.0 (2017-02-23)
------------------

-  It introduced de-normalization function after tokenization process.
   (全角英数 -> 半角英数)
-  For mecab-config, it detects path to mecab-config automatically
-  It fixed a bug of initializing juman-object in python2

after 1.3.0
-----------

change logs are in github release.

.. |Build Status| image:: https://travis-ci.org/Kensuke-Mitsuzawa/JapaneseTokenizers.svg?branch=travis
   :target: https://travis-ci.org/Kensuke-Mitsuzawa/JapaneseTokenizers
.. |MIT License| image:: http://img.shields.io/badge/license-MIT-blue.svg?style=flat
   :target: LICENSE

            

Raw data

            {
    "maintainer": "", 
    "docs_url": null, 
    "requires_python": "", 
    "maintainer_email": "", 
    "cheesecake_code_kwalitee_id": null, 
    "keywords": "MeCab", 
    "upload_time": "2017-09-21 08:10:52", 
    "author": "Kensuke Mitsuzawa", 
    "home_page": "https://github.com/Kensuke-Mitsuzawa/JapaneseTokenizers", 
    "github_user": "Kensuke-Mitsuzawa", 
    "download_url": "https://pypi.python.org/packages/ab/73/cfc8a35e964e78dbc8a72cd631b4f84d837279ae34501381a115d83f6c58/JapaneseTokenizer-1.3.4.tar.gz", 
    "platform": "", 
    "version": "1.3.4", 
    "cheesecake_documentation_id": null, 
    "description": "|Build Status|\\ |MIT License|\n\nWhat's this?\n============\n\nThis is simple python-wrapper for Japanese Tokenizers(A.K.A Tokenizer)\n\nThis project aims to call tokenizers and split a sentence into tokens as\neasy as possible.\n\nAnd, this project supports various Tokenization tools common interface.\nThus, it's easy to compare output from various tokenizers.\n\nThis project is available also in\n`Github <https://github.com/Kensuke-Mitsuzawa/JapaneseTokenizers>`__.\n\nIf you find any bugs, please report them to github issues. Or any pull\nrequests are welcomed!\n\nRequirements\n============\n\n-  Python 2.7\n-  Python 3.5\n\nFeatures\n========\n\n-  simple/common interface among various tokenizers\n-  simple/common interface for filtering with stopwords or\n   Part-of-Speech condition\n-  simple interface to add user-dictionary(mecab only)\n\nSupported Tokenizers\n--------------------\n\nMecab\n~~~~~\n\n`Mecab <http://mecab.googlecode.com/svn/trunk/mecab/doc/index.html?sess=3f6a4f9896295ef2480fa2482de521f6>`__\nis open source tokenizer system for various language(if you have\ndictionary for it)\n\nSee `english\ndocumentation <https://github.com/jordwest/mecab-docs-en>`__ for detail\n\nJuman\n~~~~~\n\n`Juman <http://nlp.ist.i.kyoto-u.ac.jp/EN/index.php?JUMAN>`__ is a\ntokenizer system developed by Kurohashi laboratory, Kyoto University,\nJapan.\n\nJuman is strong for ambiguous writing style in Japanese, and is strong\nfor new-comming words thanks to Web based huge dictionary.\n\nAnd, Juman tells you semantic meaning of words.\n\nJuman++\n~~~~~~~\n\n`Juman++ <http://nlp.ist.i.kyoto-u.ac.jp/EN/index.php?JUMAN++>`__ is a\ntokenizer system developed by Kurohashi laboratory, Kyoto University,\nJapan.\n\nJuman++ is succeeding system of Juman. It adopts RNN model for\ntokenization.\n\nJuman++ is strong for ambigious writing style in Japanese, and is strong\nfor new-comming words thanks to Web based huge dictionary.\n\nAnd, Juman tells you semantic meaning of words.\n\nKytea\n~~~~~\n\n`Kytea <http://www.phontron.com/kytea/>`__ is tokenizer tool developped\nby Graham Neubig.\n\nKytea has a different algorithm from one of Mecab or Juman.\n\nSetting up\n==========\n\nTokenizers auto-install\n-----------------------\n\n::\n\n    make install\n\nmecab-neologd dictionary auto-install\n~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~\n\n::\n\n    make install_neologd\n\nTokenizers manual-install\n-------------------------\n\nMeCab\n~~~~~\n\nSee `here <https://github.com/jordwest/mecab-docs-en>`__ to install\nMeCab system.\n\nMecab Neologd dictionary\n~~~~~~~~~~~~~~~~~~~~~~~~\n\nMecab-neologd dictionary is a dictionary-extension based on\nipadic-dictionary, which is basic dictionary of Mecab.\n\nWith, Mecab-neologd dictionary, you're able to parse new-coming words\nmake one token.\n\nHere, new-coming words is such like, movie actor name or company\nname.....\n\nSee `here <https://github.com/neologd/mecab-ipadic-neologd>`__ and\ninstall mecab-neologd dictionary.\n\nJuman\n~~~~~\n\n::\n\n    wget -O juman7.0.1.tar.bz2 \"http://nlp.ist.i.kyoto-u.ac.jp/DLcounter/lime.cgi?down=http://nlp.ist.i.kyoto-u.ac.jp/nl-resource/juman/juman-7.01.tar.bz2&name=juman-7.01.tar.bz2\"\n    bzip2 -dc juman7.0.1.tar.bz2  | tar xvf -\n    cd juman-7.01\n    ./configure\n    make   \n    [sudo] make install\n\nJuman++\n-------\n\n-  GCC version must be >= 5\n\n::\n\n    wget http://lotus.kuee.kyoto-u.ac.jp/nl-resource/jumanpp/jumanpp-1.02.tar.xz\n    tar xJvf jumanpp-1.02.tar.xz\n    cd jumanpp-1.02/\n    ./configure\n    make\n    [sudo] make install\n\nKytea\n-----\n\nInstall Kytea system\n\n::\n\n    wget http://www.phontron.com/kytea/download/kytea-0.4.7.tar.gz\n    tar -xvf kytea-0.4.7.tar\n    cd kytea-0.4.7\n    ./configure\n    make\n    make install\n\nKytea has `python wrapper <https://github.com/chezou/Mykytea-python>`__\nthanks to michiaki ariga. Install Kytea-python wrapper\n\n::\n\n    pip install kytea\n\ninstall\n-------\n\n::\n\n    [sudo] python setup.py install\n\nNote\n~~~~\n\nDuring install, you see warning message when it fails to install\n``pyknp`` or ``kytea``.\n\nif you see these messages, try to re-install these packages manually.\n\nUsage\n=====\n\nTokenization Example(For python3.x. To see exmaple code for Python2.x,\nplaese see\n`here <https://github.com/Kensuke-Mitsuzawa/JapaneseTokenizers/blob/master/examples/examples.py>`__)\n\n::\n\n    import JapaneseTokenizer\n    input_sentence = '10\u65e5\u653e\u9001\u306e\u300c\u4e2d\u5c45\u6b63\u5e83\u306e\u30df\u306b\u306a\u308b\u56f3\u66f8\u9928\u300d\uff08\u30c6\u30ec\u30d3\u671d\u65e5\u7cfb\uff09\u3067\u3001SMAP\u306e\u4e2d\u5c45\u6b63\u5e83\u304c\u3001\u7be0\u539f\u4fe1\u4e00\u306e\u904e\u53bb\u306e\u52d8\u9055\u3044\u3092\u660e\u304b\u3059\u4e00\u5e55\u304c\u3042\u3063\u305f\u3002'\n    # ipadic is well-maintained dictionary #\n    mecab_wrapper = JapaneseTokenizer.MecabWrapper(dictType='ipadic')\n    print(mecab_wrapper.tokenize(input_sentence).convert_list_object())\n\n    # neologd is automatically-generated dictionary from huge web-corpus #\n    mecab_neologd_wrapper = JapaneseTokenizer.MecabWrapper(dictType='neologd')\n    print(mecab_neologd_wrapper.tokenize(input_sentence).convert_list_object())\n\nFiltering example\n-----------------\n\n::\n\n    import JapaneseTokenizer\n    # with word filtering by stopword & part-of-speech condition #\n    print(mecab_wrapper.tokenize(input_sentence).filter(stopwords=['\u30c6\u30ec\u30d3\u671d\u65e5'], pos_condition=[('\u540d\u8a5e', '\u56fa\u6709\u540d\u8a5e')]).convert_list_object())\n\nPart-of-speech structure\n------------------------\n\nMecab, Juman, Kytea have different system of Part-of-Speech(POS).\n\nYou can check tables of Part-of-Speech(POS)\n`here <http://www.unixuser.org/~euske/doc/postag/>`__\n\nSimilar Package\n===============\n\nnatto-py\n--------\n\nnatto-py is sophisticated package for tokenization. It supports\nfollowing features\n\n-  easy interface for tokenization\n-  importing additional dictionary\n-  partial parsing mode\n\nLICENSE\n=======\n\nMIT license\n\nCHANGES\n=======\n\n0.6(2016-03-05)\n---------------\n\n-  first release to Pypi\n\n0.7(2016-03-06)\n---------------\n\n-  Juman supports(only for python2.x)\n-  Kytea supports(only for python2.x)\n\n0.8(2016-04-03)\n---------------\n\n-  removed a bug when interface calls JUMAN\n-  fixed the version number of jctconv\n\n0.9 (2016-04-05)\n----------------\n\n-  Kytea supports also for Python3.x(Thanks to @chezou)\n\n1.0 (2016-06-19)\n----------------\n\n-  Juman supports also for Python3.x\n\n1.2.5 (2016-12-28)\n------------------\n\n-  It fixed bugs in Juman server mode in python3.x\n-  It supports Juman++\n-  It supports ``filter`` method with chain expression\n\n1.2.6 (2017-01-12)\n------------------\n\n-  It introduced a paramter on text normalization function\n\n   -  All ``\\n`` strings are converted into ``\u3002``. This is because\n      ``\\n`` string in input-text causes tokenization error especially\n      with server-mode.\n\n1.2.8 (2017-02-22)\n------------------\n\n-  It has make file for installing tokenizers.\n-  It is tested with travis.\n\n1.3.0 (2017-02-23)\n------------------\n\n-  It introduced de-normalization function after tokenization process.\n   (\u5168\u89d2\u82f1\u6570 -> \u534a\u89d2\u82f1\u6570)\n-  For mecab-config, it detects path to mecab-config automatically\n-  It fixed a bug of initializing juman-object in python2\n\nafter 1.3.0\n-----------\n\nchange logs are in github release.\n\n.. |Build Status| image:: https://travis-ci.org/Kensuke-Mitsuzawa/JapaneseTokenizers.svg?branch=travis\n   :target: https://travis-ci.org/Kensuke-Mitsuzawa/JapaneseTokenizers\n.. |MIT License| image:: http://img.shields.io/badge/license-MIT-blue.svg?style=flat\n   :target: LICENSE\n", 
    "lcname": "japanesetokenizer", 
    "bugtrack_url": "", 
    "github": true, 
    "coveralls": false, 
    "name": "JapaneseTokenizer", 
    "license": "MIT", 
    "travis_ci": true, 
    "github_project": "JapaneseTokenizers", 
    "summary": "", 
    "split_keywords": [
        "mecab"
    ], 
    "author_email": "kensuke.mit@gmail.com", 
    "urls": [
        {
            "has_sig": false, 
            "upload_time": "2017-09-21T08:10:52", 
            "comment_text": "", 
            "python_version": "source", 
            "url": "https://pypi.python.org/packages/ab/73/cfc8a35e964e78dbc8a72cd631b4f84d837279ae34501381a115d83f6c58/JapaneseTokenizer-1.3.4.tar.gz", 
            "md5_digest": "29140724b608a9a2b8232c7c6bfb35d8", 
            "downloads": 0, 
            "filename": "JapaneseTokenizer-1.3.4.tar.gz", 
            "packagetype": "sdist", 
            "path": "ab/73/cfc8a35e964e78dbc8a72cd631b4f84d837279ae34501381a115d83f6c58/JapaneseTokenizer-1.3.4.tar.gz", 
            "size": 25226
        }
    ], 
    "_id": null, 
    "cheesecake_installability_id": null
}