JapaneseTokenizer


NameJapaneseTokenizer JSON
Version 1.2.7 PyPI version JSON
home_pagehttps://github.com/Kensuke-Mitsuzawa/JapaneseTokenizers
Summary
upload_time2017-01-13 00:37:10
maintainer
docs_urlNone
authorKensuke Mitsuzawa
requires_python
licenseMIT
keywords mecab
VCS
bugtrack_url
requirements No requirements were recorded.
Travis-CI No Travis.
Coveralis test coverage No Coveralis.
            What's this?
============

This is simple wrapper for Japanese Tokenizers(A.K.A Morphology
Splitter)

This project aims to call Tokenizer and split into tokens as easy as
possible.

And this project supports various Tokenization tools. You can compare
results among them.

This project is available also in
`Github <https://github.com/Kensuke-Mitsuzawa/JapaneseTokenizers>`__.

If you find any bugs, please report them to github issues. Or any pull
requests are welcomed!

Requirements
============

-  Python 2.7
-  Python 3.5

Features
========

-  You can get set of tokens from input sentence
-  You can filter some tokens with your Part-of-Speech condition or
   stopwords
-  You can add extension dictionary like mecab-neologd dictionary
-  You can define your original dictionary. And this dictionary forces
   mecab to make it one token

Supported Tokenization tool
---------------------------

Mecab
~~~~~

`Mecab <http://mecab.googlecode.com/svn/trunk/mecab/doc/index.html?sess=3f6a4f9896295ef2480fa2482de521f6>`__
is open source tokenizer system for various language(if you have
dictionary for it)

See `english
documentation <https://github.com/jordwest/mecab-docs-en>`__ for detail

Juman
~~~~~

`Juman <http://nlp.ist.i.kyoto-u.ac.jp/EN/index.php?JUMAN>`__ is
tokenizer tool developped by Kurohashi laboratory, Kyoto University,
Japan.

Juman is strong for ambigious writing style in Japanese, and is strong
for new-comming words thanks to Web based huge dictionary.

And, Juman tells you semantic meaning of words.

Juman++
~~~~~~~

`Juman++ <http://nlp.ist.i.kyoto-u.ac.jp/EN/index.php?JUMAN++>`__ is
tokenizer developped by Kurohashi laboratory, Kyoto University, Japan.

Juman++ is succeeding system of Juman. It adopts RNN model for
tokenization.

Juman++ is strong for ambigious writing style in Japanese, and is strong
for new-comming words thanks to Web based huge dictionary.

And, Juman tells you semantic meaning of words.

Kytea
~~~~~

`Kytea <http://www.phontron.com/kytea/>`__ is tokenizer tool developped
by Graham Neubig.

Kytea has a different algorithm from one of Mecab or Juman.

Setting up
==========

MeCab
-----

See `here <https://github.com/jordwest/mecab-docs-en>`__ to install
MeCab system.

Mecab Neologd dictionary
------------------------

Mecab-neologd dictionary is a dictionary-extension based on
ipadic-dictionary, which is basic dictionary of Mecab.

With, Mecab-neologd dictionary, you're able to parse new-coming words
make one token.

Here, new-coming words is such like, movie actor name or company
name.....

See `here <https://github.com/neologd/mecab-ipadic-neologd>`__ and
install mecab-neologd dictionary.

Juman
-----

::

    wget -O juman7.0.1.tar.bz2 "http://nlp.ist.i.kyoto-u.ac.jp/DLcounter/lime.cgi?down=http://nlp.ist.i.kyoto-u.ac.jp/nl-resource/juman/juman-7.01.tar.bz2&name=juman-7.01.tar.bz2"
    bzip2 -dc juman7.0.1.tar.bz2  | tar xvf -
    cd juman-7.01
    ./configure
    make   
    [sudo] make install

Juman++
-------

-  GCC version must be >= 5

::

    wget http://lotus.kuee.kyoto-u.ac.jp/nl-resource/jumanpp/jumanpp-1.01.tar.xz
    tar xJvf jumanpp-1.01.tar.xz
    cd jumanpp-1.01/
    ./configure
    make
    [sudo] make install

Kytea
-----

Install Kytea system

::

    wget http://www.phontron.com/kytea/download/kytea-0.4.7.tar.gz
    tar -xvf kytea-0.4.7.tar
    cd kytea-0.4.7
    ./configure
    make
    make install

Kytea has `python wrapper <https://github.com/chezou/Mykytea-python>`__
thanks to michiaki ariga. Install Kytea-python wrapper

::

    pip install kytea

Part-of-speech structure
========================

Mecab, Juman uses different system of Part-of-Speech(POS).

Keep in your mind when you use it.

You can check tables of Part-of-Speech(POS)
`here <http://www.unixuser.org/~euske/doc/postag/>`__

install
-------

::

    [sudo] python setup.py install

Usage
=====

Tokenization Example(For python2x. To see exmaple code for Python3.x,
plaese see
`here <https://github.com/Kensuke-Mitsuzawa/JapaneseTokenizers/blob/master/examples/examples.py>`__)

::

    # input is `unicode` type(in python2x)
    sentence = u'テヘラン(ペルシア語: تهران  ; Tehrān Tehran.ogg 発音[ヘルプ/ファイル]/teɦˈrɔːn/、英語:Tehran)は、西アジア、イランの首都でありかつテヘラン州の州都。人口12,223,598人。都市圏人口は13,413,348人に達する。'

    # make MecabWrapper object
    # path where `mecab-config` command exists. You can check it with `which mecab-config`
    # default value is '/usr/local/bin'
    path_mecab_config='/usr/local/bin'

    # you can choose from "neologd", "all", "ipaddic", "user", ""
    # "ipadic" and "" is equivalent
    dictType = ""

    mecab_wrapper = MecabWrapper(dictType=dictType, path_mecab_config=path_mecab_config)

    # tokenize sentence. Returned object is list of tuples
    tokenized_obj = mecab_wrapper.tokenize(sentence=sentence)
    assert isinstance(tokenized_obj, list)

    # Returned object is "TokenizedSenetence" class if you put return_list=False
    tokenized_obj = mecab_wrapper.tokenize(sentence=sentence, return_list=False)

Filtering example

::

    stopwords = [u'テヘラン']
    assert isinstance(tokenized_obj, TokenizedSenetence)
    # returned object is "FilteredObject" class
    filtered_obj = mecab_wrapper.filter(
        parsed_sentence=tokenized_obj,
        stopwords=stopwords
    )
    assert isinstance(filtered_obj, FilteredObject)

    # pos condition is list of tuples
    # You can set POS condition "ChaSen 品詞体系 (IPA品詞体系)" of this page http://www.unixuser.org/~euske/doc/postag/#chasen
    pos_condition = [(u'名詞', u'固有名詞'), (u'動詞', u'自立')]
    filtered_obj = mecab_wrapper.filter(
        parsed_sentence=tokenized_obj,
        pos_condition=pos_condition
    )

Similar Package
===============

natto-py
--------

natto-py is sophisticated package for tokenization. It supports
following features

-  easy interface for tokenization
-  importing additional dictionary
-  partial parsing mode

CHANGES
=======

0.6(2016-03-05)
---------------

-  first release to Pypi

0.7(2016-03-06)
---------------

-  Juman supports(only for python2.x)
-  Kytea supports(only for python2.x)

0.8(2016-04-03)
---------------

-  removed a bug when interface calls JUMAN
-  fixed the version number of jctconv

0.9 (2016-04-05)
----------------

-  Kytea supports also for Python3.x(Thanks to @chezou)

1.0 (2016-06-19)
----------------

-  Juman supports also for Python3.x

1.2.5 (2016-12-28)
------------------

-  It fixed bugs in Juman server mode in python3.x
-  It supports Juman++
-  It supports ``filter`` method with chain expression

1.2.6 (2017-01-12)
------------------

-  It introduced a paramter on text normalization function

   -  All ``\n`` strings are converted into ``。``. This is because
      ``\n`` string in input-text causes tokenization error especially
      with server-mode.

            

Raw data

            {
    "maintainer": "", 
    "docs_url": null, 
    "requires_python": "", 
    "maintainer_email": "", 
    "cheesecake_code_kwalitee_id": null, 
    "coveralis": false, 
    "keywords": "MeCab", 
    "upload_time": "2017-01-13 00:37:10", 
    "author": "Kensuke Mitsuzawa", 
    "home_page": "https://github.com/Kensuke-Mitsuzawa/JapaneseTokenizers", 
    "github_user": "Kensuke-Mitsuzawa", 
    "download_url": "https://pypi.python.org/packages/08/ed/e970d4d49554e7ddf5ed3edec1e5f1d8265c47e44c6de0b4e71191a15c7d/JapaneseTokenizer-1.2.7.tar.gz", 
    "platform": "", 
    "version": "1.2.7", 
    "cheesecake_documentation_id": null, 
    "description": "What's this?\n============\n\nThis is simple wrapper for Japanese Tokenizers(A.K.A Morphology\nSplitter)\n\nThis project aims to call Tokenizer and split into tokens as easy as\npossible.\n\nAnd this project supports various Tokenization tools. You can compare\nresults among them.\n\nThis project is available also in\n`Github <https://github.com/Kensuke-Mitsuzawa/JapaneseTokenizers>`__.\n\nIf you find any bugs, please report them to github issues. Or any pull\nrequests are welcomed!\n\nRequirements\n============\n\n-  Python 2.7\n-  Python 3.5\n\nFeatures\n========\n\n-  You can get set of tokens from input sentence\n-  You can filter some tokens with your Part-of-Speech condition or\n   stopwords\n-  You can add extension dictionary like mecab-neologd dictionary\n-  You can define your original dictionary. And this dictionary forces\n   mecab to make it one token\n\nSupported Tokenization tool\n---------------------------\n\nMecab\n~~~~~\n\n`Mecab <http://mecab.googlecode.com/svn/trunk/mecab/doc/index.html?sess=3f6a4f9896295ef2480fa2482de521f6>`__\nis open source tokenizer system for various language(if you have\ndictionary for it)\n\nSee `english\ndocumentation <https://github.com/jordwest/mecab-docs-en>`__ for detail\n\nJuman\n~~~~~\n\n`Juman <http://nlp.ist.i.kyoto-u.ac.jp/EN/index.php?JUMAN>`__ is\ntokenizer tool developped by Kurohashi laboratory, Kyoto University,\nJapan.\n\nJuman is strong for ambigious writing style in Japanese, and is strong\nfor new-comming words thanks to Web based huge dictionary.\n\nAnd, Juman tells you semantic meaning of words.\n\nJuman++\n~~~~~~~\n\n`Juman++ <http://nlp.ist.i.kyoto-u.ac.jp/EN/index.php?JUMAN++>`__ is\ntokenizer developped by Kurohashi laboratory, Kyoto University, Japan.\n\nJuman++ is succeeding system of Juman. It adopts RNN model for\ntokenization.\n\nJuman++ is strong for ambigious writing style in Japanese, and is strong\nfor new-comming words thanks to Web based huge dictionary.\n\nAnd, Juman tells you semantic meaning of words.\n\nKytea\n~~~~~\n\n`Kytea <http://www.phontron.com/kytea/>`__ is tokenizer tool developped\nby Graham Neubig.\n\nKytea has a different algorithm from one of Mecab or Juman.\n\nSetting up\n==========\n\nMeCab\n-----\n\nSee `here <https://github.com/jordwest/mecab-docs-en>`__ to install\nMeCab system.\n\nMecab Neologd dictionary\n------------------------\n\nMecab-neologd dictionary is a dictionary-extension based on\nipadic-dictionary, which is basic dictionary of Mecab.\n\nWith, Mecab-neologd dictionary, you're able to parse new-coming words\nmake one token.\n\nHere, new-coming words is such like, movie actor name or company\nname.....\n\nSee `here <https://github.com/neologd/mecab-ipadic-neologd>`__ and\ninstall mecab-neologd dictionary.\n\nJuman\n-----\n\n::\n\n    wget -O juman7.0.1.tar.bz2 \"http://nlp.ist.i.kyoto-u.ac.jp/DLcounter/lime.cgi?down=http://nlp.ist.i.kyoto-u.ac.jp/nl-resource/juman/juman-7.01.tar.bz2&name=juman-7.01.tar.bz2\"\n    bzip2 -dc juman7.0.1.tar.bz2  | tar xvf -\n    cd juman-7.01\n    ./configure\n    make   \n    [sudo] make install\n\nJuman++\n-------\n\n-  GCC version must be >= 5\n\n::\n\n    wget http://lotus.kuee.kyoto-u.ac.jp/nl-resource/jumanpp/jumanpp-1.01.tar.xz\n    tar xJvf jumanpp-1.01.tar.xz\n    cd jumanpp-1.01/\n    ./configure\n    make\n    [sudo] make install\n\nKytea\n-----\n\nInstall Kytea system\n\n::\n\n    wget http://www.phontron.com/kytea/download/kytea-0.4.7.tar.gz\n    tar -xvf kytea-0.4.7.tar\n    cd kytea-0.4.7\n    ./configure\n    make\n    make install\n\nKytea has `python wrapper <https://github.com/chezou/Mykytea-python>`__\nthanks to michiaki ariga. Install Kytea-python wrapper\n\n::\n\n    pip install kytea\n\nPart-of-speech structure\n========================\n\nMecab, Juman uses different system of Part-of-Speech(POS).\n\nKeep in your mind when you use it.\n\nYou can check tables of Part-of-Speech(POS)\n`here <http://www.unixuser.org/~euske/doc/postag/>`__\n\ninstall\n-------\n\n::\n\n    [sudo] python setup.py install\n\nUsage\n=====\n\nTokenization Example(For python2x. To see exmaple code for Python3.x,\nplaese see\n`here <https://github.com/Kensuke-Mitsuzawa/JapaneseTokenizers/blob/master/examples/examples.py>`__)\n\n::\n\n    # input is `unicode` type(in python2x)\n    sentence = u'\u30c6\u30d8\u30e9\u30f3\uff08\u30da\u30eb\u30b7\u30a2\u8a9e: \u062a\u0647\u0631\u0627\u0646  ; Tehr\u0101n Tehran.ogg \u767a\u97f3[\u30d8\u30eb\u30d7/\u30d5\u30a1\u30a4\u30eb]/te\u0266\u02c8r\u0254\u02d0n/\u3001\u82f1\u8a9e:Tehran\uff09\u306f\u3001\u897f\u30a2\u30b8\u30a2\u3001\u30a4\u30e9\u30f3\u306e\u9996\u90fd\u3067\u3042\u308a\u304b\u3064\u30c6\u30d8\u30e9\u30f3\u5dde\u306e\u5dde\u90fd\u3002\u4eba\u53e312,223,598\u4eba\u3002\u90fd\u5e02\u570f\u4eba\u53e3\u306f13,413,348\u4eba\u306b\u9054\u3059\u308b\u3002'\n\n    # make MecabWrapper object\n    # path where `mecab-config` command exists. You can check it with `which mecab-config`\n    # default value is '/usr/local/bin'\n    path_mecab_config='/usr/local/bin'\n\n    # you can choose from \"neologd\", \"all\", \"ipaddic\", \"user\", \"\"\n    # \"ipadic\" and \"\" is equivalent\n    dictType = \"\"\n\n    mecab_wrapper = MecabWrapper(dictType=dictType, path_mecab_config=path_mecab_config)\n\n    # tokenize sentence. Returned object is list of tuples\n    tokenized_obj = mecab_wrapper.tokenize(sentence=sentence)\n    assert isinstance(tokenized_obj, list)\n\n    # Returned object is \"TokenizedSenetence\" class if you put return_list=False\n    tokenized_obj = mecab_wrapper.tokenize(sentence=sentence, return_list=False)\n\nFiltering example\n\n::\n\n    stopwords = [u'\u30c6\u30d8\u30e9\u30f3']\n    assert isinstance(tokenized_obj, TokenizedSenetence)\n    # returned object is \"FilteredObject\" class\n    filtered_obj = mecab_wrapper.filter(\n        parsed_sentence=tokenized_obj,\n        stopwords=stopwords\n    )\n    assert isinstance(filtered_obj, FilteredObject)\n\n    # pos condition is list of tuples\n    # You can set POS condition \"ChaSen \u54c1\u8a5e\u4f53\u7cfb (IPA\u54c1\u8a5e\u4f53\u7cfb)\" of this page http://www.unixuser.org/~euske/doc/postag/#chasen\n    pos_condition = [(u'\u540d\u8a5e', u'\u56fa\u6709\u540d\u8a5e'), (u'\u52d5\u8a5e', u'\u81ea\u7acb')]\n    filtered_obj = mecab_wrapper.filter(\n        parsed_sentence=tokenized_obj,\n        pos_condition=pos_condition\n    )\n\nSimilar Package\n===============\n\nnatto-py\n--------\n\nnatto-py is sophisticated package for tokenization. It supports\nfollowing features\n\n-  easy interface for tokenization\n-  importing additional dictionary\n-  partial parsing mode\n\nCHANGES\n=======\n\n0.6(2016-03-05)\n---------------\n\n-  first release to Pypi\n\n0.7(2016-03-06)\n---------------\n\n-  Juman supports(only for python2.x)\n-  Kytea supports(only for python2.x)\n\n0.8(2016-04-03)\n---------------\n\n-  removed a bug when interface calls JUMAN\n-  fixed the version number of jctconv\n\n0.9 (2016-04-05)\n----------------\n\n-  Kytea supports also for Python3.x(Thanks to @chezou)\n\n1.0 (2016-06-19)\n----------------\n\n-  Juman supports also for Python3.x\n\n1.2.5 (2016-12-28)\n------------------\n\n-  It fixed bugs in Juman server mode in python3.x\n-  It supports Juman++\n-  It supports ``filter`` method with chain expression\n\n1.2.6 (2017-01-12)\n------------------\n\n-  It introduced a paramter on text normalization function\n\n   -  All ``\\n`` strings are converted into ``\u3002``. This is because\n      ``\\n`` string in input-text causes tokenization error especially\n      with server-mode.\n", 
    "tox": true, 
    "lcname": "japanesetokenizer", 
    "bugtrack_url": "", 
    "github": true, 
    "name": "JapaneseTokenizer", 
    "license": "MIT", 
    "travis_ci": false, 
    "github_project": "JapaneseTokenizers", 
    "summary": "", 
    "split_keywords": [
        "mecab"
    ], 
    "author_email": "kensuke.mit@gmail.com", 
    "urls": [
        {
            "has_sig": false, 
            "upload_time": "2017-01-13T00:37:10", 
            "comment_text": "", 
            "python_version": "source", 
            "url": "https://pypi.python.org/packages/08/ed/e970d4d49554e7ddf5ed3edec1e5f1d8265c47e44c6de0b4e71191a15c7d/JapaneseTokenizer-1.2.7.tar.gz", 
            "md5_digest": "4caa2f2fd7e0a4e71d0d95b319e2f960", 
            "downloads": 0, 
            "filename": "JapaneseTokenizer-1.2.7.tar.gz", 
            "packagetype": "sdist", 
            "path": "08/ed/e970d4d49554e7ddf5ed3edec1e5f1d8265c47e44c6de0b4e71191a15c7d/JapaneseTokenizer-1.2.7.tar.gz", 
            "size": 20224
        }
    ], 
    "_id": null, 
    "cheesecake_installability_id": null
}