thai-segmenter


Namethai-segmenter JSON
Version 0.4.2 PyPI version JSON
download
home_pagehttps://github.com/Querela/thai-segmenter
SummaryThai tokenizer, POS-tagger and sentence segmenter.
upload_time2023-08-21 11:13:48
maintainer
docs_urlNone
authorErik Körner
requires_python>=3.4
licenseMIT license
keywords thai nlp sentence segmentation tokenize pos-tag longlexto orchid
VCS
bugtrack_url
requirements No requirements were recorded.
Travis-CI
coveralls test coverage
            ========
Overview
========




This package provides utilities for Thai sentence segmentation, word tokenization and POS tagging.
Because of how sentence segmentation is performed, prior tokenization and POS tagging is required and therefore also provided with this package.

Besides functions for doing sentence segmentation, tokenization, tokenization with POS tagging for single sentence strings,
there are also functions for working with large amounts of data in a streaming fashion.
They are also accessible with a commandline script ``thai-segmenter`` that accepts file or standard in/output.
Options allow working with meta-headers or tabulator separated data files.

The main functionality for sentence segmentation was extracted, reformatted and slightly rewritten from another project, 
`Question Generation Thai <https://github.com/myscloud/Question-Generation-Thai>`_.

**LongLexTo** is used as state-of-the-art word/lexeme tokenizer. An implementation was packaged in the above project but there are also (*original?*) versions `github <https://github.com/telember/lexto>`_ and `homepage <http://www.sansarn.com/lexto/>`_. To better use it for bulk processing in Python, it has been rewritten from Java to pure Python.

For POS tagging a Viterbi-Model with the annotated Orchid-Corpus is used, `paper <https://www.researchgate.net/profile/Virach_Sornlertlamvanich/publication/2630580_Building_a_Thai_part-of-speech_tagged_corpus_ORCHID/links/02e7e514db19a98619000000/Building-a-Thai-part-of-speech-tagged-corpus-ORCHID.pdf>`_.

* Free software: MIT license


Installation
============

::

    pip install thai-segmenter


Documentation
=============

To use the project:

.. code-block:: python

    sentence = """foo bar 1234"""

    # [A] Sentence Segmentation
    from thai_segmenter.tasks import sentence_segment
    # or even easier:
    from thai_segmenter import sentence_segment
    sentences = sentence_segment(sentence)

    for sentence in sentences:
        print(str(sentence))

    # [B] Lexeme Tokenization
    from thai_segmenter import tokenize
    tokens = tokenize(sentence)
    for token in tokens:
        print(token, end=" ", flush=True)

    # [C] POS Tagging
    from thai_segmenter import tokenize_and_postag
    sentence_info = tokenize_and_postag(sentence)
    for token, pos in sentence_info.pos:
        print("{}|{}".format(token, pos), end=" ", flush=True)


See more possibilities in ``tasks.py`` or ``cli.py``.

Streaming larger sequences can be achieved like this:

.. code-block:: python

    # Streaming
    sentences = ["sent1\n", "sent2\n", "sent3\n"]  # or any iterable (like File)
    from thai_segmenter import line_sentence_segmenter
    sentences_segmented = line_sentence_segmenter(sentences)


Commandline tool
----------------

This project also provides a nifty commandline tool ``thai-segmenter`` that does most of the work for you:

.. code-block:: bash

    usage: thai-segmenter [-h] {clean,sentseg,tokenize,tokpos} ...

    Thai Segmentation utilities.

    optional arguments:
      -h, --help            show this help message and exit

    Tasks:
      {clean,sentseg,tokenize,tokpos}
        clean               Clean input from non-thai and blank lines.
        sentseg             Sentence segmentize input lines.
        tokenize            Tokenize input lines.
        tokpos              Tokenize and POS-tag input lines.


You can run sentence segmentation like this::

    thai-segmenter sentseg -i input.txt -o output.txt

or even pipe data::

    cat input.txt | thai-segmenter sentseg > output.txt

Use ``-h``/``--help`` to get more information about possible control flow options.


You can run it somewhat interactively with::

    thai-segmenter tokpos --stats

and standard input and output are used. Lines terminated with ``Enter`` are immediatly processed and printed. Stop work with key combination ``Ctrl`` + ``D`` and the ``--stats`` parameter will helpfully output some statistics.


WebApp
------

The project also provides a demo WebApp (using ``Flask`` and ``gevent``) that can be installed with::

    pip install -e .[webapp]

and then simply run (in the foreground)::

    thai-segmenter-webapp

Consider running it in a ``screen`` session.

.. code-block:: bash

    # create the screen detached and then attach
    screen -dmS thai-senseg-webapp
    screen -r thai-senseg-webapp

    # in the screen:
    thai-segmenter-webapp

    # and detach with keys [Ctrl]+[D]

*Please note that it only is a demo webapp to test and visualize how the sentence segmentor works.*


Development
===========

To install the package for development::

    git clone https://github.com/Querela/thai-segmenter.git
    cd thai-segmenter/
    pip install -e .[dev]


After changing the source, run auto code formatting with::

    isort <file>.py
    black <file>.py

And check it afterwards with::

    flake8 <file>.py

The ``setup.py`` also contains the ``flake8`` subcommand as well as an extended ``clean`` command.


Tests
-----

To run the all tests run::

    tox

You can also optionally run ``pytest`` alone::

    pytest

Or with::

    python setup.py test


Note, to combine the coverage data from all the tox environments run:

.. list-table::
    :widths: 10 90
    :stub-columns: 1

    - - Windows
      - ::

            set PYTEST_ADDOPTS=--cov-append
            tox

    - - Other
      - ::

            PYTEST_ADDOPTS=--cov-append tox


Changelog
=========

0.4.2 (2023-08-23)
------------------

* Fix signature of ``tasks.tokenize_and_postag`` function
* Update ``tox.ini`` to include newer python version, as well as older parameters and flags
* Reformat und Lint

0.4.1 (2019-04-08)
------------------

* Fix tokenization / tokenization + POS tagging: return words instead of subwords
* Add ``--escape-special`` and ``--subwords`` parameter to CLI script for tokenization.
  Allows tokenization to further tokenize unknown words (e. g. names)
  as well as escape special characters with angle bracket entities.


0.4.0 (2019-04-08)
------------------

* Add demo webapp with sentence segmentation.
  (NOTE: Running both the webapp and (batch) sentence segmentation at the same time from the same installation is not recommeded. It can have unexpected side-effects.)
* Some reformat of ``README.rst``


0.3.3 (2019-04-07)
------------------

* Fix duplicate names (class/method for ``sentence_segment``), rename class to ``sentence_segmenter`` (``.py``).


0.3.2 (2019-04-07)
------------------

* Add ``twine`` to extras dependencies.
* Publish module on **PyPI**. (Only ``sdist``, ``bdist_wheel`` can't be built currently.)
* Fix some TravisCI warnings.


0.3.1 (2019-04-07)
------------------

* Add tasks to ``__init__.py`` for easier access.


0.3.0 (2019-04-06)
------------------

* Refactor tasks into ``tasks.py`` to enable better import in case of embedding thai-segmenter into other projects.
* Have it almost release ready. :-)
* Add some more parameters to functions (optional header detection function)
* Flesh out ``README.rst`` with examples and descriptions.
* Add Changelog items.


0.2.1 / 0.2.2 (2019-04-05)
--------------------------

* Many changes, ``bumpversion`` needs to run where ``.bumpversion.cfg`` is located else it silently fails ...
* Strip Typehints and add support for Python3.5 again.
* Add CLI tasks for cleaning, sentseg, tokenize, pos-tagging.
* Add various params, e. g. for selecting columns, skipping headers.
* Fix many bugs for TravisCI (isort, flake8)
* Use iterators / streaming approach for file input/output.


0.2.0 (2019-04-05)
------------------

* Remove support of Python 2.7 and lower equal to Python 3.5 because of Typehints.
* Added CLI skeleton.
* Add really good ``setup.py``. (with ``black``, ``flake8``)


0.1.0 (2019-04-05)
------------------

* First release version as package.



            

Raw data

            {
    "_id": null,
    "home_page": "https://github.com/Querela/thai-segmenter",
    "name": "thai-segmenter",
    "maintainer": "",
    "docs_url": null,
    "requires_python": ">=3.4",
    "maintainer_email": "",
    "keywords": "thai,nlp,sentence segmentation,tokenize,pos-tag,longlexto,orchid",
    "author": "Erik K\u00f6rner",
    "author_email": "koerner@informatik.uni-leipzig.de",
    "download_url": "https://files.pythonhosted.org/packages/10/c0/991b4df414580731e2d62d206cec4d31927d6ed334a33a89567d1b4d4ec9/thai-segmenter-0.4.2.tar.gz",
    "platform": null,
    "description": "========\nOverview\n========\n\n\n\n\nThis package provides utilities for Thai sentence segmentation, word tokenization and POS tagging.\nBecause of how sentence segmentation is performed, prior tokenization and POS tagging is required and therefore also provided with this package.\n\nBesides functions for doing sentence segmentation, tokenization, tokenization with POS tagging for single sentence strings,\nthere are also functions for working with large amounts of data in a streaming fashion.\nThey are also accessible with a commandline script ``thai-segmenter`` that accepts file or standard in/output.\nOptions allow working with meta-headers or tabulator separated data files.\n\nThe main functionality for sentence segmentation was extracted, reformatted and slightly rewritten from another project, \n`Question Generation Thai <https://github.com/myscloud/Question-Generation-Thai>`_.\n\n**LongLexTo** is used as state-of-the-art word/lexeme tokenizer. An implementation was packaged in the above project but there are also (*original?*) versions `github <https://github.com/telember/lexto>`_ and `homepage <http://www.sansarn.com/lexto/>`_. To better use it for bulk processing in Python, it has been rewritten from Java to pure Python.\n\nFor POS tagging a Viterbi-Model with the annotated Orchid-Corpus is used, `paper <https://www.researchgate.net/profile/Virach_Sornlertlamvanich/publication/2630580_Building_a_Thai_part-of-speech_tagged_corpus_ORCHID/links/02e7e514db19a98619000000/Building-a-Thai-part-of-speech-tagged-corpus-ORCHID.pdf>`_.\n\n* Free software: MIT license\n\n\nInstallation\n============\n\n::\n\n    pip install thai-segmenter\n\n\nDocumentation\n=============\n\nTo use the project:\n\n.. code-block:: python\n\n    sentence = \"\"\"foo bar 1234\"\"\"\n\n    # [A] Sentence Segmentation\n    from thai_segmenter.tasks import sentence_segment\n    # or even easier:\n    from thai_segmenter import sentence_segment\n    sentences = sentence_segment(sentence)\n\n    for sentence in sentences:\n        print(str(sentence))\n\n    # [B] Lexeme Tokenization\n    from thai_segmenter import tokenize\n    tokens = tokenize(sentence)\n    for token in tokens:\n        print(token, end=\" \", flush=True)\n\n    # [C] POS Tagging\n    from thai_segmenter import tokenize_and_postag\n    sentence_info = tokenize_and_postag(sentence)\n    for token, pos in sentence_info.pos:\n        print(\"{}|{}\".format(token, pos), end=\" \", flush=True)\n\n\nSee more possibilities in ``tasks.py`` or ``cli.py``.\n\nStreaming larger sequences can be achieved like this:\n\n.. code-block:: python\n\n    # Streaming\n    sentences = [\"sent1\\n\", \"sent2\\n\", \"sent3\\n\"]  # or any iterable (like File)\n    from thai_segmenter import line_sentence_segmenter\n    sentences_segmented = line_sentence_segmenter(sentences)\n\n\nCommandline tool\n----------------\n\nThis project also provides a nifty commandline tool ``thai-segmenter`` that does most of the work for you:\n\n.. code-block:: bash\n\n    usage: thai-segmenter [-h] {clean,sentseg,tokenize,tokpos} ...\n\n    Thai Segmentation utilities.\n\n    optional arguments:\n      -h, --help            show this help message and exit\n\n    Tasks:\n      {clean,sentseg,tokenize,tokpos}\n        clean               Clean input from non-thai and blank lines.\n        sentseg             Sentence segmentize input lines.\n        tokenize            Tokenize input lines.\n        tokpos              Tokenize and POS-tag input lines.\n\n\nYou can run sentence segmentation like this::\n\n    thai-segmenter sentseg -i input.txt -o output.txt\n\nor even pipe data::\n\n    cat input.txt | thai-segmenter sentseg > output.txt\n\nUse ``-h``/``--help`` to get more information about possible control flow options.\n\n\nYou can run it somewhat interactively with::\n\n    thai-segmenter tokpos --stats\n\nand standard input and output are used. Lines terminated with ``Enter`` are immediatly processed and printed. Stop work with key combination ``Ctrl`` + ``D`` and the ``--stats`` parameter will helpfully output some statistics.\n\n\nWebApp\n------\n\nThe project also provides a demo WebApp (using ``Flask`` and ``gevent``) that can be installed with::\n\n    pip install -e .[webapp]\n\nand then simply run (in the foreground)::\n\n    thai-segmenter-webapp\n\nConsider running it in a ``screen`` session.\n\n.. code-block:: bash\n\n    # create the screen detached and then attach\n    screen -dmS thai-senseg-webapp\n    screen -r thai-senseg-webapp\n\n    # in the screen:\n    thai-segmenter-webapp\n\n    # and detach with keys [Ctrl]+[D]\n\n*Please note that it only is a demo webapp to test and visualize how the sentence segmentor works.*\n\n\nDevelopment\n===========\n\nTo install the package for development::\n\n    git clone https://github.com/Querela/thai-segmenter.git\n    cd thai-segmenter/\n    pip install -e .[dev]\n\n\nAfter changing the source, run auto code formatting with::\n\n    isort <file>.py\n    black <file>.py\n\nAnd check it afterwards with::\n\n    flake8 <file>.py\n\nThe ``setup.py`` also contains the ``flake8`` subcommand as well as an extended ``clean`` command.\n\n\nTests\n-----\n\nTo run the all tests run::\n\n    tox\n\nYou can also optionally run ``pytest`` alone::\n\n    pytest\n\nOr with::\n\n    python setup.py test\n\n\nNote, to combine the coverage data from all the tox environments run:\n\n.. list-table::\n    :widths: 10 90\n    :stub-columns: 1\n\n    - - Windows\n      - ::\n\n            set PYTEST_ADDOPTS=--cov-append\n            tox\n\n    - - Other\n      - ::\n\n            PYTEST_ADDOPTS=--cov-append tox\n\n\nChangelog\n=========\n\n0.4.2 (2023-08-23)\n------------------\n\n* Fix signature of ``tasks.tokenize_and_postag`` function\n* Update ``tox.ini`` to include newer python version, as well as older parameters and flags\n* Reformat und Lint\n\n0.4.1 (2019-04-08)\n------------------\n\n* Fix tokenization / tokenization + POS tagging: return words instead of subwords\n* Add ``--escape-special`` and ``--subwords`` parameter to CLI script for tokenization.\n  Allows tokenization to further tokenize unknown words (e. g. names)\n  as well as escape special characters with angle bracket entities.\n\n\n0.4.0 (2019-04-08)\n------------------\n\n* Add demo webapp with sentence segmentation.\n  (NOTE: Running both the webapp and (batch) sentence segmentation at the same time from the same installation is not recommeded. It can have unexpected side-effects.)\n* Some reformat of ``README.rst``\n\n\n0.3.3 (2019-04-07)\n------------------\n\n* Fix duplicate names (class/method for ``sentence_segment``), rename class to ``sentence_segmenter`` (``.py``).\n\n\n0.3.2 (2019-04-07)\n------------------\n\n* Add ``twine`` to extras dependencies.\n* Publish module on **PyPI**. (Only ``sdist``, ``bdist_wheel`` can't be built currently.)\n* Fix some TravisCI warnings.\n\n\n0.3.1 (2019-04-07)\n------------------\n\n* Add tasks to ``__init__.py`` for easier access.\n\n\n0.3.0 (2019-04-06)\n------------------\n\n* Refactor tasks into ``tasks.py`` to enable better import in case of embedding thai-segmenter into other projects.\n* Have it almost release ready. :-)\n* Add some more parameters to functions (optional header detection function)\n* Flesh out ``README.rst`` with examples and descriptions.\n* Add Changelog items.\n\n\n0.2.1 / 0.2.2 (2019-04-05)\n--------------------------\n\n* Many changes, ``bumpversion`` needs to run where ``.bumpversion.cfg`` is located else it silently fails ...\n* Strip Typehints and add support for Python3.5 again.\n* Add CLI tasks for cleaning, sentseg, tokenize, pos-tagging.\n* Add various params, e. g. for selecting columns, skipping headers.\n* Fix many bugs for TravisCI (isort, flake8)\n* Use iterators / streaming approach for file input/output.\n\n\n0.2.0 (2019-04-05)\n------------------\n\n* Remove support of Python 2.7 and lower equal to Python 3.5 because of Typehints.\n* Added CLI skeleton.\n* Add really good ``setup.py``. (with ``black``, ``flake8``)\n\n\n0.1.0 (2019-04-05)\n------------------\n\n* First release version as package.\n\n\n",
    "bugtrack_url": null,
    "license": "MIT license",
    "summary": "Thai tokenizer, POS-tagger and sentence segmenter.",
    "version": "0.4.2",
    "project_urls": {
        "Changelog": "https://github.com/Querela/thai-segmenter/blob/master/CHANGELOG.rst",
        "Homepage": "https://github.com/Querela/thai-segmenter",
        "Issue Tracker": "https://github.com/Querela/thai-segmenter/issues"
    },
    "split_keywords": [
        "thai",
        "nlp",
        "sentence segmentation",
        "tokenize",
        "pos-tag",
        "longlexto",
        "orchid"
    ],
    "urls": [
        {
            "comment_text": "",
            "digests": {
                "blake2b_256": "10c0991b4df414580731e2d62d206cec4d31927d6ed334a33a89567d1b4d4ec9",
                "md5": "1d230ef1f34f7407a552dc6d81a2c849",
                "sha256": "bb996e520e5e24f7216082aac59f4699a2c5083e38ac2b2950dd517dde47e40e"
            },
            "downloads": -1,
            "filename": "thai-segmenter-0.4.2.tar.gz",
            "has_sig": false,
            "md5_digest": "1d230ef1f34f7407a552dc6d81a2c849",
            "packagetype": "sdist",
            "python_version": "source",
            "requires_python": ">=3.4",
            "size": 2402438,
            "upload_time": "2023-08-21T11:13:48",
            "upload_time_iso_8601": "2023-08-21T11:13:48.255072Z",
            "url": "https://files.pythonhosted.org/packages/10/c0/991b4df414580731e2d62d206cec4d31927d6ed334a33a89567d1b4d4ec9/thai-segmenter-0.4.2.tar.gz",
            "yanked": false,
            "yanked_reason": null
        }
    ],
    "upload_time": "2023-08-21 11:13:48",
    "github": true,
    "gitlab": false,
    "bitbucket": false,
    "codeberg": false,
    "github_user": "Querela",
    "github_project": "thai-segmenter",
    "travis_ci": true,
    "coveralls": true,
    "github_actions": false,
    "tox": true,
    "lcname": "thai-segmenter"
}
        
Elapsed time: 0.13393s