segtok


Namesegtok JSON
Version 1.5.11 PyPI version JSON
download
home_pagehttps://github.com/fnl/segtok
Summarysentence segmentation and word tokenization tools
upload_time2021-12-15 21:56:14
maintainer
docs_urlNone
authorFlorian Leitner
requires_python
licenseMIT
keywords sentence segmenter splitter split word tokenizer token
VCS
bugtrack_url
requirements No requirements were recorded.
Travis-CI
coveralls test coverage No coveralls.
            ======
segtok
======

.. image:: https://img.shields.io/pypi/v/segtok.svg
    :target: https://pypi.python.org/pypi/segtok

.. image:: https://img.shields.io/pypi/l/segtok.svg

.. image:: https://travis-ci.org/fnl/segtok.svg?branch=master
    :target: https://travis-ci.org/fnl/segtok

NB: segtok v2, code-named syntok_, is available and fixes some tricky issues with segtok, in particular splitting sentence with terminals not followed by spaces.Like this :-).

-------------------------------------------
Sentence segmentation and word tokenization
-------------------------------------------

The segtok package provides two modules, ``segtok.segmenter`` and ``segtok.tokenizer``.
The segmenter provides functionality for splitting (Indo-European) text into sentences.
The tokenizer provides functionality for splitting (Indo-European) sentences into words and symbols (collectively called *tokens*).
Both modules can also be used from the command-line.
While other Indo-European languages could work, it has only been designed with languages such as Spanish, English, and German in mind.
For a more informed introduction to this tool, please read the article on my blog_.

Install
=======

To use this package, you minimally should have the latest version of Python 2.7 or any 3.5+ branch installed.
The package is expected to work with both Python 2.7 and 3.5+, tested against those latest Python branches, as well as Python 3.5.
The easiest way to get ``segtok`` installed is using ``pip`` or any other package manager that works with PyPI::

    pip3 install segtok

*Important*: If you are on a Linux machine and have problems installing the ``regex`` dependency of ``segtok``, make sure you have the ``python-dev`` and/or ``python3-dev`` packages installed to get the necessary headers to compile the package.

Then try the command line tools on some plain-text files (e.g., this README) to see if ``segtok`` meets your needs::

    segmenter README.rst | tokenizer

Test Suite
==========

The testing environment works with ``pytest``, ``tox`` and ``pyenv``.
You first need to install pyenv_ (on OSX with Homebrew: ``brew install pyenv``), and ``tox`` with ``pytest`` (``pip3 install tox pytest``).
Configuring ``pyenv`` depends on the Python versions you have installed.
Here, we assume you have the latest 2.7 and 3 versions installed and only need to provide an environment for testing ``segtok`` against the 3.8 branch::

    pyenv install 3.8.2
    pyenv global system 3.8.2

The second command is essential and indicates that your preferred Python binary is the system version and then the 3.8.2 branch.
If you forget the second command, you will see errors like ``ERROR: InvocationError: Failed to get version_info for python3.8: pyenv: python3.8: command not found`` when running ``tox``.
If you only have one Python version installed (say, 2.7), to fully run the tests, you must also install and globally configure the other version (e.g., the latest 3.x) with ``pyenv``, too.

Finally, to run all of ``segtok``'s unit-test suite, just run ``tox``::

    tox


Usage
=====

For details, please refer to the respective documentation; This README only provides an overview of the provided functionality.

A command-line
--------------

After installing the package, two command-line tools will be available, ``segmenter`` and ``tokenizer``.
Each can take UTF-8 encoded plain-text and transforms it into newline-separated sentences or tokens, respectively.
You can use other encoding in Python3 simply by reconfiguring your environment encoding or in any version of Python by forcing a particular encoding with the ``--encoding`` parameters.
The tokenizer assumes that each line contains (at most) one single sentence, which is the output format of the segmenter.
To learn more about each tool, please invoke them with their help option (``-h`` or ``--help``).

B ``segtok.segmenter``
----------------------

This module provides several ``split_...`` functions to segment texts into lists of sentences.
In addition, ``to_unix_linebreaks`` *normalizes* linebreaks (including the Unicode linebreak) to newline control characters (``\\n``).
The function ``rewrite_line_separators`` can be used to move (rewrite) the newline separators in the input text so that they are placed at the sentence segmentation locations.

C ``segtok.tokenizer``
----------------------

This module provides several ``..._tokenizer`` functions to tokenize input sentences into words and symbols.
To get the full functionality, use the ``web_tokenizer``, which will split everything "semantically correctly" except for URLs and e-mail addresses.
In addition, it provides convenience functionality for English texts:
Two compiled patterns (``IS_...``) can be used to detect if a word token contains a possessive-s marker ("Frank's") or is an apostrophe-based contraction ("didn't").
Tokens that match these patterns can then be split using the ``split_possessive_markers`` and ``split_contractions`` functions, respectively.

Legal
=====

License: `MIT <http://opensource.org/licenses/MIT>`_

Copyright (c) 2014-2021, Florian Leitner. All rights reserved.

Contributors (kudos):

- Mikhail Korobov (@kmike; port to Python2.7 and Travis CI integration)
- Georg Kucsko (@gkucsko; splitting sentences at terminals followed by noise)
- Karthikeyan Singaravelan (@tirkarthi; removing deprecation warnings, #23)
- Primož Godec (@PrimozGodec; fixed LICENSE file in setup.py)

History
=======

- **1.5.11** setup.py: renamed data_files with the LICENSE.txt file reference to license_files
- **1.5.10** removed deprecation warning (#23) as well as support for Python 3.3 from tox
- **1.5.9** added the license as a LICENSE.txt file to this repository
- **1.5.7** enhancement: split sentences even if the terminal is followed by invalid characters (contributed by @gkucsko)
- **1.5.6** fixed a bug that would lead to joining lines in single-line mode (#11, reported by @yucongo)
- **1.5.5** support for middle name initials ("Lester P. Pearson") 
- **1.5.4** also support for European-style number-dates with numeric months (24. 12. 2016)
- **1.5.3** added support for European-style number-dates and for months (24. Dez. 2016)
- **1.5.2** fixed a tokenizer bug when parsing URLs ending with root paths (``/``), prevented sentence splitting after U.K., U.S. and E.U. if followed by upper-case ("U.S. Air Force"), added missing Unicode hyphens and apostrophes, and added test suite setup instructions
- **1.5.1** removed ``count_continuations.py`` discussion from README (was only confusing); the segmenter now can preserve tab-separated text IDs before the text itself when reading from STDIN and then inserts a (tab-separated) sentence ID column for each sentence printed to STDOUT: see ``segmenter`` option ``--with-ids``
- **1.5.0** continuation words have been statistically evaluated and some poor choices removed (leading to more [precise] sentence splitting; see issue #9 by @Klim314 on GitHub)
- **1.4.0** the ``word_tokenizer`` no longer splits on colons between digits (time, references, ...)
- **1.3.1** fixed multiple dangling commas and colons (reported by Jim Geovedi)
- **1.3.0** added Python2.7 support and Travis CI test integration (BIG thanks to Mikhail!)
- **1.2.2** made segtok.tokenizer.match protected (renamed to "_match") and fixed UNIX linebreak normalization
- **1.2.1** the length of sentences inside brackets is now parametrized
- **1.2.0** wrote blog_ "documentation" and added chemical formula sub/super-script functionality
- **1.1.2** fixed Unicode list of valid sentence terminals (was missing U+2048)
- **1.1.1** fixed PyPI setup (missing MANIFEST.in for README.rst and "packages" in setup.py)
- **1.1.0** added possessive-s marker and apostrophe contraction splitting of tokens
- **1.0.0** initial release

.. _blog: http://fnl.es/segtok-a-segmentation-and-tokenization-library.html
.. _pyenv: https://github.com/yyuu/pyenv
.. _syntok: https://github.com/fnl/syntok



            

Raw data

            {
    "_id": null,
    "home_page": "https://github.com/fnl/segtok",
    "name": "segtok",
    "maintainer": "",
    "docs_url": null,
    "requires_python": "",
    "maintainer_email": "",
    "keywords": "sentence segmenter splitter split word tokenizer token",
    "author": "Florian Leitner",
    "author_email": "florian.leitner@gmail.com",
    "download_url": "https://files.pythonhosted.org/packages/0c/8a/1ae8e3deb805831933b63d58851ab41ff2472099e15511fc62039421ad70/segtok-1.5.11.tar.gz",
    "platform": "",
    "description": "======\nsegtok\n======\n\n.. image:: https://img.shields.io/pypi/v/segtok.svg\n    :target: https://pypi.python.org/pypi/segtok\n\n.. image:: https://img.shields.io/pypi/l/segtok.svg\n\n.. image:: https://travis-ci.org/fnl/segtok.svg?branch=master\n    :target: https://travis-ci.org/fnl/segtok\n\nNB: segtok v2, code-named syntok_, is available and fixes some tricky issues with segtok, in particular splitting sentence with terminals not followed by spaces.Like this :-).\n\n-------------------------------------------\nSentence segmentation and word tokenization\n-------------------------------------------\n\nThe segtok package provides two modules, ``segtok.segmenter`` and ``segtok.tokenizer``.\nThe segmenter provides functionality for splitting (Indo-European) text into sentences.\nThe tokenizer provides functionality for splitting (Indo-European) sentences into words and symbols (collectively called *tokens*).\nBoth modules can also be used from the command-line.\nWhile other Indo-European languages could work, it has only been designed with languages such as Spanish, English, and German in mind.\nFor a more informed introduction to this tool, please read the article on my blog_.\n\nInstall\n=======\n\nTo use this package, you minimally should have the latest version of Python 2.7 or any 3.5+ branch installed.\nThe package is expected to work with both Python 2.7 and 3.5+, tested against those latest Python branches, as well as Python 3.5.\nThe easiest way to get ``segtok`` installed is using ``pip`` or any other package manager that works with PyPI::\n\n    pip3 install segtok\n\n*Important*: If you are on a Linux machine and have problems installing the ``regex`` dependency of ``segtok``, make sure you have the ``python-dev`` and/or ``python3-dev`` packages installed to get the necessary headers to compile the package.\n\nThen try the command line tools on some plain-text files (e.g., this README) to see if ``segtok`` meets your needs::\n\n    segmenter README.rst | tokenizer\n\nTest Suite\n==========\n\nThe testing environment works with ``pytest``, ``tox`` and ``pyenv``.\nYou first need to install pyenv_ (on OSX with Homebrew: ``brew install pyenv``), and ``tox`` with ``pytest`` (``pip3 install tox pytest``).\nConfiguring ``pyenv`` depends on the Python versions you have installed.\nHere, we assume you have the latest 2.7 and 3 versions installed and only need to provide an environment for testing ``segtok`` against the 3.8 branch::\n\n    pyenv install 3.8.2\n    pyenv global system 3.8.2\n\nThe second command is essential and indicates that your preferred Python binary is the system version and then the 3.8.2 branch.\nIf you forget the second command, you will see errors like ``ERROR: InvocationError: Failed to get version_info for python3.8: pyenv: python3.8: command not found`` when running ``tox``.\nIf you only have one Python version installed (say, 2.7), to fully run the tests, you must also install and globally configure the other version (e.g., the latest 3.x) with ``pyenv``, too.\n\nFinally, to run all of ``segtok``'s unit-test suite, just run ``tox``::\n\n    tox\n\n\nUsage\n=====\n\nFor details, please refer to the respective documentation; This README only provides an overview of the provided functionality.\n\nA command-line\n--------------\n\nAfter installing the package, two command-line tools will be available, ``segmenter`` and ``tokenizer``.\nEach can take UTF-8 encoded plain-text and transforms it into newline-separated sentences or tokens, respectively.\nYou can use other encoding in Python3 simply by reconfiguring your environment encoding or in any version of Python by forcing a particular encoding with the ``--encoding`` parameters.\nThe tokenizer assumes that each line contains (at most) one single sentence, which is the output format of the segmenter.\nTo learn more about each tool, please invoke them with their help option (``-h`` or ``--help``).\n\nB ``segtok.segmenter``\n----------------------\n\nThis module provides several ``split_...`` functions to segment texts into lists of sentences.\nIn addition, ``to_unix_linebreaks`` *normalizes* linebreaks (including the Unicode linebreak) to newline control characters (``\\\\n``).\nThe function ``rewrite_line_separators`` can be used to move (rewrite) the newline separators in the input text so that they are placed at the sentence segmentation locations.\n\nC ``segtok.tokenizer``\n----------------------\n\nThis module provides several ``..._tokenizer`` functions to tokenize input sentences into words and symbols.\nTo get the full functionality, use the ``web_tokenizer``, which will split everything \"semantically correctly\" except for URLs and e-mail addresses.\nIn addition, it provides convenience functionality for English texts:\nTwo compiled patterns (``IS_...``) can be used to detect if a word token contains a possessive-s marker (\"Frank's\") or is an apostrophe-based contraction (\"didn't\").\nTokens that match these patterns can then be split using the ``split_possessive_markers`` and ``split_contractions`` functions, respectively.\n\nLegal\n=====\n\nLicense: `MIT <http://opensource.org/licenses/MIT>`_\n\nCopyright (c) 2014-2021, Florian Leitner. All rights reserved.\n\nContributors (kudos):\n\n- Mikhail Korobov (@kmike; port to Python2.7 and Travis CI integration)\n- Georg Kucsko (@gkucsko; splitting sentences at terminals followed by noise)\n- Karthikeyan Singaravelan (@tirkarthi; removing deprecation warnings, #23)\n- Primo\u017e Godec (@PrimozGodec; fixed LICENSE file in setup.py)\n\nHistory\n=======\n\n- **1.5.11** setup.py: renamed data_files with the LICENSE.txt file reference to license_files\n- **1.5.10** removed deprecation warning (#23) as well as support for Python 3.3 from tox\n- **1.5.9** added the license as a LICENSE.txt file to this repository\n- **1.5.7** enhancement: split sentences even if the terminal is followed by invalid characters (contributed by @gkucsko)\n- **1.5.6** fixed a bug that would lead to joining lines in single-line mode (#11, reported by @yucongo)\n- **1.5.5** support for middle name initials (\"Lester P. Pearson\") \n- **1.5.4** also support for European-style number-dates with numeric months (24. 12. 2016)\n- **1.5.3** added support for European-style number-dates and for months (24. Dez. 2016)\n- **1.5.2** fixed a tokenizer bug when parsing URLs ending with root paths (``/``), prevented sentence splitting after U.K., U.S. and E.U. if followed by upper-case (\"U.S. Air Force\"), added missing Unicode hyphens and apostrophes, and added test suite setup instructions\n- **1.5.1** removed ``count_continuations.py`` discussion from README (was only confusing); the segmenter now can preserve tab-separated text IDs before the text itself when reading from STDIN and then inserts a (tab-separated) sentence ID column for each sentence printed to STDOUT: see ``segmenter`` option ``--with-ids``\n- **1.5.0** continuation words have been statistically evaluated and some poor choices removed (leading to more [precise] sentence splitting; see issue #9 by @Klim314 on GitHub)\n- **1.4.0** the ``word_tokenizer`` no longer splits on colons between digits (time, references, ...)\n- **1.3.1** fixed multiple dangling commas and colons (reported by Jim Geovedi)\n- **1.3.0** added Python2.7 support and Travis CI test integration (BIG thanks to Mikhail!)\n- **1.2.2** made segtok.tokenizer.match protected (renamed to \"_match\") and fixed UNIX linebreak normalization\n- **1.2.1** the length of sentences inside brackets is now parametrized\n- **1.2.0** wrote blog_ \"documentation\" and added chemical formula sub/super-script functionality\n- **1.1.2** fixed Unicode list of valid sentence terminals (was missing U+2048)\n- **1.1.1** fixed PyPI setup (missing MANIFEST.in for README.rst and \"packages\" in setup.py)\n- **1.1.0** added possessive-s marker and apostrophe contraction splitting of tokens\n- **1.0.0** initial release\n\n.. _blog: http://fnl.es/segtok-a-segmentation-and-tokenization-library.html\n.. _pyenv: https://github.com/yyuu/pyenv\n.. _syntok: https://github.com/fnl/syntok\n\n\n",
    "bugtrack_url": null,
    "license": "MIT",
    "summary": "sentence segmentation and word tokenization tools",
    "version": "1.5.11",
    "split_keywords": [
        "sentence",
        "segmenter",
        "splitter",
        "split",
        "word",
        "tokenizer",
        "token"
    ],
    "urls": [
        {
            "comment_text": "",
            "digests": {
                "md5": "80ae68fc929b9fb06dce5dd8ee448716",
                "sha256": "910616b76198c3141b2772df530270d3b706e42ae69a5b30ef115c7bd5d1501a"
            },
            "downloads": -1,
            "filename": "segtok-1.5.11-py3-none-any.whl",
            "has_sig": false,
            "md5_digest": "80ae68fc929b9fb06dce5dd8ee448716",
            "packagetype": "bdist_wheel",
            "python_version": "py3",
            "requires_python": null,
            "size": 24332,
            "upload_time": "2021-12-15T21:56:12",
            "upload_time_iso_8601": "2021-12-15T21:56:12.508662Z",
            "url": "https://files.pythonhosted.org/packages/dd/60/d384dbae5d4756e33f1750fa3472303de2c827011907a64e213e114d0556/segtok-1.5.11-py3-none-any.whl",
            "yanked": false,
            "yanked_reason": null
        },
        {
            "comment_text": "",
            "digests": {
                "md5": "de91d35baafc8e4f2b5259312e417be0",
                "sha256": "8ab2dd44245bcbfec25b575dc4618473bbdf2af8c2649698cd5a370f42f3db23"
            },
            "downloads": -1,
            "filename": "segtok-1.5.11.tar.gz",
            "has_sig": false,
            "md5_digest": "de91d35baafc8e4f2b5259312e417be0",
            "packagetype": "sdist",
            "python_version": "source",
            "requires_python": null,
            "size": 25244,
            "upload_time": "2021-12-15T21:56:14",
            "upload_time_iso_8601": "2021-12-15T21:56:14.555170Z",
            "url": "https://files.pythonhosted.org/packages/0c/8a/1ae8e3deb805831933b63d58851ab41ff2472099e15511fc62039421ad70/segtok-1.5.11.tar.gz",
            "yanked": false,
            "yanked_reason": null
        }
    ],
    "upload_time": "2021-12-15 21:56:14",
    "github": true,
    "gitlab": false,
    "bitbucket": false,
    "github_user": "fnl",
    "github_project": "segtok",
    "travis_ci": true,
    "coveralls": false,
    "github_actions": false,
    "tox": true,
    "lcname": "segtok"
}
        
Elapsed time: 0.02100s