mosestokenizer


Namemosestokenizer JSON
Version 1.2.1 PyPI version JSON
download
home_pagehttps://github.com/luismsgomes/mosestokenizer
SummaryWrappers for several pre-processing scripts from the Moses toolkit.
upload_time2021-10-22 14:15:07
maintainer
docs_urlNone
authorLuís Gomes
requires_python
licenseLGPLv2
keywords text tokenization pre-processing
VCS
bugtrack_url
requirements No requirements were recorded.
Travis-CI No Travis.
coveralls test coverage No coveralls.
            mosestokenizer
==============

This package provides wrappers for some pre-processing Perl scripts from the
Moses toolkit, namely, ``normalize-punctuation.perl``, ``tokenizer.perl``,
``detokenizer.perl`` and ``split-sentences.perl``.

Sample Usage
------------

All provided classes are importable from the package ``mosestokenizer``.

    >>> from mosestokenizer import *

All classes have a constructor that takes a two-letter language code as
argument (``'en'``, ``'fr'``, ``'de'``, etc) and the resulting objects
are callable.

When created, these wrapper objects launch the corresponding Perl script as a
background process.  When the objects are no longer needed, you should call the
``.close()`` method to close the background process and free system resources.

The objects also support the context manager interface.
Thus, if used within a ``with`` block, the ``.close()`` method is invoked
automatically when the block exits.

The following two usages of ``MosesTokenizer`` are equivalent:

    >>> # here we will call .close() explicitly at the end:
    >>> tokenize = MosesTokenizer('en')
    >>> tokenize('Hello World!')
    ['Hello', 'World', '!']
    >>> tokenize.close()

    >>> # here we take advantage of the context manager interface:
    >>> with MosesTokenizer('en') as tokenize:
    >>>     tokenize('Hello World!')
    ...
    ['Hello', 'World', '!']

As shown above, ``MosesTokenizer`` callable objects take a string and return a
list of tokens (strings).

By contrast, ``MosesDetokenizer`` takes a list of tokens and returns a string:

    >>> with MosesDetokenizer('en') as detokenize:
    >>>     detokenize(['Hello', 'World', '!'])
    ...
    'Hello World!'

``MosesSentenceSplitter`` does more than the name says.  Besides splitting
sentences, it will also unwrap text, i.e. it will try to guess if a sentence
continues in the next line or not.  It takes a list of lines (strings) and
returns a list of sentences (strings):

    >>> with MosesSentenceSplitter('en') as splitsents:
    >>>     splitsents([
    ...         'Mr. Smith is away.  Do you want to',
    ...         'leave a message?'
    ...     ])
    ...
    ['Mr. Smith is away.', 'Do you want to leave a message?']


``MosesPunctuationNormalizer`` objects take a string as argument and return a
string:

    >>> with MosesPunctuationNormalizer('en') as normalize:
    >>>     normalize('«Hello World» — she said…')
    ...
    '"Hello World" - she said...'


License
-------

Copyright ® 2016-2021, Luís Gomes <luismsgomes@gmail.com>.

This library is free software; you can redistribute it and/or
modify it under the terms of the GNU Lesser General Public
License as published by the Free Software Foundation; either
version 2.1 of the License, or (at your option) any later version.

This library is distributed in the hope that it will be useful,
but WITHOUT ANY WARRANTY; without even the implied warranty of
MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the GNU
Lesser General Public License for more details.

You should have received a copy of the GNU Lesser General Public
License along with this library; if not, write to the Free Software
Foundation, Inc., 51 Franklin Street, Fifth Floor, Boston, MA
02110-1301 USA
            

Raw data

            {
    "_id": null,
    "home_page": "https://github.com/luismsgomes/mosestokenizer",
    "name": "mosestokenizer",
    "maintainer": "",
    "docs_url": null,
    "requires_python": "",
    "maintainer_email": "",
    "keywords": "text tokenization pre-processing",
    "author": "Lu\u00eds Gomes",
    "author_email": "luismsgomes@gmail.com",
    "download_url": "https://files.pythonhosted.org/packages/8d/84/4f3c1b5b8d796a07e3816cd41f7b1491e2291db4ade5f17b850116fd80e5/mosestokenizer-1.2.1.tar.gz",
    "platform": "",
    "description": "mosestokenizer\n==============\n\nThis package provides wrappers for some pre-processing Perl scripts from the\nMoses toolkit, namely, ``normalize-punctuation.perl``, ``tokenizer.perl``,\n``detokenizer.perl`` and ``split-sentences.perl``.\n\nSample Usage\n------------\n\nAll provided classes are importable from the package ``mosestokenizer``.\n\n    >>> from mosestokenizer import *\n\nAll classes have a constructor that takes a two-letter language code as\nargument (``'en'``, ``'fr'``, ``'de'``, etc) and the resulting objects\nare callable.\n\nWhen created, these wrapper objects launch the corresponding Perl script as a\nbackground process.  When the objects are no longer needed, you should call the\n``.close()`` method to close the background process and free system resources.\n\nThe objects also support the context manager interface.\nThus, if used within a ``with`` block, the ``.close()`` method is invoked\nautomatically when the block exits.\n\nThe following two usages of ``MosesTokenizer`` are equivalent:\n\n    >>> # here we will call .close() explicitly at the end:\n    >>> tokenize = MosesTokenizer('en')\n    >>> tokenize('Hello World!')\n    ['Hello', 'World', '!']\n    >>> tokenize.close()\n\n    >>> # here we take advantage of the context manager interface:\n    >>> with MosesTokenizer('en') as tokenize:\n    >>>     tokenize('Hello World!')\n    ...\n    ['Hello', 'World', '!']\n\nAs shown above, ``MosesTokenizer`` callable objects take a string and return a\nlist of tokens (strings).\n\nBy contrast, ``MosesDetokenizer`` takes a list of tokens and returns a string:\n\n    >>> with MosesDetokenizer('en') as detokenize:\n    >>>     detokenize(['Hello', 'World', '!'])\n    ...\n    'Hello World!'\n\n``MosesSentenceSplitter`` does more than the name says.  Besides splitting\nsentences, it will also unwrap text, i.e. it will try to guess if a sentence\ncontinues in the next line or not.  It takes a list of lines (strings) and\nreturns a list of sentences (strings):\n\n    >>> with MosesSentenceSplitter('en') as splitsents:\n    >>>     splitsents([\n    ...         'Mr. Smith is away.  Do you want to',\n    ...         'leave a message?'\n    ...     ])\n    ...\n    ['Mr. Smith is away.', 'Do you want to leave a message?']\n\n\n``MosesPunctuationNormalizer`` objects take a string as argument and return a\nstring:\n\n    >>> with MosesPunctuationNormalizer('en') as normalize:\n    >>>     normalize('\u00abHello World\u00bb \u2014 she said\u2026')\n    ...\n    '\"Hello World\" - she said...'\n\n\nLicense\n-------\n\nCopyright \u00ae 2016-2021, Lu\u00eds Gomes <luismsgomes@gmail.com>.\n\nThis library is free software; you can redistribute it and/or\nmodify it under the terms of the GNU Lesser General Public\nLicense as published by the Free Software Foundation; either\nversion 2.1 of the License, or (at your option) any later version.\n\nThis library is distributed in the hope that it will be useful,\nbut WITHOUT ANY WARRANTY; without even the implied warranty of\nMERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the GNU\nLesser General Public License for more details.\n\nYou should have received a copy of the GNU Lesser General Public\nLicense along with this library; if not, write to the Free Software\nFoundation, Inc., 51 Franklin Street, Fifth Floor, Boston, MA\n02110-1301 USA",
    "bugtrack_url": null,
    "license": "LGPLv2",
    "summary": "Wrappers for several pre-processing scripts from the Moses toolkit.",
    "version": "1.2.1",
    "split_keywords": [
        "text",
        "tokenization",
        "pre-processing"
    ],
    "urls": [
        {
            "comment_text": "",
            "digests": {
                "blake2b_256": "8d844f3c1b5b8d796a07e3816cd41f7b1491e2291db4ade5f17b850116fd80e5",
                "md5": "0004d7cb0200633ac0ce49d25683007a",
                "sha256": "438b3e35a221f7930c408e97e3f38af6d0cec74b991eb9edb00a44e3510e836d"
            },
            "downloads": -1,
            "filename": "mosestokenizer-1.2.1.tar.gz",
            "has_sig": false,
            "md5_digest": "0004d7cb0200633ac0ce49d25683007a",
            "packagetype": "sdist",
            "python_version": "source",
            "requires_python": null,
            "size": 37120,
            "upload_time": "2021-10-22T14:15:07",
            "upload_time_iso_8601": "2021-10-22T14:15:07.205726Z",
            "url": "https://files.pythonhosted.org/packages/8d/84/4f3c1b5b8d796a07e3816cd41f7b1491e2291db4ade5f17b850116fd80e5/mosestokenizer-1.2.1.tar.gz",
            "yanked": false,
            "yanked_reason": null
        }
    ],
    "upload_time": "2021-10-22 14:15:07",
    "github": true,
    "gitlab": false,
    "bitbucket": false,
    "github_user": "luismsgomes",
    "github_project": "mosestokenizer",
    "travis_ci": false,
    "coveralls": false,
    "github_actions": false,
    "lcname": "mosestokenizer"
}
        
Elapsed time: 0.04869s