mosestokenizer
==============
This package provides wrappers for some pre-processing Perl scripts from the
Moses toolkit, namely, ``normalize-punctuation.perl``, ``tokenizer.perl``,
``detokenizer.perl`` and ``split-sentences.perl``.
Sample Usage
------------
All provided classes are importable from the package ``mosestokenizer``.
>>> from mosestokenizer import *
All classes have a constructor that takes a two-letter language code as
argument (``'en'``, ``'fr'``, ``'de'``, etc) and the resulting objects
are callable.
When created, these wrapper objects launch the corresponding Perl script as a
background process. When the objects are no longer needed, you should call the
``.close()`` method to close the background process and free system resources.
The objects also support the context manager interface.
Thus, if used within a ``with`` block, the ``.close()`` method is invoked
automatically when the block exits.
The following two usages of ``MosesTokenizer`` are equivalent:
>>> # here we will call .close() explicitly at the end:
>>> tokenize = MosesTokenizer('en')
>>> tokenize('Hello World!')
['Hello', 'World', '!']
>>> tokenize.close()
>>> # here we take advantage of the context manager interface:
>>> with MosesTokenizer('en') as tokenize:
>>> tokenize('Hello World!')
...
['Hello', 'World', '!']
As shown above, ``MosesTokenizer`` callable objects take a string and return a
list of tokens (strings).
By contrast, ``MosesDetokenizer`` takes a list of tokens and returns a string:
>>> with MosesDetokenizer('en') as detokenize:
>>> detokenize(['Hello', 'World', '!'])
...
'Hello World!'
``MosesSentenceSplitter`` does more than the name says. Besides splitting
sentences, it will also unwrap text, i.e. it will try to guess if a sentence
continues in the next line or not. It takes a list of lines (strings) and
returns a list of sentences (strings):
>>> with MosesSentenceSplitter('en') as splitsents:
>>> splitsents([
... 'Mr. Smith is away. Do you want to',
... 'leave a message?'
... ])
...
['Mr. Smith is away.', 'Do you want to leave a message?']
``MosesPunctuationNormalizer`` objects take a string as argument and return a
string:
>>> with MosesPunctuationNormalizer('en') as normalize:
>>> normalize('«Hello World» — she said…')
...
'"Hello World" - she said...'
License
-------
Copyright ® 2016-2021, Luís Gomes <luismsgomes@gmail.com>.
This library is free software; you can redistribute it and/or
modify it under the terms of the GNU Lesser General Public
License as published by the Free Software Foundation; either
version 2.1 of the License, or (at your option) any later version.
This library is distributed in the hope that it will be useful,
but WITHOUT ANY WARRANTY; without even the implied warranty of
MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the GNU
Lesser General Public License for more details.
You should have received a copy of the GNU Lesser General Public
License along with this library; if not, write to the Free Software
Foundation, Inc., 51 Franklin Street, Fifth Floor, Boston, MA
02110-1301 USA
Raw data
{
"_id": null,
"home_page": "https://github.com/luismsgomes/mosestokenizer",
"name": "mosestokenizer",
"maintainer": "",
"docs_url": null,
"requires_python": "",
"maintainer_email": "",
"keywords": "text tokenization pre-processing",
"author": "Lu\u00eds Gomes",
"author_email": "luismsgomes@gmail.com",
"download_url": "https://files.pythonhosted.org/packages/8d/84/4f3c1b5b8d796a07e3816cd41f7b1491e2291db4ade5f17b850116fd80e5/mosestokenizer-1.2.1.tar.gz",
"platform": "",
"description": "mosestokenizer\n==============\n\nThis package provides wrappers for some pre-processing Perl scripts from the\nMoses toolkit, namely, ``normalize-punctuation.perl``, ``tokenizer.perl``,\n``detokenizer.perl`` and ``split-sentences.perl``.\n\nSample Usage\n------------\n\nAll provided classes are importable from the package ``mosestokenizer``.\n\n >>> from mosestokenizer import *\n\nAll classes have a constructor that takes a two-letter language code as\nargument (``'en'``, ``'fr'``, ``'de'``, etc) and the resulting objects\nare callable.\n\nWhen created, these wrapper objects launch the corresponding Perl script as a\nbackground process. When the objects are no longer needed, you should call the\n``.close()`` method to close the background process and free system resources.\n\nThe objects also support the context manager interface.\nThus, if used within a ``with`` block, the ``.close()`` method is invoked\nautomatically when the block exits.\n\nThe following two usages of ``MosesTokenizer`` are equivalent:\n\n >>> # here we will call .close() explicitly at the end:\n >>> tokenize = MosesTokenizer('en')\n >>> tokenize('Hello World!')\n ['Hello', 'World', '!']\n >>> tokenize.close()\n\n >>> # here we take advantage of the context manager interface:\n >>> with MosesTokenizer('en') as tokenize:\n >>> tokenize('Hello World!')\n ...\n ['Hello', 'World', '!']\n\nAs shown above, ``MosesTokenizer`` callable objects take a string and return a\nlist of tokens (strings).\n\nBy contrast, ``MosesDetokenizer`` takes a list of tokens and returns a string:\n\n >>> with MosesDetokenizer('en') as detokenize:\n >>> detokenize(['Hello', 'World', '!'])\n ...\n 'Hello World!'\n\n``MosesSentenceSplitter`` does more than the name says. Besides splitting\nsentences, it will also unwrap text, i.e. it will try to guess if a sentence\ncontinues in the next line or not. It takes a list of lines (strings) and\nreturns a list of sentences (strings):\n\n >>> with MosesSentenceSplitter('en') as splitsents:\n >>> splitsents([\n ... 'Mr. Smith is away. Do you want to',\n ... 'leave a message?'\n ... ])\n ...\n ['Mr. Smith is away.', 'Do you want to leave a message?']\n\n\n``MosesPunctuationNormalizer`` objects take a string as argument and return a\nstring:\n\n >>> with MosesPunctuationNormalizer('en') as normalize:\n >>> normalize('\u00abHello World\u00bb \u2014 she said\u2026')\n ...\n '\"Hello World\" - she said...'\n\n\nLicense\n-------\n\nCopyright \u00ae 2016-2021, Lu\u00eds Gomes <luismsgomes@gmail.com>.\n\nThis library is free software; you can redistribute it and/or\nmodify it under the terms of the GNU Lesser General Public\nLicense as published by the Free Software Foundation; either\nversion 2.1 of the License, or (at your option) any later version.\n\nThis library is distributed in the hope that it will be useful,\nbut WITHOUT ANY WARRANTY; without even the implied warranty of\nMERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the GNU\nLesser General Public License for more details.\n\nYou should have received a copy of the GNU Lesser General Public\nLicense along with this library; if not, write to the Free Software\nFoundation, Inc., 51 Franklin Street, Fifth Floor, Boston, MA\n02110-1301 USA",
"bugtrack_url": null,
"license": "LGPLv2",
"summary": "Wrappers for several pre-processing scripts from the Moses toolkit.",
"version": "1.2.1",
"split_keywords": [
"text",
"tokenization",
"pre-processing"
],
"urls": [
{
"comment_text": "",
"digests": {
"blake2b_256": "8d844f3c1b5b8d796a07e3816cd41f7b1491e2291db4ade5f17b850116fd80e5",
"md5": "0004d7cb0200633ac0ce49d25683007a",
"sha256": "438b3e35a221f7930c408e97e3f38af6d0cec74b991eb9edb00a44e3510e836d"
},
"downloads": -1,
"filename": "mosestokenizer-1.2.1.tar.gz",
"has_sig": false,
"md5_digest": "0004d7cb0200633ac0ce49d25683007a",
"packagetype": "sdist",
"python_version": "source",
"requires_python": null,
"size": 37120,
"upload_time": "2021-10-22T14:15:07",
"upload_time_iso_8601": "2021-10-22T14:15:07.205726Z",
"url": "https://files.pythonhosted.org/packages/8d/84/4f3c1b5b8d796a07e3816cd41f7b1491e2291db4ade5f17b850116fd80e5/mosestokenizer-1.2.1.tar.gz",
"yanked": false,
"yanked_reason": null
}
],
"upload_time": "2021-10-22 14:15:07",
"github": true,
"gitlab": false,
"bitbucket": false,
"github_user": "luismsgomes",
"github_project": "mosestokenizer",
"travis_ci": false,
"coveralls": false,
"github_actions": false,
"lcname": "mosestokenizer"
}