# Wiki NLP Tools
Python package to perform language-agnostic tokenization.
## Vision
- Researchers can start with a Wikipedia article (wikitext or HTML), strip syntax to leave just paragraphs of plaintext, and then further tokenize these sentences into sentences and words for input into models.
- This would be language-agnostic – i.e. the library would work equally well regardless of Wikipedia language.
https://meta.wikimedia.org/wiki/List_of_Wikipedias
- This would be easily accessible – i.e. each component is a open-source, pip-installable Python library that is configurable but provides good default performance out-of-the-box that Wikimedia could use internally via PySpark UDFs on our cluster and external organizations/researchers could incorporate into their workflows.
- The connections between states are transparent – i.e. for any text extracted in word tokenization, that text can be connected directly back to the original wikitext or HTML that it was derived from.
## Features
- Tokenize text to sentence and words across `300+` languages out-of-the-box
- Abbreviations can be used to improve performances
- Word-toknizer takes non-whitespace delimited languages into account during tokenization
- Input can be exactly reconstructed from the tokenization output
### Installation
```bash
$ pip install mwtokenizer
```
### Basic Usage
```python
from mwtokenizer.tokenizer import Tokenizer
# initiate a tokenizer for "en" or English
tokenizer = Tokenizer(language_code = "en")
sample_text = '''Have Moly and Co. made it to the shop near St. Michael's Church?? \n\t The address is written by Bohr Jr. here!'''
print(list(tokenizer.sentence_tokenize(sample_text, use_abbreviation=True)))
'''
[output] ["Have Moly and Co. made it to the shop near St. Michael's Church?? \n\t ", 'The address is written by Bohr Jr. here!']
'''
print(list(tokenizer.word_tokenize(text=sample_text, use_abbreviation=True)))
'''
[output] ['Have', ' ', 'Moly', ' ', 'and', ' ', 'Co.', ' ', 'made', ' ', 'it', ' ', 'to', ' ', 'the', ' ', 'shop', ' ', 'near', ' ', 'St.', ' ', "Michael's", ' ', 'Church', '??', ' \n\t ', 'The', ' ', 'address', ' ', 'is', ' ', 'written', ' ', 'by', ' ', 'Bohr', ' ', 'Jr.', ' ', 'here', '!']
'''
```
## Project Information
- [Licensing](https://gitlab.wikimedia.org/repos/research/wiki-nlp-tools/-/blob/main/LICENSE)
- [Repository](https://gitlab.wikimedia.org/repos/research/wiki-nlp-tools)
- [Issue Tracker](https://gitlab.wikimedia.org/repos/research/wiki-nlp-tools/-/issues)
- [Contribution Guidelines](CONTRIBUTION.md)
- [Benchmarking](benchmarking/)
- Resource Generation: [Abbreviations & Benchmarking data](notebooks/) + [Sentencepiece Corpus and Training](spc_scripts/)
Raw data
{
"_id": null,
"home_page": "https://gitlab.wikimedia.org/repos/research/wiki-nlp-tools/",
"name": "mwtokenizer",
"maintainer": "",
"docs_url": null,
"requires_python": "",
"maintainer_email": "",
"keywords": "python,wikipedia,nlp,tokenizer",
"author": "Aisha Khatun & Appledora & Isaac Johnson & Martin Gerlach",
"author_email": "<isaac@wikimedia.org>",
"download_url": "https://files.pythonhosted.org/packages/0d/71/3097b66d99807c97babcaf2db08abbcee3157c00e7f76337565431c48e5c/mwtokenizer-0.2.0.tar.gz",
"platform": null,
"description": "\n# Wiki NLP Tools\nPython package to perform language-agnostic tokenization.\n## Vision\n - Researchers can start with a Wikipedia article (wikitext or HTML), strip syntax to leave just paragraphs of plaintext, and then further tokenize these sentences into sentences and words for input into models.\n - This would be language-agnostic \u2013 i.e. the library would work equally well regardless of Wikipedia language.\n https://meta.wikimedia.org/wiki/List_of_Wikipedias\n - This would be easily accessible \u2013 i.e. each component is a open-source, pip-installable Python library that is configurable but provides good default performance out-of-the-box that Wikimedia could use internally via PySpark UDFs on our cluster and external organizations/researchers could incorporate into their workflows.\n - The connections between states are transparent \u2013 i.e. for any text extracted in word tokenization, that text can be connected directly back to the original wikitext or HTML that it was derived from.\n\n\n## Features\n- Tokenize text to sentence and words across `300+` languages out-of-the-box\n- Abbreviations can be used to improve performances\n- Word-toknizer takes non-whitespace delimited languages into account during tokenization\n- Input can be exactly reconstructed from the tokenization output\n\n### Installation\n```bash\n$ pip install mwtokenizer\n```\n\n### Basic Usage\n```python\nfrom mwtokenizer.tokenizer import Tokenizer\n# initiate a tokenizer for \"en\" or English\ntokenizer = Tokenizer(language_code = \"en\")\nsample_text = '''Have Moly and Co. made it to the shop near St. Michael's Church?? \\n\\t The address is written by Bohr Jr. here!'''\nprint(list(tokenizer.sentence_tokenize(sample_text, use_abbreviation=True)))\n'''\n[output] [\"Have Moly and Co. made it to the shop near St. Michael's Church?? \\n\\t \", 'The address is written by Bohr Jr. here!']\n'''\nprint(list(tokenizer.word_tokenize(text=sample_text, use_abbreviation=True)))\n'''\n[output] ['Have', ' ', 'Moly', ' ', 'and', ' ', 'Co.', ' ', 'made', ' ', 'it', ' ', 'to', ' ', 'the', ' ', 'shop', ' ', 'near', ' ', 'St.', ' ', \"Michael's\", ' ', 'Church', '??', ' \\n\\t ', 'The', ' ', 'address', ' ', 'is', ' ', 'written', ' ', 'by', ' ', 'Bohr', ' ', 'Jr.', ' ', 'here', '!']\n'''\n```\n\n## Project Information\n- [Licensing](https://gitlab.wikimedia.org/repos/research/wiki-nlp-tools/-/blob/main/LICENSE)\n- [Repository](https://gitlab.wikimedia.org/repos/research/wiki-nlp-tools)\n- [Issue Tracker](https://gitlab.wikimedia.org/repos/research/wiki-nlp-tools/-/issues)\n- [Contribution Guidelines](CONTRIBUTION.md)\n- [Benchmarking](benchmarking/)\n- Resource Generation: [Abbreviations & Benchmarking data](notebooks/) + [Sentencepiece Corpus and Training](spc_scripts/)\n",
"bugtrack_url": null,
"license": "MIT License",
"summary": "Wikipedia Tokenizer Utility",
"version": "0.2.0",
"project_urls": {
"Homepage": "https://gitlab.wikimedia.org/repos/research/wiki-nlp-tools/"
},
"split_keywords": [
"python",
"wikipedia",
"nlp",
"tokenizer"
],
"urls": [
{
"comment_text": "",
"digests": {
"blake2b_256": "25292aad1f38a7b70d7291c716e17958e43ab17416cd041910a1dbfa15a82773",
"md5": "02a3beba69e796719d424cbf313b4722",
"sha256": "84c87ea1968761fa7ad2d774d87bdbf440abba9a56c5e972710b4b504ad1f1b9"
},
"downloads": -1,
"filename": "mwtokenizer-0.2.0-py3-none-any.whl",
"has_sig": false,
"md5_digest": "02a3beba69e796719d424cbf313b4722",
"packagetype": "bdist_wheel",
"python_version": "py3",
"requires_python": null,
"size": 6891060,
"upload_time": "2023-12-22T16:24:42",
"upload_time_iso_8601": "2023-12-22T16:24:42.775472Z",
"url": "https://files.pythonhosted.org/packages/25/29/2aad1f38a7b70d7291c716e17958e43ab17416cd041910a1dbfa15a82773/mwtokenizer-0.2.0-py3-none-any.whl",
"yanked": false,
"yanked_reason": null
},
{
"comment_text": "",
"digests": {
"blake2b_256": "0d713097b66d99807c97babcaf2db08abbcee3157c00e7f76337565431c48e5c",
"md5": "781a3a64665b5c360ac588736c2feeae",
"sha256": "95c496172e6915814edbed261bc64829b8661622fcae2947e530b63aad7bc4ec"
},
"downloads": -1,
"filename": "mwtokenizer-0.2.0.tar.gz",
"has_sig": false,
"md5_digest": "781a3a64665b5c360ac588736c2feeae",
"packagetype": "sdist",
"python_version": "source",
"requires_python": null,
"size": 6876048,
"upload_time": "2023-12-22T16:24:45",
"upload_time_iso_8601": "2023-12-22T16:24:45.606168Z",
"url": "https://files.pythonhosted.org/packages/0d/71/3097b66d99807c97babcaf2db08abbcee3157c00e7f76337565431c48e5c/mwtokenizer-0.2.0.tar.gz",
"yanked": false,
"yanked_reason": null
}
],
"upload_time": "2023-12-22 16:24:45",
"github": false,
"gitlab": false,
"bitbucket": false,
"codeberg": false,
"lcname": "mwtokenizer"
}