Name | multi-tokenizer JSON |
Version |
0.1.4
JSON |
| download |
home_page | None |
Summary | Python package that provides tokenization of multilingual texts using language-specific tokenizers |
upload_time | 2024-08-11 03:18:18 |
maintainer | None |
docs_url | None |
author | chandralegend |
requires_python | <4.0.0,>=3.8.19 |
license | MIT |
keywords |
|
VCS |
|
bugtrack_url |
|
requirements |
No requirements were recorded.
|
Travis-CI |
No Travis.
|
coveralls test coverage |
No coveralls.
|
# Multi-Tokenizer
Tokenization of Multilingual Texts using Language-Specific Tokenizers
[](https://pypi.org/project/multi-tokenizer/)
## Overview
Multi-Tokenizer is a Python package that provides tokenization of multilingual texts using language-specific tokenizers. The package is designed to be used in a variety of applications, including natural language processing, machine learning, and data analysis. Behind the scenes, the package uses `lingua` library to detect the language of the text segments, the `tokenizers` library to create language-specific tokenizers, and then tokenizes the text segments using the appropriate tokenizer. Multi-tokenizer introduces additional special tokens to handle the language-specific tokenization, which can be used to reconstruct the original text segments after tokenization and allows the models to differentiate between the languages in the text segments.
## Installation
### Using pip
```bash
pip install multi-tokenizer
```
### from Source
```bash
git clone https://github.com/chandralegend/multi-tokenizer.git
cd multi-tokenizer
pip install .
```
## Usage
```python
from multi_tokenizer import MultiTokenizer, PretrainedTokenizers
# specify the language tokenizers to be used
lang_tokenizers = [
PretrainedTokenizers.ENGLISH,
PretrainedTokenizers.CHINESE,
PretrainedTokenizers.HINDI,
]
# create a multi-tokenizer object (split_text=True to split the text into segments, for better language detection)
tokenizer = MultiTokenizer(lang_tokenizers, split_text=True)
sentence = "Translate this hindi sentence to english - बिल्ली बहुत प्यारी है."
# Pretokenize the text
pretokenized_text = tokenizer.pre_tokenize(sentence) # [('<EN>', (0, 1)), ('Translate', (1, 10)), ('Ġthis', (10, 15)), ('Ġhindi', (15, 21)), ...]
# Encode the text
ids, tokens = tokenizer.encode(sentence) # [3, 7235, 6614, 86, 755, 775, 10763, 83, 19412, 276, ...], ['<EN>', 'Tr', 'ans', 'l', 'ate', 'Ġthis', 'Ġhind', ...]
# Decode the tokens
decoded_text = tokenizer.decode(ids) # Translate this hindi sentence to english - बिल्ली बहुत प्यारी है.
```
## Development Setup
### Prerequisites
- Use the VSCode Dev Containers for easy setup (Recommended)
- Install dev dependencies
```bash
pip install poetry
poetry install
```
### Linting, Formatting and Type Checking
- Add the directory to safe.directory
```bash
git config --global --add safe.directory /workspaces/multi-tokenizer
```
- Run the following command to lint and format the code
```bash
pre-commit run --all-files
```
- To install pre-commit hooks, run the following command (Recommended)
```bash
pre-commit install
```
### Running the tests
Run the tests using the following command
```bash
pytest -n "auto"
```
## Approaches
1. [Approach 1: Individual tokenizers for each language](support/proposal_1.md)
2. [Approach 2: Unified tokenization approach across languages using utf-8 encondings](support/proposal_2.md)
## Evaluation
- [Evaluation Methodologies](support/evaluation.md#evaluation-metodologies)
- [Data Collection and Analysis](support/evaluation.md#7-data-collection-and-analysis)
- [Comparative Analysis](support/evaluation.md#8-comparative-analysis)
- [Implementation Plan](support/evaluation.md#9-implementation-plan)
- [Future Expansion](support/evaluation.md#10-future-expansion)
## Contributors
- [Rob Neuhaus](https://github.com/rrenaud) - ⛴👨🏻✈️
- [Chandra Irugalbandara](https://github.com/chandralegend)
- [Alvanli](https://github.com/alvanli)
- [Vishnu Vardhan](https://github.com/VishnuVardhanSaiLanka)
- [Anthony Susevski](https://github.com/asusevski)
Raw data
{
"_id": null,
"home_page": null,
"name": "multi-tokenizer",
"maintainer": null,
"docs_url": null,
"requires_python": "<4.0.0,>=3.8.19",
"maintainer_email": null,
"keywords": null,
"author": "chandralegend",
"author_email": "irugalbandarachandra@gmail.com",
"download_url": "https://files.pythonhosted.org/packages/18/c5/734596ae2a84f8493317d01f40a034bb78a48ba7b217828684cd178b76dd/multi_tokenizer-0.1.4.tar.gz",
"platform": null,
"description": "# Multi-Tokenizer\nTokenization of Multilingual Texts using Language-Specific Tokenizers\n\n[](https://pypi.org/project/multi-tokenizer/)\n\n## Overview\n\nMulti-Tokenizer is a Python package that provides tokenization of multilingual texts using language-specific tokenizers. The package is designed to be used in a variety of applications, including natural language processing, machine learning, and data analysis. Behind the scenes, the package uses `lingua` library to detect the language of the text segments, the `tokenizers` library to create language-specific tokenizers, and then tokenizes the text segments using the appropriate tokenizer. Multi-tokenizer introduces additional special tokens to handle the language-specific tokenization, which can be used to reconstruct the original text segments after tokenization and allows the models to differentiate between the languages in the text segments.\n\n## Installation\n\n### Using pip\n```bash\npip install multi-tokenizer\n```\n\n### from Source\n```bash\ngit clone https://github.com/chandralegend/multi-tokenizer.git\ncd multi-tokenizer\npip install .\n```\n\n## Usage\n\n```python\nfrom multi_tokenizer import MultiTokenizer, PretrainedTokenizers\n\n# specify the language tokenizers to be used\nlang_tokenizers = [\n PretrainedTokenizers.ENGLISH,\n PretrainedTokenizers.CHINESE,\n PretrainedTokenizers.HINDI,\n]\n\n# create a multi-tokenizer object (split_text=True to split the text into segments, for better language detection)\ntokenizer = MultiTokenizer(lang_tokenizers, split_text=True)\n\nsentence = \"Translate this hindi sentence to english - \u092c\u093f\u0932\u094d\u0932\u0940 \u092c\u0939\u0941\u0924 \u092a\u094d\u092f\u093e\u0930\u0940 \u0939\u0948.\"\n\n# Pretokenize the text\npretokenized_text = tokenizer.pre_tokenize(sentence) # [('<EN>', (0, 1)), ('Translate', (1, 10)), ('\u0120this', (10, 15)), ('\u0120hindi', (15, 21)), ...]\n\n# Encode the text\nids, tokens = tokenizer.encode(sentence) # [3, 7235, 6614, 86, 755, 775, 10763, 83, 19412, 276, ...], ['<EN>', 'Tr', 'ans', 'l', 'ate', '\u0120this', '\u0120hind', ...]\n\n# Decode the tokens\ndecoded_text = tokenizer.decode(ids) # Translate this hindi sentence to english - \u092c\u093f\u0932\u094d\u0932\u0940 \u092c\u0939\u0941\u0924 \u092a\u094d\u092f\u093e\u0930\u0940 \u0939\u0948.\n```\n\n\n## Development Setup\n\n### Prerequisites\n- Use the VSCode Dev Containers for easy setup (Recommended)\n- Install dev dependencies\n ```bash\n pip install poetry\n poetry install\n ```\n\n### Linting, Formatting and Type Checking\n- Add the directory to safe.directory\n ```bash\n git config --global --add safe.directory /workspaces/multi-tokenizer\n ```\n- Run the following command to lint and format the code\n ```bash\n pre-commit run --all-files\n ```\n- To install pre-commit hooks, run the following command (Recommended)\n ```bash\n pre-commit install\n ```\n\n### Running the tests\nRun the tests using the following command\n```bash\npytest -n \"auto\"\n```\n\n## Approaches\n\n1. [Approach 1: Individual tokenizers for each language](support/proposal_1.md)\n2. [Approach 2: Unified tokenization approach across languages using utf-8 encondings](support/proposal_2.md)\n\n## Evaluation\n\n- [Evaluation Methodologies](support/evaluation.md#evaluation-metodologies)\n- [Data Collection and Analysis](support/evaluation.md#7-data-collection-and-analysis)\n- [Comparative Analysis](support/evaluation.md#8-comparative-analysis)\n- [Implementation Plan](support/evaluation.md#9-implementation-plan)\n- [Future Expansion](support/evaluation.md#10-future-expansion)\n\n## Contributors\n\n- [Rob Neuhaus](https://github.com/rrenaud) - \u26f4\ud83d\udc68\ud83c\udffb\u200d\u2708\ufe0f\n- [Chandra Irugalbandara](https://github.com/chandralegend)\n- [Alvanli](https://github.com/alvanli)\n- [Vishnu Vardhan](https://github.com/VishnuVardhanSaiLanka)\n- [Anthony Susevski](https://github.com/asusevski)\n",
"bugtrack_url": null,
"license": "MIT",
"summary": "Python package that provides tokenization of multilingual texts using language-specific tokenizers",
"version": "0.1.4",
"project_urls": null,
"split_keywords": [],
"urls": [
{
"comment_text": "",
"digests": {
"blake2b_256": "50d99637fe6da732657c9fde6461880c48469d408fae679456aa8d7301779fa8",
"md5": "6bf499e19f7af8506344dd0543cf1cc4",
"sha256": "8a972a438826d77caad0a28cf1d961028c6f4b4a7a85e27d19694ddbca9dc859"
},
"downloads": -1,
"filename": "multi_tokenizer-0.1.4-py3-none-any.whl",
"has_sig": false,
"md5_digest": "6bf499e19f7af8506344dd0543cf1cc4",
"packagetype": "bdist_wheel",
"python_version": "py3",
"requires_python": "<4.0.0,>=3.8.19",
"size": 958085,
"upload_time": "2024-08-11T03:18:16",
"upload_time_iso_8601": "2024-08-11T03:18:16.138703Z",
"url": "https://files.pythonhosted.org/packages/50/d9/9637fe6da732657c9fde6461880c48469d408fae679456aa8d7301779fa8/multi_tokenizer-0.1.4-py3-none-any.whl",
"yanked": false,
"yanked_reason": null
},
{
"comment_text": "",
"digests": {
"blake2b_256": "18c5734596ae2a84f8493317d01f40a034bb78a48ba7b217828684cd178b76dd",
"md5": "367c6e6fba0d7984c73fbaf98ede4ee3",
"sha256": "cf86d1e6903b10111352016f1f421fc82b6cb5b4f2a53a382c647a3224bf0fb1"
},
"downloads": -1,
"filename": "multi_tokenizer-0.1.4.tar.gz",
"has_sig": false,
"md5_digest": "367c6e6fba0d7984c73fbaf98ede4ee3",
"packagetype": "sdist",
"python_version": "source",
"requires_python": "<4.0.0,>=3.8.19",
"size": 936633,
"upload_time": "2024-08-11T03:18:18",
"upload_time_iso_8601": "2024-08-11T03:18:18.230482Z",
"url": "https://files.pythonhosted.org/packages/18/c5/734596ae2a84f8493317d01f40a034bb78a48ba7b217828684cd178b76dd/multi_tokenizer-0.1.4.tar.gz",
"yanked": false,
"yanked_reason": null
}
],
"upload_time": "2024-08-11 03:18:18",
"github": false,
"gitlab": false,
"bitbucket": false,
"codeberg": false,
"lcname": "multi-tokenizer"
}