multi-tokenizer

Name	multi-tokenizer JSON
Version	0.1.4 JSON
	download
home_page	None
Summary	Python package that provides tokenization of multilingual texts using language-specific tokenizers
upload_time	2024-08-11 03:18:18
maintainer	None
docs_url	None
author	chandralegend
requires_python	<4.0.0,>=3.8.19
license	MIT
keywords
VCS
bugtrack_url
requirements	No requirements were recorded.
Travis-CI	No Travis.
coveralls test coverage	No coveralls.

            # Multi-Tokenizer
Tokenization of Multilingual Texts using Language-Specific Tokenizers

[![PyPI version](https://img.shields.io/pypi/v/multi-tokenizer.svg)](https://pypi.org/project/multi-tokenizer/)

## Overview

Multi-Tokenizer is a Python package that provides tokenization of multilingual texts using language-specific tokenizers. The package is designed to be used in a variety of applications, including natural language processing, machine learning, and data analysis. Behind the scenes, the package uses `lingua` library to detect the language of the text segments, the `tokenizers` library to create language-specific tokenizers, and then tokenizes the text segments using the appropriate tokenizer. Multi-tokenizer introduces additional special tokens to handle the language-specific tokenization, which can be used to reconstruct the original text segments after tokenization and allows the models to differentiate between the languages in the text segments.

## Installation

### Using pip
```bash
pip install multi-tokenizer
```

### from Source
```bash
git clone https://github.com/chandralegend/multi-tokenizer.git
cd multi-tokenizer
pip install .
```

## Usage

```python
from multi_tokenizer import MultiTokenizer, PretrainedTokenizers

# specify the language tokenizers to be used
lang_tokenizers = [
    PretrainedTokenizers.ENGLISH,
    PretrainedTokenizers.CHINESE,
    PretrainedTokenizers.HINDI,
]

# create a multi-tokenizer object (split_text=True to split the text into segments, for better language detection)
tokenizer = MultiTokenizer(lang_tokenizers, split_text=True)

sentence = "Translate this hindi sentence to english - बिल्ली बहुत प्यारी है."

# Pretokenize the text
pretokenized_text = tokenizer.pre_tokenize(sentence) # [('<EN>', (0, 1)), ('Translate', (1, 10)), ('Ġthis', (10, 15)), ('Ġhindi', (15, 21)), ...]

# Encode the text
ids, tokens = tokenizer.encode(sentence) # [3, 7235, 6614, 86, 755, 775, 10763, 83, 19412, 276, ...], ['<EN>', 'Tr', 'ans', 'l', 'ate', 'Ġthis', 'Ġhind', ...]

# Decode the tokens
decoded_text = tokenizer.decode(ids) # Translate this hindi sentence to english - बिल्ली बहुत प्यारी है.
```


## Development Setup

### Prerequisites
- Use the VSCode Dev Containers for easy setup (Recommended)
- Install dev dependencies
    ```bash
    pip install poetry
    poetry install
    ```

### Linting, Formatting and Type Checking
- Add the directory to safe.directory
    ```bash
    git config --global --add safe.directory /workspaces/multi-tokenizer
    ```
- Run the following command to lint and format the code
    ```bash
    pre-commit run --all-files
    ```
- To install pre-commit hooks, run the following command (Recommended)
    ```bash
    pre-commit install
    ```

### Running the tests
Run the tests using the following command
```bash
pytest -n "auto"
```

## Approaches

1. [Approach 1: Individual tokenizers for each language](support/proposal_1.md)
2. [Approach 2: Unified tokenization approach across languages using utf-8 encondings](support/proposal_2.md)

## Evaluation

- [Evaluation Methodologies](support/evaluation.md#evaluation-metodologies)
- [Data Collection and Analysis](support/evaluation.md#7-data-collection-and-analysis)
- [Comparative Analysis](support/evaluation.md#8-comparative-analysis)
- [Implementation Plan](support/evaluation.md#9-implementation-plan)
- [Future Expansion](support/evaluation.md#10-future-expansion)

## Contributors

- [Rob Neuhaus](https://github.com/rrenaud) - ⛴👨🏻‍✈️
- [Chandra Irugalbandara](https://github.com/chandralegend)
- [Alvanli](https://github.com/alvanli)
- [Vishnu Vardhan](https://github.com/VishnuVardhanSaiLanka)
- [Anthony Susevski](https://github.com/asusevski)

Raw data

            {
    "_id": null,
    "home_page": null,
    "name": "multi-tokenizer",
    "maintainer": null,
    "docs_url": null,
    "requires_python": "<4.0.0,>=3.8.19",
    "maintainer_email": null,
    "keywords": null,
    "author": "chandralegend",
    "author_email": "irugalbandarachandra@gmail.com",
    "download_url": "https://files.pythonhosted.org/packages/18/c5/734596ae2a84f8493317d01f40a034bb78a48ba7b217828684cd178b76dd/multi_tokenizer-0.1.4.tar.gz",
    "platform": null,
    "description": "# Multi-Tokenizer\nTokenization of Multilingual Texts using Language-Specific Tokenizers\n\n[![PyPI version](https://img.shields.io/pypi/v/multi-tokenizer.svg)](https://pypi.org/project/multi-tokenizer/)\n\n## Overview\n\nMulti-Tokenizer is a Python package that provides tokenization of multilingual texts using language-specific tokenizers. The package is designed to be used in a variety of applications, including natural language processing, machine learning, and data analysis. Behind the scenes, the package uses `lingua` library to detect the language of the text segments, the `tokenizers` library to create language-specific tokenizers, and then tokenizes the text segments using the appropriate tokenizer. Multi-tokenizer introduces additional special tokens to handle the language-specific tokenization, which can be used to reconstruct the original text segments after tokenization and allows the models to differentiate between the languages in the text segments.\n\n## Installation\n\n### Using pip\n```bash\npip install multi-tokenizer\n```\n\n### from Source\n```bash\ngit clone https://github.com/chandralegend/multi-tokenizer.git\ncd multi-tokenizer\npip install .\n```\n\n## Usage\n\n```python\nfrom multi_tokenizer import MultiTokenizer, PretrainedTokenizers\n\n# specify the language tokenizers to be used\nlang_tokenizers = [\n    PretrainedTokenizers.ENGLISH,\n    PretrainedTokenizers.CHINESE,\n    PretrainedTokenizers.HINDI,\n]\n\n# create a multi-tokenizer object (split_text=True to split the text into segments, for better language detection)\ntokenizer = MultiTokenizer(lang_tokenizers, split_text=True)\n\nsentence = \"Translate this hindi sentence to english - \u092c\u093f\u0932\u094d\u0932\u0940 \u092c\u0939\u0941\u0924 \u092a\u094d\u092f\u093e\u0930\u0940 \u0939\u0948.\"\n\n# Pretokenize the text\npretokenized_text = tokenizer.pre_tokenize(sentence) # [('<EN>', (0, 1)), ('Translate', (1, 10)), ('\u0120this', (10, 15)), ('\u0120hindi', (15, 21)), ...]\n\n# Encode the text\nids, tokens = tokenizer.encode(sentence) # [3, 7235, 6614, 86, 755, 775, 10763, 83, 19412, 276, ...], ['<EN>', 'Tr', 'ans', 'l', 'ate', '\u0120this', '\u0120hind', ...]\n\n# Decode the tokens\ndecoded_text = tokenizer.decode(ids) # Translate this hindi sentence to english - \u092c\u093f\u0932\u094d\u0932\u0940 \u092c\u0939\u0941\u0924 \u092a\u094d\u092f\u093e\u0930\u0940 \u0939\u0948.\n```\n\n\n## Development Setup\n\n### Prerequisites\n- Use the VSCode Dev Containers for easy setup (Recommended)\n- Install dev dependencies\n    ```bash\n    pip install poetry\n    poetry install\n    ```\n\n### Linting, Formatting and Type Checking\n- Add the directory to safe.directory\n    ```bash\n    git config --global --add safe.directory /workspaces/multi-tokenizer\n    ```\n- Run the following command to lint and format the code\n    ```bash\n    pre-commit run --all-files\n    ```\n- To install pre-commit hooks, run the following command (Recommended)\n    ```bash\n    pre-commit install\n    ```\n\n### Running the tests\nRun the tests using the following command\n```bash\npytest -n \"auto\"\n```\n\n## Approaches\n\n1. [Approach 1: Individual tokenizers for each language](support/proposal_1.md)\n2. [Approach 2: Unified tokenization approach across languages using utf-8 encondings](support/proposal_2.md)\n\n## Evaluation\n\n- [Evaluation Methodologies](support/evaluation.md#evaluation-metodologies)\n- [Data Collection and Analysis](support/evaluation.md#7-data-collection-and-analysis)\n- [Comparative Analysis](support/evaluation.md#8-comparative-analysis)\n- [Implementation Plan](support/evaluation.md#9-implementation-plan)\n- [Future Expansion](support/evaluation.md#10-future-expansion)\n\n## Contributors\n\n- [Rob Neuhaus](https://github.com/rrenaud) - \u26f4\ud83d\udc68\ud83c\udffb\u200d\u2708\ufe0f\n- [Chandra Irugalbandara](https://github.com/chandralegend)\n- [Alvanli](https://github.com/alvanli)\n- [Vishnu Vardhan](https://github.com/VishnuVardhanSaiLanka)\n- [Anthony Susevski](https://github.com/asusevski)\n",
    "bugtrack_url": null,
    "license": "MIT",
    "summary": "Python package that provides tokenization of multilingual texts using language-specific tokenizers",
    "version": "0.1.4",
    "project_urls": null,
    "split_keywords": [],
    "urls": [
        {
            "comment_text": "",
            "digests": {
                "blake2b_256": "50d99637fe6da732657c9fde6461880c48469d408fae679456aa8d7301779fa8",
                "md5": "6bf499e19f7af8506344dd0543cf1cc4",
                "sha256": "8a972a438826d77caad0a28cf1d961028c6f4b4a7a85e27d19694ddbca9dc859"
            },
            "downloads": -1,
            "filename": "multi_tokenizer-0.1.4-py3-none-any.whl",
            "has_sig": false,
            "md5_digest": "6bf499e19f7af8506344dd0543cf1cc4",
            "packagetype": "bdist_wheel",
            "python_version": "py3",
            "requires_python": "<4.0.0,>=3.8.19",
            "size": 958085,
            "upload_time": "2024-08-11T03:18:16",
            "upload_time_iso_8601": "2024-08-11T03:18:16.138703Z",
            "url": "https://files.pythonhosted.org/packages/50/d9/9637fe6da732657c9fde6461880c48469d408fae679456aa8d7301779fa8/multi_tokenizer-0.1.4-py3-none-any.whl",
            "yanked": false,
            "yanked_reason": null
        },
        {
            "comment_text": "",
            "digests": {
                "blake2b_256": "18c5734596ae2a84f8493317d01f40a034bb78a48ba7b217828684cd178b76dd",
                "md5": "367c6e6fba0d7984c73fbaf98ede4ee3",
                "sha256": "cf86d1e6903b10111352016f1f421fc82b6cb5b4f2a53a382c647a3224bf0fb1"
            },
            "downloads": -1,
            "filename": "multi_tokenizer-0.1.4.tar.gz",
            "has_sig": false,
            "md5_digest": "367c6e6fba0d7984c73fbaf98ede4ee3",
            "packagetype": "sdist",
            "python_version": "source",
            "requires_python": "<4.0.0,>=3.8.19",
            "size": 936633,
            "upload_time": "2024-08-11T03:18:18",
            "upload_time_iso_8601": "2024-08-11T03:18:18.230482Z",
            "url": "https://files.pythonhosted.org/packages/18/c5/734596ae2a84f8493317d01f40a034bb78a48ba7b217828684cd178b76dd/multi_tokenizer-0.1.4.tar.gz",
            "yanked": false,
            "yanked_reason": null
        }
    ],
    "upload_time": "2024-08-11 03:18:18",
    "github": false,
    "gitlab": false,
    "bitbucket": false,
    "codeberg": false,
    "lcname": "multi-tokenizer"
}

chandralegend