TokenizerChanger


NameTokenizerChanger JSON
Version 0.3.4 PyPI version JSON
download
home_pagehttps://github.com/1kkiRen/Tokenizer-Changer
SummaryLibrary for manipulating the existing tokenizer.
upload_time2024-08-27 23:47:57
maintainerNone
docs_urlNone
author1kkiren
requires_python>=3.9
licenseNone
keywords tokenizer deletion tokens
VCS
bugtrack_url
requirements No requirements were recorded.
Travis-CI No Travis.
coveralls test coverage No coveralls.
            # Tokenizer-Changer

Python script for manipulating the existing tokenizer.

The solution was tested on Llama3-8B tokenizer.

-----

# Installation

Installation from PyPI:

```bash
pip install tokenizerchanger
```

-----

# Usage

```python
changer = TokenizerChanger(tokenizer, space_sign)
```

Create the object of `TokenizerChanger` class that optionally requires an existing tokenizer and space sign, which differs from one tokenizer to another. The tokenizer could be `PreTrainedTokenizerFast` class from рџ¤— `tokenizers` library.

```python
changer.load_tokenizer(tokenizer)
```

If you did not load the tokenizer with `TokenizerChanger` class declaration, you can load it using this function.

``` python
changer.set_space_sign(space_sign)
```

If you did not set the space sign with `TokenizerChanger` class declaration, you can set it using this function. Default space sign is `Д `.

## Deletion

```python
changer.delete_tokens(list_of_unwanted_tokens, include_substrings)
```

Deletes the unwanted tokens from the tokenizer. If `include_substrings` is `True`, all token occurrences will be deleted even in other tokens. Defaults to `True`.

```python
changer.delete_k_least_frequent_tokens(k=1000)
changer.delete_k_least_frequent_tokens(k=1000, exclude=list_of_tokens)
```

Deletes k most frequent tokens. The `exclude` argument stands for tokens that will be ignored during the deletion of the least frequent tokens.

```python
changer.delete_overlaps(vocab)
```

Finds and deletes all intersections of the `tokenizer`'s vocabulary and the `vocab` variable from the `tokenizer`. Notice that `vocab` should be a `dict` variable.

```python
changer.delete_inappropriate_merges(vocab)
```

Deletes all merges from `tokenizer` which contradict the `vocab` variable. Notice that `vocab` should be a `list[str]` variable.

## Addition

The idea of creating such functions arose due to the fact that the built-in functions do not add tokens/merges properly, when some tokens are deleted. That is why you can get more tokens after encoding the same text, even if the necessary tokens have been added.

```python
changer.add_tokens(list_of_tokens)
```

Adds the tokens from the list. The indexes will be filled automatically.

```python
changer.add_merges(list_of_merges)
```

Adds the merges from the list. If there are no necessary tokens for this merge, their addition will be suggested.

## "Get" functions

```python
changer.get_overlapping_tokens(vocab)
```

Returns the intersection between the `tokenizer`'s vocabulary and the `vocab` variable. Notice that `vocab` should be a `dict` variable.

```python
changer.get_overlapping_merges(merges)
```

Returns the intersection between the `tokenizer`'s merges and the `merges` variable. Notice that `merges` should be a `list` variable.

## Saving

```python
changer.save_tokenizer(path)
```

Saves the current state of the changed tokenizer. Additionally, it saves tokenizer configs into `path` folder (`./updated_tokenizer` by default).

```python
tokenizer = ch.updated_tokenizer()
```

Return the changed tokenizer.

            

Raw data

            {
    "_id": null,
    "home_page": "https://github.com/1kkiRen/Tokenizer-Changer",
    "name": "TokenizerChanger",
    "maintainer": null,
    "docs_url": null,
    "requires_python": ">=3.9",
    "maintainer_email": null,
    "keywords": "tokenizer deletion tokens",
    "author": "1kkiren",
    "author_email": "1kkiren@mail.ru",
    "download_url": "https://files.pythonhosted.org/packages/1d/2b/9dd417d9e82bf5ccbd1fbe738c05f7a6c606eaf810432cc1dfde30eacebc/TokenizerChanger-0.3.4.tar.gz",
    "platform": null,
    "description": "# Tokenizer-Changer\r\n\r\nPython script for manipulating the existing tokenizer.\r\n\r\nThe solution was tested on Llama3-8B tokenizer.\r\n\r\n-----\r\n\r\n# Installation\r\n\r\nInstallation from PyPI:\r\n\r\n```bash\r\npip install tokenizerchanger\r\n```\r\n\r\n-----\r\n\r\n# Usage\r\n\r\n```python\r\nchanger = TokenizerChanger(tokenizer, space_sign)\r\n```\r\n\r\nCreate the object of `TokenizerChanger` class that optionally requires an existing tokenizer and space sign, which differs from one tokenizer to another. The tokenizer could be `PreTrainedTokenizerFast` class from \u0440\u045f\u00a4\u2014 `tokenizers` library.\r\n\r\n```python\r\nchanger.load_tokenizer(tokenizer)\r\n```\r\n\r\nIf you did not load the tokenizer with `TokenizerChanger` class declaration, you can load it using this function.\r\n\r\n``` python\r\nchanger.set_space_sign(space_sign)\r\n```\r\n\r\nIf you did not set the space sign with `TokenizerChanger` class declaration, you can set it using this function. Default space sign is `\u0414\u00a0`.\r\n\r\n## Deletion\r\n\r\n```python\r\nchanger.delete_tokens(list_of_unwanted_tokens, include_substrings)\r\n```\r\n\r\nDeletes the unwanted tokens from the tokenizer. If `include_substrings` is `True`, all token occurrences will be deleted even in other tokens. Defaults to `True`.\r\n\r\n```python\r\nchanger.delete_k_least_frequent_tokens(k=1000)\r\nchanger.delete_k_least_frequent_tokens(k=1000, exclude=list_of_tokens)\r\n```\r\n\r\nDeletes k most frequent tokens. The `exclude` argument stands for tokens that will be ignored during the deletion of the least frequent tokens.\r\n\r\n```python\r\nchanger.delete_overlaps(vocab)\r\n```\r\n\r\nFinds and deletes all intersections of the `tokenizer`'s vocabulary and the `vocab` variable from the `tokenizer`. Notice that `vocab` should be a `dict` variable.\r\n\r\n```python\r\nchanger.delete_inappropriate_merges(vocab)\r\n```\r\n\r\nDeletes all merges from `tokenizer` which contradict the `vocab` variable. Notice that `vocab` should be a `list[str]` variable.\r\n\r\n## Addition\r\n\r\nThe idea of creating such functions arose due to the fact that the built-in functions do not add tokens/merges properly, when some tokens are deleted. That is why you can get more tokens after encoding the same text, even if the necessary tokens have been added.\r\n\r\n```python\r\nchanger.add_tokens(list_of_tokens)\r\n```\r\n\r\nAdds the tokens from the list. The indexes will be filled automatically.\r\n\r\n```python\r\nchanger.add_merges(list_of_merges)\r\n```\r\n\r\nAdds the merges from the list. If there are no necessary tokens for this merge, their addition will be suggested.\r\n\r\n## \"Get\" functions\r\n\r\n```python\r\nchanger.get_overlapping_tokens(vocab)\r\n```\r\n\r\nReturns the intersection between the `tokenizer`'s vocabulary and the `vocab` variable. Notice that `vocab` should be a `dict` variable.\r\n\r\n```python\r\nchanger.get_overlapping_merges(merges)\r\n```\r\n\r\nReturns the intersection between the `tokenizer`'s merges and the `merges` variable. Notice that `merges` should be a `list` variable.\r\n\r\n## Saving\r\n\r\n```python\r\nchanger.save_tokenizer(path)\r\n```\r\n\r\nSaves the current state of the changed tokenizer. Additionally, it saves tokenizer configs into `path` folder (`./updated_tokenizer` by default).\r\n\r\n```python\r\ntokenizer = ch.updated_tokenizer()\r\n```\r\n\r\nReturn the changed tokenizer.\r\n",
    "bugtrack_url": null,
    "license": null,
    "summary": "Library for manipulating the existing tokenizer.",
    "version": "0.3.4",
    "project_urls": {
        "GitHub": "https://github.com/1kkiRen/Tokenizer-Changer",
        "Homepage": "https://github.com/1kkiRen/Tokenizer-Changer"
    },
    "split_keywords": [
        "tokenizer",
        "deletion",
        "tokens"
    ],
    "urls": [
        {
            "comment_text": "",
            "digests": {
                "blake2b_256": "4d54ef8098588f7ab4363f94120e7d839aa61fa6fd937843ee877e81f62748c0",
                "md5": "e24788496c067758740fab0cbcea5190",
                "sha256": "8e1fc53bc1098a907668d14a89390cadcce4369c8fbdde26488e04da38cc36f9"
            },
            "downloads": -1,
            "filename": "TokenizerChanger-0.3.4-py3-none-any.whl",
            "has_sig": false,
            "md5_digest": "e24788496c067758740fab0cbcea5190",
            "packagetype": "bdist_wheel",
            "python_version": "py3",
            "requires_python": ">=3.9",
            "size": 10457,
            "upload_time": "2024-08-27T23:47:56",
            "upload_time_iso_8601": "2024-08-27T23:47:56.132960Z",
            "url": "https://files.pythonhosted.org/packages/4d/54/ef8098588f7ab4363f94120e7d839aa61fa6fd937843ee877e81f62748c0/TokenizerChanger-0.3.4-py3-none-any.whl",
            "yanked": false,
            "yanked_reason": null
        },
        {
            "comment_text": "",
            "digests": {
                "blake2b_256": "1d2b9dd417d9e82bf5ccbd1fbe738c05f7a6c606eaf810432cc1dfde30eacebc",
                "md5": "ae1a69a2ac3c1f8188866b53b5280592",
                "sha256": "d7f846994507838e14133295d24a184388d5dfcd939bc923aa450e1c058e01e9"
            },
            "downloads": -1,
            "filename": "TokenizerChanger-0.3.4.tar.gz",
            "has_sig": false,
            "md5_digest": "ae1a69a2ac3c1f8188866b53b5280592",
            "packagetype": "sdist",
            "python_version": "source",
            "requires_python": ">=3.9",
            "size": 10039,
            "upload_time": "2024-08-27T23:47:57",
            "upload_time_iso_8601": "2024-08-27T23:47:57.830971Z",
            "url": "https://files.pythonhosted.org/packages/1d/2b/9dd417d9e82bf5ccbd1fbe738c05f7a6c606eaf810432cc1dfde30eacebc/TokenizerChanger-0.3.4.tar.gz",
            "yanked": false,
            "yanked_reason": null
        }
    ],
    "upload_time": "2024-08-27 23:47:57",
    "github": true,
    "gitlab": false,
    "bitbucket": false,
    "codeberg": false,
    "github_user": "1kkiRen",
    "github_project": "Tokenizer-Changer",
    "travis_ci": false,
    "coveralls": false,
    "github_actions": false,
    "lcname": "tokenizerchanger"
}
        
Elapsed time: 0.29250s