# Tokenizer-Changer
Python script for manipulating the existing tokenizer.
The solution was tested on Llama3-8B tokenizer.
-----
# Installation
Installation from PyPI:
```bash
pip install tokenizerchanger
```
-----
# Usage
```python
changer = TokenizerChanger(tokenizer, space_sign)
```
Create the object of `TokenizerChanger` class that optionally requires an existing tokenizer and space sign, which differs from one tokenizer to another. The tokenizer could be `PreTrainedTokenizerFast` class from рџ¤— `tokenizers` library.
```python
changer.load_tokenizer(tokenizer)
```
If you did not load the tokenizer with `TokenizerChanger` class declaration, you can load it using this function.
``` python
changer.set_space_sign(space_sign)
```
If you did not set the space sign with `TokenizerChanger` class declaration, you can set it using this function. Default space sign is `Д `.
## Deletion
```python
changer.delete_tokens(list_of_unwanted_tokens, include_substrings)
```
Deletes the unwanted tokens from the tokenizer. If `include_substrings` is `True`, all token occurrences will be deleted even in other tokens. Defaults to `True`.
```python
changer.delete_k_least_frequent_tokens(k=1000)
changer.delete_k_least_frequent_tokens(k=1000, exclude=list_of_tokens)
```
Deletes k most frequent tokens. The `exclude` argument stands for tokens that will be ignored during the deletion of the least frequent tokens.
```python
changer.delete_overlaps(vocab)
```
Finds and deletes all intersections of the `tokenizer`'s vocabulary and the `vocab` variable from the `tokenizer`. Notice that `vocab` should be a `dict` variable.
```python
changer.delete_inappropriate_merges(vocab)
```
Deletes all merges from `tokenizer` which contradict the `vocab` variable. Notice that `vocab` should be a `list[str]` variable.
## Addition
The idea of creating such functions arose due to the fact that the built-in functions do not add tokens/merges properly, when some tokens are deleted. That is why you can get more tokens after encoding the same text, even if the necessary tokens have been added.
```python
changer.add_tokens(list_of_tokens)
```
Adds the tokens from the list. The indexes will be filled automatically.
```python
changer.add_merges(list_of_merges)
```
Adds the merges from the list. If there are no necessary tokens for this merge, their addition will be suggested.
## "Get" functions
```python
changer.get_overlapping_tokens(vocab)
```
Returns the intersection between the `tokenizer`'s vocabulary and the `vocab` variable. Notice that `vocab` should be a `dict` variable.
```python
changer.get_overlapping_merges(merges)
```
Returns the intersection between the `tokenizer`'s merges and the `merges` variable. Notice that `merges` should be a `list` variable.
## Saving
```python
changer.save_tokenizer(path)
```
Saves the current state of the changed tokenizer. Additionally, it saves tokenizer configs into `path` folder (`./updated_tokenizer` by default).
```python
tokenizer = ch.updated_tokenizer()
```
Return the changed tokenizer.
Raw data
{
"_id": null,
"home_page": "https://github.com/1kkiRen/Tokenizer-Changer",
"name": "TokenizerChanger",
"maintainer": null,
"docs_url": null,
"requires_python": ">=3.9",
"maintainer_email": null,
"keywords": "tokenizer deletion tokens",
"author": "1kkiren",
"author_email": "1kkiren@mail.ru",
"download_url": "https://files.pythonhosted.org/packages/1d/2b/9dd417d9e82bf5ccbd1fbe738c05f7a6c606eaf810432cc1dfde30eacebc/TokenizerChanger-0.3.4.tar.gz",
"platform": null,
"description": "# Tokenizer-Changer\r\n\r\nPython script for manipulating the existing tokenizer.\r\n\r\nThe solution was tested on Llama3-8B tokenizer.\r\n\r\n-----\r\n\r\n# Installation\r\n\r\nInstallation from PyPI:\r\n\r\n```bash\r\npip install tokenizerchanger\r\n```\r\n\r\n-----\r\n\r\n# Usage\r\n\r\n```python\r\nchanger = TokenizerChanger(tokenizer, space_sign)\r\n```\r\n\r\nCreate the object of `TokenizerChanger` class that optionally requires an existing tokenizer and space sign, which differs from one tokenizer to another. The tokenizer could be `PreTrainedTokenizerFast` class from \u0440\u045f\u00a4\u2014 `tokenizers` library.\r\n\r\n```python\r\nchanger.load_tokenizer(tokenizer)\r\n```\r\n\r\nIf you did not load the tokenizer with `TokenizerChanger` class declaration, you can load it using this function.\r\n\r\n``` python\r\nchanger.set_space_sign(space_sign)\r\n```\r\n\r\nIf you did not set the space sign with `TokenizerChanger` class declaration, you can set it using this function. Default space sign is `\u0414\u00a0`.\r\n\r\n## Deletion\r\n\r\n```python\r\nchanger.delete_tokens(list_of_unwanted_tokens, include_substrings)\r\n```\r\n\r\nDeletes the unwanted tokens from the tokenizer. If `include_substrings` is `True`, all token occurrences will be deleted even in other tokens. Defaults to `True`.\r\n\r\n```python\r\nchanger.delete_k_least_frequent_tokens(k=1000)\r\nchanger.delete_k_least_frequent_tokens(k=1000, exclude=list_of_tokens)\r\n```\r\n\r\nDeletes k most frequent tokens. The `exclude` argument stands for tokens that will be ignored during the deletion of the least frequent tokens.\r\n\r\n```python\r\nchanger.delete_overlaps(vocab)\r\n```\r\n\r\nFinds and deletes all intersections of the `tokenizer`'s vocabulary and the `vocab` variable from the `tokenizer`. Notice that `vocab` should be a `dict` variable.\r\n\r\n```python\r\nchanger.delete_inappropriate_merges(vocab)\r\n```\r\n\r\nDeletes all merges from `tokenizer` which contradict the `vocab` variable. Notice that `vocab` should be a `list[str]` variable.\r\n\r\n## Addition\r\n\r\nThe idea of creating such functions arose due to the fact that the built-in functions do not add tokens/merges properly, when some tokens are deleted. That is why you can get more tokens after encoding the same text, even if the necessary tokens have been added.\r\n\r\n```python\r\nchanger.add_tokens(list_of_tokens)\r\n```\r\n\r\nAdds the tokens from the list. The indexes will be filled automatically.\r\n\r\n```python\r\nchanger.add_merges(list_of_merges)\r\n```\r\n\r\nAdds the merges from the list. If there are no necessary tokens for this merge, their addition will be suggested.\r\n\r\n## \"Get\" functions\r\n\r\n```python\r\nchanger.get_overlapping_tokens(vocab)\r\n```\r\n\r\nReturns the intersection between the `tokenizer`'s vocabulary and the `vocab` variable. Notice that `vocab` should be a `dict` variable.\r\n\r\n```python\r\nchanger.get_overlapping_merges(merges)\r\n```\r\n\r\nReturns the intersection between the `tokenizer`'s merges and the `merges` variable. Notice that `merges` should be a `list` variable.\r\n\r\n## Saving\r\n\r\n```python\r\nchanger.save_tokenizer(path)\r\n```\r\n\r\nSaves the current state of the changed tokenizer. Additionally, it saves tokenizer configs into `path` folder (`./updated_tokenizer` by default).\r\n\r\n```python\r\ntokenizer = ch.updated_tokenizer()\r\n```\r\n\r\nReturn the changed tokenizer.\r\n",
"bugtrack_url": null,
"license": null,
"summary": "Library for manipulating the existing tokenizer.",
"version": "0.3.4",
"project_urls": {
"GitHub": "https://github.com/1kkiRen/Tokenizer-Changer",
"Homepage": "https://github.com/1kkiRen/Tokenizer-Changer"
},
"split_keywords": [
"tokenizer",
"deletion",
"tokens"
],
"urls": [
{
"comment_text": "",
"digests": {
"blake2b_256": "4d54ef8098588f7ab4363f94120e7d839aa61fa6fd937843ee877e81f62748c0",
"md5": "e24788496c067758740fab0cbcea5190",
"sha256": "8e1fc53bc1098a907668d14a89390cadcce4369c8fbdde26488e04da38cc36f9"
},
"downloads": -1,
"filename": "TokenizerChanger-0.3.4-py3-none-any.whl",
"has_sig": false,
"md5_digest": "e24788496c067758740fab0cbcea5190",
"packagetype": "bdist_wheel",
"python_version": "py3",
"requires_python": ">=3.9",
"size": 10457,
"upload_time": "2024-08-27T23:47:56",
"upload_time_iso_8601": "2024-08-27T23:47:56.132960Z",
"url": "https://files.pythonhosted.org/packages/4d/54/ef8098588f7ab4363f94120e7d839aa61fa6fd937843ee877e81f62748c0/TokenizerChanger-0.3.4-py3-none-any.whl",
"yanked": false,
"yanked_reason": null
},
{
"comment_text": "",
"digests": {
"blake2b_256": "1d2b9dd417d9e82bf5ccbd1fbe738c05f7a6c606eaf810432cc1dfde30eacebc",
"md5": "ae1a69a2ac3c1f8188866b53b5280592",
"sha256": "d7f846994507838e14133295d24a184388d5dfcd939bc923aa450e1c058e01e9"
},
"downloads": -1,
"filename": "TokenizerChanger-0.3.4.tar.gz",
"has_sig": false,
"md5_digest": "ae1a69a2ac3c1f8188866b53b5280592",
"packagetype": "sdist",
"python_version": "source",
"requires_python": ">=3.9",
"size": 10039,
"upload_time": "2024-08-27T23:47:57",
"upload_time_iso_8601": "2024-08-27T23:47:57.830971Z",
"url": "https://files.pythonhosted.org/packages/1d/2b/9dd417d9e82bf5ccbd1fbe738c05f7a6c606eaf810432cc1dfde30eacebc/TokenizerChanger-0.3.4.tar.gz",
"yanked": false,
"yanked_reason": null
}
],
"upload_time": "2024-08-27 23:47:57",
"github": true,
"gitlab": false,
"bitbucket": false,
"codeberg": false,
"github_user": "1kkiRen",
"github_project": "Tokenizer-Changer",
"travis_ci": false,
"coveralls": false,
"github_actions": false,
"lcname": "tokenizerchanger"
}