icetk

Name	icetk JSON
Version	0.0.7 JSON
	download
home_page	https://github.com/THUDM/icetk
Summary	A unified tokenization tool for Images, Chinese and English.
upload_time	2023-03-23 16:37:57
maintainer
docs_url	None
author	Wendi Zheng and Ming Ding
requires_python	>=3.5
license	MIT license
keywords
VCS
bugtrack_url
requirements	No requirements were recorded.
Travis-CI	No Travis.
coveralls test coverage	No coveralls.

            # ICE Tokenizer

- Token id `[0, 20000)` are image tokens.
- Token id `[20000, 20100)` are common tokens, mainly punctuations. E.g., `icetk[20000] == '<unk>'`, `icetk[20003] == '<pad>'`, `icetk[20006] == ','`.
-  Token id `[20100, 83823)` are English tokens.
-  Token id `[83823, 145653)` are Chinese tokens.
-  Token id `[145653, 150000)` are rare tokens. E.g., `icetk[145803] == 'α'`.

You can install the package via 
```
pip install icetk
```

## Tokenization

```python
from icetk import icetk
tokens = icetk.tokenize('Hello World! I am icetk.')
# tokens == ['▁Hello', '▁World', '!', '▁I', '▁am', '▁ice', 'tk', '.']
ids = icetk.encode('Hello World! I am icetk.')
# ids == [39316, 20932, 20035, 20115, 20344, 22881, 35955, 20007]
en = icetk.decode(ids)
# en == 'Hello World! I am icetk.' # always perfectly recover (if without <unk>)

ids = icetk.encode('你好世界！这里是 icetk。')
# ids == [20005, 94874, 84097, 20035, 94947, 22881, 35955, 83823]

ids = icetk.encode(image_path='test.jpeg', image_size=256, compress_rate=8)
# ids == tensor([[12738, 12430, 10398,  ...,  7236, 12844, 12386]], device='cuda:0')
# ids.shape == torch.Size([1, 1024])
img = icetk.decode(image_ids=ids, compress_rate=8)
# img.shape == torch.Size([1, 3, 256, 256])
from torchvision.utils import save_image
save_image(img, 'recover.jpg')

# add special tokens
icetk.add_special_tokens(['<start_of_image>', '<start_of_english>', '<start_of_chinese>'])

# transform \n
icetk.decode(icetk.encode('abc\nhi', ignore_linebreak=False))
# 'abc\nhi'
icetk.decode(icetk.encode('abc\nhi'))
# 'abc hi'

# discourage rare composed tokens
icetk.tokenize('//--------')
# ['▁//', '--------']
icetk.text_tokenizer.discourage_ids(range(125653,130000)) # or use icetk.text_tokenizer.discourage_tokens
icetk.tokenize('//--------')
# ['▁//', '-', '-', '-', '-', '-', '-', '-', '-']
```

Raw data

            {
    "_id": null,
    "home_page": "https://github.com/THUDM/icetk",
    "name": "icetk",
    "maintainer": "",
    "docs_url": null,
    "requires_python": ">=3.5",
    "maintainer_email": "",
    "keywords": "",
    "author": "Wendi Zheng and Ming Ding",
    "author_email": "",
    "download_url": "https://files.pythonhosted.org/packages/fe/83/df39f1cc80e380cbb440757cb89b950cc7bd4d69b5fe286d650217dab030/icetk-0.0.7.tar.gz",
    "platform": null,
    "description": "# ICE Tokenizer\n\n- Token id `[0, 20000)` are image tokens.\n- Token id `[20000, 20100)` are common tokens, mainly punctuations. E.g., `icetk[20000] == '<unk>'`, `icetk[20003] == '<pad>'`, `icetk[20006] == ','`.\n-  Token id `[20100, 83823)` are English tokens.\n-  Token id `[83823, 145653)` are Chinese tokens.\n-  Token id `[145653, 150000)` are rare tokens. E.g., `icetk[145803] == '\u03b1'`.\n\nYou can install the package via \n```\npip install icetk\n```\n\n## Tokenization\n\n```python\nfrom icetk import icetk\ntokens = icetk.tokenize('Hello World! I am icetk.')\n# tokens == ['\u2581Hello', '\u2581World', '!', '\u2581I', '\u2581am', '\u2581ice', 'tk', '.']\nids = icetk.encode('Hello World! I am icetk.')\n# ids == [39316, 20932, 20035, 20115, 20344, 22881, 35955, 20007]\nen = icetk.decode(ids)\n# en == 'Hello World! I am icetk.' # always perfectly recover (if without <unk>)\n\nids = icetk.encode('\u4f60\u597d\u4e16\u754c\uff01\u8fd9\u91cc\u662f icetk\u3002')\n# ids == [20005, 94874, 84097, 20035, 94947, 22881, 35955, 83823]\n\nids = icetk.encode(image_path='test.jpeg', image_size=256, compress_rate=8)\n# ids == tensor([[12738, 12430, 10398,  ...,  7236, 12844, 12386]], device='cuda:0')\n# ids.shape == torch.Size([1, 1024])\nimg = icetk.decode(image_ids=ids, compress_rate=8)\n# img.shape == torch.Size([1, 3, 256, 256])\nfrom torchvision.utils import save_image\nsave_image(img, 'recover.jpg')\n\n# add special tokens\nicetk.add_special_tokens(['<start_of_image>', '<start_of_english>', '<start_of_chinese>'])\n\n# transform \\n\nicetk.decode(icetk.encode('abc\\nhi', ignore_linebreak=False))\n# 'abc\\nhi'\nicetk.decode(icetk.encode('abc\\nhi'))\n# 'abc hi'\n\n# discourage rare composed tokens\nicetk.tokenize('//--------')\n# ['\u2581//', '--------']\nicetk.text_tokenizer.discourage_ids(range(125653,130000)) # or use icetk.text_tokenizer.discourage_tokens\nicetk.tokenize('//--------')\n# ['\u2581//', '-', '-', '-', '-', '-', '-', '-', '-']\n```\n",
    "bugtrack_url": null,
    "license": "MIT license",
    "summary": "A unified tokenization tool for Images, Chinese and English.",
    "version": "0.0.7",
    "split_keywords": [],
    "urls": [
        {
            "comment_text": "",
            "digests": {
                "blake2b_256": "bf8a731927e0901273815b779e6ce0e081a95ecf78835ff80be30830505ae06c",
                "md5": "81f3fa0ce90f56cdb84ee699fe5ab1dd",
                "sha256": "830eaa0acdaa0c1f3be3b8da820f5731b3960dba27c6ab19f6810a68ad193fa8"
            },
            "downloads": -1,
            "filename": "icetk-0.0.7-py3-none-any.whl",
            "has_sig": false,
            "md5_digest": "81f3fa0ce90f56cdb84ee699fe5ab1dd",
            "packagetype": "bdist_wheel",
            "python_version": "py3",
            "requires_python": ">=3.5",
            "size": 16015,
            "upload_time": "2023-03-23T16:37:55",
            "upload_time_iso_8601": "2023-03-23T16:37:55.101768Z",
            "url": "https://files.pythonhosted.org/packages/bf/8a/731927e0901273815b779e6ce0e081a95ecf78835ff80be30830505ae06c/icetk-0.0.7-py3-none-any.whl",
            "yanked": false,
            "yanked_reason": null
        },
        {
            "comment_text": "",
            "digests": {
                "blake2b_256": "fe83df39f1cc80e380cbb440757cb89b950cc7bd4d69b5fe286d650217dab030",
                "md5": "1c3f5025d112e324cbf0d6226639964d",
                "sha256": "88ac3d04717cb188562bb2fd2827f1dce26870c9bc9127da448b36e3adcb9d1c"
            },
            "downloads": -1,
            "filename": "icetk-0.0.7.tar.gz",
            "has_sig": false,
            "md5_digest": "1c3f5025d112e324cbf0d6226639964d",
            "packagetype": "sdist",
            "python_version": "source",
            "requires_python": ">=3.5",
            "size": 15001,
            "upload_time": "2023-03-23T16:37:57",
            "upload_time_iso_8601": "2023-03-23T16:37:57.091887Z",
            "url": "https://files.pythonhosted.org/packages/fe/83/df39f1cc80e380cbb440757cb89b950cc7bd4d69b5fe286d650217dab030/icetk-0.0.7.tar.gz",
            "yanked": false,
            "yanked_reason": null
        }
    ],
    "upload_time": "2023-03-23 16:37:57",
    "github": true,
    "gitlab": false,
    "bitbucket": false,
    "github_user": "THUDM",
    "github_project": "icetk",
    "travis_ci": false,
    "coveralls": false,
    "github_actions": false,
    "requirements": [],
    "lcname": "icetk"
}

Wendi Zheng and Ming Ding