Name | ai21-tokenizer JSON |
Version |
0.12.0
JSON |
| download |
home_page | None |
Summary | None |
upload_time | 2024-08-21 11:42:05 |
maintainer | None |
docs_url | None |
author | AI21 Labs |
requires_python | <4.0,>=3.8 |
license | None |
keywords |
|
VCS |
|
bugtrack_url |
|
requirements |
No requirements were recorded.
|
Travis-CI |
No Travis.
|
coveralls test coverage |
No coveralls.
|
<h1 align="center">
<a href="https://github.com/AI21Labs/ai21-tokenizer">AI21 Labs Tokenizer</a>
</h1>
<p align="center">
<em>A SentencePiece based tokenizer for production uses with AI21's models</em>
</p>
<p align="center">
<a href="https://github.com/AI21Labs/ai21-tokenizer/actions?query=workflow%3ATest+event%3Apush+branch%3Amain"><img src="https://github.com/AI21Labs/ai21-tokenizer/actions/workflows/test.yaml/badge.svg" alt="Test"></a>
<a href="https://pypi.org/project/ai21-tokenizer" target="_blank"><img src="https://img.shields.io/pypi/v/ai21-tokenizer?color=%2334D058&label=pypi%20package" alt="Package version"></a>
<a href="https://pypi.org/project/ai21-tokenizer" target="_blank"><img src="https://img.shields.io/pypi/pyversions/ai21-tokenizer?color=%2334D058" alt="Supported Python versions"></a>
<a href="https://python-poetry.org/" target="_blank"><img src="https://img.shields.io/endpoint?url=https://python-poetry.org/badge/v0.json" alt="Poetry"></a>
<a href="https://github.com/semantic-release/semantic-release" target="_blank"><img src="https://img.shields.io/badge/semantic--release-python-e10079?logo=semantic-release" alt="Supported Python versions"></a>
<a href="https://opensource.org/licenses/Apache-2.0" target="_blank"><img src="https://img.shields.io/badge/License-Apache_2.0-blue.svg" alt="License"></a>
</p>
---
## Installation
### pip
```bash
pip install ai21-tokenizer
```
### poetry
```bash
poetry add ai21-tokenizer
```
## Usage
### Tokenizer Creation
### Jamba 1.5 Mini Tokenizer
```python
from ai21_tokenizer import Tokenizer, PreTrainedTokenizers
tokenizer = Tokenizer.get_tokenizer(PreTrainedTokenizers.JAMBA_1_5_MINI_TOKENIZER)
# Your code here
```
Another way would be to use our Jamba 1.5 Mini tokenizer directly:
```python
from ai21_tokenizer import Jamba1_5Tokenizer
model_path = "<Path to your vocabs file>"
tokenizer = Jamba1_5Tokenizer(model_path=model_path)
# Your code here
```
#### Async usage
```python
from ai21_tokenizer import Tokenizer, PreTrainedTokenizers
tokenizer = await Tokenizer.get_async_tokenizer(PreTrainedTokenizers.JAMBA_1_5_MINI_TOKENIZER)
# Your code here
```
### Jamba 1.5 Large Tokenizer
```python
from ai21_tokenizer import Tokenizer, PreTrainedTokenizers
tokenizer = Tokenizer.get_tokenizer(PreTrainedTokenizers.JAMBA_1_5_LARGE_TOKENIZER)
# Your code here
```
Another way would be to use our Jamba 1.5 Large tokenizer directly:
```python
from ai21_tokenizer import Jamba1_5Tokenizer
model_path = "<Path to your vocabs file>"
tokenizer = Jamba1_5Tokenizer(model_path=model_path)
# Your code here
```
#### Async usage
```python
from ai21_tokenizer import Tokenizer, PreTrainedTokenizers
tokenizer = await Tokenizer.get_async_tokenizer(PreTrainedTokenizers.JAMBA_1_5_LARGE_TOKENIZER)
# Your code here
```
### Jamba Instruct Tokenizer
```python
from ai21_tokenizer import Tokenizer, PreTrainedTokenizers
tokenizer = Tokenizer.get_tokenizer(PreTrainedTokenizers.JAMBA_INSTRUCT_TOKENIZER)
# Your code here
```
Another way would be to use our Jamba tokenizer directly:
```python
from ai21_tokenizer import JambaInstructTokenizer
model_path = "<Path to your vocabs file>"
tokenizer = JambaInstructTokenizer(model_path=model_path)
# Your code here
```
#### Async usage
```python
from ai21_tokenizer import Tokenizer, PreTrainedTokenizers
tokenizer = await Tokenizer.get_async_tokenizer(PreTrainedTokenizers.JAMBA_INSTRUCT_TOKENIZER)
# Your code here
```
Another way would be to use our async Jamba tokenizer class method create:
```python
from ai21_tokenizer import AsyncJambaInstructTokenizer
model_path = "<Path to your vocabs file>"
tokenizer = AsyncJambaInstructTokenizer.create(model_path=model_path)
# Your code here
```
### J2 Tokenizer
```python
from ai21_tokenizer import Tokenizer
tokenizer = Tokenizer.get_tokenizer()
# Your code here
```
Another way would be to use our Jurassic model directly:
```python
from ai21_tokenizer import JurassicTokenizer
model_path = "<Path to your vocabs file. This is usually a binary file that end with .model>"
config = {} # "dictionary object of your config.json file"
tokenizer = JurassicTokenizer(model_path=model_path, config=config)
```
#### Async usage
```python
from ai21_tokenizer import Tokenizer
tokenizer = await Tokenizer.get_async_tokenizer()
# Your code here
```
Another way would be to use our async Jamba tokenizer class method create:
```python
from ai21_tokenizer import AsyncJurassicTokenizer
model_path = "<Path to your vocabs file. This is usually a binary file that end with .model>"
config = {} # "dictionary object of your config.json file"
tokenizer = AsyncJurassicTokenizer.create(model_path=model_path, config=config)
# Your code here
```
### Functions
#### Encode and Decode
These functions allow you to encode your text to a list of token ids and back to plaintext
```python
text_to_encode = "apple orange banana"
encoded_text = tokenizer.encode(text_to_encode)
print(f"Encoded text: {encoded_text}")
decoded_text = tokenizer.decode(encoded_text)
print(f"Decoded text: {decoded_text}")
```
#### Async
```python
# Assuming you have created an async tokenizer
text_to_encode = "apple orange banana"
encoded_text = await tokenizer.encode(text_to_encode)
print(f"Encoded text: {encoded_text}")
decoded_text = await tokenizer.decode(encoded_text)
print(f"Decoded text: {decoded_text}")
```
#### What if you had wanted to convert your tokens to ids or vice versa?
```python
tokens = tokenizer.convert_ids_to_tokens(encoded_text)
print(f"IDs corresponds to Tokens: {tokens}")
ids = tokenizer.convert_tokens_to_ids(tokens)
```
#### Async
```python
# Assuming you have created an async tokenizer
tokens = await tokenizer.convert_ids_to_tokens(encoded_text)
print(f"IDs corresponds to Tokens: {tokens}")
ids = tokenizer.convert_tokens_to_ids(tokens)
```
**For more examples, please see our [examples](examples) folder.**
Raw data
{
"_id": null,
"home_page": null,
"name": "ai21-tokenizer",
"maintainer": null,
"docs_url": null,
"requires_python": "<4.0,>=3.8",
"maintainer_email": null,
"keywords": null,
"author": "AI21 Labs",
"author_email": null,
"download_url": "https://files.pythonhosted.org/packages/39/80/183f0bcdcb707a7e6593ff048b60d7e127d241ef8bef58c0a4dc7d1b63c7/ai21_tokenizer-0.12.0.tar.gz",
"platform": null,
"description": "<h1 align=\"center\">\n <a href=\"https://github.com/AI21Labs/ai21-tokenizer\">AI21 Labs Tokenizer</a>\n</h1>\n\n<p align=\"center\">\n <em>A SentencePiece based tokenizer for production uses with AI21's models</em>\n</p>\n\n<p align=\"center\">\n<a href=\"https://github.com/AI21Labs/ai21-tokenizer/actions?query=workflow%3ATest+event%3Apush+branch%3Amain\"><img src=\"https://github.com/AI21Labs/ai21-tokenizer/actions/workflows/test.yaml/badge.svg\" alt=\"Test\"></a>\n<a href=\"https://pypi.org/project/ai21-tokenizer\" target=\"_blank\"><img src=\"https://img.shields.io/pypi/v/ai21-tokenizer?color=%2334D058&label=pypi%20package\" alt=\"Package version\"></a>\n<a href=\"https://pypi.org/project/ai21-tokenizer\" target=\"_blank\"><img src=\"https://img.shields.io/pypi/pyversions/ai21-tokenizer?color=%2334D058\" alt=\"Supported Python versions\"></a>\n<a href=\"https://python-poetry.org/\" target=\"_blank\"><img src=\"https://img.shields.io/endpoint?url=https://python-poetry.org/badge/v0.json\" alt=\"Poetry\"></a>\n<a href=\"https://github.com/semantic-release/semantic-release\" target=\"_blank\"><img src=\"https://img.shields.io/badge/semantic--release-python-e10079?logo=semantic-release\" alt=\"Supported Python versions\"></a>\n<a href=\"https://opensource.org/licenses/Apache-2.0\" target=\"_blank\"><img src=\"https://img.shields.io/badge/License-Apache_2.0-blue.svg\" alt=\"License\"></a>\n</p>\n\n---\n\n## Installation\n\n### pip\n\n```bash\npip install ai21-tokenizer\n```\n\n### poetry\n\n```bash\npoetry add ai21-tokenizer\n```\n\n## Usage\n\n### Tokenizer Creation\n\n### Jamba 1.5 Mini Tokenizer\n\n```python\nfrom ai21_tokenizer import Tokenizer, PreTrainedTokenizers\n\ntokenizer = Tokenizer.get_tokenizer(PreTrainedTokenizers.JAMBA_1_5_MINI_TOKENIZER)\n# Your code here\n```\n\nAnother way would be to use our Jamba 1.5 Mini tokenizer directly:\n\n```python\nfrom ai21_tokenizer import Jamba1_5Tokenizer\n\nmodel_path = \"<Path to your vocabs file>\"\ntokenizer = Jamba1_5Tokenizer(model_path=model_path)\n# Your code here\n```\n\n#### Async usage\n\n```python\nfrom ai21_tokenizer import Tokenizer, PreTrainedTokenizers\n\ntokenizer = await Tokenizer.get_async_tokenizer(PreTrainedTokenizers.JAMBA_1_5_MINI_TOKENIZER)\n# Your code here\n```\n\n### Jamba 1.5 Large Tokenizer\n\n```python\nfrom ai21_tokenizer import Tokenizer, PreTrainedTokenizers\n\ntokenizer = Tokenizer.get_tokenizer(PreTrainedTokenizers.JAMBA_1_5_LARGE_TOKENIZER)\n# Your code here\n```\n\nAnother way would be to use our Jamba 1.5 Large tokenizer directly:\n\n```python\nfrom ai21_tokenizer import Jamba1_5Tokenizer\n\nmodel_path = \"<Path to your vocabs file>\"\ntokenizer = Jamba1_5Tokenizer(model_path=model_path)\n# Your code here\n```\n\n#### Async usage\n\n```python\nfrom ai21_tokenizer import Tokenizer, PreTrainedTokenizers\n\ntokenizer = await Tokenizer.get_async_tokenizer(PreTrainedTokenizers.JAMBA_1_5_LARGE_TOKENIZER)\n# Your code here\n```\n\n### Jamba Instruct Tokenizer\n\n```python\nfrom ai21_tokenizer import Tokenizer, PreTrainedTokenizers\n\ntokenizer = Tokenizer.get_tokenizer(PreTrainedTokenizers.JAMBA_INSTRUCT_TOKENIZER)\n# Your code here\n```\n\nAnother way would be to use our Jamba tokenizer directly:\n\n```python\nfrom ai21_tokenizer import JambaInstructTokenizer\n\nmodel_path = \"<Path to your vocabs file>\"\ntokenizer = JambaInstructTokenizer(model_path=model_path)\n# Your code here\n```\n\n#### Async usage\n\n```python\nfrom ai21_tokenizer import Tokenizer, PreTrainedTokenizers\n\ntokenizer = await Tokenizer.get_async_tokenizer(PreTrainedTokenizers.JAMBA_INSTRUCT_TOKENIZER)\n# Your code here\n```\n\nAnother way would be to use our async Jamba tokenizer class method create:\n\n```python\nfrom ai21_tokenizer import AsyncJambaInstructTokenizer\n\nmodel_path = \"<Path to your vocabs file>\"\ntokenizer = AsyncJambaInstructTokenizer.create(model_path=model_path)\n# Your code here\n```\n\n### J2 Tokenizer\n\n```python\nfrom ai21_tokenizer import Tokenizer\n\ntokenizer = Tokenizer.get_tokenizer()\n# Your code here\n```\n\nAnother way would be to use our Jurassic model directly:\n\n```python\nfrom ai21_tokenizer import JurassicTokenizer\n\nmodel_path = \"<Path to your vocabs file. This is usually a binary file that end with .model>\"\nconfig = {} # \"dictionary object of your config.json file\"\ntokenizer = JurassicTokenizer(model_path=model_path, config=config)\n```\n\n#### Async usage\n\n```python\nfrom ai21_tokenizer import Tokenizer\n\ntokenizer = await Tokenizer.get_async_tokenizer()\n# Your code here\n```\n\nAnother way would be to use our async Jamba tokenizer class method create:\n\n```python\nfrom ai21_tokenizer import AsyncJurassicTokenizer\n\nmodel_path = \"<Path to your vocabs file. This is usually a binary file that end with .model>\"\nconfig = {} # \"dictionary object of your config.json file\"\ntokenizer = AsyncJurassicTokenizer.create(model_path=model_path, config=config)\n# Your code here\n```\n\n### Functions\n\n#### Encode and Decode\n\nThese functions allow you to encode your text to a list of token ids and back to plaintext\n\n```python\ntext_to_encode = \"apple orange banana\"\nencoded_text = tokenizer.encode(text_to_encode)\nprint(f\"Encoded text: {encoded_text}\")\n\ndecoded_text = tokenizer.decode(encoded_text)\nprint(f\"Decoded text: {decoded_text}\")\n```\n\n#### Async\n\n```python\n# Assuming you have created an async tokenizer\ntext_to_encode = \"apple orange banana\"\nencoded_text = await tokenizer.encode(text_to_encode)\nprint(f\"Encoded text: {encoded_text}\")\n\ndecoded_text = await tokenizer.decode(encoded_text)\nprint(f\"Decoded text: {decoded_text}\")\n```\n\n#### What if you had wanted to convert your tokens to ids or vice versa?\n\n```python\ntokens = tokenizer.convert_ids_to_tokens(encoded_text)\nprint(f\"IDs corresponds to Tokens: {tokens}\")\n\nids = tokenizer.convert_tokens_to_ids(tokens)\n```\n\n#### Async\n\n```python\n# Assuming you have created an async tokenizer\ntokens = await tokenizer.convert_ids_to_tokens(encoded_text)\nprint(f\"IDs corresponds to Tokens: {tokens}\")\n\nids = tokenizer.convert_tokens_to_ids(tokens)\n```\n\n**For more examples, please see our [examples](examples) folder.**\n\n",
"bugtrack_url": null,
"license": null,
"summary": null,
"version": "0.12.0",
"project_urls": null,
"split_keywords": [],
"urls": [
{
"comment_text": "",
"digests": {
"blake2b_256": "18956ea741600ed38100a7d01f58b3e61608b753f7ed75ff0dc45b4397443c75",
"md5": "c616bba971bd67d9480cc8cdeba7d21f",
"sha256": "7fd37b9093894b30b0f200e5f44fc8fb8772e2b272ef71b6d73722b4696e63c4"
},
"downloads": -1,
"filename": "ai21_tokenizer-0.12.0-py3-none-any.whl",
"has_sig": false,
"md5_digest": "c616bba971bd67d9480cc8cdeba7d21f",
"packagetype": "bdist_wheel",
"python_version": "py3",
"requires_python": "<4.0,>=3.8",
"size": 2675582,
"upload_time": "2024-08-21T11:42:03",
"upload_time_iso_8601": "2024-08-21T11:42:03.628285Z",
"url": "https://files.pythonhosted.org/packages/18/95/6ea741600ed38100a7d01f58b3e61608b753f7ed75ff0dc45b4397443c75/ai21_tokenizer-0.12.0-py3-none-any.whl",
"yanked": false,
"yanked_reason": null
},
{
"comment_text": "",
"digests": {
"blake2b_256": "3980183f0bcdcb707a7e6593ff048b60d7e127d241ef8bef58c0a4dc7d1b63c7",
"md5": "4dfe06f027d3b761108684ea2d536b5c",
"sha256": "d2a5b17789d21572504b7693148bf66e692bdb3ab563023dbcbee340bcbd11c6"
},
"downloads": -1,
"filename": "ai21_tokenizer-0.12.0.tar.gz",
"has_sig": false,
"md5_digest": "4dfe06f027d3b761108684ea2d536b5c",
"packagetype": "sdist",
"python_version": "source",
"requires_python": "<4.0,>=3.8",
"size": 2622526,
"upload_time": "2024-08-21T11:42:05",
"upload_time_iso_8601": "2024-08-21T11:42:05.066564Z",
"url": "https://files.pythonhosted.org/packages/39/80/183f0bcdcb707a7e6593ff048b60d7e127d241ef8bef58c0a4dc7d1b63c7/ai21_tokenizer-0.12.0.tar.gz",
"yanked": false,
"yanked_reason": null
}
],
"upload_time": "2024-08-21 11:42:05",
"github": false,
"gitlab": false,
"bitbucket": false,
"codeberg": false,
"lcname": "ai21-tokenizer"
}