ai21-tokenizer


Nameai21-tokenizer JSON
Version 0.12.0 PyPI version JSON
download
home_pageNone
SummaryNone
upload_time2024-08-21 11:42:05
maintainerNone
docs_urlNone
authorAI21 Labs
requires_python<4.0,>=3.8
licenseNone
keywords
VCS
bugtrack_url
requirements No requirements were recorded.
Travis-CI No Travis.
coveralls test coverage No coveralls.
            <h1 align="center">
    <a href="https://github.com/AI21Labs/ai21-tokenizer">AI21 Labs Tokenizer</a>
</h1>

<p align="center">
    <em>A SentencePiece based tokenizer for production uses with AI21's models</em>
</p>

<p align="center">
<a href="https://github.com/AI21Labs/ai21-tokenizer/actions?query=workflow%3ATest+event%3Apush+branch%3Amain"><img src="https://github.com/AI21Labs/ai21-tokenizer/actions/workflows/test.yaml/badge.svg" alt="Test"></a>
<a href="https://pypi.org/project/ai21-tokenizer" target="_blank"><img src="https://img.shields.io/pypi/v/ai21-tokenizer?color=%2334D058&label=pypi%20package" alt="Package version"></a>
<a href="https://pypi.org/project/ai21-tokenizer" target="_blank"><img src="https://img.shields.io/pypi/pyversions/ai21-tokenizer?color=%2334D058" alt="Supported Python versions"></a>
<a href="https://python-poetry.org/" target="_blank"><img src="https://img.shields.io/endpoint?url=https://python-poetry.org/badge/v0.json" alt="Poetry"></a>
<a href="https://github.com/semantic-release/semantic-release" target="_blank"><img src="https://img.shields.io/badge/semantic--release-python-e10079?logo=semantic-release" alt="Supported Python versions"></a>
<a href="https://opensource.org/licenses/Apache-2.0" target="_blank"><img src="https://img.shields.io/badge/License-Apache_2.0-blue.svg" alt="License"></a>
</p>

---

## Installation

### pip

```bash
pip install ai21-tokenizer
```

### poetry

```bash
poetry add ai21-tokenizer
```

## Usage

### Tokenizer Creation

### Jamba 1.5 Mini Tokenizer

```python
from ai21_tokenizer import Tokenizer, PreTrainedTokenizers

tokenizer = Tokenizer.get_tokenizer(PreTrainedTokenizers.JAMBA_1_5_MINI_TOKENIZER)
# Your code here
```

Another way would be to use our Jamba 1.5 Mini tokenizer directly:

```python
from ai21_tokenizer import Jamba1_5Tokenizer

model_path = "<Path to your vocabs file>"
tokenizer = Jamba1_5Tokenizer(model_path=model_path)
# Your code here
```

#### Async usage

```python
from ai21_tokenizer import Tokenizer, PreTrainedTokenizers

tokenizer = await Tokenizer.get_async_tokenizer(PreTrainedTokenizers.JAMBA_1_5_MINI_TOKENIZER)
# Your code here
```

### Jamba 1.5 Large Tokenizer

```python
from ai21_tokenizer import Tokenizer, PreTrainedTokenizers

tokenizer = Tokenizer.get_tokenizer(PreTrainedTokenizers.JAMBA_1_5_LARGE_TOKENIZER)
# Your code here
```

Another way would be to use our Jamba 1.5 Large tokenizer directly:

```python
from ai21_tokenizer import Jamba1_5Tokenizer

model_path = "<Path to your vocabs file>"
tokenizer = Jamba1_5Tokenizer(model_path=model_path)
# Your code here
```

#### Async usage

```python
from ai21_tokenizer import Tokenizer, PreTrainedTokenizers

tokenizer = await Tokenizer.get_async_tokenizer(PreTrainedTokenizers.JAMBA_1_5_LARGE_TOKENIZER)
# Your code here
```

### Jamba Instruct Tokenizer

```python
from ai21_tokenizer import Tokenizer, PreTrainedTokenizers

tokenizer = Tokenizer.get_tokenizer(PreTrainedTokenizers.JAMBA_INSTRUCT_TOKENIZER)
# Your code here
```

Another way would be to use our Jamba tokenizer directly:

```python
from ai21_tokenizer import JambaInstructTokenizer

model_path = "<Path to your vocabs file>"
tokenizer = JambaInstructTokenizer(model_path=model_path)
# Your code here
```

#### Async usage

```python
from ai21_tokenizer import Tokenizer, PreTrainedTokenizers

tokenizer = await Tokenizer.get_async_tokenizer(PreTrainedTokenizers.JAMBA_INSTRUCT_TOKENIZER)
# Your code here
```

Another way would be to use our async Jamba tokenizer class method create:

```python
from ai21_tokenizer import AsyncJambaInstructTokenizer

model_path = "<Path to your vocabs file>"
tokenizer = AsyncJambaInstructTokenizer.create(model_path=model_path)
# Your code here
```

### J2 Tokenizer

```python
from ai21_tokenizer import Tokenizer

tokenizer = Tokenizer.get_tokenizer()
# Your code here
```

Another way would be to use our Jurassic model directly:

```python
from ai21_tokenizer import JurassicTokenizer

model_path = "<Path to your vocabs file. This is usually a binary file that end with .model>"
config = {} # "dictionary object of your config.json file"
tokenizer = JurassicTokenizer(model_path=model_path, config=config)
```

#### Async usage

```python
from ai21_tokenizer import Tokenizer

tokenizer = await Tokenizer.get_async_tokenizer()
# Your code here
```

Another way would be to use our async Jamba tokenizer class method create:

```python
from ai21_tokenizer import AsyncJurassicTokenizer

model_path = "<Path to your vocabs file. This is usually a binary file that end with .model>"
config = {} # "dictionary object of your config.json file"
tokenizer = AsyncJurassicTokenizer.create(model_path=model_path, config=config)
# Your code here
```

### Functions

#### Encode and Decode

These functions allow you to encode your text to a list of token ids and back to plaintext

```python
text_to_encode = "apple orange banana"
encoded_text = tokenizer.encode(text_to_encode)
print(f"Encoded text: {encoded_text}")

decoded_text = tokenizer.decode(encoded_text)
print(f"Decoded text: {decoded_text}")
```

#### Async

```python
# Assuming you have created an async tokenizer
text_to_encode = "apple orange banana"
encoded_text = await tokenizer.encode(text_to_encode)
print(f"Encoded text: {encoded_text}")

decoded_text = await tokenizer.decode(encoded_text)
print(f"Decoded text: {decoded_text}")
```

#### What if you had wanted to convert your tokens to ids or vice versa?

```python
tokens = tokenizer.convert_ids_to_tokens(encoded_text)
print(f"IDs corresponds to Tokens: {tokens}")

ids = tokenizer.convert_tokens_to_ids(tokens)
```

#### Async

```python
# Assuming you have created an async tokenizer
tokens = await tokenizer.convert_ids_to_tokens(encoded_text)
print(f"IDs corresponds to Tokens: {tokens}")

ids = tokenizer.convert_tokens_to_ids(tokens)
```

**For more examples, please see our [examples](examples) folder.**


            

Raw data

            {
    "_id": null,
    "home_page": null,
    "name": "ai21-tokenizer",
    "maintainer": null,
    "docs_url": null,
    "requires_python": "<4.0,>=3.8",
    "maintainer_email": null,
    "keywords": null,
    "author": "AI21 Labs",
    "author_email": null,
    "download_url": "https://files.pythonhosted.org/packages/39/80/183f0bcdcb707a7e6593ff048b60d7e127d241ef8bef58c0a4dc7d1b63c7/ai21_tokenizer-0.12.0.tar.gz",
    "platform": null,
    "description": "<h1 align=\"center\">\n    <a href=\"https://github.com/AI21Labs/ai21-tokenizer\">AI21 Labs Tokenizer</a>\n</h1>\n\n<p align=\"center\">\n    <em>A SentencePiece based tokenizer for production uses with AI21's models</em>\n</p>\n\n<p align=\"center\">\n<a href=\"https://github.com/AI21Labs/ai21-tokenizer/actions?query=workflow%3ATest+event%3Apush+branch%3Amain\"><img src=\"https://github.com/AI21Labs/ai21-tokenizer/actions/workflows/test.yaml/badge.svg\" alt=\"Test\"></a>\n<a href=\"https://pypi.org/project/ai21-tokenizer\" target=\"_blank\"><img src=\"https://img.shields.io/pypi/v/ai21-tokenizer?color=%2334D058&label=pypi%20package\" alt=\"Package version\"></a>\n<a href=\"https://pypi.org/project/ai21-tokenizer\" target=\"_blank\"><img src=\"https://img.shields.io/pypi/pyversions/ai21-tokenizer?color=%2334D058\" alt=\"Supported Python versions\"></a>\n<a href=\"https://python-poetry.org/\" target=\"_blank\"><img src=\"https://img.shields.io/endpoint?url=https://python-poetry.org/badge/v0.json\" alt=\"Poetry\"></a>\n<a href=\"https://github.com/semantic-release/semantic-release\" target=\"_blank\"><img src=\"https://img.shields.io/badge/semantic--release-python-e10079?logo=semantic-release\" alt=\"Supported Python versions\"></a>\n<a href=\"https://opensource.org/licenses/Apache-2.0\" target=\"_blank\"><img src=\"https://img.shields.io/badge/License-Apache_2.0-blue.svg\" alt=\"License\"></a>\n</p>\n\n---\n\n## Installation\n\n### pip\n\n```bash\npip install ai21-tokenizer\n```\n\n### poetry\n\n```bash\npoetry add ai21-tokenizer\n```\n\n## Usage\n\n### Tokenizer Creation\n\n### Jamba 1.5 Mini Tokenizer\n\n```python\nfrom ai21_tokenizer import Tokenizer, PreTrainedTokenizers\n\ntokenizer = Tokenizer.get_tokenizer(PreTrainedTokenizers.JAMBA_1_5_MINI_TOKENIZER)\n# Your code here\n```\n\nAnother way would be to use our Jamba 1.5 Mini tokenizer directly:\n\n```python\nfrom ai21_tokenizer import Jamba1_5Tokenizer\n\nmodel_path = \"<Path to your vocabs file>\"\ntokenizer = Jamba1_5Tokenizer(model_path=model_path)\n# Your code here\n```\n\n#### Async usage\n\n```python\nfrom ai21_tokenizer import Tokenizer, PreTrainedTokenizers\n\ntokenizer = await Tokenizer.get_async_tokenizer(PreTrainedTokenizers.JAMBA_1_5_MINI_TOKENIZER)\n# Your code here\n```\n\n### Jamba 1.5 Large Tokenizer\n\n```python\nfrom ai21_tokenizer import Tokenizer, PreTrainedTokenizers\n\ntokenizer = Tokenizer.get_tokenizer(PreTrainedTokenizers.JAMBA_1_5_LARGE_TOKENIZER)\n# Your code here\n```\n\nAnother way would be to use our Jamba 1.5 Large tokenizer directly:\n\n```python\nfrom ai21_tokenizer import Jamba1_5Tokenizer\n\nmodel_path = \"<Path to your vocabs file>\"\ntokenizer = Jamba1_5Tokenizer(model_path=model_path)\n# Your code here\n```\n\n#### Async usage\n\n```python\nfrom ai21_tokenizer import Tokenizer, PreTrainedTokenizers\n\ntokenizer = await Tokenizer.get_async_tokenizer(PreTrainedTokenizers.JAMBA_1_5_LARGE_TOKENIZER)\n# Your code here\n```\n\n### Jamba Instruct Tokenizer\n\n```python\nfrom ai21_tokenizer import Tokenizer, PreTrainedTokenizers\n\ntokenizer = Tokenizer.get_tokenizer(PreTrainedTokenizers.JAMBA_INSTRUCT_TOKENIZER)\n# Your code here\n```\n\nAnother way would be to use our Jamba tokenizer directly:\n\n```python\nfrom ai21_tokenizer import JambaInstructTokenizer\n\nmodel_path = \"<Path to your vocabs file>\"\ntokenizer = JambaInstructTokenizer(model_path=model_path)\n# Your code here\n```\n\n#### Async usage\n\n```python\nfrom ai21_tokenizer import Tokenizer, PreTrainedTokenizers\n\ntokenizer = await Tokenizer.get_async_tokenizer(PreTrainedTokenizers.JAMBA_INSTRUCT_TOKENIZER)\n# Your code here\n```\n\nAnother way would be to use our async Jamba tokenizer class method create:\n\n```python\nfrom ai21_tokenizer import AsyncJambaInstructTokenizer\n\nmodel_path = \"<Path to your vocabs file>\"\ntokenizer = AsyncJambaInstructTokenizer.create(model_path=model_path)\n# Your code here\n```\n\n### J2 Tokenizer\n\n```python\nfrom ai21_tokenizer import Tokenizer\n\ntokenizer = Tokenizer.get_tokenizer()\n# Your code here\n```\n\nAnother way would be to use our Jurassic model directly:\n\n```python\nfrom ai21_tokenizer import JurassicTokenizer\n\nmodel_path = \"<Path to your vocabs file. This is usually a binary file that end with .model>\"\nconfig = {} # \"dictionary object of your config.json file\"\ntokenizer = JurassicTokenizer(model_path=model_path, config=config)\n```\n\n#### Async usage\n\n```python\nfrom ai21_tokenizer import Tokenizer\n\ntokenizer = await Tokenizer.get_async_tokenizer()\n# Your code here\n```\n\nAnother way would be to use our async Jamba tokenizer class method create:\n\n```python\nfrom ai21_tokenizer import AsyncJurassicTokenizer\n\nmodel_path = \"<Path to your vocabs file. This is usually a binary file that end with .model>\"\nconfig = {} # \"dictionary object of your config.json file\"\ntokenizer = AsyncJurassicTokenizer.create(model_path=model_path, config=config)\n# Your code here\n```\n\n### Functions\n\n#### Encode and Decode\n\nThese functions allow you to encode your text to a list of token ids and back to plaintext\n\n```python\ntext_to_encode = \"apple orange banana\"\nencoded_text = tokenizer.encode(text_to_encode)\nprint(f\"Encoded text: {encoded_text}\")\n\ndecoded_text = tokenizer.decode(encoded_text)\nprint(f\"Decoded text: {decoded_text}\")\n```\n\n#### Async\n\n```python\n# Assuming you have created an async tokenizer\ntext_to_encode = \"apple orange banana\"\nencoded_text = await tokenizer.encode(text_to_encode)\nprint(f\"Encoded text: {encoded_text}\")\n\ndecoded_text = await tokenizer.decode(encoded_text)\nprint(f\"Decoded text: {decoded_text}\")\n```\n\n#### What if you had wanted to convert your tokens to ids or vice versa?\n\n```python\ntokens = tokenizer.convert_ids_to_tokens(encoded_text)\nprint(f\"IDs corresponds to Tokens: {tokens}\")\n\nids = tokenizer.convert_tokens_to_ids(tokens)\n```\n\n#### Async\n\n```python\n# Assuming you have created an async tokenizer\ntokens = await tokenizer.convert_ids_to_tokens(encoded_text)\nprint(f\"IDs corresponds to Tokens: {tokens}\")\n\nids = tokenizer.convert_tokens_to_ids(tokens)\n```\n\n**For more examples, please see our [examples](examples) folder.**\n\n",
    "bugtrack_url": null,
    "license": null,
    "summary": null,
    "version": "0.12.0",
    "project_urls": null,
    "split_keywords": [],
    "urls": [
        {
            "comment_text": "",
            "digests": {
                "blake2b_256": "18956ea741600ed38100a7d01f58b3e61608b753f7ed75ff0dc45b4397443c75",
                "md5": "c616bba971bd67d9480cc8cdeba7d21f",
                "sha256": "7fd37b9093894b30b0f200e5f44fc8fb8772e2b272ef71b6d73722b4696e63c4"
            },
            "downloads": -1,
            "filename": "ai21_tokenizer-0.12.0-py3-none-any.whl",
            "has_sig": false,
            "md5_digest": "c616bba971bd67d9480cc8cdeba7d21f",
            "packagetype": "bdist_wheel",
            "python_version": "py3",
            "requires_python": "<4.0,>=3.8",
            "size": 2675582,
            "upload_time": "2024-08-21T11:42:03",
            "upload_time_iso_8601": "2024-08-21T11:42:03.628285Z",
            "url": "https://files.pythonhosted.org/packages/18/95/6ea741600ed38100a7d01f58b3e61608b753f7ed75ff0dc45b4397443c75/ai21_tokenizer-0.12.0-py3-none-any.whl",
            "yanked": false,
            "yanked_reason": null
        },
        {
            "comment_text": "",
            "digests": {
                "blake2b_256": "3980183f0bcdcb707a7e6593ff048b60d7e127d241ef8bef58c0a4dc7d1b63c7",
                "md5": "4dfe06f027d3b761108684ea2d536b5c",
                "sha256": "d2a5b17789d21572504b7693148bf66e692bdb3ab563023dbcbee340bcbd11c6"
            },
            "downloads": -1,
            "filename": "ai21_tokenizer-0.12.0.tar.gz",
            "has_sig": false,
            "md5_digest": "4dfe06f027d3b761108684ea2d536b5c",
            "packagetype": "sdist",
            "python_version": "source",
            "requires_python": "<4.0,>=3.8",
            "size": 2622526,
            "upload_time": "2024-08-21T11:42:05",
            "upload_time_iso_8601": "2024-08-21T11:42:05.066564Z",
            "url": "https://files.pythonhosted.org/packages/39/80/183f0bcdcb707a7e6593ff048b60d7e127d241ef8bef58c0a4dc7d1b63c7/ai21_tokenizer-0.12.0.tar.gz",
            "yanked": false,
            "yanked_reason": null
        }
    ],
    "upload_time": "2024-08-21 11:42:05",
    "github": false,
    "gitlab": false,
    "bitbucket": false,
    "codeberg": false,
    "lcname": "ai21-tokenizer"
}
        
Elapsed time: 0.67053s